Tulsiramji Gaikwad-Patil College of Engineering ...tgpcet.com › IT-QP › 6 › DBMS.pdf ·...

of 20 Prof. Jayant Rohankar

Tulsiramji Gaikwad-Patil College of Engineering & Technology,

Nagpur

Department of Information Technology

University Paper Solution Winter-2019

Subject: Database Management System Semester: VI

1. a) Differentiate between file processing & DBMS.

Ans :

There are following differences between DBMS and File system:

DBMS File System

DBMS is a collection of data. In DBMS,

the user is not required to write the

procedures.

File system is a collection of data. In this

system, the user has to write the procedures

for managing the database.

DBMS gives an abstract view of data

that hides the details.

File system provides the detail of the data

representation and storage of data.

DBMS provides a crash recovery

mechanism, i.e., DBMS protects the user

from the system failure.

File system doesn't have a crash mechanism,

i.e., if the system crashes while entering some

data, then the content of the file will lost.

DBMS provides a good protection

mechanism.

It is very difficult to protect a file under the

file system.

DBMS contains a wide variety of

sophisticated techniques to store and

retrieve the data.

File system can't efficiently store and retrieve

the data.

DBMS takes care of Concurrent access

of data using some form of locking.

In the File system, concurrent access has

many problems like redirecting the file while

other deleting some information or updating

some information.

b) Explain four relational Algebra operation in detail with example.

Ans :


The relational algebra is a relation-at-a-time (or set) language where all tuples are controlled in one

statement without the use of a loop. There are several variations of syntax for relational algebra

commands, and you use a common symbolic notation for the commands and present it informally.

The primary operations of relational algebra are as follows:

Select

Project

Union

Set different

Cartesian product

Rename

Select Operation (σ)

It selects tuples that satisfy the given predicate from a relation.

Notation − σp(r)

Here σ stands for selection predicate, and r stands for relation, and p is a propositional logic formula

which may use connectors like and, or, and not.

σ predicate(R): This selection operation functions on a single relation R and describes a relation that contains

only those tuples of R that satisfy the specified condition (predicate).

Example:

σteacher = "database"(Names)

Output - It selects tuples from names where the teacher is 'database.'

Project Operation (∏)

The Projection operation works on a single relation R and defines a relation that contains a vertical subset

of R, extracting the values of specified attributes and eliminating duplicates.

Produce a list of salaries for all staff, showing only the staffNo, fName, lName, and

salary details.

ΠstaffNo, fName, lName, salary(Staff)

In the below-mentioned example, the Projection operation defines a relation that contains only the

designated Staff attributes staffNo, fName, lName, and salary, in the specified order. The result of this

operation is shown in the figure below


Union Operation

For R ∪ S, The union of two relations, R and S, defines a relation that contains all the tuples of R, or S, or

both R and S, duplicate tuples being eliminated. R and S must be union-compatible.

For a union operation to be applied, the following rules must hold −

r and s must have the same quantity of attributes.

Attribute domains must be compatible.

Duplicate tuples get automatically eliminated.

Set difference

For R − S The Set difference operation defines a relation consisting of the tuples that are in relation R, but

not in S. R and S must be union-compatible.

Example:

∏ writer (Nobels) − ∏ writer (papers)

Cartesian product

For R × S, the Cartesian product operation defines a relation that is the concatenation of every tuple of

relation R with every tuple of relation S.

Example:

σwriter = 'gauravray'(Articles Χ Notes)

Join Operations

Typically, you want only combinations of the Cartesian product which satisfy certain situations, and so

you can normally use a Join operation instead of the Cartesian product operation. The Join operation,

which combines two relations to form a new relation, is one of the essential operations in the relational

algebra. There are various types of Join operation, each with subtle differences, some more useful than

others:

Theta join

Equijoin (a particular type of Theta join)

Natural join

Outer join

Semijoin

Rename Operation (ρ)


The results of relational algebra are also relations but without any name. The rename operation provides

database designers to rename the output relation. The rename-operation is denoted using a small Greek

letter rho (ρ).

It is written as:

ρ x (E)

4 . a) List various file organization methods and explain different ways of organizing records in a

file.

Ans :

Types of file organization:

File organization contains various methods. These particular methods have pros and cons on the basis of

access or selection. In the file organization, the programmer decides the best-suited file organization

method according to his requirement.

Types of file organization are as follows:

o Sequential file organization

o Heap file organization

o Hash file organization

o B+ file organization

o Indexed sequential access method (ISAM)

o Cluster file organization

Sequential File Organization

This method is the easiest method for file organization. In this method, files are stored sequentially. This

method can be implemented in two ways:

1. Pile File Method:

https://www.javatpoint.com/dbms-sequential-file-organization

https://www.javatpoint.com/dbms-heap-file-organization

https://www.javatpoint.com/dbms-hash-file-organization

https://www.javatpoint.com/dbms-b-plus-file-organization

https://www.javatpoint.com/dbms-indexed-sequential-access-method

https://www.javatpoint.com/dbms-cluster-file-organization


o It is a quite simple method. In this method, we store the record in a sequence, i.e., one after

another. Here, the record will be inserted in the order in which they are inserted into tables.

In case of updating or deleting of any record, the record will be searched in the memory blocks. When it is

found, then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:

Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records are nothing

but a row in the table. Suppose we want to insert a new record R2 in the sequence, then it will be placed at

the end of the file. Here, records are nothing but a row in any table.

2. Sorted File Method:

o In this method, the new record is always inserted at the file's end, and then it will sort the sequence

in ascending or descending order. Sorting of records is based on any primary key or any other key.

o In the case of modification of any record, it will update the record and then sort the file, and lastly,

the updated record is placed in the right place.

Insertion of the new record:

Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7. Suppose

a new record R2 has to be inserted in the sequence, then it will be inserted at the end of the file, and then it

will sort the sequence.


Heap file organization

o It is the simplest and most basic type of organization. It works with data blocks. In heap file

organization, the records are inserted at the file's end. When the records are inserted, it doesn't

require the sorting and ordering of records.

o When the data block is full, the new record is stored in some other block. This new data block need

not to be the very next data block, but it can select any data block in the memory to store new

records. The heap file is also known as an unordered file.

o In the file, every record has a unique id, and every page in a file is of the same size. It is the DBMS

responsibility to store and manage the new records.

Insertion of a new record

Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a new

record R2 in a heap. If the data block 3 is full then it will be inserted in any of the database selected by the

DBMS, let's say data block 1.


If we want to search, update or delete the data in heap file organization, then we need to traverse the data

from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will be time-consuming because

there is no sorting or ordering of records. In the heap file organization, we need to check all the data until

we get the requested record.

Hash File Organization

Hash File Organization uses the computation of hash function on some fields of the records. The hash

function's output determines the location of disk block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is generated, and the whole

record is retrieved using that address. In the same way, when a new record has to be inserted, then the

address is generated using the hash key and record is directly inserted. The same process is applied in the

case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method, each record will

be stored randomly in the memory.


B+ File Organization

o B+ tree file organization is the advanced method of an indexed sequential access method. It uses a

tree-like structure to store records in File.

o It uses the same concept of key-index where the primary key is used to sort the records. For each

primary key, the value of the index is generated and mapped with the record.

o The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this

method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the

leaf nodes. They do not contain any records.

The above B+ tree shows that:

o There is one root node of the tree, i.e., 25.

o There is an intermediary layer with nodes. They do not store the actual record. They have only

pointers to the leaf node.

o The nodes to the left of the root node contain the prior value of the root and nodes to the right

contain next value of the root, i.e., 15 and 30 respectively.

o There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.

o Searching for any record is easier as all the leaf nodes are balanced.

o In this method, searching any record can be traversed through the single path and accessed easily.

Indexed sequential access method (ISAM)


ISAM method is an advanced sequential file organization. In this method, records are stored in the file

using the primary key. An index value is generated for each primary key and mapped with the record. This

index contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is fetched and

the record is retrieved from the memory.

Cluster file organization

o When the two or more records are stored in the same file, it is known as clusters. These files will

have two or more tables in the same data block, and key attributes which are used to map these

tables together are stored only once.

o This method reduces the cost of searching for various records in different files.

o The cluster file organization is used when there is a frequent need for joining the tables with the

same condition. These joins will give only a few records from both tables. In the given example,

we are retrieving the record for only particular departments. This method can't be used to retrieve

the record for the entire department.


In this method, we can directly insert, update or delete any record. Data is sorted based on the key with

which searching is done. Cluster key is a type of key with which joining of the table is performed.

Types of Cluster file organization:

Cluster file organization is of two types:

1. Indexed Clusters: In indexed cluster, records are grouped based on the cluster key and stored together.

The above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all

the records are grouped based on the cluster key- DEP_ID and all the records are grouped.

2. Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the records based on

the cluster key, we generate the value of the hash key for the cluster key and store the records with the

same hash key value.

b)Compare primary & secondary index.

Ans :

Primary Index

i) It is an ordered file whose records are of fixed length with two fields.

ii) Only based on the primary key.

iii) The total number of entries in the index is the same as the number of disk blocks in the ordered data

file.

iv) Primary index is a king of nondense (sparse) index.

v) There may be at most one primary index for a file.

vi) Needs less storage space.

Secondary index


i) It provides a secondary means of accessing a file for which some primary access already exists.

ii) May be based on candidate key or secondary key.

iii) It has a large number entries due to duplication.

iv) Secondary index is a kind of dense index.

v) There may be more than one secondary indexes for the same file.

vi) Needs more storage space and longer search time.

6.What is normalization and why is it needed? Explain the process in detail. Also explain INF, 2NF

and 3NF with example.

Ans :

Normalization

Here are the most commonly used normal forms:

First normal form(1NF)

Second normal form(2NF)

Third normal form(3NF)

First normal form (1NF)

As per the rule of first normal form, an attribute (column) of a table cannot hold multiple values. It should

hold only atomic values.

Sample Employee table, it displays employees are working with multiple departments.

Employee Age Department

Melvin 32 Marketing, Sales

Edward 45 Quality Assurance

Alex 36 Human Resource

Employee table following 1NF:

Employee Age Department


Melvin 32 Marketing

Melvin 32 Sales

Edward 45 Quality Assurance

Alex 36 Human Resource

Second normal form (2NF)

A table is said to be in 2NF if both the following conditions hold:

Table is in 1NF (First normal form)

No non-prime attribute is dependent on the proper subset of any candidate key of table.

An attribute that is not part of any candidate key is known as non-prime attribute.

Third Normal form (3NF)

A table design is said to be in 3NF if both the following conditions hold:

Table must be in 2NF

Transitive functional dependency of non-prime attribute on any super key should be removed.

An attribute that is not part of any candidate key is known as non-prime attribute.

In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each functional

dependency X-> Y at least one of the following conditions hold:

X is a super key of table

Y is a prime attribute of table

An attribute that is a part of one of the candidate keys is known as prime attribute.

7. a) Explain the different phases involved in Query processing

Ans :

Query Processing is a translation of high-level queries into low-level expression.

It is a step wise process that can be used at the physical level of the file system, query optimization

and actual execution of the query to get the result.

It requires the basic concepts of relational algebra and file structure.

https://beginnersbook.com/2015/04/transitive-dependency-in-dbms/

https://beginnersbook.com/2015/04/candidate-key-in-dbms/

https://beginnersbook.com/2015/04/super-key-in-dbms/


It refers to the range of activities that are involved in extracting data from the database.

It includes translation of queries in high-level database languages into expressions that can be

implemented at the physical level of the file system.

In query processing, we will actually understand how these queries are processed and how they

are optimized.

In the above diagram,

The first step is to transform the query into a standard form.

A query is translated into SQL and into a relational algebraic expression. During this process,

Parser checks the syntax and verifies the relations and the attributes which are used in the query.

The second step is Query Optimizer. In this, it transforms the query into equivalent expressions

that are more efficient to execute.

The third step is Query evaluation. It executes the above query execution plan and returns the

result.

b) What do you mean by Materialization? How do pipelining overcome materialization?

Ans :

Materialization

Materialized evaluation walks the parse or expression tree of the relational algebra operation, and

performs the innermost or leaf-level operations first

The intermediate result of each operation is materialized — an actual, but temporary, relation —

and becomes input for subsequent operations.


The cost of materialization is the sum of the individual operations plus the cost of writing the

intermediate results to disk — a function of the blocking factor (number of records per block) of

the temporaries.

The problem with materialization is that — lots of temporary files, lots of I/O.

Pipelining

With pipelined evaluation, operations form a queue, and results are passed from one operation to

another as

they are calculated, hence the technique’s name.

General approach: restructure the individual operation algorithms so that they take streams of

tuples as both input and output.

Limitation : General approach: restructure the individual operation algorithms so that they take

streams of tuples as both

input and output.

So for instance, algorithms that require sorting can only use pipelining if the input is already

sorted beforehand, since sorting by nature cannot be performed until all tuples to be sorted are

known.

9 .a) Explain the states of a transaction with the help of state transition diagram

Ans :

• A transaction is the sequence of one or more SQL statements that are combined together to form a single

unit of work.

Fig -Transaction State Diagram:

Active state

o The active state is the first state of every transaction. In this state, the transaction is being executed.


o For example: Insertion or deletion or updating a record is done here. But all the records are still not

saved to the database.

Partially committed

o In the partially committed state, a transaction executes its final operation, but the data is still not

saved to the database.

o In the total mark calculation example, a final display of the total marks step is executed in this

state.

Committed

A transaction is said to be in a committed state if it executes all its operations successfully. In this state, all

the effects are now permanently saved on the database system.

Failed state

o If any of the checks made by the database recovery system fails, then the transaction is said to be

in the failed state.

o In the example of total mark calculation, if the database is not able to fire a query to fetch the

marks, then the transaction will fail to execute.

Aborted

o If any of the checks fail and the transaction has reached a failed state then the database recovery

system will make sure that the database is in its previous consistent state. If not then it will abort or

roll back the transaction to bring the database into a consistent state.

o If the transaction fails in the middle of the transaction then before executing the transaction, all the

executed transactions are rolled back to its consistent state.

o After aborting the transaction, the database recovery module will select one of the two operations:

1. Re-start the transaction

2. Kill the transaction

b) What is conflict serializability? Explain different form of schedule equivalence i.e conflict

serializability

Ans :

o A schedule is called conflict serializability if after swapping of non-conflicting operations, it can

transform into a serial schedule.

o The schedule will be a conflict serializable if it is conflict equivalent to a serial schedule.

In DBMS, schedules may have the following three different kinds of equivalence relations among them-


1. Result Equivalence

2. Conflict Equivalence

3. View Equivalence

1. Result Equivalent Schedules-

If any two schedules generate the same result after their execution, then they are called as result

equivalent schedules.

This equivalence relation is considered of least significance.

This is because some schedules might produce same results for some set of values and different

results for some other set of values.

2. Conflict Equivalent Schedules-

If any two schedules satisfy the following two conditions, then they are called as conflict equivalent

schedules-

1. The set of transactions present in both the schedules is same.

2. The order of pairs of conflicting operations of both the schedules is same.

3. View Equivalent Schedules-

Consider two schedules S1 and S2 each consisting of two transactions T1 and T2.

Schedules S1 and S2 are called view equivalent if the following three conditions hold true for them-

Condition-01:

For each data item X, if transaction Ti reads X from the database initially in schedule S1, then in schedule

S2 also, Ti must perform the initial read of X from the database.

Condition-02:

If transaction Ti reads a data item that has been updated by the transaction Tj in schedule S1, then in

schedule S2 also, transaction Ti must read the same data item that has been updated by the transaction Tj.

Condition-03

For each data item X, if X has been updated at last by transaction Ti in schedule S1, then in schedule S2

also, X must be updated at last by transaction Ti.

Checking Whether a Schedule is View Serializable Or Not-

Method-01:


Check whether the given schedule is conflict serializable or not.

If the given schedule is conflict serializable, then it is surely view serializable. Stop and report

your answer.

If the given schedule is not conflict serializable, then it may or may not be view serializable. Go

and check using other methods.

Method-02:

Check if there exists any blind write operation.

(Writing without reading is called as a blind write).

If there does not exist any blind write, then the schedule is surely not view serializable. Stop and

report your answer.

If there exists any blind write, then the schedule may or may not be view serializable. Go and

check using other methods.

Method-03:

In this method, try finding a view equivalent serial schedule.

By using the above three conditions, write all the dependencies.

Then, draw a graph using those dependencies.

If there exists no cycle in the graph, then the schedule is view serializable otherwise not.

11. b) Define Integrity constraints. Explain different types of integrity with suitable example

Ans :

Integrity Constraints

o Integrity constraints are a set of rules. It is used to maintain the quality of information.

o Integrity constraints ensure that the data insertion, updating, and other processes have to be

performed in such a way that data integrity is not affected.

o Thus, integrity constraint is used to guard against accidental damage to the database.

Types of Integrity Constraint


1. Domain constraints

o Domain constraints can be defined as the definition of a valid set of values for an attribute.

o The data type of domain includes string, character, integer, time, date, currency, etc. The value of

the attribute must be available in the corresponding domain.

Example:

2. Entity integrity constraints

o The entity integrity constraint states that primary key value can't be null.

o This is because the primary key value is used to identify individual rows in relation and if the

primary key has a null value, then we can't identify those rows.

o A table can contain a null value other than the primary key field.


Example:

3. Referential Integrity Constraints

o A referential integrity constraint is specified between two tables.

o In the Referential integrity constraints, if a foreign key in Table 1 refers to the Primary Key of

Table 2, then every value of the Foreign Key in Table 1 must be null or be available in Table 2.

Example:

4. Key constraints

o Keys are the entity set that is used to identify an entity within its entity set uniquely.

o An entity set can have multiple keys, but out of which one key will be the primary key. A primary

key can contain a unique and null value in the relational table.


Example:

of 35 Prof. Parul Bhanarkar

Tulsiramji Gaikwad-Patil College of Engineering & Technology,

Nagpur


University Paper Solution Summer-2019


Q.1. (a). What do you mean by data abstraction? Explain three level of abstraction and also three

level of architecture of database systems with reference to above levels.

6M

Ans.

For the system to be usable, it must retrieve data efficiently. The need for efficiency has led designers

to use complex data structures to represent data in the database.

Since many database-systems users are not computer trained, developers hide the complexity from

users through several levels of abstraction, to simplify users’ interactions with the system. The Figure

below shows the various levels of abstraction.

1. Physical Level :

• The lowest level of abstraction describes how the data are actually stored.

• The physical level describes complex low-level data structures in detail.

• The physical schema describes details of how data is stored: files, indices, etc. on the random

access disk system.

• It also typically describes the record layout of files and type of files (hash, b-tree, flat).

• Early applications worked at this level - explicitly dealt with details. E.g., minimizing physical

distances between related data and organizing the data structures within the file (blocked records,

linked lists of blocks, etc.)

Problems:

Routines are hardcoded to deal with physical representation.

Changes to data structures are difficult to make.

Application code becomes complex since it must deal with details.

Rapid implementation of new features very difficult.

2. Logical Level :

• The next-higher level of abstraction describes what data are stored in the database, and what

relationships exist among those data.

• The logical level thus describes the entire database in terms of a small number of relatively simple

structures.

• Although implementation of the simple structures at the logical level may involve complex

physical-level structures, the user of the logical level does not need to be aware of this complexity.

• Database administrators, who must decide what information to keep in the database, use the logical

level of abstraction.

• This level hides details of the physical level and also called as conceptual level.

• In the relational model, the conceptual schema presents data as a set of tables.

• The DBMS maps data access between the conceptual to physical schemas automatically.

• Physical schema can be changed without changing application

DBMS must change mapping from conceptual to physical.

Referred to as physical data independence.

3. View Level :

• The highest level of abstraction describes only part of the entire database.


• Even though the logical level uses simpler structures, complexity remains because of the variety of

information stored in a large database.

• Many users of the database system do not need all this information; instead, they need to access

only a part of the database.

• The view level of abstraction exists to simplify their interaction with the system.

• The system may provide many views for the same database.

• In the relational model, the external schema also presents data as a set of relations.

• An external schema specifies a view of the data in terms of the conceptual level.

• It is designed to fulfil the needs of a particular category of users.

• Portions of stored data should not be seen by some users and begins to implement a level of

security and simplifies the view for these users

Examples:

Students should not see faculty salaries.

Faculty should not see billing or payment data.

-Applications are written in terms of an external schema.

-The external view is computed when accessed. It is not stored.

-Different external schemas can be provided to different categories of users.

-Translation from external level to conceptual level is done automatically by DBMS at run time.

-The conceptual schema can be changed without changing application:

Mapping from external to conceptual must be changed.

Referred to as conceptual data independence.

Fig. 5.2. DBMS Levels of Abstraction (Schemas)

Q.1.(b). Explain four relational algebra operations in detail with example.

6M

Ans.

The relational algebra is a relation-at-a-time (or set) language where all tuples are controlled in one

statement without the use of a loop. There are several variations of syntax for relational algebra

commands, and you use a common symbolic notation for the commands and present it informally.

The primary operations of relational algebra are as follows:


Select

Project

Union

Set different

Cartesian product

Rename

Select Operation (σ)

It selects tuples that satisfy the given predicate from a relation.

Notation − σp(r)

Here σ stands for selection predicate, and r stands for relation, and p is a propositional logic formula

which may use connectors like and, or, and not.

σ predicate(R): This selection operation functions on a single relation R and describes a relation that contains

only those tuples of R that satisfy the specified condition (predicate).

Example:

σteacher = "database"(Names)

Output - It selects tuples from names where the teacher is 'database.'

Project Operation (∏)

The Projection operation works on a single relation R and defines a relation that contains a vertical subset

of R, extracting the values of specified attributes and eliminating duplicates.

Produce a list of salaries for all staff, showing only the staffNo, fName, lName, and

salary details.

ΠstaffNo, fName, lName, salary(Staff)

In the below-mentioned example, the Projection operation defines a relation that contains only the

designated Staff attributes staffNo, fName, lName, and salary, in the specified order. The result of this

operation is shown in the figure below

Union Operation

For R ∪ S, The union of two relations, R and S, defines a relation that contains all the tuples of R, or S, or

both R and S, duplicate tuples being eliminated. R and S must be union-compatible.


For a union operation to be applied, the following rules must hold −

r and s must have the same quantity of attributes.

Attribute domains must be compatible.

Duplicate tuples get automatically eliminated.

Set difference

For R − S The Set difference operation defines a relation consisting of the tuples that are in relation R, but

not in S. R and S must be union-compatible.

Example:

∏ writer (Nobels) − ∏ writer (papers)

Cartesian product

For R × S, the Cartesian product operation defines a relation that is the concatenation of every tuple of

relation R with every tuple of relation S.

Example:

σwriter = 'gauravray'(Articles Χ Notes)

Join Operations

Typically, you want only combinations of the Cartesian product which satisfy certain situations, and so

you can normally use a Join operation instead of the Cartesian product operation. The Join operation,

which combines two relations to form a new relation, is one of the essential operations in the relational

algebra. There are various types of Join operation, each with subtle differences, some more useful than

others:

Theta join

Equijoin (a particular type of Theta join)

Natural join

Outer join

Semijoin

Rename Operation (ρ)

The results of relational algebra are also relations but without any name. The rename operation provides

database designers to rename the output relation. The rename-operation is denoted using a small Greek

letter rho (ρ).

It is written as:


ρ x (E)

Q.2.(b). Differentiate between file processing system and DBMS 8M

Ans.

A file system is a method for storing and organizing computer files and the data they contain to make it

easy to find and access them.

File systems may use a storage device such as a hard disk or CD-ROM and involve maintaining the

physical location of the files.

A typical example of file processing system is a system used to store and manage data of each

department or area within an organization having its own set of files, often creating data redundancy

and data isolation.

Before the advent of DBMS the data or records were stored in permanent system files using the

conventional operating system.

Application programs were then created independently to access the data stored in these files.

The following are the drawbacks of traditional File System:

1. Difficulty in accessing data: It is not easy to retrieve information using a conventional file processing system. Getting the exact

result matching the query is difficult.

2. Duplication of data:

• Often the same information is stored in more than one file. This uncontrolled duplication of data

is not required for several reasons, such as:

• Duplication is wasteful. It costs time and money to enter the data more than once

• It takes up additional storage space, again with associated costs.

• Duplication can lead to loss of data integrity; in other words the data is no longer consistent.

3. Separated and Isolated Data:

• To make a decision, a user might need data from two separate files.

• First, the files were evaluated by analysts and programmers to determine the specific data

required from each file and then applications were written in a programming language to

process and extract the needed data.

• The amount of work involved increased because data from several files was needed.

• Since data is scattered in various files, and files may be in different formats, it is difficult to

write new application programs to retrieve the appropriate data.

4. Data Security:

• The security of data is low in file based system because, the data maintained in the flat file(s) is

easily accessible.

Example: Consider the Banking System. The Customer Transaction file has details about the total

available balance of all customers. A Customer wants information about his account balance. In a

file system it is difficult to give the Customer access to only his data in the file. Thus enforcing

security constraints for the entire file or for certain data items are difficult.. If the physical formats

that were coded into the application program by programmers was changed, the code in each file

containing that format must be updated. Furthermore, instructions for data storage and access were

written into the application's code. Therefore, .changes in storage structure or access methods

could greatly affect the processing or results of an application.

5. Data Dependence:

• In file processing systems, files and records were described by specific physical formats that were

coded into the application program by programmers.

http://ecomputernotes.com/fundamental/introduction-to-computer/what-is-computer


• If the format of a certain record was changed, the code in each file containing that format must be

updated.

• Furthermore, instructions for data storage and access were written into the application's code.

• Therefore, changes in storage structure or access methods could greatly affect the processing or

results of an application.

6. Data Inflexibility:

Program-data interdependency and data isolation, limited the flexibility of file processing systems

in providing users with results of information requests.

7. Incompatible file formats:

• As the structure of files is embedded in the application programs, the structures are dependent

on the application programming language.

Example, the structure of a file generated by a COBOL program may be different from the

structure of a file generated by a 'C' program.

The direct incompatibility of such files makes them difficult to process jointly.

8. Concurrency problems.

When multiple users access the same piece of data at same interval of time then it is called as

concurrency of the system.

When two or more users read the data simultaneously there is no problem, but when they like to

update a file simultaneously, it may result in a big problem.

Example:

Let us consider a scenario where in transaction T 1 a user transfers an amout1t 1000 from

Account A to B (initial value of A is 5000 and B is 8000). In mean while, another transaction T2,

tries to display the sum of account A and B is also executed. If both the transaction runs in parallel

it may results inconsistency as shown below:

The above schedule results inconsistency of database and it shows Rs.12,000 as sum of accounts A

and B instead of Rs .13,000. The problem occurs because second concurrently running transaction

T2, reads A and B at intermediate point and computes its sum, which results inconsistent value.

9. Integrity Problems: The data values may need to satisfy some integrity constraints.

For example, the balance field Value must be grater than 5000.

We have to handle this through program code in file processing systems.

But in database we can declare the integrity constraints along with definition itself.

10. Atomicity Problem:

It is difficult to ensure atomicity in file processing system.

Example: Transferring $100 from Account A to account B. If a failure occurs during execution

there could be situation like $100 is deducted from Account A and not credited in Account B.

Q.2.(b). Explain the following in detail :- 5M

(i). Cost Estimation

Ans.


The optimizer attempts to generate the best execution plan for a SQL statement.


The best execution plan is defined as the plan with the lowest cost among all considered candidate

plans.

The cost computation accounts for factors of query execution such as I/O, CPU, and

communication.

The best method of execution depends on myriad conditions including how the query is written,

the size of the data set, the layout of the data, and which access structures exist.

The optimizer determines the best plan for a SQL statement by examining multiple access

methods, such as full table scan or index scans, and different join methods such as nested loops

and hash joins.

Because the database has many internal statistics and tools at its disposal, the optimizer is usually

in a better position than the user to determine the best method of statement execution.

For this reason, all SQL statements use the optimizer.

Consider a user who queries records for employees who are managers.

If the database statistics indicate that 80% of employees are managers, then the optimizer may

decide that a full table scan is most efficient.

However, if statistics indicate that few employees are managers, then reading an index followed by

a table access by rowid may be more efficient than a full table scan.

Query optimization is the overall process of choosing the most efficient means of executing a SQL

statement.

SQL is a nonprocedural language, so the optimizer is free to merge, reorganize, and process in any

order.

The database optimizes each SQL statement based on statistics collected about the accessed data.

When generating execution plans, the optimizer considers different access paths and join methods.

Factors considered by the optimizer include: System resources, which includes I/O, CPU, and

memory Number of rows returned Size of the initial data sets.

The cost is a number that represents the estimated resource usage for an execution plan.

The optimizer assigns a cost to each possible plan, and then chooses the plan with the lowest cost.

For this reason, the optimizer is sometimes called the cost-based optimizer (CBO) to contrast it

with the legacy rule-based optimizer (RBO).

Q.3.(a). List various file organization methods & explain different ways of organizing records in a file.

6M

Ans.

A file is sequence of records stored in binary format.

A disk drive is formatted into several blocks, which are capable for storing records.

File records are mapped onto those disk blocks. The fig below shows the file structure.


The blocks are of a fixed size which is determined by the physical properties

of the disk operating system, but the size may vary.

The records which make up the block can be of fixed size or variable size. Files with fixed length records are easier to manage & implement.

a) File Organization The method of mapping file records to disk blocks defines file organization, i.e. how the file

records are organized.

The following are the types of file organization

(i). Heap File Organization:

When a file is created using Heap File Organization mechanism, the Operating Systems allocates

memory area to that file without any further accounting details.

File records can be placed anywhere in that memory area.

It is the responsibility of software to manage the records.

Heap File does not support any ordering, sequencing or indexing on its own.

(ii). Sequential File Organization:

Every file record contains a data field (attribute) to uniquely identify that record.

In sequential file organization mechanism, records are placed in the file in the some sequential

order based on the unique key field or search key.

Practically, it is not possible to store all the records sequentially in physical form.

(iii). Hash File Organization:

This mechanism uses a Hash function computation on some field of the records.

As we know, that file is a collection of records, which has to be mapped on some block of the disk

space allocated to it.

This mapping is defined that the hash computation.

The output of hash determines the location of disk block where the records may exist.

(iv). Clustered File Organization:

Clustered file organization is not considered good for large databases.

In this mechanism, related records from one or more relations are kept in a same disk block, that is,

the ordering of records is not based on primary key or search key.

This organization helps to retrieve data easily based on particular join condition.


Q.3.(b). Differentiate between the following

i) B and B+ Tree

ii) Sparse index and dense index 6M

Ans.

ii) Sparse Index

In sparse index, index records are not created for every search key.

An index record here contains search key and actual pointer to the data on the disk.

To search a record we first proceed by index record and reach at the actual location of the data. If

the data we are looking for is not where we directly reach by following index, the system starts

sequential search until the desired data is found.

Dense Index:

In dense index, there is an index record for every search key value in the database.

This makes searching faster but requires more space to store index records itself.

Index record contains search key value and a pointer to the actual record on the disk.

Q.4.a) What is Query Processing. What are the joint strategies in Joint operation. Explain in detail.

8 M

Ans. Query processing is a set of activities involving in getting the result of a query expressed in a high-

level language.

These activities includes parsing the queries and translate them into expressions that can be

implemented at the physical level of the file system, optimizing the query of internal form to get a

suitable execution strategies for processing and then doing the actual execution of queries to get

the results.

The cost of processing of query is dominated by the disk access.

For a given query, there are several possible strategies for processing exist, especially when query

is complex.

The difference between a good strategies and a bad one may be several order of magnitude.

Therefore, it is worhwhile for the system to spend some time on selecting a good strategies for

processing query.

There are several join strategies for computing the join of relations, and we analyze their

respective costs.

The cardinality of Join operations can be calculated as under:

Assume join: R S

1. If R, S have no common attributes: nr*ns


2. If R,S have attribute A in common:

(take min)

3. If R, S have attribute A in common and:

1. A is a candidate key for R: ≤ ns

2. A is candidate key in R and candidate key in S : ≤ min(nr, ns)

3. A is a key for R, foreign key for S: = ns

Size and plans for join operation

Running example: depositor customer

Metadata:

ncustomer = 10,000 ndepositor = 5000

fcustomer = 25 fdepositor = 50

bcustomer= 400 bdepositor= 100

V(cname, depositor) = 2500 (each customer has on average 2 accts)

cname in depositor is foreign key

Nested-loop join:

1. Figure below shows a simple algorithm to compute the theta join, r �θ s, of two

Relations r and s.

2. This algorithm is called the nested-loop join algorithm, since it basically consists

of a pair of nested for loops.

3. Relation r is called the outer relation and relation s the inner relation of the join,

since the loop for r encloses the loop for s.

The algorithm uses the notation tr · ts, where tr and ts are tuples; tr · ts denotes the

tuple constructed by concatenating the attribute values of tuples tr and ts.

for each tuple tr in r do begin

for each tuple ts in s do begin

test pair (tr, ts) to see if they satisfy the join condition θ

if they do, add tr · ts to the result.

end

end

Block nested loop join:

),(),( rAVr

nsnor

sAVs

nrn


1. If the buffer is too small to hold either relation entirely in memory, saving in block accesses can

be done if we process the relations on a per-block basis, rather than on a per-tuple basis.

2. Figure below shows block nested-loop join, which is a variant of the nested-loop join where

every block of the inner relation is paired with every block of the outer relation.

3. Within each pair of blocks, every tuple in one block is paired with every tuple in

the other block, to generate all pairs of tuples.

4. As before, all pairs of tuples that satisfy the join condition are added to the result.

5. The primary difference in cost between the block nested-loop join and the basic

nested-loop join is that, in the worst case, each block in the inner relation s is read

only once for each block in the outer relation, instead of once for each tuple in the

outer relation.

6. Thus, in the worst case, there will be a total of br * bs + br block accesses, where

br and bs denote the number of blocks containing records of r and s respectively.

7. Clearly, it is more efficient to use the smaller relation as the outer relation, in case neither of the

relations fits in memory.

8. In the best case, there will be br + bs block accesses.

for each block Br of r do begin

for each block Bs of s do begin

for each tuple tr in Br do begin

for each tuple ts in Bs do begin

test pair (tr, ts) to see if they satisfy the join condition


end end

end end

Cost:

1. Worst case estimate: br bs + br block accesses.

2. Improvements to nested loop and block nested loop algorithms for a buffer with M blocks:

In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in

blocks; use remaining two blocks to buffer inner relation and output

è Cost = br / (M-2) bs + br

If equi-join attribute forms a key on inner relation, stop inner loop on first match

Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer .

Indexed-Nested loop join:

1. In a nested-loop join , if an index is available on the inner loop’s join attribute,

index lookups can replace file scans.

2. For each tuple tr in the outer relation r, the index is used to look up tuples in s that

will satisfy the join condition with tuple tr.


3. This join method is called an indexed nested-loop join; it can be used with existing indices, as

well as with temporary indices created for the sole purpose of evaluating the join.

4. Looking up tuples in s that will satisfy the join conditions with a given tuple tr is

essentially a selection on s.

5. The cost of an indexed nested-loop join can be computed as follows.

6. For each tuple in the outer relation r, a lookup is performed on the index for s, and the relevant

tuples are retrieved.

7. In the worst case, there is space in the buffer for only one page of r and one page of the index.

8. Then, br disk accesses are needed to read relation r, where br denotes the number of blocks

containing records of r.

9. For each tuple in r, we perform an index lookup on s.

10.Then, the cost of the join can be computed as br +nr ∗ c, where nr is the number of records in

relation r, and c is the cost of a single selection on s using the join condition.

• For each tuple tR in the outer relation R, use the index to look up tuples in S that satisfy the join

condition with tuple tR.

• Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index

lookup on s.

• Cost of the join: br + nr c

1. Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from

r

2. c can be estimated as cost of a single selection on s using the join condition.

If indices are available on join attributes of both R and S,

use the relation with fewer tuples as the outer relation.

Merge Join:

1. The merge join algorithm (also called the sort–merge join algorithm) can be used

to compute natural joins and equi-joins.

2. Let r(R) and s(S) be the relations whose natural join is to be computed, and let R∩S

denote their common attributes.

3. Suppose that both relations are sorted on the attributes R∩S.

4. Then, their join can be computed by a process much like the merge stage in the merge–sort

algorithm.

5. The merge join algorithm requires that the set Ss of all tuples with the same value for the join

attributes must fit in main memory.


1. Each block needs to be read only once (assuming all tuples for any given value of the join

attributes fit in memory)

2.Thus number of block accesses for merge-join is

bR + bS

3. But....

What if one/both of R,S not sorted on A?

Ans: May be worth sorting first and then perform merge join (Sort-Merge Join)

Cost: bR + bS + sortR + sortS

Hash join:

1.Like the merge join algorithm, the hash join algorithm can be used to implement

natural joins and equi-joins.

2. In the hash join algorithm, a hash function h is used to partition tuples of both relations.

3. The basic idea is to partition the tuples of each of the relations into sets that have the same hash

value on the join attributes.

4. We assume that

• h is a hash function mapping JoinAttrs values to {0, 1, . . . , nh}, where JoinAttrs

denotes the common attributes of r and s used in the natural join.

• Hr0 , Hr1, . . .,Hrnh denote partitions of r tuples, each initially empty.

Each tuple tr ∈ r is put in partition Hri, where i = h(tr[JoinAttrs]).

• Hs0 ,Hs1 , ...,Hsnh denote partitions of s tuples, each initially empty.

Each tuplets ∈ s is put in partition Hsi, where i = h(ts[JoinAttrs]).

5.The hash function h should have the ―goodness‖ properties of randomness and

uniformity.

6. The idea behind the hash join algorithm is this: Suppose that an r tuple and an

s tuple satisfy the join condition; then, they will have the same value for the join

attributes.

7. If that value is hashed to some value i, the r tuple has to be in Hri and the

s tuple in Hsi .

8. Therefore, r tuples in Hri need only to be compared with s tuples in

Hsi ; they do not need to be compared with s tuples in any other partition.

Q.4.a) What are bitmap indices? What is its use? Explain with example. 5 M

Ans: Bitmap index

A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (bitmaps).

It answers most of the queries by performing bitwise logical operations on these bitmaps.

http://en.wikipedia.org/wiki/Bit_array

http://en.wikipedia.org/wiki/Bitwise_operation


The most commonly used indexes, such as B+trees, are most efficient if the values they index do

not repeat or repeat a smaller number of times.

In contrast, the bitmap index is designed for cases where the values of a variable repeat very

frequently.

For example, the gender field in a customer database usually contains at most three distinct values:

male, female or other.

For such variables, the bitmap index can have a significant performance advantage over the

commonly used trees.

Example: A bitmap index may be logically viewed as follows:

identifier Has internet Bitmaps

Y N

1 Yes 1 0

2 No 0 1

3 No 0 1

4 unspecified 0 0

5 yes 1 0

On the left, Identifier refers to the unique number assigned to each resident, HasInternet is the data

to be indexed, the content of the bitmap index is shown as two columns under the heading bitmaps.

Each column in the left illustration is a bitmap in the bitmap index.

In this case, there are two such bitmaps, one for "has internet" Yes and one for "has internet" No.

It is easy to see that each bit in bitmap Y shows whether a particular row refers to a person who has

internet access. This is the simplest form of bitmap index.

Q5.(a) What are strong entities and weak entities? Draw an ER diagram illustrating the use of strong

entity, weak entity, composite attribute, multivalued attribute and derived attributes.

Ans. The entity set which does not have sufficient attributes to form a primary key is called as Weak

entity set.

An entity set that has a primary key is called as Strong entity set.

Consider an entity set Payment which has three attributes: payment_number, payment_date and

payment_amount.

Although each payment entity is distinct but payment for different loans may share the same payment

number. Thus, this entity set does not have a primary key and it is an entity set.

Each weak set must be a part of one-to-many relationship set.

A member of a strong entity set is called dominant entity and member of weak entity set is called as

subordinate entity.

A weak entity set does not have a primary key but we need a means of distinguishing among all those

entries in the entity set that depend on one particular strong entity set.

The discriminator of a weak entity set is a set of attributes that allows this distinction be made.

Example, payment_number acts as discriminator for payment entity set. It is also called as the Partial

key of the entity set.

The primary key of a weak entity set is formed by the primary key of the strong entity set on which the

weak entity set is existence dependent plus the weak entity sets discriminator.

http://en.wikipedia.org/wiki/B%2Btree

http://en.wikipedia.org/wiki/Identifier


In the above example {loan_number, payment_number} acts as primary key for payment entity set.

The relationship between weak entity and strong entity set is called as Identifying Relationship.

In example, loan-payment is the identifying relationship for payment entity.

A weak entity set is represented by doubly outlined box .and corresponding identifying relation by a

doubly outlined diamond as shown in figure.

Here double lines indicate total participation of weak entity in strong entity set it means that every

payment must be related via loan-payment to some account.

The arrow from loan-payment to loan indicates that each payment is for a single loan.

The discriminator of a weak entity set is underlined with dashed lines rather than solid line.

Q5.(a) Explain E. F. Codd’s relational database rules.

Ans: Codd's twelve rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F.

Codd, a pioneer of the relational model for databases, designed to define what is required from a database

management system in order for it to be considered relational, i.e., a relational database management

system (RDBMS). They are sometimes jokingly referred to as "Codd's Twelve Commandments".

Rule 0: The foundation rule:

For any system that is advertised as, or claimed to be, a relational data base management system, that

system must be able to manage data bases entirely through its relational capabilities.

Rule 1: The information rule:

All information in a relational data base is represented explicitly at the logical level and in exactly one

way – by values in tables.

Rule 2: The guaranteed access rule:

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by

resorting to a combination of table name, primary key value and column name.

Rule 3: Systematic treatment of null values:

Null values (distinct from the empty character string or a string of blank characters and distinct from zero or

any other number) are supported in fully relational DBMS for representing missing information and

inapplicable information in a systematic way, independent of data type.

Rule 4: Dynamic online catalog based on the relational model:

https://en.wikipedia.org/wiki/Zero-based_numbering

https://en.wikipedia.org/wiki/Edgar_F._Codd



https://en.wikipedia.org/wiki/Relational_model

https://en.wikipedia.org/wiki/Database

https://en.wikipedia.org/wiki/Database_management_system



https://en.wikipedia.org/wiki/Relational_database_management_system



https://en.wikipedia.org/wiki/Online

https://en.wikipedia.org/wiki/Database_catalog


The data base description is represented at the logical level in the same way as ordinary data, so that

authorized users can apply the same relational language to its interrogation as they apply to the regular data.

Rule 5: The comprehensive data sublanguage rule:

A relational system may support several languages and various modes of terminal use (for example, the fill-

in-the-blanks mode). However, there must be at least one language whose statements are expressible, per

some well-defined syntax, as character strings and that is comprehensive in supporting all of the following

items:

1. Data definition.

2. View definition.

3. Data manipulation (interactive and by program).

4. Integrity constraints.

5. Authorization.

6. Transaction boundaries (begin, commit and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable are also updatable by the system.

Rule 7: Possible for high-level insert, update, and delete:

The capability of handling a base relation or a derived relation as a single operand applies not only to the

retrieval of data but also to the insertion, update and deletion of data.

Rule 8: Physical data independence:

Application programs and terminal activities remain logically unimpaired whenever any changes are made

in either storage representations or access methods.

Rule 9: Logical data independence:

Application programs and terminal activities remain logically unimpaired when information-preserving

changes of any kind that theoretically permit unimpairment are made to the base tables.

Rule 10: Integrity independence:

Integrity constraints specific to a particular relational data base must be definable in the relational data

sublanguage and storable in the catalog, not in the application programs.

Rule 11: Distribution independence:

The end-user must not be able to see that the data is distributed over various locations. Users should always

get the impression that the data is located at one site only.

Rule 12: The nonsubversion rule:

If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to

subvert or bypass the integrity rules and constraints expressed in the higher level relational language

(multiple-records-at-a-time).

https://en.wikipedia.org/wiki/View_(database)


Q.6.(b) What is data dictionary? Explain its use with example. Ans:

Q.6.(c) Explain BCNF with example.

Ans: Boyce–Codd Normal Form

Using functional dependencies, we can define several normal forms that represent ―good‖ database

designs.

One of the more desirable normal forms that we can obtain is Boyce–Codd normal form (BCNF). A

relation schema R is in BCNF with respect to a set F of functional dependencies if, for all functional

dependencies in F+ of the form α → β, where α ⊆ R and β ⊆ R, at least one of the following holds:

• α → β is a trivial functional dependency (that is, β ⊆ α).

• α is a superkey for schema R.

A database design is in BCNF if each member of the set of relation schemas that constitutes the design is

in BCNF.

As an illustration, consider the following relation schemas and their respective functional dependencies:

• Customer-schema = (customer-name, customer-street, customer-city)

customer-name → customer-street customer-city

• Branch-schema = (branch-name, assets, branch-city)

branch-name → assets branch-city

• Loan-info-schema = (branch-name, customer-name, loan-number, amount)

loan-number → amount branch-name

We claim that Customer-schema is in BCNF. We note that a candidate key for the

schema is customer-name. The only nontrivial functional dependencies that hold on

Customer-schema have customer-name on the left side of the arrow. Since customer-name

is a candidate key, functional dependencies with customer-name on the left side do

not violate the definition of BCNF. Similarly, it can be shown easily that the relation

schema Branch-schema is in BCNF.

The schema Loan-info-schema, however, is not in BCNF. First, note that loan-number

is not a superkey for Loan-info-schema, sincewe could have a pair of tuples representing

a single loan made to two people—for example,

(Downtown, John Bell, L-44, 1000)

(Downtown, Jane Bell, L-44, 1000)

Q.7.(a) What are different equivalence rules present in transformation of relational expression?

Ans: Given a query, there are generally a variety of methods for computing the answer.

For example, we have seen that, in SQL, a query could be expressed in several different ways.

Each SQL query can itself be translated into a relational-algebra expression in one of several ways.

Furthermore, the relational-algebra representation of a query specifies only partially how to

evaluate a query.

There are usually several ways to evaluate relational-algebra expressions.

As an illustration, consider the query

select balance

from account

where balance < 2500

This query can be translated into either of the following relational-algebra expressions:

• σ balance<2500 (Π balance (account))

• Π balance (σ balance<2500 (account))

Further, we can execute each relational-algebra operation by one of several different algorithms.

For example, to implement the preceding selection, we can search every tuple in account to find

tuples with balance less than 2500.


If a B+-tree index is available on the attribute balance, we can use the index instead to locate the

tuples.

To specify fully how to evaluate a query, we need not only to provide the relational algebra

expression, but also to annotate it with instructions specifying how to evaluate each operation.

Annotations may state the algorithm to be used for a specific operation, or the particular index or

indices to use.

A relational-algebra operation annotated with instructions on how to evaluate it is called an

evaluation primitive.

A sequence of primitive operations that can be used to evaluate a query is a query execution plan

or query-evaluation plan.

Figure illustrates an evaluation plan for our example query, in which a particular index (denoted in

the figure as ―index 1‖) is specified for the selection operation.

The query-execution engine takes a query-evaluation plan, executes that plan, and returns the

answers to the query.

The different evaluation plans for a given query can have different costs.

It is the responsibility of the system to construct a query-evaluation plan that minimizes the cost of

query evaluation.

Once the query plan is chosen, the query is evaluated with that plan, and the result of the query is output.

Q.7.(b)What are materialized views? Explain in details.

Ans: It is easiest to understand intuitively how to evaluate an expression by looking at a

pictorial representation of the expression in an operator tree.

• Consider the expression

Πcustomer-name (σbalance<2500 (account) � customer)

in Figure below:

• If we apply the materialization approach, we start from the lowest-level operations in the

expression (at the bottom of the tree).

• In our example, there is only one such operation; the selection operation on account.

• The inputs to the lowest-level operations are relations in the database.


• We execute these operations by the join algorithms and we store the results in temporary

relations.

• We can use these temporary relations to execute the operations at the next level up in the tree,

where the inputs now are either temporary relations or relations stored in the database.

• In our example, the inputs to the join are the customer relation and the temporary relation created

by the selection on account.

• The join can now be evaluated, creating another temporary relation.

• By repeating the process, we will eventually evaluate the operation at the root of the tree, giving

the final result of the expression.

• In our example, we get the final result by executing the projection operation at the root of the tree,

using as input the temporary relation created by the join.

• Evaluation as just described is called materialized evaluation, since the results of

each intermediate operation are created (materialized) and then are used for evaluation of the next-

level operations.

• The cost of a materialized evaluation is not simply the sum of the costs of the operations involved.

• When we computed the cost estimates of algorithms, we ignored the cost of writing the result of

the operation to disk.

• To compute the cost of evaluating an expression as done here, we have to add the costs of all the

operations, as well as the cost of writing the intermediate results to disk.

• We assume that the records of the result accumulate in a buffer, and, when the buffer is full, they

are written to disk.

• The cost of writing out the result can be estimated as nr/fr, where nr is the estimated number of

tuples in the result relation r, and fr is the blocking factor of the result relation, that is, the number

of records of r that will fit in a block.

• Double buffering (using two buffers, with one continuing execution of the algorithm while the

other is being written out) allows the algorithm to execute more quickly by performing CPU

activity in parallel with I/O activity.

Q.8.(a). Define query optimization. What are the various measures to evaluate the cost of query?

Ans. 7M

Query optimization is the process of selecting the most efficient query-evaluation plan from

among the many strategies usually possible for processing a given query, especially if the query is

complex.

The system constructs a query-evaluation plan that minimizes the cost of query evaluation.

This is where query optimization comes into play.

One aspect of optimization occurs at the relational-algebra level, where the system attempts to find

an expression that is equivalent to the given expression, but more efficient to execute.

Another aspect is selecting a detailed strategy for processing the query, such as choosing the

algorithm to use for executing an operation, choosing the specific indices to use, and so on.

For the Query optimization, Find the cheapest" execution plan for a query

Consider a relational algebra expression that may have many equivalent

Expressions given below:

Representation as logical query plan (a tree):


Where, Non-leaf nodes = operations of relational algebra (with

parameters); Leaf nodes = relations

A relational algebra expression can be evaluated in many ways.

An annotated expression specifying detailed evaluation strategy is called the execution plan

(includes, e.g., whether index is used, join algorithms, . . . )

Among all semantically equivalent expressions, the one with the least costly evaluation plan is

chosen.

Cost estimate of a plan is based on statistical information in the system catalogs as given below:

Query optimizers use the statistic information stored in DBMS catalog to estimate the cost of a

plan.

The relevant catalog information about the relation includes:

1. Number of tuples in a relation r; denote by nr

2. Number of blocks containing tuple of relation r: br

3. Size of the tuple in a relation r ( assume records in a file are all of same types): sr

4. Blocking factor of relation r which is the number of tuples that fit into one block: fr

5. V(A,r) is the number of distinct value of an attribute A in a relation r. This value is the same as

size of πA(r). If A is a key attribute then V(A,r) = nr

6. SC(A,r) is the selection cardinality of attribute A of relation r. This is the average number of

records that satisfy an equality condition on attribute A.

7. In addition to relation information, some information about indices is also used:

Number of levels in index i.

Number of lowest –level index blocks in index i ( number of blocks in leaf level of the index)

With the statistical information maintained in DBMS catalog and the measures of query cost based

on number of disk accesses, we can estimate the cost for different relational algebra operations

The cost of a query execution plan includes the following components:

Access cost to secondary storage: This is the cost of searching for, reading, writing data blocks of

secondary storage such as disk.

Computation cost: This is the cost of performing in-memory operation on the data buffer during

execution. This can be considered as CPU time to execute a query

Storage cost: This is the cost of storing immediate files that are generated during execution

Communication cost: This is the cost of transfering the query and its result from site to site ( in a

distributed or parallel database system)

Memory usage cost: Number of buffers needed during execution.

In a large database, access cost is usually the most important cost since disk accesses are slow

compared to in-memory operations.

In a small database, when almost data reside in the memory, the emphasis is on computation cost.

In the distributed system, communication cost should be minimized.

It is difficult to include all the cost components in a cost function. Therefore, some cost functions

consider only disk access cost as the reasonable measure of the cost of a query-evaluation plan.

Q.8.(b) How an expression can be evaluated with help of materialization and pipeline approach. Explain

in detail.

Ans. It is easiest to understand intuitively how to evaluate an expression by looking at a





in Figure below:

• If we apply the materialization approach, we start from the lowest-level operations in the

expression (at the bottom of the tree).



• We execute these operations by the join algorithms and we store the results in temporary

relations.

• We can use these temporary relations to execute the operations at the next level up in the tree,

where the inputs now are either temporary relations or relations stored in the database.

• In our example, the inputs to the join are the customer relation and the temporary relation created

by the selection on account.


• By repeating the process, we will eventually evaluate the operation at the root of the tree, giving

the final result of the expression.

• In our example, we get the final result by executing the projection operation at the root of the tree,

using as input the temporary relation created by the join.


each intermediate operation are created (materialized) and then are used for evaluation of the next-

level operations.


• When we computed the cost estimates of algorithms, we ignored the cost of writing the result of

the operation to disk.



• We assume that the records of the result accumulate in a buffer, and, when the buffer is full, they

are written to disk.

• The cost of writing out the result can be estimated as nr/fr, where nr is the estimated number of

tuples in the result relation r, and fr is the blocking factor of the result relation, that is, the number

of records of r that will fit in a block.

• Double buffering (using two buffers, with one continuing execution of the algorithm while the

other is being written out) allows the algorithm to execute more quickly by performing CPU

activity in parallel with I/O activity.

Q.9.(a) Explain transaction with neat sketch diagram. Explain ACID properties in brief.

Ans: Transaction State Diagram:


Transaction must be in one of these states:

1.Active:

• It is the initial state if a transaction.

• Execution of transaction starts in an active state.

• Transaction remains in an active state till its execution is in process.

2. Partially Committed:

• When the last operation of a transaction is executed it goes to a partially committed state.

• Here there is a possibility that the transaction may be aborted or else it goes to committed state.

3.Failed:

• A transaction goes to a Failed state if it is determined that it can no longer proceed with its normal

exection.

4.Aborted:

• Failed transaction when rolled back is in an aborted state.

• In this stage system has two options:

1) Restart the transaction: A restarted transaction is considered to be new transaction which may recover

from possible failure.

2) Kill the transaction: A transaction can be killed to recover from failure.

5.Committed:


• The transaction when successfully completed comes to this state.

• Transaction is said to be terminated if its neither committed nor aborted.

Properties of Transaction:

A database guarantees the following four properties to ensure database reliability, as follows:

• Atomicity:

A database follows the all or nothing rule, i.e., the database considers all transaction operations as one

whole unit or atom. Thus, when a database processes a transaction, it is either fully completed or not

executed at all. Suppose A is transferring Rs 100 to B's account. Computers are electronic device and are

prone to failure. Assume A has initially Rs 300 and B has Rs 500. Now it may happen that when A has

initiated the transfer, in the midst of transferring from A to B, system fail. Now balance is deducted from

A's account but has not been added to B's account. Hence we need either the transaction executes fully or

just revert back to initial state.

• Consistency:

Ensures that only valid data following all rules and constraints is written in the database. When a

transaction results in invalid data, the database reverts to its previous state, which abides by all customary

rules and constraints. This must be totally ensured by the programmer. Referring to above example, this

basically means sum of balances of both A's and B's account are same before and after transaction.

• Isolation:

Ensures that transactions are securely and independently processed at the same time without interference,

but it does not ensure the order of transactions. For example, user A withdraws Rs 100 and user B

withdraws Rs 250 from user Z’s account, which has a balance of Rs 1000. Since both A and B draw from

Z’s account, one of the users is required to wait until the other user transaction is completed, avoiding

inconsistent data. If B is required to wait, then B must wait until A’s transaction is completed, and Z’s

account balance changes to Rs 900. Now, B can withdraw Rs 250 from this Rs 900 balance.

• Durability:

It may happen that even the transaction is successful, system fails. In the above example, user B may

withdraw $100 only after user A’s transaction is completed and is updated in the database. If the system

fails before A’s transaction is logged in the database, A cannot withdraw any money, and Z’s account

returns to its previous consistent state.

Q.9.(b) What is serializability? Discuss various types of serializability.

Ans: Serializability:

The database system must control concurrent execution of transactions, to ensure that the database

state remains consistent.

In the fields of databases and transaction processing (transaction management),

a schedule describes execution of transactions running in the system.

Often it is a list of operations (actions) ordered by time, performed by a set of transactions that are

executed together in the system.

If order in time between certain operations is not determined by the system, then a partial order is

used.

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Transaction_processing

http://en.wikipedia.org/wiki/Database_transaction

http://en.wikipedia.org/wiki/Partial_order


Examples of such operations are requesting a read operation, reading, writing, aborting,

committing, requesting lock, locking, etc.

Not all transaction operation types should be included in a schedule.

Types of Schedules:

1. Serial Schedule:

The transactions are executed non-interleaved (i.e., a serial schedule is one in which no

transaction starts until a running transaction has ended).

2. Serializable Schedule:

A schedule that is equivalent (in its outcome) to a serial schedule has

the serializability property.

Example-In schedule E shown below, the order in which the actions of the transactions are

executed is not the same as in D, but in the end, E gives the same result as D.

3. Conflict-serializable schedules

A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or

more serial schedules.

Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if

its precedence graph/serializability graph, when only committed transactions are considered, is

acyclic.

Consider a schedule S in which there are two consecutive instructions Ii and Ij, of transactions Ti

and Tj , respectively (i _= j).

If Ii and Ij refer to different data items, then we can swap Ii and Ij without affecting the results of

any instruction in the schedule.

However, if Ii and Ij refer to the same data item Q, then the order of the two steps may matter.

Since we are dealing with only read and write instructions, there are four cases that we need to

consider:

1. Ii = read(Q), Ij = read(Q). The order of Ii and Ij does not matter, since the

same value of Q is read by Ti and Tj , regardless of the order.

2. Ii = read(Q), Ij = write(Q). If Ii comes before Ij, then Ti does not read the value

of Q that is written by Tj in instruction Ij. If Ij comes before Ii, then Ti reads

the value of Q that is written by Tj. Thus, the order of Ii and Ij matters.

3. Ii = write(Q), Ij = read(Q). The order of Ii and Ij matters for reasons similar

to those of the previous case.

4. Ii = write(Q), Ij = write(Q). Since both instructions are write operations, the

order of these instructions does not affect either Ti or Tj . However, the value

obtained by the next read(Q) instruction of S is affected, since the result of

only the latter of the two write instructions is preserved in the database.

Thus, only in the case where both Ii and Ij are read instructions does the relative order of their

execution not matter.

We say that Ii and Ij conflict if they are operations by different transactions on the same data item,

and at least one of these instructions is a write operation.

http://en.wikipedia.org/wiki/Serializability

http://en.wikipedia.org/wiki/Precedence_graph


To illustrate the concept of conflicting instructions, we consider schedule 1 given below.

Schedule 1

The write(A) instruction of T1 conflicts with the read(A) instruction of T2.

However, the write(A) instruction of T2 does not conflict with the read(B) instruction of T1,

because the two instructions access different data items.

Let Ii and Ij be consecutive instructions of a schedule S.

If Ii and Ij are instructions of different transactions and Ii and Ij do not conflict, then we can swap

the order of Ii and Ij to produce a new schedule S’.

We expect S to be equivalent to S’, since all instructions appear in the same order in both schedules

except for Ii and Ij, whose order does not matter.

Since the write(A) instruction of T2 in schedule 1 does not conflict with the read(B) instruction of

T1, we can swap these instructions to generate an equivalent schedule, schedule 2 shown below.

Schedule 2

Regardless of the initial system state, schedules 1 and 2 both produce the same final system state.

We continue to swap nonconflicting instructions:

• Swap the read(B) instruction of T1 with the read(A) instruction of T2.

• Swap the write(B) instruction of T1 with the write(A) instruction of T2.

• Swap the write(B) instruction of T1 with the read(A) instruction of T2.

The final result of these swaps, schedule 3 of Figure shown below, is a serial schedule.

Thus, we have shown that schedule 1 is equivalent to a serial schedule.

This equivalence implies that, regardless of the initial system state, schedule 1 will produce

the same final state as will some serial schedule.

Schedule 3


If a schedule S can be transformed into a schedule S’ by a series of swaps of nonconflicting

instructions, we say that S and S’ are conflict equivalent.

The concept of conflict equivalence leads to the concept of conflict serializability.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial

schedule.

4. View Serializable Schedule:

View equivalence that is less stringent than conflict equivalence, but that, like conflict

equivalence, is based on only the read and write operations of transactions.

Schedule 1

Consider two schedules S and S’, where the same set of transactions participates in both schedules.

The schedules S and S’ are said to be view equivalent if three conditions are met:

1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then

transaction Ti must, in schedule S’, also read the initial value of Q.

2. For each data item Q, if transaction Ti executes read(Q) in schedule S, and if that

value was produced by a write(Q) operation executed by transaction Tj , then the read(Q)

operation of transaction Ti must, in schedule S’, also read the value of Q that

was produced by the same write(Q) operation of transaction Tj .

3. For each data item Q, the transaction (if any) that performs the final write(Q)

operation in schedule S must perform the final write(Q) operation in schedule S’.

Conditions 1 and 2 ensure that each transaction reads the same values in both schedules and,

therefore, performs the same computation.

Condition 3, coupled with conditions 1 and 2, ensures that both schedules result in the same final

system state.

Consider the following Schedule 1:

Schedule 1


Schedule 2

The schedule 1 is not view equivalent to schedule 2, since, in schedule 1, the value of account A

read by transaction T2 was produced by T1, whereas this case does not hold in schedule 2.

The concept of view equivalence leads to the concept of view serializability.

We say that a schedule S is view serializable if it is view equivalent to a serial schedule.

Every conflict-serializable schedule is also view serializable, but there are view serializable

schedules that are not conflict serializable.

Q.10.(a) What are deadlocks? How is deadlock detection and prevention achieved in DBMS.

Ans: Deadlock is a state of a database system having two or more transactions, when each transaction is

waiting for a data item that is being locked by some other transaction. A deadlock can be indicated by a

cycle in the wait-for-graph. This is a directed graph in which the vertices denote transactions and the

edges denote waits for data items.

For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is locked by

T3. T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked by T1. Hence, a

waiting cycle is formed, and none of the transactions can proceed executing.

Deadlock Handling in Centralized Systems

There are three classical approaches for deadlock handling, namely −

Deadlock prevention. Deadlock avoidance. Deadlock detection and removal.

All of the three approaches can be incorporated in both a centralized and a distributed database system.


Deadlock Prevention

The deadlock prevention approach does not allow any transaction to acquire locks that will lead to

deadlocks. The convention is that when more than one transactions request for locking the same data item,

only one of them is granted the lock.

One of the most popular deadlock prevention methods is pre-acquisition of all the locks. In this method, a

transaction acquires all the locks before starting to execute and retains the locks for the entire duration of

transaction. If another transaction needs any of the already acquired locks, it has to wait until all the locks

it needs are available. Using this approach, the system is prevented from being deadlocked since none of

the waiting transactions are holding any lock.

Deadlock Avoidance

The deadlock avoidance approach handles deadlocks before they occur. It analyzes the transactions and

the locks to determine whether or not waiting leads to a deadlock.

The method can be briefly stated as follows. Transactions start executing and request data items that they

need to lock. The lock manager checks whether the lock is available. If it is available, the lock manager

allocates the data item and the transaction acquires the lock. However, if the item is locked by some other

transaction in incompatible mode, the lock manager runs an algorithm to test whether keeping the

transaction in waiting state will cause a deadlock or not. Accordingly, the algorithm decides whether the

transaction can wait or one of the transactions should be aborted.

There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume that there are

two transactions, T1 and T2, where T1 tries to lock a data item which is already locked by T2. The

algorithms are as follows −

Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger than T2, T1

is aborted and later restarted.

Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1 is

younger than T2, T1 is allowed to wait.

Deadlock Detection and Removal

The deadlock detection and removal approach runs a deadlock detection algorithm periodically and

removes deadlock in case there is one. It does not check for deadlock when a transaction places a request

for a lock. When a transaction requests a lock, the lock manager checks whether it is available. If it is

available, the transaction is allowed to lock the data item; otherwise the transaction is allowed to wait.

Since there are no precautions while granting lock requests, some of the transactions may be deadlocked.

To detect deadlocks, the lock manager periodically checks if the wait-forgraph has cycles. If the system is

deadlocked, the lock manager chooses a victim transaction from each cycle.

Q.10.(b) Explain different concurrency problems and give solutions for it.

Ans: Concurrency

The ability of a database system which handles simultaneously or a number of transactions by interleaving

parts of the actions or the overlapping this is called concurrency of the system.


Advantages of concurrency

The good is to serve many users and provides better throughput by sharing resources.

Reduced waiting time response time or turn around time.

Increased throughput or resource utilization

If we run only one transaction at a time than the acid property is sufficient but it is possible that

when multiple transactions are executed concurrently than database may become inconsistent.

Overlapping with the input-output activity with CPU also makes the response time better.

But interleaving of instruction between transaction may also lead to many problems due to which

concurrency control is required.

Problems due to concurrency

There are many which may occur due to concurrency,

1) Dirty read problem

If a transaction reads an uncommitted temporary value written by some other transaction than it is called

dirty read problem. In this one transaction read a data item updated by another uncommitted transaction

that may be future be aborted or failed. In such cases, the read value disappears from the database upon

abort this is turned on dirty read the reading transaction end with incorrect results.

Example

T1 T2

R(A)

W(A)

R(A)

The values of item x which is read by T2 is called dirty read data because this data can be created by a

transactions that has not been committed yet.

2) Loss update problem/ write - write problem

This problem occur when two transactions access the same data item and have their operations interleaved

in a way that makes the value of some database items incorrect.

If there are two write operations of the different transaction on some data values and in between them

there are no read operations then the second write over the first .consider the schedule below,

Example

T1 T2

R(A)

W(A)

W(A)

Here is a blind write that means write without a read. Here the changes made by transaction T1 are lost

which is updated by a transaction T2.

Q.11. (a) Explain various types of JOIN expressions with example.

Ans: Nested-loop join:



Relations r and s.











end

end


1. If the buffer is too small to hold either relation entirely in memory, saving in block accesses can

be done if we process the relations on a per-block basis, rather than on a per-tuple basis.

2. Figure below shows block nested-loop join, which is a variant of the nested-loop join where

every block of the inner relation is paired with every block of the outer relation.







outer relation.













end end

end end

Cost:













3. This join method is called an indexed nested-loop join; it can be used with existing indices, as

well as with temporary indices created for the sole purpose of evaluating the join.




6. For each tuple in the outer relation r, a lookup is performed on the index for s, and the relevant

tuples are retrieved.











lookup on s.


1. Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from

r




Merge Join:







algorithm.



1. Each block needs to be read only once (assuming all tuples for any given value of the join

attributes fit in memory)


bR + bS

3. But....





Hash join:




3. The basic idea is to partition the tuples of each of the relations into sets that have the same hash

value on the join attributes.

4. We assume that







Q.11. (b) What is nested sub query? Explain with the help of example.

Ans: Subqueries in SQL

Subqueries provide a powerful means to combine data from two tables into a single result. You can also

call these nested queries. As the name implies, subqueries contain one or more queries, one inside the

other.

Subqueries are very versatile and that can make them somewhat hard to understand. For most cases use

them anywhere you can use an expression or table specification.

For example, you can use subqueries in the SELECT, FROM, WHERE, or HAVING clauses. Depending

on how they are used, a subquery may return a single value or multiple rows.

Subqueries make it possible for you to write queries that are more dynamic and data driven. For instance

using a subquery you can return all products whose ListPrice is greater than the average ListPrice for all

products.

You can do this by having it first calculate the average price and then use this to compare against each

product’s price.

Subquery Breakdown

Let’s break down this query so you can see how it works.

Step 1: First let’s run the subquery:

SELECT AVG(ListPrice)

FROM Production.Product

It returns 438.6662 as the average ListPrice

Step 2: Find products greater than the average price by plugging in the average ListPrice value into our

query’s comparison


SELECT ProductID,

Name,

ListPrice

FROM production.Product

WHERE ListPrice > 438.6662

As you can see, by using the subquery we combined the two steps together. The subquery eliminated the

need for us to find the average ListPrice and then plug it into our query.

This is huge! It means our query automatically adjusts itself to changing data and new averages.

Hopefully you’re seeing a glimpse into how subqueries can make your statements more flexible. In this

case, by using a subquery we don’t need to know the value for the average list price.

We let the subquery do the work for us! The average value is calculated on-the-fly; there is no need for us

to ―update‖ the average value within the query.

Being able to dynamically create the criterion for a query is very handy. Here we use a subquery to list all

customers whose territories have sales below $5,000,000.

Q.12.(b) Enlist and explain with example, various DDL commands.

Ans: Data Definition language(DDL) in DBMS with Examples: Data Definition Language can be

defined as a standard for commands through which data structures are defined. It is a computer language

that used for creating and modifying the structure of the database objects, such as schemas, tables, views,

indexes, etc. Additionally, it assists in storing the metadata details in the database.

Also See: What is SQL, Its Application, Advantages and Disadvantages

Data Definition language(DDL) in DBMS with Examples

Some of the common Data Definition Language commands are:

CREATE

ALTER

DROP

1. CREATE- Data Definition language(DDL)

The main use of the create command is to build a new table and it comes with a predefined syntax. It

creates a component in a relational database management system. There are many implementations that

extend the syntax of the command to create the additional elements, like user profiles and indexes.

For Example

CREATE TABLE PUPIL (PUPIL_ID CHAR (10), STUDENT_Name Char (10);

Pupil Table with his ID and name is created by the DDL statement

Generally, the data types often used consists of strings and dates while creating a table. Every system

varies in how to specify the data type.

Also See: Explain Relational Database System RDBMS

https://whatisdbms.com/what-is-sql-applications-advantages-and-disadvantages/

https://whatisdbms.com/explain-relational-database-management-system-rdbms/


2. ALTER- Data Definition language(DDL)

An existing database object can be modified by the ALTER statement. Using this command, the users can

add up some additional column and drop existing columns. Additionally, the data type of columns

involved in a database table can be changed by the ALTER command.

The general syntax of the ALTER command is mentioned below:

ALTER TABLE table_name ADD column_name (for adding a new column)

ALTER TABLE table_name RENAME To new_table_name (for renaming a table)

ALTER TABLE table_name MODIFY column_name data type (for modifying a column)

ALTER TABLE table_name DROP COLUMN column_name (for deleting a column)

Q.12.(c) Explain dynamic SQL and embedded SQL.

Ans: SQL queries can be of two types i.e. embedded or static SQL and dynamic SQL. So, in this blog, we

will be learning about these two types of SQL statements.

Embedded / Static SQL

Embedded or Static SQL is those SQL statements that are fixed and can't be changed at runtime in an

application. These statements are compiled at the compile-time only. The benefit of using this statement is

that you know the path of execution of statements because you have the SQL statements with you, so you

can optimize your SQL query and can execute the query in the best and fastest possible way. The way of

accessing the data is predefined and these static SQL statements are generally used on those databases that

are uniformly distributed.

These statements are hardcoded in the application, so if you want to build some application in which you

need some dynamic or run-time SQL statements, then you should use the Dynamic SQL statement.

Dynamic SQL

Dynamic SQL statements are those SQL statements that are created or executed at the run-time. The users

can execute their own query in some application. These statements are compiled at the run-time. These

kinds of SQL statements are used where there is a non-uniformity in the data stored in the database. It is

more flexible as compared to the static SQL and can be used in some flexible applications.

Since the compilation is done at run-time, the system will know how to access the database at run-time

only. So, no proper planning for execution and optimization can be done previously. This will reduce the

performance of the system. Also, if you are taking the database query from the user at run-time, then there

are possibilities that the users might enter some wrong queries and this is very dangerous because here you

are dealing with lots and lots of data.

of 41

Tulsiramji Gaikwad-Patil College of Engineering & Technology, Nagpur


Model Solution (Summer-17) Academic Session: 2018 - 2019


Q.1. (a). Explain Query processing? Explain various steps in query processing with the help of neat

sketch. 6M

Ans.

Query processing refers to the range of activities involved in extracting data from a database.

The activities include translation of queries in high-level database languages into expressions that can be

used at the physical level of the file system, a variety of query-optimizing transformations, and actual

evaluation of queries.

A given SQL query is translated by the query processor into a low level program called an execution

plan.

An execution plan is a program in a functional language which is called the physical relational algebra,

specialized for internal storage representation in the DBMS.

The physical relational algebra extends the relational algebra with Primitives to search through the

internal storage structures of the DBMS.

The steps involved in processing a query are shown in the figure below.

The basic steps are

1. Parsing and translation

2. Optimization

3. Evaluation

Before query processing can begin, the system must translate the query into a usable form.

A language such as SQL is suitable for human use, but is not useful for system’s internal representation

of a query.

A more useful internal representation is one based on the extended relational algebra.

Thus, the first action the system must take in query processing is to translate a given query into its

internal form.

This translation process is similar to the work performed by the parser of a compiler.

of 41

In generating the internal form of the query, the parser checks the syntax of the user’s query, verifies

that the relation names appearing in the query are names of the relations in the database, and so on.

The system constructs a parse-tree representation of the query, which it then translates into a relational-

algebra expression.

If the query was expressed in terms of a view, the translation phase also replaces all uses of the view by

the relational-algebra expression that defines the view.

Given a query, there are generally a variety of methods for computing the answer.



Furthermore, the relational-algebra representation of a query specifies only partially how to evaluate a

query.


Q.1.(b). Write short notes on Query evaluation. 6M

Ans.


For example, we know that, in SQL, a query could be expressed in several different ways.


The relational-algebra representation of a query specifies only partially how to evaluate a query; there

are usually several ways to evaluate relational-algebra expressions.

Consider the query select balance

from account






For example, to implement the preceding selection, we can search every tuple in account to find tuples

with balance less than 2500.

If a B+-tree index is available on the attribute balance, we can use the index instead to locate the tuples.

To specify fully how to evaluate a query, we need not only to provide the relational algebra expression,

but also to annotate it with instructions specifying how to evaluate each operation.

Annotations may state the algorithm to be used for a specific operation, or the particular index or indices

to use.

A relational-algebra operation annotated with instructions on how to evaluate it is called an evaluation

primitive.

A sequence of primitive operations that can be used to evaluate a query is a query execution plan or

query-evaluation plan.

Figure above illustrates an evaluation plan for our example query, in which a particular index is

specified for the selection operation.

of 41

The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers

to the query.



Once the query plan is chosen, the query is evaluated with that plan, and the result of the query is

output.

The cost of query evaluation can be measured in terms of a number of different resources, including

disk accesses, CPU time to execute a query, and, in a distributedor parallel database system, the cost of

communication.

The response time for a query-evaluation plan (that is, the clock time required to execute the plan),

assuming no other activity is going on on the computer,would account for all these costs.

We use the number of block transfers from disk as a measure of the actual cost.

To simplify our computation of disk-access cost, we assume that all transfers of blocks have the same

cost.

A more accurate measure would therefore estimate

1. The number of seek operations performed

2. The number of blocks read

3. The number of blocks written

and then add up these numbers after multiplying them by the average seek time,

average transfer time for reading a block, and average transfer time for writing a

block, respectively.

Q.2.(a). What is meant by the term heuristic optimization? Discuss the main heuristic that are applied

during query optimization. 8M

Ans.

A drawback of cost-based optimization is the cost of optimization itself.

Although the cost of query processing can be reduced by clever optimizations, cost-based optimization

is still expensive.

Hence, many systems use heuristics to reduce the number of choices that must be made in a cost-based

fashion.

Some systems even choose to use only heuristics, and do not use cost-based optimization at all.

An example of a heuristic rule is the following rule for transforming relational algebra

queries:

• Perform selection operations as early as possible.

• A heuristic optimizer would use this rule without finding out whether the cost is reduced by this

transformation.

• For an example where it can result in an increase in cost, consider an expression

• σθ(r s), where the condition θ refers to only attributes in s.

• The selection can certainly be performed before the join.

• However, if r is extremely small compared to s, and if there is an index on the join attributes of s, but no

index on the attributes used by θ, then it is probably a bad idea to perform the selection early.

• Performing the selection early—that is, directly on s—would require doing a scan of all tuples in s.

• It is probably cheaper, in this case, to compute the join by using the index, and then to reject tuples that

fail the selection.

• Heuristic optimization applies the rules to the initial query expression and produces the heuristically

transformed query expressions.

• However, there are cases perform selection before join is not a good idea.

of 41

• Assume that r is small relation, s is very large, s has an index on the join attribute and no index on the

attributes of s in selection condition then compute the join using index then do select might be better

then scan the whole s to do selection first.

• The heuristic rules can be used to convert a initial query expression to an equivalent one .

Transforming Relational Algebra:

• One aspect of optimization occurs at relational algebra level.

• This involves transforming an initial expression (tree) into an equivalent expression (tree) which is more

efficient to execute.

• Two relational algebra expressions are said to be equivalent if the two expressions generate two relation

of the same set of attributes and contain the same set of tuples although their attributes may be ordered

differently.

• The query tree is a data structure that represents the relational algebra expression in the query

optimization process.

• The leaf nodes in the query tree corresponds to the input relations of the query.

• The internal nodes represent the operators in the query.

• When executing the query, the system will execute an internal node operation whenever its operands

available, then the internal node is replaced by the relation which is obtained from the preceding

execution.

• Equivalence Rules for Transforming the Queries.

• There are many rules which can be used to transform relational algebra operations to equivalent ones.

• Some useful rules for query optimization are as under:

• we use the following notation:

1. E1, E2, E3,… : denote relational algebra expressions

2. X, Y, Z : denote set of attributes

3. F, F1, F2, F3 ,… : denote predicates (selection or join conditions)

1. Commutativity of Join, Cartesian Product operations

E1⊳⊲FE2≡E2⊳⊲FE1E1×E2≡E2×E1

2. Associativity of Join , Cartesian Product operations

(E1∗E2)∗E3≡E1∗(E2∗E3)(E1×E2)×E3≡E1×(E2×E3)(E1⊳⊲F1E2)⊳⊲F2E3≡E1⊳⊲F1(E2⊳⊲F2

E3)

Join operation associative in the following manner: F1 involves attributes from only E1 and E2 and F2 involves

only attributes from E2 and E3

3. Cascade of Projection

πX1(πX2(...(πXn(E))...))≡πX1(E)

4. Cascade of Selection

σF1∧F2∧...∧Fn(E)≡σF1(σF2(...(σFn(E))...))

5. Commutativity of Selection

σF1(σF2(E))≡σF2(σF1(E))

6. Commuting Selection with Projection

πX(σF(E))≡σF(πX(E))

This rule holds if the selection condition F involves only the attributes in set X.

7. Selection with Cartesian Product and Join

If all the attributes in the selection condition F involve only the attributes of one of the expression say

E1, then the selection and Join can be combined as follows:

σF(E1⊳⊲CE2)≡(σF(E1))⊳⊲CE2

of 41

If the selection condition F = F1 AND F2 where F1 involves only attributes of expression E1 and F2

involves only attribute of expression E2 then we have:

σF1∧F2(E1⊳⊲CE2)≡(σF1(E1))⊳⊲C(σF2(E2))


involves attributes from both E1 and E2 then we have:

σF1∧F2(E1⊳⊲CE2)≡σF2((σF1(E1))⊳⊲CE2)

8. Commuting Selection with set operations

The Selection commutes with all three set operations (Union, Intersect, Set Difference) .

σF(E1∪E2)≡(σF(E1))∪(σF(E2))

The same rule apply when replace Union by Intersection or Set Difference

9. Commuting Projection with Union

πX(E1∪E2)≡(πX(E1))∪(πX(E2))

10. Commutativity of set operations: The Union and Intersection are commutative but Set Difference is not.

E1∪E2≡E2∪E1E1∩E2≡E2∩E1

11. Associativity of set operations: Union and Intersection are associative but Set Difference is not

(E1∪E2)∪E3≡E1∪(E2∪E3)(E1∩E2)∩E3≡E1∩(E2∩E3)

12. Converting a Catersian Product followed by a Selection into Join.

If the selection condition corresponds to a join condition we can do the convert as follows:

σF(E1×E2)≡E1⊳⊲FE2



Ans.




plans.

The cost computation accounts for factors of query execution such as I/O, CPU, and communication.

The best method of execution depends on myriad conditions including how the query is written, the size

of the data set, the layout of the data, and which access structures exist.

The optimizer determines the best plan for a SQL statement by examining multiple access methods,

such as full table scan or index scans, and different join methods such as nested loops and hash joins.

Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a

better position than the user to determine the best method of statement execution.



If the database statistics indicate that 80% of employees are managers, then the optimizer may decide

that a full table scan is most efficient.

However, if statistics indicate that few employees are managers, then reading an index followed by a

table access by rowid may be more efficient than a full table scan.

of 41


statement.


order.



Factors considered by the optimizer include: System resources, which includes I/O, CPU, and memory

Number of rows returned Size of the initial data sets.



For this reason, the optimizer is sometimes called the cost-based optimizer (CBO) to contrast it with the

legacy rule-based optimizer (RBO).

Q.3.(a). Explain Materialization with example. 6M

Ans.

• It is easiest to understand intuitively how to evaluate an expression by looking at a




in Figure below:

• If we apply the materialization approach, we start from the lowest-level operations in the expression (at

the bottom of the tree).



• We execute these operations by the join algorithms and we store the results in temporary relations.

• We can use these temporary relations to execute the operations at the next level up in the tree, where the

inputs now are either temporary relations or relations stored in the database.

• In our example, the inputs to the join are the customer relation and the temporary relation created by the

selection on account.


• By repeating the process, we will eventually evaluate the operation at the root of the tree, giving the

final result of the expression.

• In our example, we get the final result by executing the projection operation at the root of the tree, using

as input the temporary relation created by the join.


of 41

each intermediate operation are created (materialized) and then are used for evaluation of the next-level

operations.


• When we computed the cost estimates of algorithms, we ignored the cost of writing the result of the

operation to disk.



• We assume that the records of the result accumulate in a buffer, and, when the buffer is full, they are

written to disk.

• The cost of writing out the result can be estimated as nr/fr, where nr is the estimated number of tuples in

the result relation r, and fr is the blocking factor of the result relation, that is, the number of records of r

that will fit in a block.

• Double buffering (using two buffers, with one continuing execution of the algorithm while the other is

being written out) allows the algorithm to execute more quickly by performing CPU activity in parallel

with I/O activity.

Q.3.(b). Explain the pipelining with example. 6M

Ans.

• We can improve query-evaluation efficiency by reducing the number of temporary files that are

produced.

• We achieve this reduction by combining several relational operations into a pipeline of operations, in

which the results of one operation are passed along to the next operation in the pipeline.

• Evaluation as just described is called pipelined evaluation.

• Combining operations into a pipeline eliminates the cost of reading and writing temporary relations.

• For example, consider the expression (Πa1,a2(r ⊳⊲ s)).

• If materialization were applied, evaluation would involve creating a temporary relation to hold the result

of the join, and then reading back in the result to perform the projection.

• These operations can be combined: When the join operation generates a tuple of its result, it passes that

tuple immediately to the project operation for processing.

• By combining the join and the projection, we avoid creating the intermediate result, and instead create

the final result directly.

• We can implement a pipeline by constructing a single, complex operation that combines the operations

that constitute the pipeline.

• Although this approach may be feasible for various frequently occurring situations, it is desirable in

general to reuse the code for individual operations in the construction of a pipeline.

• Therefore, each operation in the pipeline is modeled as a separate process or thread within the system,

which takes a stream of tuples from its pipelined inputs, and generates a stream of tuples for its output.

• For each pair of adjacent operations in the pipeline, the system creates a buffer to hold tuples being

passed from one operation to the next.

• In the example of Figure shown below, all three operations can be placed in a pipeline, which passes the

results of the selection to the join as they are generated. In turn, it passes the results of the join to the

projection as they are generated.

of 41

• The memory requirements are low, since results of an operation are not stored for long.

• However, as a result of pipelining, the inputs to the operations are not available all at once for

processing.

• Pipelines can be executed in either of two ways:

1. Demand driven

2. Producer driven

• In a demand-driven pipeline, the system makes repeated requests for tuples from

the operation at the top of the pipeline.

• Each time that an operation receives a request for tuples, it computes the next tuple (or tuples) to be

returned, and then returns that tuple.

• In a producer-driven pipeline, operations do not wait for requests to produce

tuples, but instead generate the tuples eagerly.

• Each operation at the bottom of a pipeline continually generates output tuples, and puts them in its

output buffer, until the buffer is full.

Q.4. What is Query Processing. What are the joint strategies in Joint operation. Explain in detail. 13 M

Ans. Query processing is a set of activities involving in getting the result of a query expressed in a high-level

language.



suitable execution strategies for processing and then doing the actual execution of queries to get the

results.


For a given query, there are several possible strategies for processing exist, especially when query is

complex.



processing query.

There are several join strategies for computing the join of relations, and we analyze their respective

costs.


Assume join: R S


of 41


(take min)







Metadata:






Nested-loop join:


Relations r and s.











end

end


),(),( rAVr

nsnor

sAVs

nrn

of 41

1. If the buffer is too small to hold either relation entirely in memory, saving in block accesses can be

done if we process the relations on a per-block basis, rather than on a per-tuple basis.

2. Figure below shows block nested-loop join, which is a variant of the nested-loop join where every

block of the inner relation is paired with every block of the outer relation.







outer relation.












end end

end end

Cost:













of 41

3. This join method is called an indexed nested-loop join; it can be used with existing indices, as well

as with temporary indices created for the sole purpose of evaluating the join.




6. For each tuple in the outer relation r, a lookup is performed on the index for s, and the relevant tuples

are retrieved.










lookup on s.


1. Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from r




Merge Join:







algorithm.



of 41

1. Each block needs to be read only once (assuming all tuples for any given value of the join attributes

fit in memory)


bR + bS

3. But....




Hash join:




3. The basic idea is to partition the tuples of each of the relations into sets that have the same hash value

on the join attributes.

4. We assume that







5.The hash function h should have the “goodness” properties of randomness and

uniformity.



attributes.


s tuple in Hsi .



of 41

Q5.(a) Let relations r1(A,B,C) and r2(C,D,E) have the following properties: 7M

r1 has 20000 tuples

r2 has 45000 tuples

25 tuples of r1 fit on one bloclk and 30 tuples of r2 fit on one block.

Estimate the number of block accesses required using each of the following join strategies for r1 and r2.

1. Nested Loop Join

2. Block Nested Join

3. Merge Join

4. Hash Join

Ans.

r1 needs 800 blocks, and r2 needs 1500 blocks.

Let us assume M pages of memory.

If M > 800, the join can easily be done in 1500 + 800 disk accesses, using even plain nested-loop join.

So we consider only the case where M ≤ 800 pages.

a. Nested-loop join:

Using r1 as the outer relation we need 20000 ∗ 1500 + 800 = 30, 000, 800 disk accesses, if r2 is the

outer relation we need 45000 ∗ 800 + 1500 = 36, 001, 500 disk accesses.

b. Block nested-loop join:

If r1 is the outer relation, we need disk access.

if r2 is the outer relation we need disk accesses

c. Merge-join:

Assuming that r1 and r2 are not initially sorted on the join key, the total sorting cost inclusive of the

outputs is Bs = 1500 (2 [ log M -1 (1500/M)] + 2) disk accesses. Assuming all tuples

with the same value for the join attributes fit in memory, the total cost is Bs + 1500 + 800 disk accesses

d. Hash join:

We assume no overflow occurs. Since r1 is smaller, we use it as the build relation and r2 as the probe

relation. If M > 800/M, i.e. no need for recursive partitioning, then the cost is 3(1500+800) = 6900 disk

accesses, else the cost is 2(1500 + 800)⌈logM−1(800) − 1⌉ + 1500 + 800 disk accesses.

Q.5.(b). Define query optimization. What are the various measures to evaluate the cost of query?

Ans. 7M

Query optimization is the process of selecting the most efficient query-evaluation plan from among the

many strategies usually possible for processing a given query, especially if the query is complex.



One aspect of optimization occurs at the relational-algebra level, where the system attempts to find an

expression that is equivalent to the given expression, but more efficient to execute.

Another aspect is selecting a detailed strategy for processing the query, such as choosing the algorithm

to use for executing an operation, choosing the specific indices to use, and so on.

of 41








An annotated expression specifying detailed evaluation strategy is called the execution plan (includes,

e.g., whether index is used, join algorithms, . . . )

Among all semantically equivalent expressions, the one with the least costly evaluation plan is chosen.


Query optimizers use the statistic information stored in DBMS catalog to estimate the cost of a plan.






5. V(A,r) is the number of distinct value of an attribute A in a relation r. This value is the same as size

of πA(r). If A is a key attribute then V(A,r) = nr

6. SC(A,r) is the selection cardinality of attribute A of relation r. This is the average number of records

that satisfy an equality condition on attribute A.




With the statistical information maintained in DBMS catalog and the measures of query cost based on

number of disk accesses, we can estimate the cost for different relational algebra operations










In a large database, access cost is usually the most important cost since disk accesses are slow compared

to in-memory operations.





of 41

Q.6.(a). List the properties of a transaction must have. Briefly explain it. 6M

Ans.

A transaction is a logical unit of work that contains one or more SQL statements.

It is a collection of operations that form a single logical unit of work.

A database system must ensure proper execution of transactions despite failures that is either the entire

transaction executes, or none of it does.

Furthermore, it must manage concurrent execution of transactions in a way that avoids the introduction

of inconsistency.

Ideally, a database System will guarantee the properties of Atomicity, Consistency, Isolation and

Durability (ACID) for each transaction.

The effects of all the SQL statements in a transaction can be either all committed or all rolled back.

To ensure integrity of the data, we require that the database system maintain the following properties of

the transactions:

Atomicity. Either all operations of the transaction are reflected properly in the database, or none are.

Example: A transaction to transfer funds from one account to another involves making a withdrawal

operation from the first account and a deposit operation on the second. If the deposit operation failed,

you don’t want the withdrawal operation to happen either.

Consistency. Execution of a transaction in isolation (that is, with no other transaction executing

concurrently) preserves the consistency of the database.

Example: A database tracking a checking account may only allow unique check numbers to exist for

each transaction.

Isolation. Even though multiple transactions may execute concurrently, the system guarantees that, for

every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or

Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing

concurrently in the system.

Example: A teller looking up a balance must be isolated from a concurrent transaction involving a

withdrawal from the same account. Only when the withdrawal transaction commits successfully and the

teller looks at the balance again will the new balance be reported.

Durability. After a transaction completes successfully, the changes it has made to the database persist,

even if there are system failures.

Example: A system crash or any other failure must not be allowed to lose the results of a transaction or

the contents of the database. Durability is often achieved through separate transaction logs that can "re-

create" all transactions from some picked point in time (like a backup).

These properties are often called the ACID properties; the acronym is derived from the first letter of

each of the four properties.

Q.6.(b). Explain the state diagram with neat sketch that a transaction goes through during execution. 7M

Ans.

of 41

A transaction in a database can be in one of the above states.

In the absence of failures, all transactions complete successfully.

However, a transaction may not always complete its execution successfully.

Such a transaction is termed aborted.

If we are to ensure the atomicity property, an aborted transaction must have no effect on the state of the

database.

Thus, any changes that caused by an aborted transaction have been undone, we say that the transaction

has been rolled back.

It is part of the responsibility of the recovery scheme to manage transaction aborts.

A transaction that completes its execution successfully is said to be committed.

A committed transaction that has performed updates transforms the database into a new consistent state,

which must persist even if there is a system failure.

Once a transaction has committed, we cannot undo its effects by aborting it.

The only way to undo the effects of a committed transaction is to execute a compensating transaction.

For instance, if a transaction added $20 to an account, the compensating transaction would subtract $20

from the account.

However, it is not always possible to create such a compensating transaction.

Therefore, the responsibility of writing and executing a compensating transaction is left to the user, and

is not handled by the database system.

A transaction must be in one of the following states: Active, the initial state; the transaction stays in this

state while it is executing

• Partially committed, after the final statement has been executed

• Failed, after the discovery that normal execution can no longer proceed

• Aborted, after the transaction has been rolled back and the database has been

restored to its state prior to the start of the transaction.

• Committed, after successful completion.

The state diagram corresponding to a transaction is shown abvove.

We say that a transaction has committed only if it has entered the committed state.

Similarly, we say that a transaction has aborted only if it has entered the aborted state.

A transaction is said to have terminated if has either committed or aborted.

A transaction starts in the active state.

When it finishes its final statement, it enters the partially committed state.

At this point, the transaction has completed its execution, but it is still possible that it may have to be

aborted, since the actual output may still be temporarily residing in main memory, and thus a hardware

failure may preclude its successful completion.

The database system then writes out enough information to disk that, even in the event of a failure, the

updates performed by the transaction can be re-created when the system restarts after the failure.

When the last of this information is written out, the transaction enters the committed state.

A transaction enters the failed state after the system determines that the transaction can no longer

proceed with its normal execution (for example, because of hardware or logical errors).

Such a transaction must be rolled back.

Then, it enters the aborted state.

At this point, the system has two options: It can restart the transaction, but only if the transaction was

aborted as a result of some hardware or software error that was not created through the internal logic of

the transaction.

A restarted transaction is considered to be a new transaction.

It can kill the transaction.

It usually does so because of some internal logical error that can be corrected only by rewriting the

application program, or because the input was bad, or because the desired data were not found in the

database.

of 41

Q.7. Explain schedule, serializability. Also explain what is conflict & view serializability. 13M

Ans.

Serializability:

The database system must control concurrent execution of transactions, to ensure that the database state

remains consistent.

In the fields of databases and transaction processing (transaction management), a schedule describes

execution of transactions running in the system.



If order in time between certain operations is not determined by the system, then a partial order is used.

Examples of such operations are requesting a read operation, reading, writing, aborting, committing,

requesting lock, locking, etc.


Types of Schedules:

1. Serial Schedule:




A schedule that is equivalent (in its outcome) to a serial schedule has the serializability property.




A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more

serial schedules.


its precedence graph/serializability graph, when only committed transactions are considered, is acyclic.

Consider a schedule S in which there are two consecutive instructions Ii and Ij, of transactions Ti and Tj

, respectively (i _= j).

If Ii and Ij refer to different data items, then we can swap Ii and Ij without affecting the results of any

instruction in the schedule.


Since we are dealing with only read and write instructions, there are four cases that we need to consider:














of 41





Thus, only in the case where both Ii and Ij are read instructions does the relative order of their execution

not matter.

We say that Ii and Ij conflict if they are operations by different transactions on the same data item, and

at least one of these instructions is a write operation.


Schedule 1


However, the write(A) instruction of T2 does not conflict with the read(B) instruction of T1, because the

two instructions access different data items.


If Ii and Ij are instructions of different transactions and Ii and Ij do not conflict, then we can swap the

order of Ii and Ij to produce a new schedule S’.



Since the write(A) instruction of T2 in schedule 1 does not conflict with the read(B) instruction of T1,

we can swap these instructions to generate an equivalent schedule, schedule 2 shown below.

Schedule 2

Regardless of the initial system state, schedules 1 and 2 both produce the same final system state. We

continue to swap nonconflicting instructions:






This equivalence implies that, regardless of the initial system state, schedule 1 will produce the

same final state as will some serial schedule.

of 41

Schedule 3




We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule.


View equivalence that is less stringent than conflict equivalence, but that, like conflict equivalence, is

based on only the read and write operations of transactions.

Schedule 1






value was produced by a write(Q) operation executed by transaction Tj , then the read(Q) operation of

transaction Ti must, in schedule S’, also read the value of Q that




Conditions 1 and 2 ensure that each transaction reads the same values in both schedules and, therefore,

performs the same computation.

Condition 3, coupled with conditions 1 and 2, ensures that both schedules result in the same final system

state.


of 41

Schedule 1

Schedule 2

The schedule 1 is not view equivalent to schedule 2, since, in schedule 1, the value of account A read by

transaction T2 was produced by T1, whereas this case does not hold in schedule 2.



Every conflict-serializable schedule is also view serializable, but there are view serializable schedules

that are not conflict serializable.

Q.8. (a). Which of the following schedules is conflict serializable? For each serializable schedule,

determine the equivalent serial schedules: 7M

(i). r1 (X); r3 (X); w1 (X); r2 (X); w3 (X);

(ii). r1 (X); r3 (X); w3 (X); w1 (X); r2 (X);

(iii). r3 (X); r2 (X); w3 (X); r1 (X); w1 (X);

(iv). r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);

Ans.

Conflict-serializable schedules:


serial schedules.









of 41














not matter.




Schedule 1











Schedule 2

of 41










Schedule 3





(i). r1 (X); r3 (X); w1 (X); r2 (X); w3 (X);

The above schedule is conflict serializable as two conflicting instructions i.e. r3(X)

and w1 (X) is appearing one after the other . The equivalent serial schedule is as

follows:

The conflicting instructions can be swapped to make it serializable. After swapping

the conflicting instructions we get:

r1 (X); r3 (X); r2 (X); w1 (X); w3 (X);

(ii). r1 (X); r3 (X); w3 (X); w1 (X); r2 (X);

The above schedule is not conflict serializable as no conflicting instructions are

present.

(iii). r3 (X); r2 (X); w3 (X); r1 (X); w1 (X);



follows:



r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);

(iv). r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);


present.

of 41

Q.8.(b). What is log based recovery? What is the information in log records & how can it be used in

recovery? 6M

Ans.

When a system crashes, it many have several transactions being executed and various files opened for

them to modifying data items.

As we know that transactions are made of various operations, which are atomic in nature.

But according to ACID properties of DBMS, atomicity of transactions as a whole must be maintained

that is, either all operations are executed or none.

When DBMS recovers from a crash it should maintain the following:

It should check the states of all transactions, which were being executed.

A transaction may be in the middle of some operation; DBMS must ensure the atomicity of

transaction in this case.

It should check whether the transaction can be completed now or needs to be rolled back.

No transactions would be allowed to left DBMS in inconsistent state.

There are two types of techniques, which can help DBMS in recovering as well as maintaining the

atomicity of transaction:

Maintaining the logs of each transaction, and writing them onto some stable storage before actually

modifying the database.

Maintaining shadow paging, where are the changes are done on a volatile memory and later the

actual database is updated.

Log-Based Recovery

1. The most widely used structure for recording database modifications is the log.

2. The log is a sequence of log records, recording all the update activities in the database.

3. There are several types of log records.

4. An update log record describes a single database write.

5. It has these fields:

• Transaction identifier is the unique identifier of the transaction that performed

the write operation.

• Data-item identifier is the unique identifier of the data item written. Typically,

it is the location on disk of the data item.

• Old value is the value of the data item prior to the write.

• New value is the value that the data item will have after the write.

6. Other special log records exist to record significant events during transaction processing, such as the

start of a transaction and the commit or abort of a transaction.

7. We denote the various types of log records as:

• <Ti start>. Transaction Ti has started.

• <Ti, Xj, V1, V2>. Transaction Ti has performed a write on data item Xj . Xj

had value V1 before the write, and will have value V2 after the write.

• <Ti commit>. Transaction Ti has committed.

• <Ti abort>. Transaction Ti has aborted.

Whenever a transaction performs a write, it is essential that the log record for that write be created

before the database is modified.

Once a log record exists, we can output the modification to the database if that is desirable.

Also, we have the ability to undo a modification that has already been output to the database.

We undo it by using the old-value field in log records.

For log records to be useful for recovery from system and disk failures, the log must reside in stable

storage.

of 41

Q.9.(a). Explain lock based protocols. 7M

Ans. Lock based protocols

One way to ensure serializability is to require that data items be accessed in a mutually exclusive

manner; that is, while one transaction is accessing a data item, no other transaction can modify that data

item.

The most common method used to implement this requirement is to allow a transaction to access a data

item only if it is currently holding a lock on that item.

Locks

There are various modes in which a data item may be locked.

1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S)

on item Q, then Ti can read, but cannot write, Q.

2. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted

by X) on item Q, then Ti can both read and write Q.

We require that every transaction request a lock in an appropriate mode on data item Q, depending on

the types of operations that it will perform on Q.

The transaction makes the request to the concurrency-control manager.

The transaction can proceed with the operation only after the concurrency-control manager grants the

lock to the transaction.

Given a set of lock modes, we can define a compatibility function on them as follows.

Let A and B represent arbitrary lock modes.

Suppose that a transaction Ti requests a lock of mode A on item Q on which transaction Tj (Ti _= Tj )

currently holds a lock of mode B.

If transaction Ti can be granted a lock on Q immediately, in spite of the presence of the mode B lock,

then we say mode A is compatible with mode B.

Such a function can be represented conveniently by a matrix.

The compatibility relation between the two modes of locking appears in the matrix comp of Figure

shown above.

An element comp(A, B) of the matrix has the value true if and only if mode A is compatible with mode

B.

Note that shared mode is compatible with shared mode, but not with exclusive mode.

At any time, several shared-mode locks can be held simultaneously (by different transactions) on a

particular data item.

A subsequent exclusive-mode lock request has to wait until the currently held shared-mode locks are

released.

A transaction requests a shared lock on data item Q by executing the lock-S(Q) instruction.

Similarly, a transaction requests an exclusive lock through the lock-X(Q) instruction.

A transaction can unlock a data item Q by the unlock(Q) instruction.

To access a data item, transaction Ti must first lock that item.

If the data item is already locked by another transaction in an incompatible mode, the concurrency

control manager will not grant the lock until all incompatible locks held by other transactions have been

released.

Thus, Ti is made to wait until all incompatible locks held by other transactions have been released.

Transaction Ti may unlock a data item that it had locked at some earlier point.

Note that a transaction must hold a lock on a data item as long as it accesses that item.

Moreover, for a transaction to unlock a data item immediately after its final access of that data item is

not always desirable, since serializability may not be ensured.

of 41

There are four types lock protocols available:

1. Simplistic Simplistic lock based protocols allow transaction to obtain lock on every object before 'write' operation

is performed.

As soon as 'write' has been done, transactions may unlock the data item.

2. Pre-claiming In this protocol, a transactions evaluates its operations and creates a list of data items on which it needs

locks.

Before starting the execution, transaction requests the system for all locks it needs beforehand.

If all the locks are granted, the transaction executes and releases all the locks when all its operations are

over.

Else if all the locks are not granted, the transaction rolls back and waits until all locks are granted.

3. Two Phase Locking - 2PL

This locking protocol is divides transaction execution phase into three parts.

In the first part, when transaction starts executing, transaction seeks grant for locks it needs as it

executes.

Second part is where the transaction acquires all locks and no other lock is required. Transaction keeps

executing its operation.

As soon as the transaction releases its first lock, the third phase starts.

In this phase a transaction cannot demand for any lock but only releases the acquired locks.

Two phase locking has two phases, one is growing; where all locks are being acquired by transaction

and second one is shrinking, where locks held by the transaction are being released.

To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then upgrade

it to exclusive lock.

4. Strict Two Phase Locking

The first phase of Strict-2PL is same as 2PL. After acquiring all locks in the first phase, transaction

continues to execute normally.

But in contrast to 2PL, Strict-2PL does not release lock as soon as it is no more required, but it holds all

locks until commit state arrives.

Strict-2PL releases all locks at once at commit point.

Strict-2PL does not have cascading abort as 2PL does.

Q.9.(b). Write short notes on two phase locking protocol. 6M

Ans.

Two phase locking protocol. One way to ensure serializability is to require that data items be accessed in a mutually exclusive


item.



Locks






of 41



One protocol that ensures serializability is the two-phase locking protocol.

This protocol requires that each transaction issue lock and unlock requests in two phases:

1. Growing phase. A transaction may obtain locks, but may not release any lock.

2. Shrinking phase. A transaction may release locks, but may not obtain any new locks.

Initially, a transaction is in the growing phase.

The transaction acquires locks as needed.

Once the transaction releases a lock, it enters the shrinking phase, and it can issue no more lock

requests.

For example, transactions T3 and T4 are two phase.

On the other hand, transactions T1 and T2 are not two phase.

Note that the unlock instructions do not need to appear at the end of the transaction.

For example, in the case of transaction T3, we could move the unlock(B) instruction to just after the

lock-X(A) instruction, and still retain the two-phase locking property.

We can show that the two-phase locking protocol ensures conflict serializability.

Consider any transaction. The point in the schedule where the transaction has obtained its final lock (the

end of its growing phase) is called the lock point of the transaction.

Now, transactions can be ordered according to their lock points—this ordering is, in fact, a

serializability ordering for the transactions.

Two-phase locking does not ensure freedom from deadlock.

of 41

Q.10.(a). Explain Lock based protocol for concurrency control in database transactions. Consider the

following transactions: 6M

T31: read (A) T32: read (B)

read (B) read (A)

If A = 0, then if B=0, then

B:= B + 1; A:= A + 1;

Write (B) write (A)

Add lock and unlock instructions to transactions T31 and T32 so that they observe two phase locking

protocol. Can the execution of these transactions result in a deadlock? Explain your answer.

Ans.

Lock and unlock instructions:

T31: lock-S(A)

read(A)

lock-X(B)

read(B)

if A = 0

then B := B + 1

write(B)

unlock(A)

unlock(B)

T32: lock-S(B)

read(B)

lock-X(A)

read(A)

if B = 0

then A := A + 1

write(A)

unlock(B)

unlock(A)

Execution of these transactions can result in deadlock. For example, consider

the following partial schedule:

T31 T32

lock-S(A)

lock-S(B)

read(B)

read(A)

lock-X(B)

lock-X(A)

The transactions are now deadlocked

of 41

Q.10.(b). Write short notes on Time stamp based protocol. 7M

Ans.

Time stamp ordering Protocol

With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti).

This timestamp is assigned by the database system before the transaction Ti starts execution.

If a transaction Ti has been assigned timestamp TS(Ti), and a new transaction Tj enters the system, then

TS(Ti) < TS(Tj ).

There are two simple methods for implementing this scheme:

1. Use the value of the system clock as the timestamp; that is, a transaction’s

Timestamp is equal to the value of the clock when the transaction enters the system.

2. Use a logical counter that is incremented after a new timestamp has been

assigned; that is, a transaction’s timestamp is equal to the value of the counter

when the transaction enters the system.

The timestamps of the transactions determine the serializability order.

Thus, if TS(Ti) < TS(Tj ), then the system must ensure that the produced schedule is equivalent to a

serial schedule in which transaction Ti appears before transaction Tj .

To implement this scheme, we associate with each data item Q two timestamp values:

• W-timestamp(Q) denotes the largest timestamp of any transaction that executed

write(Q) successfully.

• R-timestamp(Q) denotes the largest timestamp of any transaction that executed

read(Q) successfully.

These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed.

The timestamp-ordering protocol ensures serializability among transaction in their conflicting read and

write operations.

This is the responsibility of the protocol system that the conflicting pair of tasks should be executed

according to the timestamp values of the transactions.

Time-stamp of Transaction Ti is denoted as TS(Ti).

Read time-stamp of data-item X is denoted by R-timestamp(X).

Write time-stamp of data-item X is denoted by W-timestamp(X).

Timestamp ordering protocol works as follows:

If a transaction Ti issues read(X) operation: o If TS(Ti) < W-timestamp(X)

Operation isrejected.

o If TS(Ti) >= W-timestamp(X)

Operation isexecuted.

o All data-item Timestamps updated.

If a transaction Ti issues write(X) operation: o If TS(Ti) < R-timestamp(X)

Operation is rejected.

o If TS(Ti) < W-timestamp(X)

Operation rejected and Ti rolled back.

o Otherwise, operation is executed.

If a transaction Ti is rolled back by the concurrency-control scheme as result of issuance of either a read

or write operation, the system assigns it a new timestamp and restarts it.

of 41

Q.11.(a). Discuss any one multiversion techniques used for concurrency control. 8M

Ans.

The concurrency-control schemes discussed thus far ensure serializability by either delaying an

operation or aborting the transaction that issued the operation.

For example, a read operation may be delayed because the appropriate value has not been written yet; or

it may be rejected (that is, the issuing transaction must be aborted) because the value that it was

supposed to read has already been overwritten.

These difficulties could be avoided if old copies of each data item were kept in a system.

In multiversion concurrency control schemes, each write(Q) operation creates a new version of Q.

When a transaction issues a read(Q) operation, the concurrency control manager selects one of the

versions of Q to be read.

The concurrency-control scheme must ensure that the version to be read is selected in a manner that

ensures serializability.

It is also crucial, for performance reasons, that a transaction be able to determine easily and quickly

which version of the data item should be read.

1. Multiversion Timestamp Ordering

The most common transaction ordering technique used by multiversion schemes is timestamping.

With each transaction Ti in the system, we associate a unique static timestamp, denoted by TS(Ti).

The database system assigns this timestamp before the transaction starts execution.

With each data item Q, a sequence of versions <Q1, Q2, . . .,Qm> is associated.

Each version Qk contains three data fields:

• Content is the value of version Qk.

• W-timestamp(Qk) is the timestamp of the transaction that created version Qk.

• R-timestamp(Qk) is the largest timestamp of any transaction that successfully

read version Qk.

A transaction—say, Ti—creates a new version Qk of data item Q by issuing a write(Q) operation.

The content field of the version holds the value written by Ti.

The system initializes the W-timestamp and R-timestamp to TS(Ti).

It updates the R-timestamp value of Qk whenever a transaction Tj reads the content of Qk, and R-

timestamp(Qk) < TS(Tj ).

The scheme operates as follows.

Suppose that transaction Ti issues a read(Q)or write(Q) operation. Let Qk denote the version of Q whose

write timestamp is thelargest write timestamp less than or equal to TS(Ti).

1. If transaction Ti issues a read(Q), then the value returned is the content of version Qk.

2. If transaction Ti issues write(Q), and if TS(Ti)<R-timestamp(Qk), then the system rolls back transaction Ti.

On the other hand, if TS(Ti) = W-timestamp(Qk), the system overwrites the contents of Qk; otherwise it creates

a new version of Q.

The justification for rule 1 is clear.

A transaction reads the most recent version that comes before it in time.

The second rule forces a transaction to abort if it is “too late” in doing a write.

More precisely, if Ti attempts to write a version that some other transaction would have read, then we

cannot allow that write to succeed.

Versions that are no longer needed are removed according to the following rule.

Suppose that there are two versions, Qk and Qj , of a data item, and that both versions have a W-

timestamp less than the timestamp of the oldest transaction in the system.

Then, the older of the two versions Qk and Qj will not be used again, and can be deleted.

The multiversion timestamp-ordering scheme has the desirable property that a read request never fails

and is never made towait.

of 41

In typical database systems,wherereading is a more frequent operation than is writing, this advantage

may be of major practical significance

.2. Multiversion Two-Phase Locking

The mutiversion two-phase locking protocol attempts to combine the advantages of multiversion

concurrency control with the advantages of two-phase locking.

This protocol differentiates between read-only transactions and update transactions.

Update transactions perform rigorous two-phase locking; that is, they hold all locks up to the end of the

transaction.

Thus, they can be serialized according to their commit order.

Each version of a data item has a single timestamp.

The timestamp in this case is not a real clock-based timestamp, but rather is a counter, which we will

call the ts-counter, that is incremented during commit processing.

Read-only transactions are assigned a timestamp by reading the current value of ts-counter before they

start execution; they follow the multiversion timestampordering protocol for performing reads.

Thus, when a read-only transaction Ti issues a read(Q), the value returned is the contents of the version

whose timestamp is the largest timestamp less than TS(Ti).

When an update transaction reads an item, it gets a shared lock on the item, and reads the latest version

of that item.

When an update transaction wants to write an item, it first gets an exclusive lock on the item, and then

creates a new version of the data item.

The write is performed on the new version, and the timestamp of the new version is initially set to a

value ∞, a value greater than that of any possible timestamp.

When the update transaction Ti completes its actions, it carries out commit processing:

First, Ti sets the timestamp on every version it has created to 1 more than the value of ts-counter; then,

Ti increments ts-counter by 1.

Only one update transaction is allowed to perform commit processing at a time.

As a result, read-only transactions that start after Ti increments ts-counter will see the values updated by

Ti,whereas those that start before Ti increments ts-counter will see the value before the updates by Ti.

In either case, read-only transactions never need to wait for locks.

Multiversion two-phase locking also ensures that schedules are recoverable and cascadeless.

Q.11.(b).Explain various types of failures that occur in the system and also explain recovery method

used? 6M

Ans.

A computer system, like any other device, is subject to failure from a variety of causes: disk crash,

power outage, software error, a fire in the machine room.

In any failure, information may be lost.

Therefore, the database system must take actions in advance to ensure that the atomicity and durability

properties of transactions are preserved.

An integral part of a database system is a recovery scheme that can restore the database to the

consistent state that existed before the failure.

The recovery scheme must also provide high availability; that is, it must minimize the time for which

the database is not usable after a crash.

Failure Classification

There are various types of failure that may occur in a system, each of which needs to be dealt with in a

different manner.

The simplest type of failure is one that does not result in the loss of information in the system.

The failures that are more difficult to deal with are those that result in loss of information.

of 41

The following types of failure can occur:

A. Transaction failure.

There are two types of errors that may cause a transaction to fail:

1. Logical error. The transaction can no longer continue with its normal execution because of some internal

condition, such as bad input, data not found, overflow, or resource limit exceeded.

2. System error. The system has entered an undesirable state (for example, deadlock), as a result of which a

transaction cannot continue with its normal execution.

The transaction, however, can be re-executed at a later time.

B. System crash.

There is a hardware malfunction, or a bug in the database software or the operating system, that causes

the loss of the content of volatile storage, and brings transaction processing to a halt.

The content of non-volatile storage remains intact, and is not corrupted.

The assumption that hardware errors and bugs in the software bring the system to a halt, but do not

corrupt the nonvolatile storage contents, is known as the fail-stop assumption.

Well-designed systems have numerous internal checks, at the hardware and the software level, that

bring the system to a halt when there is an error.

Hence, the fail-stop assumption is a reasonable one.

C. Disk failure.

A disk block loses its content as a result of either a head crash or failure during a data transfer operation.

Copies of the data on other disks, or archival backups on tertiary media, such as tapes, are used to

recover from the failure.

To determine how the system should recover from failures, we need to identify the failure modes of

those devices used for storing data.

Next, we must consider how these failure modes affect the contents of the database.

We can then propose algorithms to ensure database consistency and transaction atomicity despite

failures.

These algorithms, known as recovery algorithms, have two parts:

1. Actions taken during normal transaction processing to ensure that enough

information exists to allow recovery from failures.

2. Actions taken after a failure to recover the database contents to a state that ensures

database consistency, transaction atomicity, and durability.

Q.12.(a). What is deadlock in DBMS? Explain with example. What are deadlock prevention stretegies?

7M

Ans.

A system is in a deadlock state if there exists a set of transactions such that every transaction in the set is

waiting for another transaction in the set.

More precisely, there exists a set of waiting transactions {T0, T1, . . ., Tn} such that T0 is waiting for a

data item that T1 holds, and T1 is waiting for a data item that T2 holds, and . . ., and Tn−1 is waiting for

a data item that Tn holds, and Tn is waiting for a data item that T0 holds.

None of the transactions can make progress in such a situation.

The only remedy to this undesirable situation is for the system to invoke some drastic action, such as

rolling back some of the transactions involved in the deadlock.

Rollback of a transaction may be partial: That is, a transaction may be rolled back to the point where it

obtained a lock whose release resolves the deadlock.

There are two principal methods for dealing with the deadlock problem.

We can use a deadlock prevention protocol to ensure that the system will never enter a deadlock state.

of 41

Alternatively, we can allow the system to enter a deadlock state, and then try to recover by using a

deadlock detection and deadlock recovery scheme.

These methods may result in transaction rollback.

Prevention is commonly used if the probability that the system would enter a deadlock state is relatively

high; otherwise, detection and recovery are more efficient.

If a system does not employ some protocol that ensures deadlock freedom, then a detection and

recovery scheme must be used.

An algorithm that examines the state of the system is invoked periodically to determine whether a

deadlock has occurred.

If one has, then the system must attempt to recover from the deadlock.

To do so, the system must:

• Maintain information about the current allocation of data items to transactions,

as well as any outstanding data item requests.

• Provide an algorithm that uses this information to determine whether the system

has entered a deadlock state.

• Recover from the deadlock when the detection algorithm determines that a

deadlock exists.

Deadlock Prevention

There are two approaches to deadlock prevention.

One approach ensures that no cyclic waits can occur by ordering the requests for locks, or requiring all

locks to be acquired together.

The other approach is closer to deadlock recovery, and performs transaction rollback instead of waiting

for a lock, whenever the wait could potentially result in a deadlock.

The simplest scheme under the first approach requires that each transaction locks all its data items

before it begins execution.

Moreover, either all are locked in one step or none are locked.

There are two main disadvantages to this protocol:

(1) it is often hard to predict, before the transaction begins, what data items need to be locked;

(2) data-item utilization may be very low, since many of the data items may be locked

but unused for a long time.

Another approach for preventing deadlocks is to impose an ordering of all data items, and to require that

a transaction lock data items only in a sequence consistent with the ordering.

A variation of this approach is to use a total order of data items, in conjunction with two-phase locking.

Once a transaction has locked a particular item, it cannotrequest locks on items that precede that item in

the ordering.

This scheme is easy to implement, as long as the set of data items accessed by a transaction is known

when the transaction starts execution.

There is no need to change the underlying concurrency-control system if two-phase locking is used: All

that is needed it to ensure that locks are requested in the right order.

The second approach for preventing deadlocks is to use preemption and transaction rollbacks.

In preemption, when a transaction T2 requests a lock that transaction T1 holds, the lock granted to T1

may be preempted by rolling back of T1, and granting of the lock to T2.

To control the preemption, we assign a unique timestamp to each transaction.

The system uses these timestamps only to decide whether a transaction should wait or roll back.

Locking is still used for concurrency control.

If a transaction is rolled back, it retains its old timestamp when restarted.

Two different deadlock prevention schemes using timestamps have been proposed:

1. The wait–die scheme is a nonpreemptive technique.

When transaction Ti requests a data item currently held by Tj , Ti is allowed to wait only if it has

a timestamp smaller than that of Tj (that is, Ti is older than Tj ).

Otherwise, Ti is rolled back (dies).

of 41

For example, suppose that transactions T22, T23, and T24 have timestamps 5, 10, and 15,

respectively. If T22 requests a data itemheld by T23, then T22 will wait.

If T24 requests a data item held by T23, then T24 will be rolled back.

2. The wound–wait scheme is a preemptive technique.

It is a counterpart to the wait–die scheme.


a timestamp larger than that of Tj (that is, Ti is younger than Tj ). Otherwise, Tj is rolled back (Tj

is wounded by Ti).

Returning to our example, with transactions T22, T23, and T24, if T22 requests a data item held

by T23, then the data item will be preempted from T23, and T23 will be rolled back.

If T24 requests a data item held by T23, then T24 will wait.

Whenever the system rolls back transactions, it is important to ensure that there is no starvation—

that is, no transaction gets rolled back repeatedly and is never allowed to make progress.

Both the wound–wait and the wait–die schemes avoid starvation: At any time, there is a transaction

with the smallest timestamp.

This transaction cannot be required to roll back in either scheme.

Since timestamps always increase, and since transactions are not assigned new timestamps when they

are rolled back.

A transaction that is rolled back repeatedly will eventually have the smallest timestamp, at which

point it will not be rolled back again.

Q.12.(b). How are buffering and caching techniques used by the recovery systems? 7M

Ans. Bufffering technique used by Recovery system:

DBMS application programs require input/output (I/O) operations, which are performed by a

component of operating system.

These I/O operations normally use buffers to match the speed of the processor and the relatively fast

main (or primary) memories with the slower secondary storages and also to minimize the number of I/O

operations between the main and secondary memories wherever possible.

The buffers are the reserved blocks of the main memory.

The assignment and management of memory blocks is called and the component of the operating

system that performs this task is called buffer manager.

The buffer manager is responsible for the efficient management of the database buffers that are used to

transfer (flushing) pages between buffer and secondary storage.

It ensures that as many data requests made by programs as possible are satisfied form data copied

(flushed) from secondary storage into the buffers.

The buffer manager takes care of reading of pages from the disk (secondary storage) into the buffers

(physical memory) until the buffers become full and then using a replacement strategy to decide which

buffer(s) to force-write to disk to make space for new pages that need to be read from disk.

Some of the replacement strategies used by the buffer manager are (a) first-in-first-out (FIFO) and (b)

least recently used (LRU).

A computer system uses buffers that are in effect virtual memory buffers.

Thus, a mapping is required between a virtual memory buffer and the physical memory.

The physical memory is managed by the memory management component of operating system of

computer system.

of 41

In a virtual memory management, the buffers containing pages of the database undergoing modification

by a transaction could be written out to secondary storage.

The timing of this premature writing of a buffer is decided by the memory management component of

the operating system and is independent of the state of the transaction.

To decrease the number of buffer faults, the least recently used (LRU) algorithm is used for buffer

replacement.

Caching Techniques used in Recovery system:

Whenever a transaction needs to update the database, the disk pages (or disk blocks) containing the data

items to be modified are first cached (buffered) by the cache manager into the main memory and then

modified in the memory before being written back to the disk.

A cache directory is maintained to keep track of all the data items present in the buffers.

When an operation needs to be performed on a data item, the cache directory is first searched to

determine whether the disk page containing the data item resides in the cache.

If it is not present in the cache, the data item is searched on the disk and the appropriate disk page is

copied in the cache.

Sometimes it may be necessary to replace some of the disk pages to create space for the new pages.

Any page-replacement strategy such as least recently used (LRU) or first-in-first-out (FIFO) can be used

for replacing the disk page.

Each memory buffer has a free bit associated with it which indicates whether the buffer is free

(available for allocation) or not.

Other associated bits are dirty bit and pin/unpin bit.

For the efficiency of recovery purpose, the caching of disk pages is handled by the DBMS instead of the

OS.

Typically, a collection of in-memory buffers, called DBMS cache kept under the control of the DBMS.

A directory for the cache is used to keep track of which DB items are in the buffers.

It is in the form of a table of <disk page address, buffer location> entries.

The DBMS cache holds the database disk blocks including

• Data blocks

• Index blocks

• Log blocks

When DBMS requests action on some item, it first checks the cache directory to determine if the

corresponding disk page is in the cache.

If no, the item must be located on disk and the appropriate disk pages are copied into the cache.

It may be necessary to replace (flush) some of the cache buffers to make space available for the new

item.

Dirty bit.

– Associated with each buffer in the cache is a dirty bit.

– The dirty bit can be included in the directory entry.

– It indicates whether or not the buffer has been modified.

• Set dirty bit=0 when the page is first read from disk to the buffer cache.

• Set dirty bit=1 as soon as the corresponding buffer is modified.

– When the buffer content is replaced –flushed- from the cache, write it back

to the corresponding disk page only if dirty bit=1

Pin-unpin bit.

– A page is pinned –i.e. pin-unpin bit value=1-, if it cannot be written back to disk as

yet.

• Strategies that can be used when flushing occurs.

– In-place updating

• Writes the buffer back to the same original disk location –overwriting the old

value on disk-

http://my.safaribooksonline.com/9788131731925/gloss01#gloss01_030



of 41

– Shadowing

• Writes the updated buffer at a different disk location.

– Multiple versions of data items can be maintained.

– The old value called BFIM –before image-

– The new value AFIM –after image-

• The new value and old value are kept on disk, so no need of log for recovery.

Q.13.(a). Explain in brief about log based recovery. 6M

Ans. Log-Based Recovery

The most widely used structure for recording database modifications is the log.

The log is a sequence of log records, recording all the update activities in the database.

There are several types of log records.

An update log record describes a single database write.

It has these fields:







Other special log records exist to record significant events during transaction processing, such as the


We denote the various types of log records as:












storage.

Every log record is written to the end of the log on stable storage as soon as it is created

Observe that the log contains a complete record of all database activity.

As a result, the volume of data stored in the log may become unreasonably large.

Types of Logs:All log records include the general log attributes above, and also other attributes

depending on their type (which is recorded in the Type attribute, as above).

2. Update Log Record notes an update (change) to the database. It includes this extra information:

PageID: A reference to the Page ID of the modified page.

Length and Offset: Length in bytes and offset of the page are usually included.

Before and After Images: Includes the value of the bytes of page before and after the page

change. Some databases may have logs which include one or both images.

3. Compensation Log Record notes the rollback of a particular change to the database. Each correspond

with exactly one other Update Log Record (although the corresponding update log record is not

typically stored in the Compensation Log Record). It includes this extra information:

a. undoNextLSN: This field contains the LSN of the next log record that is to be undone for transaction that

wrote the last Update Log.

4. Commit Record notes a decision to commit a transaction.

5. Abort Record notes a decision to abort and hence roll back a transaction.

of 41

6. Checkpoint Record notes that a checkpoint has been made. These are used to speed up recovery. They

record information that eliminates the need to read a long way into the log's past. This varies according

to checkpoint algorithm. If all dirty pages are flushed while creating the checkpoint (as in PostgreSQL),

it might contain:

a. redoLSN: This is a reference to the first log record that corresponds to a dirty page. i.e. the first update

that wasn't flushed at checkpoint time. This is where redo must begin on recovery.

b. undoLSN: This is a reference to the oldest log record of the oldest in-progress transaction. This is the

oldest log record needed to undo all in-progress transactions.

7. Completion Record notes that all work has been done for this particular transaction. (It has been fully

committed or aborted)

Q.13.(b). Discuss the immediate update recovery technique in both single user and multiuser

environments. What are the advantages and disadvantages of immediate update? 8M

Or

Describe a recovery scheme that works in single user environment if system fails:-

(i). After transaction starts and before the read.

(ii). After the read and before the write.

(iii). After the commit and before al database entries are flushed into disk.

or

Describe a recovery technique that employ the immediate update scheme.

Ans.

Immediate Database Modification

The immediate-modification technique allows database modifications to be output to the database

while the transaction is still in the active state.

Data modifications written by active transactions are called uncommitted modifications.

In the event of a crash or a transaction failure, the system must use the old-value field of the log records

to restore the modified data items to the value they had prior to the start of the transaction.

The undo operation accomplishes this restoration.

Before a transaction Ti starts its execution, the system writes the record <Ti start> to the log.

During its execution, any write(X) operation by Ti is preceded by the writing of the appropriate new

update record to the log.

When Ti partially commits, the system writes the record <Ti commit> to the log.

Since the information in the log is used in reconstructing the state of the database, we cannot allow the

actual update to the database to take place before the corresponding log record is written out to stable

storage.

We therefore require that, before execution of an output(B) operation, the log records corresponding to

B be written onto stable storage.

As an illustration, let us reconsider our simplified banking system, with transactionsT0 and T1 executed

one after the other in the order T0 followed by T1.

The portion of the log containing the relevant information concerning these two transactions appears

below:

<T0 start>

<T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start>

<T1 , C, 700, 600>

<T1 commit>

http://en.wikipedia.org/wiki/PostgreSQL

of 41

Portion of the system log corresponding to T0 and T1.

One possible order in which the actual outputs took place in both the database system and the log as a

result of the execution of T0 and T1 is shown below:

Log Database

A = 950

B = 2050

C = 600

<T0 start>

<T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start>

<T1 , C, 700, 600>

<T1 commit>

State of system log and database corresponding to T0 and T1.

Using the log, the system can handle any failure that does not result in the loss of information in

nonvolatile storage.

The recovery scheme uses two recovery procedures:

• undo(Ti) restores the value of all data items updated by transaction Ti to the

old values.

• redo(Ti) sets the value of all data items updated by transaction Ti to the new

values.

The set of data items updated by Ti and their respective old and new values can be found in the log.

The undo and redo operations must be idempotent to guarantee correct behaviour even if a failure

occurs during the recovery process.

After a failure has occurred, the recovery scheme consults the log to determine which transactions need

to be redone, and which need to be undone:

• Transaction Ti needs to be undone if the log contains the record <Ti start>,

but does not contain the record <Ti commit>.

• Transaction Ti needs to be redone if the log contains both the record <Ti start>

and the record <Ti commit>.

As an illustration, return to our banking example, with transaction T0 and T1 executed one after the

other in the order T0 followed by T1.

Suppose that the system crashes before the completion of the transactions.

The state of the logs for each of these cases appears below:

First, let us assume that the crash occurs just after the log record for the step write(B) of transaction T0

has been written to stable storage.

<T0 start><T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 start> <T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start> <T1 , C, 700, 600>

<T0 start> <T0 , A, 1000, 950>

<T0 , B, 2000, 2050> <T0 commit>

<T1 start> <T1 , C, 700, 600>

<T1 commit>

The same log, shown at three different times.

of 41

When the system comes back up, it finds the record <T0 start> in the log, but no corresponding <T0

commit> record.

Thus, transaction T0 must be undone, so an undo(T0) is performed.

As a result, the values in accounts A and B (on the disk) are restored to $1000 and $2000, respectively.

Next, let us assume that the crash comes just after the log record for the step write(C) of transaction T1


When the system comes back up, two recovery actions need to be taken.

The operation undo(T1) must be performed, since the record <T1 start> appears in the log, but there is

no record <T1 commit>.

The operation redo(T0)must be performed, since the log contains both the record <T0 start> and the

record <T0 commit>.

At the end of the entire recovery procedure, the values of accounts A, B, and C are $950, $2050, and

$700, respectively.

Note that the undo(T1) operation is performed before the redo(T0).

In this example, the same outcome would result if the order were reversed.

However, the order of doing undo operations first, and then redo operations, is important for the

recovery algorithm .

Finally, let us assume that the crash occurs just after the log record<T1 commit> has been written to

stable storage.

When the system comes back up,both T0 and T1 need to be redone, since the records <T0 start> and

<T0 commit> appear in the log, as do the records <T1 start> and <T1 commit>.

After the system performs the recovery procedures redo(T0) and redo(T1), the values in accounts A, B,

and C are $950, $2050, and $600, respectively.

Q.14.(a). Describe Write Ahead Logging protocol. 5M

Ans.

In computer science, write-ahead logging (WAL) is a family of techniques for providing atomicity and

durability (two of the ACID properties) in database systems.

In a system using WAL, all modifications are written to a log before they are applied.

Usually both redo and undo information is stored in the log.

Example: Imagine a program that is in the middle of performing some operation when the machine it is

running on loses power.

Upon restart, that program might well need to know whether the operation it was performing

succeeded, half-succeeded, or failed.

If a write-ahead log were used, the program could check this log and compare what it was supposed to

be doing when it unexpectedly lost power to what was actually done.

On the basis of this comparison, the program could decide to undo what it had started, complete what it

had started, or keep things as they are.

WAL allows updates of a database to be done in-place.

Another way to implement atomic updates is with shadow paging, which is not in-place.

The main advantage of doing updates in-place is that it reduces the need to modify indexes and block

lists.

ARIES is a popular algorithm in the WAL family.

File systems typically use a variant of WAL for at least file system metadata called journaling.

The PostgreSQL database system also uses WAL to provide point-in-time recovery and database

replication features. SQLite database also uses WAL.

MongoDB uses write-ahead logging to provide consistency and crash safety.

Apache HBase uses WAL in order to provide recovery after disaster.

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Atomic_%28computer_science%29

http://en.wikipedia.org/wiki/Durability_%28database_systems%29

http://en.wikipedia.org/wiki/ACID

http://en.wikipedia.org/wiki/Database_system

http://en.wikipedia.org/wiki/Database_log

http://en.wikipedia.org/wiki/In-place_algorithm

http://en.wikipedia.org/wiki/Shadow_paging

http://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics

http://en.wikipedia.org/wiki/File_system

http://en.wikipedia.org/wiki/Metadata

http://en.wikipedia.org/wiki/Journaling_file_system


http://en.wikipedia.org/wiki/Point-in-time_recovery

http://en.wikipedia.org/wiki/Database_replication


http://en.wikipedia.org/wiki/SQLite

http://www.sqlite.org/draft/wal.html

http://en.wikipedia.org/wiki/MongoDB

http://en.wikipedia.org/wiki/HBase

of 41

Q.4.(b). Explain the process buffer management with suitable example. 7M

Ans.
















Buffer Manager needs to make a critical choice of which block to keep and which block to discard when

buffer is needed for newly requested blocks.

Then buffer manager uses buffer replacement strategies. Some common strategies are:

Least-Recently-Used (LRU)

First-In-First-Out (FIFO)

The Clock Algorithm (Second Chance)

System Control




computer system.






replacement.

Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in

available main memory.

When buffer manager controls main memory directly:

If requests exceeds available space then buffer manager has to select a buffer to empty by returning its

contents to disk.

When blocks have not been changed then they are simply erased from main memory. But, when blocks

have been changed then they are written back to its place on disk.

When buffer manager allocates space in virtual memory :

Buffer manager has the option of allocating more buffers, which can actually fit into main memory.

When all these buffers will be in use then there will be thrashing.

It is an operating system problem where many blocks are moved in and out of disk’s swap space.

Therefore, system will end up spending most of time in swapping blocks and getting very little work

done.

Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-

replacement strategy used by the buffer manager.

of 41

Q.15. (a). Write short notes on ARIES Recovery method. 6M

Or

Describe three phases of Aries recovery method.

Or

Explain Aries Recovery Algorithm in detail for database.

Ans.

The state of the art in recovery methods is best illustrated by the ARIES recovery method.

The advanced recovery technique is modeled after ARIES, but has been simplified significantly to bring

out key concepts.

In contrast, ARIES uses a number of techniques to reduce the time taken for recovery, and to reduce the

overheads of checkpointing.

In particular, ARIES is able to avoid redoing many logged operations that have already been applied

and to reduce the amount of information logged.

The price paid is greater complexity; the benefits are worth the price.

ARIES recovers from a system crash in three passes.

• Analysis pass: This pass determines which transactions to undo, which pages

were dirty at the time of the crash, and the LSN from which the redo pass

should start.

• Redo pass: This pass starts from a position determined during analysis, and

performs a redo, repeating history, to bring the database to a state it was in

before the crash.

• Undo pass: This pass rolls back all transactions that were incomplete at the

time of crash.

1. Analysis Pass:

• The analysis pass finds the last complete checkpoint log record, and reads in the Dirty PageTable from

this record.

• It then sets RedoLSN to the minimum of the RecLSNs of the pages in the DirtyPageTable.

• If there are no dirty pages, it sets RedoLSN to the LSN of the checkpoint log record.

• The redo pass starts its scan of the log from RedoLSN.

• All the log records earlier than this point have already been applied to the database pages on disk.

• The analysis pass initially sets the list of transactions to be undone, undo-list, to the list of transactions

in the checkpoint log record.

• The analysis pass also reads from the checkpoint log record the LSNs of the last log record for each

transaction in undo-list.

• The analysis pass continues scanning forward from the checkpoint.

• Whenever it finds a log record for a transaction not in the undo-list, it adds the transaction to undo-list.

• Whenever it finds a transaction end log record, it deletes the transaction from undo-list.

• All transactions left in undo-list at the end of analysis have to be rolled back later, in the undo pass.

• The analysis pass also keeps track of the last record of each transaction in undo-list, which is used in the

undo pass.

• The analysis pass also updates DirtyPageTable whenever it finds a log record for an update on a page.

• If the page is not in DirtyPageTable, the analysis pass adds it to DirtyPageTable, and sets the RecLSN

of the page to the LSN of the log record.

2. Redo Pass:

• The redo pass repeats history by replaying every action that is not already reflected in the page on disk.

• The redo pass scans the log forward from RedoLSN.

• Whenever it finds an update log record, it takes this action:

1. If the page is not inDirtyPageTable or the LSNof the update log record is less

than the RecLSN of the page in DirtyPageTable, then the redo pass skips the

log record.

2. Otherwise the redo pass fetches the page from disk, and if the PageLSN is less

than the LSN of the log record, it redoes the log record.

of 41

• Note that if either of the tests is negative, then the effects of the log record have already appeared on the

page.

• If the first test is negative, it is not even necessary to fetch the page from disk.

• Undo Pass and Transaction Rollback: The undo pass is relatively straightforward.

• It performs a backward scan of the log, undoing all transactions in undo-list.

• If a CLR is found, it uses the UndoNextLSN field to skip log records that have already been rolled back.

• Otherwise, it uses the PrevLSN field of the log record to find the next log record to be undone.

• Whenever an update log record is used to perform an undo (whether for transaction rollback during

normal processing, or during the restart undo pass), the undo pass generates a CLR containing the undo

action performed (which must be physiological).

• It sets the UndoNextLSN of the CLR to the PrevLSN value of the update log record.

Q.15.(b). Explain Transaction Rollback in detail. 5M

Ans.

Consider transaction rollback during normal operation (that is, not during recovery from system failure).

The system scans the log backward and uses log records belonging to the transaction to restore the old

values of data items.

Unlike rollback in normal operation, however, rollback in our advanced recovery scheme writes out

special redo-only log records of the form <Ti, Xj, V>containing the value V being restored to data item

Xj during the rollback.

These log records are sometimes called compensation log records. Such records do not need undo

information, since we will never need to undo such an undo operation.

Whenever the system finds a log record <Ti,Oj , operation-end, U>, it takes special actions:

1. It rolls back the operation by using the undo information U in the log record.

• It logs the updates performed during the rollback of the operation just like updates performed when the

operation was first executed.

• In other words, the system logs physical undo information for the updates performed during rollback,

instead of using compensation log records.

• This is because a crash may occur while a logical undo is in progress, and on recovery the system has to

complete the logical undo; to do so, restart recovery will undo the partial effects of the earlier undo,

using the physical undo information, and then perform the logical undo again.

• At the end of the operation rollback, instead of generating a log record < Ti,Oj , operation-end, U >, the

system generates a log record < Ti,Oj ,operation-abort>.

2. When the backward scan of the log continues, the system skips all log records

of the transaction until it finds the log record <Ti,Oj , operation-begin>.

• After it finds the operation-begin log record, it processes log records of the transaction in the normal

manner again.

• Observe that skipping over physical log records when the operation-end log record is found during

rollback ensures that the old values in the physical log record are not used for rollback, once the

operation completes.

• If the system finds a record < Ti,Oj , operation-abort>, it skips all preceding records until it finds the

record< Ti,Oj , operation-begin>.

• These preceding log records must be skipped to prevent multiple rollback of the same operation, in case

there had been a crash during an earlier rollback, and the transaction had already been partly rolled

back.

• When the transaction Ti has been rolled back, the system adds a record <Ti abort> to the log.

• If failures occur while a logical operation is in progress, the operation-end log record for the operation

will not be found when the transaction is rolled back. However, for every update performed by the

operation, undo information—in the form of the old value in the physical log records—is available in

the log. The physical log records will be used to roll back the incomplete operation.


Tulsiramji Gaikwad-Patil College of Engineering & Technology, Nagpur


Model Solution (End- Term Examination) Academic Session: 2014 - 2015


Q.1. (a). Explain Query processing? Explain various steps in query processing with the help of neat

sketch. 6M

Ans.

Query processing refers to the range of activities involved in extracting data from a database.

The activities include translation of queries in high-level database languages into expressions that can be

used at the physical level of the file system, a variety of query-optimizing transformations, and actual

evaluation of queries.

A given SQL query is translated by the query processor into a low level program called an execution

plan.

An execution plan is a program in a functional language which is called the physical relational algebra,

specialized for internal storage representation in the DBMS.

The physical relational algebra extends the relational algebra with Primitives to search through the

internal storage structures of the DBMS.

The steps involved in processing a query are shown in the figure below.

The basic steps are

1. Parsing and translation

2. Optimization

3. Evaluation

Before query processing can begin, the system must translate the query into a usable form.

A language such as SQL is suitable for human use, but is not useful for system’s internal representation

of a query.

A more useful internal representation is one based on the extended relational algebra.

Thus, the first action the system must take in query processing is to translate a given query into its

internal form.

This translation process is similar to the work performed by the parser of a compiler.


In generating the internal form of the query, the parser checks the syntax of the user’s query, verifies

that the relation names appearing in the query are names of the relations in the database, and so on.

The system constructs a parse-tree representation of the query, which it then translates into a relational-

algebra expression.

If the query was expressed in terms of a view, the translation phase also replaces all uses of the view by

the relational-algebra expression that defines the view.




Furthermore, the relational-algebra representation of a query specifies only partially how to evaluate a

query.


Q.1.(b). Write short notes on Query evaluation. 6M

Ans.


For example, we know that, in SQL, a query could be expressed in several different ways.


The relational-algebra representation of a query specifies only partially how to evaluate a query; there

are usually several ways to evaluate relational-algebra expressions.

Consider the query select balance

from account






For example, to implement the preceding selection, we can search every tuple in account to find tuples

with balance less than 2500.

If a B+-tree index is available on the attribute balance, we can use the index instead to locate the tuples.

To specify fully how to evaluate a query, we need not only to provide the relational algebra expression,

but also to annotate it with instructions specifying how to evaluate each operation.

Annotations may state the algorithm to be used for a specific operation, or the particular index or indices

to use.

A relational-algebra operation annotated with instructions on how to evaluate it is called an evaluation

primitive.

A sequence of primitive operations that can be used to evaluate a query is a query execution plan or

query-evaluation plan.

Figure above illustrates an evaluation plan for our example query, in which a particular index is

specified for the selection operation.


The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers

to the query.



Once the query plan is chosen, the query is evaluated with that plan, and the result of the query is

output.

The cost of query evaluation can be measured in terms of a number of different resources, including

disk accesses, CPU time to execute a query, and, in a distributedor parallel database system, the cost of

communication.

The response time for a query-evaluation plan (that is, the clock time required to execute the plan),

assuming no other activity is going on on the computer,would account for all these costs.

We use the number of block transfers from disk as a measure of the actual cost.

To simplify our computation of disk-access cost, we assume that all transfers of blocks have the same

cost.

A more accurate measure would therefore estimate

1. The number of seek operations performed

2. The number of blocks read

3. The number of blocks written

and then add up these numbers after multiplying them by the average seek time,

average transfer time for reading a block, and average transfer time for writing a

block, respectively.

Q.2.(a). What is meant by the term heuristic optimization? Discuss the main heuristic that are applied

during query optimization. 8M

Ans.

A drawback of cost-based optimization is the cost of optimization itself.

Although the cost of query processing can be reduced by clever optimizations, cost-based optimization

is still expensive.

Hence, many systems use heuristics to reduce the number of choices that must be made in a cost-based

fashion.

Some systems even choose to use only heuristics, and do not use cost-based optimization at all.

An example of a heuristic rule is the following rule for transforming relational algebra

queries:

• Perform selection operations as early as possible.

• A heuristic optimizer would use this rule without finding out whether the cost is reduced by this

transformation.

• For an example where it can result in an increase in cost, consider an expression

• σθ(r s), where the condition θ refers to only attributes in s.

• The selection can certainly be performed before the join.

• However, if r is extremely small compared to s, and if there is an index on the join attributes of s, but no

index on the attributes used by θ, then it is probably a bad idea to perform the selection early.

• Performing the selection early—that is, directly on s—would require doing a scan of all tuples in s.

• It is probably cheaper, in this case, to compute the join by using the index, and then to reject tuples that

fail the selection.

• Heuristic optimization applies the rules to the initial query expression and produces the heuristically

transformed query expressions.

• However, there are cases perform selection before join is not a good idea.


• Assume that r is small relation, s is very large, s has an index on the join attribute and no index on the

attributes of s in selection condition then compute the join using index then do select might be better

then scan the whole s to do selection first.

• The heuristic rules can be used to convert a initial query expression to an equivalent one .

Transforming Relational Algebra:

• One aspect of optimization occurs at relational algebra level.

• This involves transforming an initial expression (tree) into an equivalent expression (tree) which is more

efficient to execute.

• Two relational algebra expressions are said to be equivalent if the two expressions generate two relation

of the same set of attributes and contain the same set of tuples although their attributes may be ordered

differently.

• The query tree is a data structure that represents the relational algebra expression in the query

optimization process.

• The leaf nodes in the query tree corresponds to the input relations of the query.

• The internal nodes represent the operators in the query.

• When executing the query, the system will execute an internal node operation whenever its operands

available, then the internal node is replaced by the relation which is obtained from the preceding

execution.

• Equivalence Rules for Transforming the Queries.

• There are many rules which can be used to transform relational algebra operations to equivalent ones.

• Some useful rules for query optimization are as under:

• we use the following notation:

1. E1, E2, E3,… : denote relational algebra expressions

2. X, Y, Z : denote set of attributes

3. F, F1, F2, F3 ,… : denote predicates (selection or join conditions)

1. Commutativity of Join, Cartesian Product operations

E1⊳⊲FE2≡E2⊳⊲FE1E1×E2≡E2×E1

2. Associativity of Join , Cartesian Product operations

(E1∗E2)∗E3≡E1∗(E2∗E3)(E1×E2)×E3≡E1×(E2×E3)(E1⊳⊲F1E2)⊳⊲F2E3≡E1⊳⊲F1(E2⊳⊲F2

E3)

Join operation associative in the following manner: F1 involves attributes from only E1 and E2 and F2 involves

only attributes from E2 and E3

3. Cascade of Projection

πX1(πX2(...(πXn(E))...))≡πX1(E)

4. Cascade of Selection

σF1∧F2∧...∧Fn(E)≡σF1(σF2(...(σFn(E))...))

5. Commutativity of Selection

σF1(σF2(E))≡σF2(σF1(E))

6. Commuting Selection with Projection

πX(σF(E))≡σF(πX(E))

This rule holds if the selection condition F involves only the attributes in set X.

7. Selection with Cartesian Product and Join

If all the attributes in the selection condition F involve only the attributes of one of the expression say

E1, then the selection and Join can be combined as follows:

σF(E1⊳⊲CE2)≡(σF(E1))⊳⊲CE2



involves only attribute of expression E2 then we have:

σF1∧F2(E1⊳⊲CE2)≡(σF1(E1))⊳⊲C(σF2(E2))


involves attributes from both E1 and E2 then we have:

σF1∧F2(E1⊳⊲CE2)≡σF2((σF1(E1))⊳⊲CE2)

8. Commuting Selection with set operations

The Selection commutes with all three set operations (Union, Intersect, Set Difference) .

σF(E1∪E2)≡(σF(E1))∪(σF(E2))

The same rule apply when replace Union by Intersection or Set Difference

9. Commuting Projection with Union

πX(E1∪E2)≡(πX(E1))∪(πX(E2))

10. Commutativity of set operations: The Union and Intersection are commutative but Set Difference is not.

E1∪E2≡E2∪E1E1∩E2≡E2∩E1

11. Associativity of set operations: Union and Intersection are associative but Set Difference is not

(E1∪E2)∪E3≡E1∪(E2∪E3)(E1∩E2)∩E3≡E1∩(E2∩E3)

12. Converting a Catersian Product followed by a Selection into Join.

If the selection condition corresponds to a join condition we can do the convert as follows:

σF(E1×E2)≡E1⊳⊲FE2



Ans.




plans.

The cost computation accounts for factors of query execution such as I/O, CPU, and communication.

The best method of execution depends on myriad conditions including how the query is written, the size

of the data set, the layout of the data, and which access structures exist.

The optimizer determines the best plan for a SQL statement by examining multiple access methods,

such as full table scan or index scans, and different join methods such as nested loops and hash joins.

Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a

better position than the user to determine the best method of statement execution.



If the database statistics indicate that 80% of employees are managers, then the optimizer may decide

that a full table scan is most efficient.

However, if statistics indicate that few employees are managers, then reading an index followed by a

table access by rowid may be more efficient than a full table scan.



statement.


order.



Factors considered by the optimizer include: System resources, which includes I/O, CPU, and memory

Number of rows returned Size of the initial data sets.



For this reason, the optimizer is sometimes called the cost-based optimizer (CBO) to contrast it with the

legacy rule-based optimizer (RBO).

Q.3.(a). Explain Materialization with example. 6M

Ans.

It is easiest to understand intuitively how to evaluate an expression by looking at a




in Figure below:

• If we apply the materialization approach, we start from the lowest-level operations in the expression (at

the bottom of the tree).



• We execute these operations by the join algorithms and we store the results in temporary relations.

• We can use these temporary relations to execute the operations at the next level up in the tree, where the

inputs now are either temporary relations or relations stored in the database.

• In our example, the inputs to the join are the customer relation and the temporary relation created by the

selection on account.


• By repeating the process, we will eventually evaluate the operation at the root of the tree, giving the

final result of the expression.

• In our example, we get the final result by executing the projection operation at the root of the tree, using

as input the temporary relation created by the join.


each intermediate operation are created (materialized) and then are used for evaluation of the next-level

operations.



• When we computed the cost estimates of algorithms, we ignored the cost of writing the result of the

operation to disk.



• We assume that the records of the result accumulate in a buffer, and, when the buffer is full, they are

written to disk.

• The cost of writing out the result can be estimated as nr/fr, where nr is the estimated number of tuples in

the result relation r, and fr is the blocking factor of the result relation, that is, the number of records of r

that will fit in a block.

• Double buffering (using two buffers, with one continuing execution of the algorithm while the other is

being written out) allows the algorithm to execute more quickly by performing CPU activity in parallel

with I/O activity.

Q.3.(b). Explain the pipelining with example. 6M

Ans.

• We can improve query-evaluation efficiency by reducing the number of temporary files that are

produced.

• We achieve this reduction by combining several relational operations into a pipeline of operations, in

which the results of one operation are passed along to the next operation in the pipeline.

• Evaluation as just described is called pipelined evaluation.

• Combining operations into a pipeline eliminates the cost of reading and writing temporary relations.

• For example, consider the expression (Πa1,a2(r ⊳⊲ s)).

• If materialization were applied, evaluation would involve creating a temporary relation to hold the result

of the join, and then reading back in the result to perform the projection.

• These operations can be combined: When the join operation generates a tuple of its result, it passes that

tuple immediately to the project operation for processing.

• By combining the join and the projection, we avoid creating the intermediate result, and instead create

the final result directly.

• We can implement a pipeline by constructing a single, complex operation that combines the operations

that constitute the pipeline.

• Although this approach may be feasible for various frequently occurring situations, it is desirable in

general to reuse the code for individual operations in the construction of a pipeline.

• Therefore, each operation in the pipeline is modeled as a separate process or thread within the system,

which takes a stream of tuples from its pipelined inputs, and generates a stream of tuples for its output.

• For each pair of adjacent operations in the pipeline, the system creates a buffer to hold tuples being

passed from one operation to the next.

• In the example of Figure shown below, all three operations can be placed in a pipeline, which passes the

results of the selection to the join as they are generated. In turn, it passes the results of the join to the

projection as they are generated.


• The memory requirements are low, since results of an operation are not stored for long.

• However, as a result of pipelining, the inputs to the operations are not available all at once for

processing.

• Pipelines can be executed in either of two ways:

1. Demand driven

2. Producer driven

• In a demand-driven pipeline, the system makes repeated requests for tuples from

the operation at the top of the pipeline.

• Each time that an operation receives a request for tuples, it computes the next tuple (or tuples) to be

returned, and then returns that tuple.

• In a producer-driven pipeline, operations do not wait for requests to produce

tuples, but instead generate the tuples eagerly.

• Each operation at the bottom of a pipeline continually generates output tuples, and puts them in its

output buffer, until the buffer is full.

Q.4. What is Query Processing. What are the joint strategies in Joint operation. Explain in detail. 13 M

Ans. Query processing is a set of activities involving in getting the result of a query expressed in a high-level

language.



suitable execution strategies for processing and then doing the actual execution of queries to get the

results.


For a given query, there are several possible strategies for processing exist, especially when query is

complex.



processing query.

There are several join strategies for computing the join of relations, and we analyze their respective

costs.


Assume join: R S




(take min)







Metadata:






Nested-loop join:


Relations r and s.











end

end


),(),( rAVr

nsnor

sAVs

nrn


1. If the buffer is too small to hold either relation entirely in memory, saving in block accesses can be

done if we process the relations on a per-block basis, rather than on a per-tuple basis.

2. Figure below shows block nested-loop join, which is a variant of the nested-loop join where every

block of the inner relation is paired with every block of the outer relation.







outer relation.












end end

end end

Cost:














3. This join method is called an indexed nested-loop join; it can be used with existing indices, as well

as with temporary indices created for the sole purpose of evaluating the join.




6. For each tuple in the outer relation r, a lookup is performed on the index for s, and the relevant tuples

are retrieved.










lookup on s.


1. Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from r




Merge Join:







algorithm.




1. Each block needs to be read only once (assuming all tuples for any given value of the join attributes

fit in memory)


bR + bS

3. But....




Hash join:




3. The basic idea is to partition the tuples of each of the relations into sets that have the same hash value

on the join attributes.

4. We assume that







5.The hash function h should have the ―goodness‖ properties of randomness and

uniformity.



attributes.


s tuple in Hsi .




Q5.(a) Let relations r1(A,B,C) and r2(C,D,E) have the following properties: 7M

r1 has 20000 tuples

r2 has 45000 tuples

25 tuples of r1 fit on one bloclk and 30 tuples of r2 fit on one block.

Estimate the number of block accesses required using each of the following join strategies for r1 and r2.

1. Nested Loop Join

2. Block Nested Join

3. Merge Join

4. Hash Join

Ans.

r1 needs 800 blocks, and r2 needs 1500 blocks.

Let us assume M pages of memory.

If M > 800, the join can easily be done in 1500 + 800 disk accesses, using even plain nested-loop join.

So we consider only the case where M ≤ 800 pages.

a. Nested-loop join:

Using r1 as the outer relation we need 20000 ∗ 1500 + 800 = 30, 000, 800 disk accesses, if r2 is the

outer relation we need 45000 ∗ 800 + 1500 = 36, 001, 500 disk accesses.

b. Block nested-loop join:

If r1 is the outer relation, we need disk access.

if r2 is the outer relation we need disk accesses

c. Merge-join:

Assuming that r1 and r2 are not initially sorted on the join key, the total sorting cost inclusive of the

outputs is Bs = 1500 (2 [ log M -1 (1500/M)] + 2) disk accesses. Assuming all tuples

with the same value for the join attributes fit in memory, the total cost is Bs + 1500 + 800 disk accesses

d. Hash join:

We assume no overflow occurs. Since r1 is smaller, we use it as the build relation and r2 as the probe

relation. If M > 800/M, i.e. no need for recursive partitioning, then the cost is 3(1500+800) = 6900 disk

accesses, else the cost is 2(1500 + 800)⌈logM−1(800) − 1⌉ + 1500 + 800 disk accesses.

Q.5.(b). Define query optimization. What are the various measures to evaluate the cost of query?

Ans. 7M

Query optimization is the process of selecting the most efficient query-evaluation plan from among the

many strategies usually possible for processing a given query, especially if the query is complex.



One aspect of optimization occurs at the relational-algebra level, where the system attempts to find an

expression that is equivalent to the given expression, but more efficient to execute.

Another aspect is selecting a detailed strategy for processing the query, such as choosing the algorithm

to use for executing an operation, choosing the specific indices to use, and so on.









An annotated expression specifying detailed evaluation strategy is called the execution plan (includes,

e.g., whether index is used, join algorithms, . . . )

Among all semantically equivalent expressions, the one with the least costly evaluation plan is chosen.


Query optimizers use the statistic information stored in DBMS catalog to estimate the cost of a plan.






5. V(A,r) is the number of distinct value of an attribute A in a relation r. This value is the same as size

of πA(r). If A is a key attribute then V(A,r) = nr

6. SC(A,r) is the selection cardinality of attribute A of relation r. This is the average number of records

that satisfy an equality condition on attribute A.




With the statistical information maintained in DBMS catalog and the measures of query cost based on

number of disk accesses, we can estimate the cost for different relational algebra operations










In a large database, access cost is usually the most important cost since disk accesses are slow compared

to in-memory operations.






Q.6.(a). List the properties of a transaction must have. Briefly explain it. 6M

Ans.

A transaction is a logical unit of work that contains one or more SQL statements.

It is a collection of operations that form a single logical unit of work.

A database system must ensure proper execution of transactions despite failures that is either the entire

transaction executes, or none of it does.

Furthermore, it must manage concurrent execution of transactions in a way that avoids the introduction

of inconsistency.

Ideally, a database System will guarantee the properties of Atomicity, Consistency, Isolation and

Durability (ACID) for each transaction.

The effects of all the SQL statements in a transaction can be either all committed or all rolled back.

To ensure integrity of the data, we require that the database system maintain the following properties of

the transactions:

Atomicity. Either all operations of the transaction are reflected properly in the database, or none are.

Example: A transaction to transfer funds from one account to another involves making a withdrawal

operation from the first account and a deposit operation on the second. If the deposit operation failed,

you don’t want the withdrawal operation to happen either.

Consistency. Execution of a transaction in isolation (that is, with no other transaction executing

concurrently) preserves the consistency of the database.

Example: A database tracking a checking account may only allow unique check numbers to exist for

each transaction.

Isolation. Even though multiple transactions may execute concurrently, the system guarantees that, for

every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or

Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing

concurrently in the system.

Example: A teller looking up a balance must be isolated from a concurrent transaction involving a

withdrawal from the same account. Only when the withdrawal transaction commits successfully and the

teller looks at the balance again will the new balance be reported.

Durability. After a transaction completes successfully, the changes it has made to the database persist,

even if there are system failures.

Example: A system crash or any other failure must not be allowed to lose the results of a transaction or

the contents of the database. Durability is often achieved through separate transaction logs that can "re-

create" all transactions from some picked point in time (like a backup).

These properties are often called the ACID properties; the acronym is derived from the first letter of

each of the four properties.

Q.6.(b). Explain the state diagram with neat sketch that a transaction goes through during execution. 7M

Ans.


A transaction in a database can be in one of the above states.

In the absence of failures, all transactions complete successfully.

However, a transaction may not always complete its execution successfully.

Such a transaction is termed aborted.

If we are to ensure the atomicity property, an aborted transaction must have no effect on the state of the

database.

Thus, any changes that caused by an aborted transaction have been undone, we say that the transaction

has been rolled back.

It is part of the responsibility of the recovery scheme to manage transaction aborts.

A transaction that completes its execution successfully is said to be committed.

A committed transaction that has performed updates transforms the database into a new consistent state,

which must persist even if there is a system failure.

Once a transaction has committed, we cannot undo its effects by aborting it.

The only way to undo the effects of a committed transaction is to execute a compensating transaction.

For instance, if a transaction added $20 to an account, the compensating transaction would subtract $20

from the account.

However, it is not always possible to create such a compensating transaction.

Therefore, the responsibility of writing and executing a compensating transaction is left to the user, and

is not handled by the database system.

A transaction must be in one of the following states: Active, the initial state; the transaction stays in this

state while it is executing

• Partially committed, after the final statement has been executed

• Failed, after the discovery that normal execution can no longer proceed

• Aborted, after the transaction has been rolled back and the database has been

restored to its state prior to the start of the transaction.

• Committed, after successful completion.

The state diagram corresponding to a transaction is shown abvove.

We say that a transaction has committed only if it has entered the committed state.

Similarly, we say that a transaction has aborted only if it has entered the aborted state.

A transaction is said to have terminated if has either committed or aborted.

A transaction starts in the active state.

When it finishes its final statement, it enters the partially committed state.

At this point, the transaction has completed its execution, but it is still possible that it may have to be

aborted, since the actual output may still be temporarily residing in main memory, and thus a hardware

failure may preclude its successful completion.

The database system then writes out enough information to disk that, even in the event of a failure, the

updates performed by the transaction can be re-created when the system restarts after the failure.

When the last of this information is written out, the transaction enters the committed state.

A transaction enters the failed state after the system determines that the transaction can no longer

proceed with its normal execution (for example, because of hardware or logical errors).

Such a transaction must be rolled back.

Then, it enters the aborted state.

At this point, the system has two options: It can restart the transaction, but only if the transaction was

aborted as a result of some hardware or software error that was not created through the internal logic of

the transaction.

A restarted transaction is considered to be a new transaction.

It can kill the transaction.

It usually does so because of some internal logical error that can be corrected only by rewriting the

application program, or because the input was bad, or because the desired data were not found in the

database.


Q.7. Explain schedule, serializability. Also explain what is conflict & view serializability. 13M

Ans.

Serializability:

The database system must control concurrent execution of transactions, to ensure that the database state

remains consistent.

In the fields of databases and transaction processing (transaction management), a schedule describes

execution of transactions running in the system.



If order in time between certain operations is not determined by the system, then a partial order is used.

Examples of such operations are requesting a read operation, reading, writing, aborting, committing,

requesting lock, locking, etc.


Types of Schedules:

1. Serial Schedule:




A schedule that is equivalent (in its outcome) to a serial schedule has the serializability property.





serial schedules.




























not matter.




Schedule 1











Schedule 2











Schedule 3






View equivalence that is less stringent than conflict equivalence, but that, like conflict equivalence, is

based on only the read and write operations of transactions.

Schedule 1






value was produced by a write(Q) operation executed by transaction Tj , then the read(Q) operation of

transaction Ti must, in schedule S’, also read the value of Q that




Conditions 1 and 2 ensure that each transaction reads the same values in both schedules and, therefore,

performs the same computation.

Condition 3, coupled with conditions 1 and 2, ensures that both schedules result in the same final system

state.



Schedule 1

Schedule 2

The schedule 1 is not view equivalent to schedule 2, since, in schedule 1, the value of account A read by

transaction T2 was produced by T1, whereas this case does not hold in schedule 2.



Every conflict-serializable schedule is also view serializable, but there are view serializable schedules

that are not conflict serializable.

Q.8. (a). Which of the following schedules is conflict serializable? For each serializable schedule,

determine the equivalent serial schedules: 7M

(i). r1 (X); r3 (X); w1 (X); r2 (X); w3 (X);

(ii). r1 (X); r3 (X); w3 (X); w1 (X); r2 (X);

(iii). r3 (X); r2 (X); w3 (X); r1 (X); w1 (X);

(iv). r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);

Ans.

Conflict-serializable schedules:


serial schedules.























not matter.




Schedule 1











Schedule 2











Schedule 3





(i). r1 (X); r3 (X); w1 (X); r2 (X); w3 (X);



follows:



r1 (X); r3 (X); r2 (X); w1 (X); w3 (X);

(ii). r1 (X); r3 (X); w3 (X); w1 (X); r2 (X);


present.

(iii). r3 (X); r2 (X); w3 (X); r1 (X); w1 (X);



follows:



r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);

(iv). r3 (X); r2 (X); r1 (X); w3 (X); w1 (X);


present.


Q.8.(b). What is log based recovery? What is the information in log records & how can it be used in

recovery? 6M

Ans.

When a system crashes, it many have several transactions being executed and various files opened for

them to modifying data items.

As we know that transactions are made of various operations, which are atomic in nature.

But according to ACID properties of DBMS, atomicity of transactions as a whole must be maintained

that is, either all operations are executed or none.

When DBMS recovers from a crash it should maintain the following:

It should check the states of all transactions, which were being executed.

A transaction may be in the middle of some operation; DBMS must ensure the atomicity of

transaction in this case.

It should check whether the transaction can be completed now or needs to be rolled back.

No transactions would be allowed to left DBMS in inconsistent state.

There are two types of techniques, which can help DBMS in recovering as well as maintaining the

atomicity of transaction:

Maintaining the logs of each transaction, and writing them onto some stable storage before actually

modifying the database.

Maintaining shadow paging, where are the changes are done on a volatile memory and later the

actual database is updated.

Log-Based Recovery

1. The most widely used structure for recording database modifications is the log.

2. The log is a sequence of log records, recording all the update activities in the database.

3. There are several types of log records.

4. An update log record describes a single database write.

5. It has these fields:







6. Other special log records exist to record significant events during transaction processing, such as the


7. We denote the various types of log records as:












storage.


Q.9.(a). Explain lock based protocols. 7M

Ans. Lock based protocols

One way to ensure serializability is to require that data items be accessed in a mutually exclusive


item.



Locks








The transaction makes the request to the concurrency-control manager.

The transaction can proceed with the operation only after the concurrency-control manager grants the

lock to the transaction.

Given a set of lock modes, we can define a compatibility function on them as follows.

Let A and B represent arbitrary lock modes.

Suppose that a transaction Ti requests a lock of mode A on item Q on which transaction Tj (Ti _= Tj )

currently holds a lock of mode B.

If transaction Ti can be granted a lock on Q immediately, in spite of the presence of the mode B lock,

then we say mode A is compatible with mode B.

Such a function can be represented conveniently by a matrix.

The compatibility relation between the two modes of locking appears in the matrix comp of Figure

shown above.

An element comp(A, B) of the matrix has the value true if and only if mode A is compatible with mode

B.

Note that shared mode is compatible with shared mode, but not with exclusive mode.

At any time, several shared-mode locks can be held simultaneously (by different transactions) on a

particular data item.

A subsequent exclusive-mode lock request has to wait until the currently held shared-mode locks are

released.

A transaction requests a shared lock on data item Q by executing the lock-S(Q) instruction.

Similarly, a transaction requests an exclusive lock through the lock-X(Q) instruction.

A transaction can unlock a data item Q by the unlock(Q) instruction.

To access a data item, transaction Ti must first lock that item.

If the data item is already locked by another transaction in an incompatible mode, the concurrency

control manager will not grant the lock until all incompatible locks held by other transactions have been

released.

Thus, Ti is made to wait until all incompatible locks held by other transactions have been released.

Transaction Ti may unlock a data item that it had locked at some earlier point.

Note that a transaction must hold a lock on a data item as long as it accesses that item.


Moreover, for a transaction to unlock a data item immediately after its final access of that data item is

not always desirable, since serializability may not be ensured.

There are four types lock protocols available:

1. Simplistic Simplistic lock based protocols allow transaction to obtain lock on every object before 'write' operation

is performed.

As soon as 'write' has been done, transactions may unlock the data item.

2. Pre-claiming In this protocol, a transactions evaluates its operations and creates a list of data items on which it needs

locks.

Before starting the execution, transaction requests the system for all locks it needs beforehand.

If all the locks are granted, the transaction executes and releases all the locks when all its operations are

over.

Else if all the locks are not granted, the transaction rolls back and waits until all locks are granted.

3. Two Phase Locking - 2PL

This locking protocol is divides transaction execution phase into three parts.

In the first part, when transaction starts executing, transaction seeks grant for locks it needs as it

executes.

Second part is where the transaction acquires all locks and no other lock is required. Transaction keeps

executing its operation.

As soon as the transaction releases its first lock, the third phase starts.

In this phase a transaction cannot demand for any lock but only releases the acquired locks.

Two phase locking has two phases, one is growing; where all locks are being acquired by transaction

and second one is shrinking, where locks held by the transaction are being released.

To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then upgrade

it to exclusive lock.

4. Strict Two Phase Locking

The first phase of Strict-2PL is same as 2PL. After acquiring all locks in the first phase, transaction

continues to execute normally.

But in contrast to 2PL, Strict-2PL does not release lock as soon as it is no more required, but it holds all

locks until commit state arrives.

Strict-2PL releases all locks at once at commit point.

Strict-2PL does not have cascading abort as 2PL does.

Q.9.(b). Write short notes on two phase locking protocol. 6M

Ans.

Two phase locking protocol. One way to ensure serializability is to require that data items be accessed in a mutually exclusive


item.



Locks









One protocol that ensures serializability is the two-phase locking protocol.

This protocol requires that each transaction issue lock and unlock requests in two phases:

1. Growing phase. A transaction may obtain locks, but may not release any lock.

2. Shrinking phase. A transaction may release locks, but may not obtain any new locks.

Initially, a transaction is in the growing phase.

The transaction acquires locks as needed.

Once the transaction releases a lock, it enters the shrinking phase, and it can issue no more lock

requests.

For example, transactions T3 and T4 are two phase.

On the other hand, transactions T1 and T2 are not two phase.

Note that the unlock instructions do not need to appear at the end of the transaction.

For example, in the case of transaction T3, we could move the unlock(B) instruction to just after the

lock-X(A) instruction, and still retain the two-phase locking property.

We can show that the two-phase locking protocol ensures conflict serializability.

Consider any transaction. The point in the schedule where the transaction has obtained its final lock (the

end of its growing phase) is called the lock point of the transaction.

Now, transactions can be ordered according to their lock points—this ordering is, in fact, a

serializability ordering for the transactions.

Two-phase locking does not ensure freedom from deadlock.


Q.10.(a). Explain Lock based protocol for concurrency control in database transactions. Consider the

following transactions: 6M

T31: read (A) T32: read (B)

read (B) read (A)

If A = 0, then if B=0, then

B:= B + 1; A:= A + 1;

Write (B) write (A)

Add lock and unlock instructions to transactions T31 and T32 so that they observe two phase locking

protocol. Can the execution of these transactions result in a deadlock? Explain your answer.

Ans.

Lock and unlock instructions:

T31: lock-S(A)

read(A)

lock-X(B)

read(B)

if A = 0

then B := B + 1

write(B)

unlock(A)

unlock(B)

T32: lock-S(B)

read(B)

lock-X(A)

read(A)

if B = 0

then A := A + 1

write(A)

unlock(B)

unlock(A)

Execution of these transactions can result in deadlock. For example, consider

the following partial schedule:

T31 T32

lock-S(A)

lock-S(B)

read(B)

read(A)

lock-X(B)

lock-X(A)

The transactions are now deadlocked


Q.10.(b). Write short notes on Time stamp based protocol. 7M

Ans.

Time stamp ordering Protocol

With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti).

This timestamp is assigned by the database system before the transaction Ti starts execution.

If a transaction Ti has been assigned timestamp TS(Ti), and a new transaction Tj enters the system, then

TS(Ti) < TS(Tj ).

There are two simple methods for implementing this scheme:

1. Use the value of the system clock as the timestamp; that is, a transaction’s

Timestamp is equal to the value of the clock when the transaction enters the system.

2. Use a logical counter that is incremented after a new timestamp has been

assigned; that is, a transaction’s timestamp is equal to the value of the counter

when the transaction enters the system.

The timestamps of the transactions determine the serializability order.

Thus, if TS(Ti) < TS(Tj ), then the system must ensure that the produced schedule is equivalent to a

serial schedule in which transaction Ti appears before transaction Tj .

To implement this scheme, we associate with each data item Q two timestamp values:

• W-timestamp(Q) denotes the largest timestamp of any transaction that executed

write(Q) successfully.

• R-timestamp(Q) denotes the largest timestamp of any transaction that executed

read(Q) successfully.

These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed.

The timestamp-ordering protocol ensures serializability among transaction in their conflicting read and

write operations.

This is the responsibility of the protocol system that the conflicting pair of tasks should be executed

according to the timestamp values of the transactions.

Time-stamp of Transaction Ti is denoted as TS(Ti).

Read time-stamp of data-item X is denoted by R-timestamp(X).

Write time-stamp of data-item X is denoted by W-timestamp(X).

Timestamp ordering protocol works as follows:

If a transaction Ti issues read(X) operation: o If TS(Ti) < W-timestamp(X)

Operation isrejected.

o If TS(Ti) >= W-timestamp(X)

Operation isexecuted.

o All data-item Timestamps updated.

If a transaction Ti issues write(X) operation: o If TS(Ti) < R-timestamp(X)

Operation is rejected.

o If TS(Ti) < W-timestamp(X)

Operation rejected and Ti rolled back.

o Otherwise, operation is executed.

If a transaction Ti is rolled back by the concurrency-control scheme as result of issuance of either a read

or write operation, the system assigns it a new timestamp and restarts it.


Q.11.(a). Discuss any one multiversion techniques used for concurrency control. 8M

Ans.

The concurrency-control schemes discussed thus far ensure serializability by either delaying an

operation or aborting the transaction that issued the operation.

For example, a read operation may be delayed because the appropriate value has not been written yet; or

it may be rejected (that is, the issuing transaction must be aborted) because the value that it was

supposed to read has already been overwritten.

These difficulties could be avoided if old copies of each data item were kept in a system.

In multiversion concurrency control schemes, each write(Q) operation creates a new version of Q.

When a transaction issues a read(Q) operation, the concurrency control manager selects one of the

versions of Q to be read.

The concurrency-control scheme must ensure that the version to be read is selected in a manner that

ensures serializability.

It is also crucial, for performance reasons, that a transaction be able to determine easily and quickly

which version of the data item should be read.

1. Multiversion Timestamp Ordering

The most common transaction ordering technique used by multiversion schemes is timestamping.

With each transaction Ti in the system, we associate a unique static timestamp, denoted by TS(Ti).

The database system assigns this timestamp before the transaction starts execution.

With each data item Q, a sequence of versions <Q1, Q2, . . .,Qm> is associated.

Each version Qk contains three data fields:

• Content is the value of version Qk.

• W-timestamp(Qk) is the timestamp of the transaction that created version Qk.

• R-timestamp(Qk) is the largest timestamp of any transaction that successfully

read version Qk.

A transaction—say, Ti—creates a new version Qk of data item Q by issuing a write(Q) operation.

The content field of the version holds the value written by Ti.

The system initializes the W-timestamp and R-timestamp to TS(Ti).

It updates the R-timestamp value of Qk whenever a transaction Tj reads the content of Qk, and R-

timestamp(Qk) < TS(Tj ).

The scheme operates as follows.

Suppose that transaction Ti issues a read(Q)or write(Q) operation. Let Qk denote the version of Q whose

write timestamp is thelargest write timestamp less than or equal to TS(Ti).

1. If transaction Ti issues a read(Q), then the value returned is the content of version Qk.

2. If transaction Ti issues write(Q), and if TS(Ti)<R-timestamp(Qk), then the system rolls back transaction Ti.

On the other hand, if TS(Ti) = W-timestamp(Qk), the system overwrites the contents of Qk; otherwise it creates

a new version of Q.

The justification for rule 1 is clear.

A transaction reads the most recent version that comes before it in time.

The second rule forces a transaction to abort if it is ―too late‖ in doing a write.

More precisely, if Ti attempts to write a version that some other transaction would have read, then we

cannot allow that write to succeed.

Versions that are no longer needed are removed according to the following rule.

Suppose that there are two versions, Qk and Qj , of a data item, and that both versions have a W-

timestamp less than the timestamp of the oldest transaction in the system.

Then, the older of the two versions Qk and Qj will not be used again, and can be deleted.

The multiversion timestamp-ordering scheme has the desirable property that a read request never fails

and is never made towait.


In typical database systems,wherereading is a more frequent operation than is writing, this advantage

may be of major practical significance

.2. Multiversion Two-Phase Locking

The mutiversion two-phase locking protocol attempts to combine the advantages of multiversion

concurrency control with the advantages of two-phase locking.

This protocol differentiates between read-only transactions and update transactions.

Update transactions perform rigorous two-phase locking; that is, they hold all locks up to the end of the

transaction.

Thus, they can be serialized according to their commit order.

Each version of a data item has a single timestamp.

The timestamp in this case is not a real clock-based timestamp, but rather is a counter, which we will

call the ts-counter, that is incremented during commit processing.

Read-only transactions are assigned a timestamp by reading the current value of ts-counter before they

start execution; they follow the multiversion timestampordering protocol for performing reads.

Thus, when a read-only transaction Ti issues a read(Q), the value returned is the contents of the version

whose timestamp is the largest timestamp less than TS(Ti).

When an update transaction reads an item, it gets a shared lock on the item, and reads the latest version

of that item.

When an update transaction wants to write an item, it first gets an exclusive lock on the item, and then

creates a new version of the data item.

The write is performed on the new version, and the timestamp of the new version is initially set to a

value ∞, a value greater than that of any possible timestamp.

When the update transaction Ti completes its actions, it carries out commit processing:

First, Ti sets the timestamp on every version it has created to 1 more than the value of ts-counter; then,

Ti increments ts-counter by 1.

Only one update transaction is allowed to perform commit processing at a time.

As a result, read-only transactions that start after Ti increments ts-counter will see the values updated by

Ti,whereas those that start before Ti increments ts-counter will see the value before the updates by Ti.

In either case, read-only transactions never need to wait for locks.

Multiversion two-phase locking also ensures that schedules are recoverable and cascadeless.

Q.11.(b).Explain various types of failures that occur in the system and also explain recovery method

used? 6M

Ans.

A computer system, like any other device, is subject to failure from a variety of causes: disk crash,

power outage, software error, a fire in the machine room.

In any failure, information may be lost.

Therefore, the database system must take actions in advance to ensure that the atomicity and durability

properties of transactions are preserved.

An integral part of a database system is a recovery scheme that can restore the database to the

consistent state that existed before the failure.

The recovery scheme must also provide high availability; that is, it must minimize the time for which

the database is not usable after a crash.

Failure Classification

There are various types of failure that may occur in a system, each of which needs to be dealt with in a

different manner.

The simplest type of failure is one that does not result in the loss of information in the system.

The failures that are more difficult to deal with are those that result in loss of information.


The following types of failure can occur:

A. Transaction failure.

There are two types of errors that may cause a transaction to fail:

1. Logical error. The transaction can no longer continue with its normal execution because of some internal

condition, such as bad input, data not found, overflow, or resource limit exceeded.

2. System error. The system has entered an undesirable state (for example, deadlock), as a result of which a

transaction cannot continue with its normal execution.

The transaction, however, can be re-executed at a later time.

B. System crash.

There is a hardware malfunction, or a bug in the database software or the operating system, that causes

the loss of the content of volatile storage, and brings transaction processing to a halt.

The content of non-volatile storage remains intact, and is not corrupted.

The assumption that hardware errors and bugs in the software bring the system to a halt, but do not

corrupt the nonvolatile storage contents, is known as the fail-stop assumption.

Well-designed systems have numerous internal checks, at the hardware and the software level, that

bring the system to a halt when there is an error.

Hence, the fail-stop assumption is a reasonable one.

C. Disk failure.

A disk block loses its content as a result of either a head crash or failure during a data transfer operation.

Copies of the data on other disks, or archival backups on tertiary media, such as tapes, are used to

recover from the failure.

To determine how the system should recover from failures, we need to identify the failure modes of

those devices used for storing data.

Next, we must consider how these failure modes affect the contents of the database.

We can then propose algorithms to ensure database consistency and transaction atomicity despite

failures.

These algorithms, known as recovery algorithms, have two parts:

1. Actions taken during normal transaction processing to ensure that enough

information exists to allow recovery from failures.

2. Actions taken after a failure to recover the database contents to a state that ensures

database consistency, transaction atomicity, and durability.

Q.12.(a). What is deadlock in DBMS? Explain with example. What are deadlock prevention stretegies?

7M

Ans.

A system is in a deadlock state if there exists a set of transactions such that every transaction in the set is

waiting for another transaction in the set.

More precisely, there exists a set of waiting transactions {T0, T1, . . ., Tn} such that T0 is waiting for a

data item that T1 holds, and T1 is waiting for a data item that T2 holds, and . . ., and Tn−1 is waiting for

a data item that Tn holds, and Tn is waiting for a data item that T0 holds.

None of the transactions can make progress in such a situation.

The only remedy to this undesirable situation is for the system to invoke some drastic action, such as

rolling back some of the transactions involved in the deadlock.

Rollback of a transaction may be partial: That is, a transaction may be rolled back to the point where it

obtained a lock whose release resolves the deadlock.

There are two principal methods for dealing with the deadlock problem.

We can use a deadlock prevention protocol to ensure that the system will never enter a deadlock state.


Alternatively, we can allow the system to enter a deadlock state, and then try to recover by using a

deadlock detection and deadlock recovery scheme.

These methods may result in transaction rollback.

Prevention is commonly used if the probability that the system would enter a deadlock state is relatively

high; otherwise, detection and recovery are more efficient.

If a system does not employ some protocol that ensures deadlock freedom, then a detection and

recovery scheme must be used.

An algorithm that examines the state of the system is invoked periodically to determine whether a

deadlock has occurred.

If one has, then the system must attempt to recover from the deadlock.

To do so, the system must:

• Maintain information about the current allocation of data items to transactions,

as well as any outstanding data item requests.

• Provide an algorithm that uses this information to determine whether the system

has entered a deadlock state.

• Recover from the deadlock when the detection algorithm determines that a

deadlock exists.

Deadlock Prevention

There are two approaches to deadlock prevention.

One approach ensures that no cyclic waits can occur by ordering the requests for locks, or requiring all

locks to be acquired together.

The other approach is closer to deadlock recovery, and performs transaction rollback instead of waiting

for a lock, whenever the wait could potentially result in a deadlock.

The simplest scheme under the first approach requires that each transaction locks all its data items

before it begins execution.

Moreover, either all are locked in one step or none are locked.

There are two main disadvantages to this protocol:

(1) it is often hard to predict, before the transaction begins, what data items need to be locked;

(2) data-item utilization may be very low, since many of the data items may be locked

but unused for a long time.

Another approach for preventing deadlocks is to impose an ordering of all data items, and to require that

a transaction lock data items only in a sequence consistent with the ordering.

A variation of this approach is to use a total order of data items, in conjunction with two-phase locking.

Once a transaction has locked a particular item, it cannotrequest locks on items that precede that item in

the ordering.

This scheme is easy to implement, as long as the set of data items accessed by a transaction is known

when the transaction starts execution.

There is no need to change the underlying concurrency-control system if two-phase locking is used: All

that is needed it to ensure that locks are requested in the right order.

The second approach for preventing deadlocks is to use preemption and transaction rollbacks.

In preemption, when a transaction T2 requests a lock that transaction T1 holds, the lock granted to T1

may be preempted by rolling back of T1, and granting of the lock to T2.

To control the preemption, we assign a unique timestamp to each transaction.

The system uses these timestamps only to decide whether a transaction should wait or roll back.

Locking is still used for concurrency control.

If a transaction is rolled back, it retains its old timestamp when restarted.

Two different deadlock prevention schemes using timestamps have been proposed:

1. The wait–die scheme is a nonpreemptive technique.


a timestamp smaller than that of Tj (that is, Ti is older than Tj ).

Otherwise, Ti is rolled back (dies).


For example, suppose that transactions T22, T23, and T24 have timestamps 5, 10, and 15,

respectively. If T22 requests a data itemheld by T23, then T22 will wait.

If T24 requests a data item held by T23, then T24 will be rolled back.

2. The wound–wait scheme is a preemptive technique.

It is a counterpart to the wait–die scheme.


a timestamp larger than that of Tj (that is, Ti is younger than Tj ). Otherwise, Tj is rolled back (Tj

is wounded by Ti).

Returning to our example, with transactions T22, T23, and T24, if T22 requests a data item held

by T23, then the data item will be preempted from T23, and T23 will be rolled back.

If T24 requests a data item held by T23, then T24 will wait.

Whenever the system rolls back transactions, it is important to ensure that there is no starvation—

that is, no transaction gets rolled back repeatedly and is never allowed to make progress.

Both the wound–wait and the wait–die schemes avoid starvation: At any time, there is a transaction

with the smallest timestamp.

This transaction cannot be required to roll back in either scheme.

Since timestamps always increase, and since transactions are not assigned new timestamps when they

are rolled back.

A transaction that is rolled back repeatedly will eventually have the smallest timestamp, at which

point it will not be rolled back again.

Q.12.(b). How are buffering and caching techniques used by the recovery systems? 7M

Ans. Bufffering technique used by Recovery system:
















Some of the replacement strategies used by the buffer manager are (a) first-in-first-out (FIFO) and (b)

least recently used (LRU).




computer system.







replacement.

Caching Techniques used in Recovery system:

Whenever a transaction needs to update the database, the disk pages (or disk blocks) containing the data

items to be modified are first cached (buffered) by the cache manager into the main memory and then

modified in the memory before being written back to the disk.

A cache directory is maintained to keep track of all the data items present in the buffers.

When an operation needs to be performed on a data item, the cache directory is first searched to

determine whether the disk page containing the data item resides in the cache.

If it is not present in the cache, the data item is searched on the disk and the appropriate disk page is

copied in the cache.

Sometimes it may be necessary to replace some of the disk pages to create space for the new pages.

Any page-replacement strategy such as least recently used (LRU) or first-in-first-out (FIFO) can be used

for replacing the disk page.

Each memory buffer has a free bit associated with it which indicates whether the buffer is free

(available for allocation) or not.

Other associated bits are dirty bit and pin/unpin bit.

For the efficiency of recovery purpose, the caching of disk pages is handled by the DBMS instead of the

OS.

Typically, a collection of in-memory buffers, called DBMS cache kept under the control of the DBMS.

A directory for the cache is used to keep track of which DB items are in the buffers.

It is in the form of a table of <disk page address, buffer location> entries.

The DBMS cache holds the database disk blocks including

• Data blocks

• Index blocks

• Log blocks

When DBMS requests action on some item, it first checks the cache directory to determine if the

corresponding disk page is in the cache.

If no, the item must be located on disk and the appropriate disk pages are copied into the cache.

It may be necessary to replace (flush) some of the cache buffers to make space available for the new

item.

Dirty bit.

– Associated with each buffer in the cache is a dirty bit.

– The dirty bit can be included in the directory entry.

– It indicates whether or not the buffer has been modified.

• Set dirty bit=0 when the page is first read from disk to the buffer cache.

• Set dirty bit=1 as soon as the corresponding buffer is modified.

– When the buffer content is replaced –flushed- from the cache, write it back

to the corresponding disk page only if dirty bit=1

Pin-unpin bit.

– A page is pinned –i.e. pin-unpin bit value=1-, if it cannot be written back to disk as

yet.

• Strategies that can be used when flushing occurs.

– In-place updating

• Writes the buffer back to the same original disk location –overwriting the old

value on disk-





– Shadowing

• Writes the updated buffer at a different disk location.

– Multiple versions of data items can be maintained.

– The old value called BFIM –before image-

– The new value AFIM –after image-

• The new value and old value are kept on disk, so no need of log for recovery.

Q.13.(a). Explain in brief about log based recovery. 6M

Ans. Log-Based Recovery

The most widely used structure for recording database modifications is the log.

The log is a sequence of log records, recording all the update activities in the database.

There are several types of log records.

An update log record describes a single database write.

It has these fields:







Other special log records exist to record significant events during transaction processing, such as the


We denote the various types of log records as:












storage.

Every log record is written to the end of the log on stable storage as soon as it is created

Observe that the log contains a complete record of all database activity.

As a result, the volume of data stored in the log may become unreasonably large.

Types of Logs:All log records include the general log attributes above, and also other attributes

depending on their type (which is recorded in the Type attribute, as above).

2. Update Log Record notes an update (change) to the database. It includes this extra information:

PageID: A reference to the Page ID of the modified page.

Length and Offset: Length in bytes and offset of the page are usually included.

Before and After Images: Includes the value of the bytes of page before and after the page

change. Some databases may have logs which include one or both images.

3. Compensation Log Record notes the rollback of a particular change to the database. Each correspond

with exactly one other Update Log Record (although the corresponding update log record is not

typically stored in the Compensation Log Record). It includes this extra information:

a. undoNextLSN: This field contains the LSN of the next log record that is to be undone for transaction that

wrote the last Update Log.

4. Commit Record notes a decision to commit a transaction.


5. Abort Record notes a decision to abort and hence roll back a transaction.

6. Checkpoint Record notes that a checkpoint has been made. These are used to speed up recovery. They

record information that eliminates the need to read a long way into the log's past. This varies according

to checkpoint algorithm. If all dirty pages are flushed while creating the checkpoint (as in PostgreSQL),

it might contain:

a. redoLSN: This is a reference to the first log record that corresponds to a dirty page. i.e. the first update

that wasn't flushed at checkpoint time. This is where redo must begin on recovery.

b. undoLSN: This is a reference to the oldest log record of the oldest in-progress transaction. This is the

oldest log record needed to undo all in-progress transactions.

7. Completion Record notes that all work has been done for this particular transaction. (It has been fully

committed or aborted)

Q.13.(b). Discuss the immediate update recovery technique in both single user and multiuser

environments. What are the advantages and disadvantages of immediate update? 8M

Or

Describe a recovery scheme that works in single user environment if system fails:-

(i). After transaction starts and before the read.

(ii). After the read and before the write.

(iii). After the commit and before al database entries are flushed into disk.

or

Describe a recovery technique that employ the immediate update scheme.

Ans.

Immediate Database Modification

The immediate-modification technique allows database modifications to be output to the database

while the transaction is still in the active state.

Data modifications written by active transactions are called uncommitted modifications.

In the event of a crash or a transaction failure, the system must use the old-value field of the log records

to restore the modified data items to the value they had prior to the start of the transaction.

The undo operation accomplishes this restoration.

Before a transaction Ti starts its execution, the system writes the record <Ti start> to the log.

During its execution, any write(X) operation by Ti is preceded by the writing of the appropriate new

update record to the log.

When Ti partially commits, the system writes the record <Ti commit> to the log.

Since the information in the log is used in reconstructing the state of the database, we cannot allow the

actual update to the database to take place before the corresponding log record is written out to stable

storage.

We therefore require that, before execution of an output(B) operation, the log records corresponding to

B be written onto stable storage.

As an illustration, let us reconsider our simplified banking system, with transactionsT0 and T1 executed

one after the other in the order T0 followed by T1.

The portion of the log containing the relevant information concerning these two transactions appears

below:

<T0 start>

<T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start>

<T1 , C, 700, 600>



<T1 commit>

Portion of the system log corresponding to T0 and T1.

One possible order in which the actual outputs took place in both the database system and the log as a

result of the execution of T0 and T1 is shown below:

Log Database

A = 950

B = 2050

C = 600

<T0 start>

<T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start>

<T1 , C, 700, 600>

<T1 commit>

State of system log and database corresponding to T0 and T1.

Using the log, the system can handle any failure that does not result in the loss of information in

nonvolatile storage.

The recovery scheme uses two recovery procedures:

• undo(Ti) restores the value of all data items updated by transaction Ti to the

old values.

• redo(Ti) sets the value of all data items updated by transaction Ti to the new

values.

The set of data items updated by Ti and their respective old and new values can be found in the log.

The undo and redo operations must be idempotent to guarantee correct behaviour even if a failure

occurs during the recovery process.

After a failure has occurred, the recovery scheme consults the log to determine which transactions need

to be redone, and which need to be undone:

• Transaction Ti needs to be undone if the log contains the record <Ti start>,

but does not contain the record <Ti commit>.

• Transaction Ti needs to be redone if the log contains both the record <Ti start>

and the record <Ti commit>.

As an illustration, return to our banking example, with transaction T0 and T1 executed one after the

other in the order T0 followed by T1.

Suppose that the system crashes before the completion of the transactions.

The state of the logs for each of these cases appears below:

First, let us assume that the crash occurs just after the log record for the step write(B) of transaction T0


<T0 start><T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 start> <T0 , A, 1000, 950>

<T0 , B, 2000, 2050>

<T0 commit>

<T1 start> <T1 , C, 700, 600>

<T0 start> <T0 , A, 1000, 950>

<T0 , B, 2000, 2050> <T0 commit>

<T1 start> <T1 , C, 700, 600>

<T1 commit>


The same log, shown at three different times.

When the system comes back up, it finds the record <T0 start> in the log, but no corresponding <T0

commit> record.

Thus, transaction T0 must be undone, so an undo(T0) is performed.

As a result, the values in accounts A and B (on the disk) are restored to $1000 and $2000, respectively.

Next, let us assume that the crash comes just after the log record for the step write(C) of transaction T1


When the system comes back up, two recovery actions need to be taken.

The operation undo(T1) must be performed, since the record <T1 start> appears in the log, but there is

no record <T1 commit>.

The operation redo(T0)must be performed, since the log contains both the record <T0 start> and the

record <T0 commit>.

At the end of the entire recovery procedure, the values of accounts A, B, and C are $950, $2050, and

$700, respectively.

Note that the undo(T1) operation is performed before the redo(T0).

In this example, the same outcome would result if the order were reversed.

However, the order of doing undo operations first, and then redo operations, is important for the

recovery algorithm .

Finally, let us assume that the crash occurs just after the log record<T1 commit> has been written to

stable storage.

When the system comes back up,both T0 and T1 need to be redone, since the records <T0 start> and

<T0 commit> appear in the log, as do the records <T1 start> and <T1 commit>.

After the system performs the recovery procedures redo(T0) and redo(T1), the values in accounts A, B,

and C are $950, $2050, and $600, respectively.

Q.14.(a). Describe Write Ahead Logging protocol. 5M

Ans.

In computer science, write-ahead logging (WAL) is a family of techniques for providing atomicity and

durability (two of the ACID properties) in database systems.

In a system using WAL, all modifications are written to a log before they are applied.

Usually both redo and undo information is stored in the log.

Example: Imagine a program that is in the middle of performing some operation when the machine it is

running on loses power.

Upon restart, that program might well need to know whether the operation it was performing

succeeded, half-succeeded, or failed.

If a write-ahead log were used, the program could check this log and compare what it was supposed to

be doing when it unexpectedly lost power to what was actually done.

On the basis of this comparison, the program could decide to undo what it had started, complete what it

had started, or keep things as they are.

WAL allows updates of a database to be done in-place.

Another way to implement atomic updates is with shadow paging, which is not in-place.

The main advantage of doing updates in-place is that it reduces the need to modify indexes and block

lists.

ARIES is a popular algorithm in the WAL family.

File systems typically use a variant of WAL for at least file system metadata called journaling.

The PostgreSQL database system also uses WAL to provide point-in-time recovery and database

replication features. SQLite database also uses WAL.

MongoDB uses write-ahead logging to provide consistency and crash safety.

Apache HBase uses WAL in order to provide recovery after disaster.

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Atomic_%28computer_science%29

http://en.wikipedia.org/wiki/Durability_%28database_systems%29

http://en.wikipedia.org/wiki/ACID

http://en.wikipedia.org/wiki/Database_system

http://en.wikipedia.org/wiki/Database_log

http://en.wikipedia.org/wiki/In-place_algorithm

http://en.wikipedia.org/wiki/Shadow_paging

http://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics

http://en.wikipedia.org/wiki/File_system

http://en.wikipedia.org/wiki/Metadata

http://en.wikipedia.org/wiki/Journaling_file_system


http://en.wikipedia.org/wiki/Point-in-time_recovery




http://en.wikipedia.org/wiki/SQLite

http://www.sqlite.org/draft/wal.html

http://en.wikipedia.org/wiki/MongoDB

http://en.wikipedia.org/wiki/HBase


Q.4.(b). Explain the process buffer management with suitable example. 7M

Ans.
















Buffer Manager needs to make a critical choice of which block to keep and which block to discard when

buffer is needed for newly requested blocks.

Then buffer manager uses buffer replacement strategies. Some common strategies are:

Least-Recently-Used (LRU)

First-In-First-Out (FIFO)

The Clock Algorithm (Second Chance)

System Control




computer system.






replacement.

Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in

available main memory.

When buffer manager controls main memory directly:

If requests exceeds available space then buffer manager has to select a buffer to empty by returning its

contents to disk.

When blocks have not been changed then they are simply erased from main memory. But, when blocks

have been changed then they are written back to its place on disk.

When buffer manager allocates space in virtual memory :

Buffer manager has the option of allocating more buffers, which can actually fit into main memory.

When all these buffers will be in use then there will be thrashing.

It is an operating system problem where many blocks are moved in and out of disk’s swap space.

Therefore, system will end up spending most of time in swapping blocks and getting very little work

done.

Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-

replacement strategy used by the buffer manager.


Q.15. (a). Write short notes on ARIES Recovery method. 6M

Or

Describe three phases of Aries recovery method.

Or

Explain Aries Recovery Algorithm in detail for database.

Ans.

The state of the art in recovery methods is best illustrated by the ARIES recovery method.

The advanced recovery technique is modeled after ARIES, but has been simplified significantly to bring

out key concepts.

In contrast, ARIES uses a number of techniques to reduce the time taken for recovery, and to reduce the

overheads of checkpointing.

In particular, ARIES is able to avoid redoing many logged operations that have already been applied

and to reduce the amount of information logged.

The price paid is greater complexity; the benefits are worth the price.

ARIES recovers from a system crash in three passes.

• Analysis pass: This pass determines which transactions to undo, which pages

were dirty at the time of the crash, and the LSN from which the redo pass

should start.

• Redo pass: This pass starts from a position determined during analysis, and

performs a redo, repeating history, to bring the database to a state it was in

before the crash.

• Undo pass: This pass rolls back all transactions that were incomplete at the

time of crash.

1. Analysis Pass:

• The analysis pass finds the last complete checkpoint log record, and reads in the Dirty PageTable from

this record.

• It then sets RedoLSN to the minimum of the RecLSNs of the pages in the DirtyPageTable.

• If there are no dirty pages, it sets RedoLSN to the LSN of the checkpoint log record.

• The redo pass starts its scan of the log from RedoLSN.

• All the log records earlier than this point have already been applied to the database pages on disk.

• The analysis pass initially sets the list of transactions to be undone, undo-list, to the list of transactions

in the checkpoint log record.

• The analysis pass also reads from the checkpoint log record the LSNs of the last log record for each

transaction in undo-list.

• The analysis pass continues scanning forward from the checkpoint.

• Whenever it finds a log record for a transaction not in the undo-list, it adds the transaction to undo-list.

• Whenever it finds a transaction end log record, it deletes the transaction from undo-list.

• All transactions left in undo-list at the end of analysis have to be rolled back later, in the undo pass.

• The analysis pass also keeps track of the last record of each transaction in undo-list, which is used in the

undo pass.

• The analysis pass also updates DirtyPageTable whenever it finds a log record for an update on a page.

• If the page is not in DirtyPageTable, the analysis pass adds it to DirtyPageTable, and sets the RecLSN

of the page to the LSN of the log record.

2. Redo Pass:

• The redo pass repeats history by replaying every action that is not already reflected in the page on disk.

• The redo pass scans the log forward from RedoLSN.

• Whenever it finds an update log record, it takes this action:

1. If the page is not inDirtyPageTable or the LSNof the update log record is less

than the RecLSN of the page in DirtyPageTable, then the redo pass skips the

log record.

2. Otherwise the redo pass fetches the page from disk, and if the PageLSN is less

than the LSN of the log record, it redoes the log record.


• Note that if either of the tests is negative, then the effects of the log record have already appeared on the

page.

• If the first test is negative, it is not even necessary to fetch the page from disk.

• Undo Pass and Transaction Rollback: The undo pass is relatively straightforward.

• It performs a backward scan of the log, undoing all transactions in undo-list.

• If a CLR is found, it uses the UndoNextLSN field to skip log records that have already been rolled back.

• Otherwise, it uses the PrevLSN field of the log record to find the next log record to be undone.

• Whenever an update log record is used to perform an undo (whether for transaction rollback during

normal processing, or during the restart undo pass), the undo pass generates a CLR containing the undo

action performed (which must be physiological).

• It sets the UndoNextLSN of the CLR to the PrevLSN value of the update log record.

Q.15.(b). Explain Transaction Rollback in detail. 5M

Ans.

Consider transaction rollback during normal operation (that is, not during recovery from system failure).

The system scans the log backward and uses log records belonging to the transaction to restore the old

values of data items.

Unlike rollback in normal operation, however, rollback in our advanced recovery scheme writes out

special redo-only log records of the form <Ti, Xj, V>containing the value V being restored to data item

Xj during the rollback.

These log records are sometimes called compensation log records. Such records do not need undo

information, since we will never need to undo such an undo operation.

Whenever the system finds a log record <Ti,Oj , operation-end, U>, it takes special actions:

1. It rolls back the operation by using the undo information U in the log record.

• It logs the updates performed during the rollback of the operation just like updates performed when the

operation was first executed.

• In other words, the system logs physical undo information for the updates performed during rollback,

instead of using compensation log records.

• This is because a crash may occur while a logical undo is in progress, and on recovery the system has to

complete the logical undo; to do so, restart recovery will undo the partial effects of the earlier undo,

using the physical undo information, and then perform the logical undo again.

• At the end of the operation rollback, instead of generating a log record < Ti,Oj , operation-end, U >, the

system generates a log record < Ti,Oj ,operation-abort>.

2. When the backward scan of the log continues, the system skips all log records

of the transaction until it finds the log record <Ti,Oj , operation-begin>.

• After it finds the operation-begin log record, it processes log records of the transaction in the normal

manner again.

• Observe that skipping over physical log records when the operation-end log record is found during

rollback ensures that the old values in the physical log record are not used for rollback, once the

operation completes.

• If the system finds a record < Ti,Oj , operation-abort>, it skips all preceding records until it finds the

record< Ti,Oj , operation-begin>.

• These preceding log records must be skipped to prevent multiple rollback of the same operation, in case

there had been a crash during an earlier rollback, and the transaction had already been partly rolled

back.

• When the transaction Ti has been rolled back, the system adds a record <Ti abort> to the log.

• If failures occur while a logical operation is in progress, the operation-end log record for the operation

will not be found when the transaction is rolled back. However, for every update performed by the

operation, undo information—in the form of the old value in the physical log records—is available in

the log. The physical log records will be used to roll back the incomplete operation.

Tulsiramji Gaikwad-Patil College of Engineering ...tgpcet.com › IT-QP › 6 › DBMS.pdf ·...

Documents

Transcript of Tulsiramji Gaikwad-Patil College of Engineering ...tgpcet.com › IT-QP › 6 › DBMS.pdf ·...