Download - Academic Year 2014 Spring Academic Year 2014 Spring.

Transcript
Page 1: Academic Year 2014 Spring Academic Year 2014 Spring.

• Academic Year 2014 Spring

Page 2: Academic Year 2014 Spring Academic Year 2014 Spring.

MODULECC3005NI:Advanced Database Systems

“Distributed Database (DDB) and Data Mining (DM)”

• Academic Year 2014 Spring

Page 3: Academic Year 2014 Spring Academic Year 2014 Spring.

Distributed Database (DDB) can be defined as - a single logical database which is physically distributed across

computers in multiple locations, that are connected by a

computer network. Distributed Database Management System (DDBMS)

can be defined as - the software system that facilitates the management of the DDBs

and makes the distribution transparent to the users.

Definition of Distributed Database:

Page 4: Academic Year 2014 Spring Academic Year 2014 Spring.

A decentralised database is also stored on computers at multiple locations. However, the computers are not connected by a network. Consequently, data cannot be shared by users at different locations.

Thus, a decentralised database is best regarded as a collection of independent databases, rather than having the geographical distribution of a single database.

Distributed vs. Decentralized DBs:

Page 5: Academic Year 2014 Spring Academic Year 2014 Spring.

Reflecting distributed nature of some database applications Many database applications are naturally distributed over some different

locations (e.g. company may have locations in different cities) Increased reliability and availability

When centralised system fails, database is unavailable to all users.

However distributed system will continue function at some reduced level

even when a component fails.

Advantages of Distributed Databases:

Page 6: Academic Year 2014 Spring Academic Year 2014 Spring.

Local Control Data distribution in a distributed database encourages local groups to

exercise greater control over “their” data. This promotes improved data

integrity and administration and users can still access non local data

when necessary. Lower Communication Costs

With distributed system data can be located closer to the point of use.

This can reduce communication costs compared to centralised system.

Advantages of Distributed Databases:

Page 7: Academic Year 2014 Spring Academic Year 2014 Spring.

Fast response and improved performance By distributing a large database over multiple sites, smaller database

exist at each site. Some user queries at a particular site may only need to

access their smaller database stored locally. This speeds up query

processing and enhances database performance. It may also be possible to decompose complex queries into sub-queries

that can be processed in parallel at several different sites

Advantages of Distributed Databases:

Page 8: Academic Year 2014 Spring Academic Year 2014 Spring.

Interconnection of existing databases When several databases already exist in an organisation and the

necessity of performing global application arises, the distributed

database often offers a natural solution by integrating and

interconnecting the pre-existing local databases.

Advantages of Distributed Databases:

Page 9: Academic Year 2014 Spring Academic Year 2014 Spring.

Software Cost and Complexity This is far greater for distributed environment than for a centralised

system Communication / Processing Overheads

These are much greater because the various sites must exchange

messages and perform additional calculations to ensure proper co-

ordination among the sites. There is always the risk that the

communication overhead may degrade system responsiveness and

performance.

Disadvantages of Distributed Databases:

Page 10: Academic Year 2014 Spring Academic Year 2014 Spring.

Data Integrity and Consistency Given the increased complexity of the system and the need for co-

ordination from multiple sites, additional control mechanisms are

required in order to prevent improper updating of data and avoid other

problems of data integrity / consistency.

Disadvantages of Distributed Databases:

Page 11: Academic Year 2014 Spring Academic Year 2014 Spring.

Centralized Schematic View:

Page 12: Academic Year 2014 Spring Academic Year 2014 Spring.

The External schema describes database view of a set of database user groups. Each view typically describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group.

The Conceptual schema is a global description of database that hides details of physical storage structures and concentrates on describing entities, data attributes, relationships and constraints.

Centralized Schematic View:

Page 13: Academic Year 2014 Spring Academic Year 2014 Spring.

Internal schema is a global description of physical storage structures of database.

Centralized Schematic View:

Page 14: Academic Year 2014 Spring Academic Year 2014 Spring.

Distributed Schematic View:

Page 15: Academic Year 2014 Spring Academic Year 2014 Spring.

At top level distributed database acts conceptually as a single centralised database. So, global conceptual schema, which defines all data contained in distributed database, represents to end users a unified view of data for complete distributed system.

Distributed Schematic View:

Page 16: Academic Year 2014 Spring Academic Year 2014 Spring.

Set of end users interact with global schema with their own local schema, each confirming to ANSI-SPARC three level architecture. These user’s external schemas are subsets of global conceptual schemas.

Note that in centralised database external schemas use subsets of conceptual schema, whilst in distributed environment local external schemas describe subsets of global conceptual schema.

Distributed Schematic View:

Page 17: Academic Year 2014 Spring Academic Year 2014 Spring.

Data Replication DDBMS may contain multiple copies of data at several different sites

Data Fragmentation (Partition) Relation may be divided into number of sub-relations (fragments),

which are then distributed. Distribution Transparency

This allows users to perceive distributed database as a single, logical

entity.

Issues of Distributing a Database:

Page 18: Academic Year 2014 Spring Academic Year 2014 Spring.

Desirable properties of DDBs is ability to have a local repository of frequently used data, while still being able to access data stored at other networks sites.

Replicated Database – is a distributed database where (some) stored data is duplicated at various sites.

Fully Replicated Database – is a distributed database where all stored data are duplicated and allocated to all sites.

Data Replication:

Page 19: Academic Year 2014 Spring Academic Year 2014 Spring.

Replication has different effect on read only and update application: Read only Application takes advantage of replication which makes it

more likely that they can reference data locally. Update Application may present problems due to replication, since they

must update all copies in order to preserve data consistency.

Data Replication – Key Points:

Page 20: Academic Year 2014 Spring Academic Year 2014 Spring.

Replicated data enhances locality of reference by satisfying more read only queries locally. This reduces query response time and reduces traffic on communication network.

Replicated data provides greater reliability through backup copies from which data can be recovered in event of media failures.

Data Replication – Key Points:

Page 21: Academic Year 2014 Spring Academic Year 2014 Spring.

Proper provision must be made for update operations in a Distributed Database where data replication is used. Two of the update strategies are;1. Unanimous (एकमत)Agreement Update Strategy

Updates are refused unless they have unanimous acceptance from all sites containing a replica.

In order to reflect a single copy image of replicated files, updates are propagated to all replicas immediately.

Unanimous Acceptance of the proposed update by all sites having replicas is necessary in order to make modifications and all those sites must be available for this to happen.

Data Replication – Update Strategies:

Page 22: Academic Year 2014 Spring Academic Year 2014 Spring.

2. Single Primary Update Strategy Update requests are issued to primary replica, which serialises all updates.

One replica is designated as PRIMARY and remaining replicas as SECONDARYs.

Update request are issued to primary replica, which serves to serialise

updates to secondary replicas and thereby preserve data consistency.

Data Replication – Update Strategies:

Page 23: Academic Year 2014 Spring Academic Year 2014 Spring.

…is basically dividing of relations into fragments for distribution Some Advantages:

Usage: Since applications usually work with views rather than entire

relations, it makes sense to use subsets of relations as unit of distribution

Efficiency: If relation can be decomposed into fragment, it is possible to allow

number of transaction to execute concurrently.

Parallelism: Parallel execution can be realised whereby a single query can be

split into set of subqueries that operate on fragments.

Data Fragmentation:

Page 24: Academic Year 2014 Spring Academic Year 2014 Spring.

Some Disadvantages: Integrity: Integrity checking can be made more complex if data and functional

dependencies are fragmented and distributed to different sites.

Performance: It can be slower to process some global applications which

require data from fragments at different sites.

Data Fragmentation:

Page 25: Academic Year 2014 Spring Academic Year 2014 Spring.

Completeness: Each data item from a global relation R must appear in at least one of its

fragments. This rule ensure no loss of data during fragmentation. Reconstruction:

It must always be possible to reconstruct each global relation from its

fragments. This rule ensures no loss of functional dependencies Disjointness:

Each data item from a global relation should appear in only one of its

fragments, except for vertical fragmentation where primary key attributes

must be repeated to allow reconstruction. This rule ensures minimal data

redundancy.

Data Fragmentation – 3 Rules:

Page 26: Academic Year 2014 Spring Academic Year 2014 Spring.

Horizontal Fragmentation Horizontal fragment of a relation R is a subset of tuples in that relations.

Using RESTRICT operation, horizontal fragmentation divides a relation

horizontally by grouping subsets of tuples, where each subset (fragment) is

specified by some condition on one or more attributes of relation.

These fragments can then be assigned to different sites in distributed system.

Relation can be reconstructed from its fragments by using UNION operation

to fragments.

Data Fragmentation – Options:

Page 27: Academic Year 2014 Spring Academic Year 2014 Spring.

We may define three horizontal

fragments on EMPLOYEE relation

with following conditions:

(DNO = 10)

(DNO = 30)

(DNO = 20)

This fragmentation satisfies three

rules: Completeness, Reconstruction

and Disjointness

Data Fragmentation – Example:

Page 28: Academic Year 2014 Spring Academic Year 2014 Spring.

Vertical Fragmentation Vertical fragment of a relation R groups together certain attributes in relation

using PROJECT operation.

With vertical fragmentation, some of columns of a relation are projected into

one fragment and other columns are projected into other fragment(s).

Set of vertical fragments, whose projection list L1, L2, ...... include all

attributes in R but share only primary key attribute of R, is called complete

vertical fragmentation of R

Data Fragmentation – Options:

Page 29: Academic Year 2014 Spring Academic Year 2014 Spring.

Vertical Fragmentation To reconstruct relation R from a complete vertical fragmentation, we apply

natural JOIN operation to fragments. Therefore, fragments must share a

common attribute (normally primary key) to enable original relation to be

constructed if required.

Data Fragmentation – Options:

Page 30: Academic Year 2014 Spring Academic Year 2014 Spring.

We fragment EMPLOYEE relation

into two vertical fragments where:

first fragment includes personal

information – ENAME, BDATE,

ADDRESS and second includes work

related information ENO, SALARY,

DNO.

* Primary Key attribute ENO is

needed in personal information

Data Fragmentation – Example:

Page 31: Academic Year 2014 Spring Academic Year 2014 Spring.

Mixed Fragmentation Combination of two types of fragmentation schema discussed above,

resulting in mixed fragmentation.

Original relation can be reconstructed by applying UNION and JOIN

Operations to fragments in appropriate order.

Data Fragmentation – Options:

Page 32: Academic Year 2014 Spring Academic Year 2014 Spring.

Data Fragmentation – Example:

Page 33: Academic Year 2014 Spring Academic Year 2014 Spring.

Important aspects of distributed database is to hide details of data distribution from its users. This allows them to perceive distributed database as single, logical entity.

Types of Transparency Distribution Transparency

Replication Transparency

Fragmentation Transparency

Transparency in Distributed Databases:

Page 34: Academic Year 2014 Spring Academic Year 2014 Spring.

Distribution Transparency User should write global queries and transactions as though database were

centralised, without having to specify sites at which data referenced in query

Replication Transparency Where data is replicated, system should handle management of copies and

user normally should act as if there is a single copy of data. However there

may be situations where users should be made aware of existence of copies

(but not placement of copies)

Transparency in Distributed Databases:

Page 35: Academic Year 2014 Spring Academic Year 2014 Spring.

Fragmentation Transparency When database relations are fragmented, DDBMS deals with problem of

handling user queries that were specified on entire relations but now have to

be performed on sub-relations due to fragmentation. In other words, issue is

one of finding a query processing strategy based on fragments rather than on

relations.

Transparency in Distributed Databases:

Page 36: Academic Year 2014 Spring Academic Year 2014 Spring.

Distributed Query Processing One of most important additional factors to consider is cost of transferring

data (including intermediate relations and final results) over network.

Therefore many DDBMS query optimisation algorithms consider objective of

reducing amount of data transfer as main criterion in choosing distributed

query execution strategy.

Other Issues in Distributed Databases (1):

Page 37: Academic Year 2014 Spring Academic Year 2014 Spring.

Distributed Concurrency and Recovery Dealing with multiple copies of data items

Extension of centralised locking, where a particular copy of each data item is

designated as distinguished copy.

Distributed concurrency control based on voting, where there is no one

distinguished copy and lock requests are made to all sites.

Distributed commit

Other Issues in Distributed Databases (2):

Page 38: Academic Year 2014 Spring Academic Year 2014 Spring.

Distributed Concurrency and Recovery Distributed deadlock

Recovery from failure of individual sites

Recovery from failure of communication sites

etc.

Other Issues in Distributed Databases (3):

Page 39: Academic Year 2014 Spring Academic Year 2014 Spring.

Thank you!!!

Questions are WELCOME

• Academic Year 2014 Spring