Academic Year 2014 Spring Academic Year 2014 Spring

Click here to load reader

download Academic Year 2014 Spring Academic Year 2014 Spring

of 39

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Academic Year 2014 Spring Academic Year 2014 Spring

  • Slide 1
  • Academic Year 2014 Spring Academic Year 2014 Spring
  • Slide 2
  • MODULE CC3005NI: Advanced Database Systems Distributed Database (DDB) and Data Mining (DM) Academic Year 2014 Spring Academic Year 2014 Spring
  • Slide 3
  • Distributed Database (DDB) can be defined as - a single logical database which is physically distributed across computers in multiple locations, that are connected by a computer network. Distributed Database Management System (DDBMS) can be defined as - the software system that facilitates the management of the DDBs and makes the distribution transparent to the users. Definition of Distributed Database:
  • Slide 4
  • A decentralised database is also stored on computers at multiple locations. However, the computers are not connected by a network. Consequently, data cannot be shared by users at different locations. Thus, a decentralised database is best regarded as a collection of independent databases, rather than having the geographical distribution of a single database. Distributed vs. Decentralized DBs:
  • Slide 5
  • Reflecting distributed nature of some database applications Many database applications are naturally distributed over some different locations (e.g. company may have locations in different cities) Increased reliability and availability When centralised system fails, database is unavailable to all users. However distributed system will continue function at some reduced level even when a component fails. Advantages of Distributed Databases:
  • Slide 6
  • Local Control Data distribution in a distributed database encourages local groups to exercise greater control over their data. This promotes improved data integrity and administration and users can still access non local data when necessary. Lower Communication Costs With distributed system data can be located closer to the point of use. This can reduce communication costs compared to centralised system. Advantages of Distributed Databases:
  • Slide 7
  • Fast response and improved performance By distributing a large database over multiple sites, smaller database exist at each site. Some user queries at a particular site may only need to access their smaller database stored locally. This speeds up query processing and enhances database performance. It may also be possible to decompose complex queries into sub-queries that can be processed in parallel at several different sites Advantages of Distributed Databases:
  • Slide 8
  • Interconnection of existing databases When several databases already exist in an organisation and the necessity of performing global application arises, the distributed database often offers a natural solution by integrating and interconnecting the pre-existing local databases. Advantages of Distributed Databases:
  • Slide 9
  • Software Cost and Complexity This is far greater for distributed environment than for a centralised system Communication / Processing Overheads These are much greater because the various sites must exchange messages and perform additional calculations to ensure proper co- ordination among the sites. There is always the risk that the communication overhead may degrade system responsiveness and performance. Disadvantages of Distributed Databases:
  • Slide 10
  • Data Integrity and Consistency Given the increased complexity of the system and the need for co- ordination from multiple sites, additional control mechanisms are required in order to prevent improper updating of data and avoid other problems of data integrity / consistency. Disadvantages of Distributed Databases:
  • Slide 11
  • Centralized Schematic View:
  • Slide 12
  • The External schema The External schema describes database view of a set of database user groups. Each view typically describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group. The Conceptual schema The Conceptual schema is a global description of database that hides details of physical storage structures and concentrates on describing entities, data attributes, relationships and constraints. Centralized Schematic View:
  • Slide 13
  • Internal schema Internal schema is a global description of physical storage structures of database. Centralized Schematic View:
  • Slide 14
  • Distributed Schematic View:
  • Slide 15
  • At top level distributed database acts conceptually as a single centralised database. So, global conceptual schema, which defines all data contained in distributed database, represents to end users a unified view of data for complete distributed system. Distributed Schematic View:
  • Slide 16
  • Set of end users interact with global schema with their own local schema, each confirming to ANSI-SPARC three level architecture. These users external schemas are subsets of global conceptual schemas. Note that in centralised database external schemas use subsets of conceptual schema, whilst in distributed environment local external schemas describe subsets of global conceptual schema. Distributed Schematic View:
  • Slide 17
  • Data Replication DDBMS may contain multiple copies of data at several different sites Data Fragmentation (Partition) Relation may be divided into number of sub-relations (fragments), which are then distributed. Distribution Transparency This allows users to perceive distributed database as a single, logical entity. Issues of Distributing a Database:
  • Slide 18
  • Desirable properties of DDBs is ability to have a local repository of frequently used data, while still being able to access data stored at other networks sites. Replicated Database is a distributed database where (some) stored data is duplicated at various sites. Fully Replicated Database is a distributed database where all stored data are duplicated and allocated to all sites. Data Replication:
  • Slide 19
  • Replication has different effect on read only and update application: Read only Application takes advantage of replication which makes it more likely that they can reference data locally. Update Application may present problems due to replication, since they must update all copies in order to preserve data consistency. Data Replication Key Points:
  • Slide 20
  • Replicated data enhances locality of reference by satisfying more read only queries locally. This reduces query response time and reduces traffic on communication network. Replicated data provides greater reliability through backup copies from which data can be recovered in event of media failures. Data Replication Key Points:
  • Slide 21
  • Proper provision must be made for update operations in a Distributed Database where data replication is used. Two of the update strategies are; 1. Unanimous ( )Agreement Update Strategy Updates are refused unless they have unanimous acceptance from all sites containing a replica. In order to reflect a single copy image of replicated files, updates are propagated to all replicas immediately. Unanimous Acceptance of the proposed update by all sites having replicas is necessary in order to make modifications and all those sites must be available for this to happen. Data Replication Update Strategies:
  • Slide 22
  • 2. Single Primary Update Strategy Update requests are issued to primary replica, which serialises all updates. One replica is designated as PRIMARY and remaining replicas as SECONDARYs. Update request are issued to primary replica, which serves to serialise updates to secondary replicas and thereby preserve data consistency. Data Replication Update Strategies:
  • Slide 23
  • is basically dividing of relations into fragments for distribution Some Advantages: Usage: Since applications usually work with views rather than entire relations, it makes sense to use subsets of relations as unit of distribution Efficiency: If relation can be decomposed into fragment, it is possible to allow number of transaction to execute concurrently. Parallelism: Parallel execution can be realised whereby a single query can be split into set of subqueries that operate on fragments. Data Fragmentation:
  • Slide 24
  • Some Disadvantages: Integrity: Integrity checking can be made more complex if data and functional dependencies are fragmented and distributed to different sites. Performance: It can be slower to process some global applications which require data from fragments at different sites. Data Fragmentation:
  • Slide 25
  • Completeness: Each data item from a global relation R must appear in at least one of its fragments. This rule ensure no loss of data during fragmentation. Reconstruction: It must always be possible to reconstruct each global relation from its fragments. This rule ensures no loss of functional dependencies Disjointness: Each data item from a global relation should appear in only one of its fragments, except for vertical fragmentation where primary key attributes must be repeated to allow reconstruction. This rule ensures minimal data redundancy. Data Fragmentation 3 Rules:
  • Slide 26
  • Horizontal Fragmentation Horizontal fragment of a relation R is a subset of tuples in that relations. Using RESTRICT operation, horizontal fragmentation divides a relation horizon