About the Instructor

39
1 About the Instructor About the Instructor Name: Gong Zhiguo Office: N512 Phone: 83974962 E-Mail: [email protected] Remark: Some of the slides are tailored from the slides by Prof. Hector Garcia-Molina

description

About the Instructor. Name : Gong Zhiguo Office : N512 Phone : 83974962 E-Mail : [email protected] Remark: Some of the slides are tailored from the slides by Prof. Hector Garcia-Molina. DBMS. From File Processing to DBMS. Program 1 deposit withdraw. Program 1 deposit withdraw. - PowerPoint PPT Presentation

Transcript of About the Instructor

Page 1: About the Instructor

1

About the InstructorAbout the Instructor

Name: Gong ZhiguoOffice: N512Phone: 83974962E-Mail: [email protected]

Remark: Some of the slides are tailored from the slides by Prof. Hector Garcia-Molina

Page 2: About the Instructor

2

From File Processing to DBMSFrom File Processing to DBMS

File of currentaccounts

File of savingaccounts

File of customers

Program 4 customer information

Program 1 depositwithdraw

Program 2 transfer

Program 3 printing stmt

DBMS

BANKDATABASE

Program 4 customer information

Program 1 depositwithdraw

Program 2 transfer

Program 3 printing stmt

Page 3: About the Instructor

3

DDB S = Database + DDB S = Database + Networking Networking The technology of computer networks,

promotes a mode of work that goes against all centralization efforts and facilitates distributed computing

Distributed database system technology is the union of what appear to be diametrically opposed approaches to data processing: Database System, Computer Network technologies

A database system aims at integrating the operational data of an enterprise, and to provide a centralized and controlled access to that data

Page 4: About the Instructor

4

Distributed Computing System Distributed Computing System

A distributed computing system consists of a number of autonomous processing elements (not

necessarily homogeneous) interconnected by a computer network cooperate in performing their assigned tasks

What is distributed? Processing Logic Function Data Control

All these are necessary and important for

distributed database technology

Page 5: About the Instructor

5

Distributed DBMS EnvironmentDistributed DBMS Environment

Site 1

Site 2

Site 4

Site 3

Site 5 Site 6

Communication Network

Page 6: About the Instructor

6

Distributed Database System Distributed Database System

A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network; stores data on multiple computers (nodes) over the network and permits access from any node to the joint data

A distributed database management system (DDBMS) is a software system that permits the management of the distributed databases and makes the distribution transparent to the users.

Page 7: About the Instructor

7

What is not a Distributed What is not a Distributed Database System? Database System? A DDBS is not a ``collection of files''

that can be individually stored at each node of a computer network files are not logically related no access via common interface

Page 8: About the Instructor

8

Centralized DBMS on a Centralized DBMS on a NetworkNetwork data resides only at one node the database management is no different from

centralized DBMS remote processing, single server multiple

clients

Site 1

Site 2

Site 4

Site 3

Site 5 Site 6

Communication Network

Page 9: About the Instructor

9

Client-Server SystemsClient-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 10: About the Instructor

10

Client-Server SystemsClient-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 11: About the Instructor

11

Client-Server SystemsClient-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 12: About the Instructor

12

Transaction ServersTransaction Servers

Clients ship transactions consisting of 1 or more SQL commands

E.g., Open DataBase Connectivity (ODBC)

(standard API)

Page 13: About the Instructor

13

Data ServersData Servers

Client requests pages or records Popular for OODB systems

Page 14: About the Instructor

14

Multiprocessor Systems Multiprocessor Systems (Parallel Server)(Parallel Server)

Shared Memory (SMP)

Shared Disk Shared Nothing (network)

Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

CLIENTS

Memory

Processors

CLIENTSCLIENTS

Page 15: About the Instructor

15

Parallel or distributed DB Parallel or distributed DB system?system? More similarities than differences!

Page 16: About the Instructor

16

Typically, parallel DBs:Typically, parallel DBs:

Fast interconnect Homogeneous software High performance is goal Transparency is goal

Page 17: About the Instructor

17

Typically, distributed DBs:Typically, distributed DBs:

Geographically distributed Data sharing is goal (may run into heterogeneity, autonomy) Disconnected operation possible

Page 18: About the Instructor

18

Query processing in parallel Query processing in parallel DBs:DBs: Typically: we can distribute/

partition/sort…. data to make certain DBoperations (e.g., Join) fast

Page 19: About the Instructor

19

Query processing in distributed Query processing in distributed DBs:DBs: Typically: we are given data

distribution; we need to find query processing strategy to minimize cost

(e.g., communication cost)

Page 20: About the Instructor

20

Cloud ComputingCloud Computing

Is CC just a marketing term?? utility (like power) data or CPU cycles? many processors, many storage units business model

Page 21: About the Instructor

21

Cloud Computing (Cloud Computing (M. Armbrust, M. Armbrust, A View of Cloud Computing, A View of Cloud Computing, Communication of ACM)Communication of ACM) Larry Ellison (Oracle CEO)

“The interesting thing about cloud computing is that we’ve redefined cloud computing to include everything that we already do….”

Cloud computing: both the applications delivered as services over

the internet and the hardware and systems software in the data centers that provide those services.

Grid computing Protocols to offer shared computation and

storage over long distance, bbut within a community.

Page 22: About the Instructor

22

Is CC a subset, superset, Is CC a subset, superset, disjoint from, or overlaps with:disjoint from, or overlaps with:

grid computing distributed computing Web 2.0 Cluster Computing Peer-to-peer computing software as a service client-server computing data center as a computer massively parallel

computing

(A)

CC(B)

CC(C)

CC(D)

CC

Page 23: About the Instructor

23

Distributed Database System Distributed Database System Technology Technology The key is integration, not

centralization Distributed database technology

attempts to achieve integration without centralization Database Technology Computers Networks

Distributed Database Systems

Integration Integration Without

Centralization

Distributed Computing

Page 24: About the Instructor

24

Example Example

Multinational manufacturing company: head quarters in Macau manufacturing plants in Nanning and

Kunming warehouses in Zhongshan and Dongguan R&D facilities in Beijing

Data and Information: employee records (working location) projects (R&D) engineering data (manufacturing plants, R&D) inventory (manufacturing, warehouse)

Page 25: About the Instructor

25

Promises of Distributed DBMS Promises of Distributed DBMS

transparent management of distributed, fragmented, and replicated data

improved reliability and availability through distributed transactions

improved performance higher system extendibility

Page 26: About the Instructor

26

TransparencyTransparency

Transparency refers to separation of the higher-level semantics of a system from lower-level implementation details.

From data independence in centralized DBMS to fragmentation transparency in DDBMS.

Issues Who should provide transparency? What is the state of the art in the industry?

Page 27: About the Instructor

27

Improved ReliabilityImproved Reliability

Distributed DBMS can use replicated components to eliminate single point failure.

The users can still access part of the distributed database with “proper care” even though some of the data is unreachable.

Distributed transactions facilitate maintenance of consistent database state even when failures occur.

Page 28: About the Instructor

28

Improved PerformanceImproved Performance

Since each site handles only a portion of a database, the contention for CPU and I/O resources is not that severe. Data localization reduces communication overheads.

Inherent parallelism of distributed systems may be exploited inter-query parallelism intra-query parallelism

Performance models are not sufficiently developed.

Page 29: About the Instructor

29

Easier System ExpansionEasier System Expansion

Ability to add new sites, data, and users over time without major restructuring.

Huge centralized database systems (mainframes) are history (almost!).

PC revolution (Compaq buying Digital, 1998) will make natural distributed processing environments.

New applications (such as, supply chain) are naturally distributed - centralized systems will just not work.

Page 30: About the Instructor

30

Disadvantages of DDBSs Disadvantages of DDBSs

Lack of Experience No operating true distributed database systems in

existence Complexity

DDBS problems are inherently more complex than centralized DBMS ones

Cost More hardware, software and people costs

Distribution of control Problems of synchronization and coordination to

maintain data consistency Security

Database security + network security Difficult to convert

No tools to convert centralized DBMSs to DDBSs

Page 31: About the Instructor

31

Complicating Factors Complicating Factors

Data may be replicated in a distributed environment, consequently the DDBS is responsible for choosing one of the stored copies of the

requested data for access in case of retrievals making sure that the effect of an update is

reflected on each and every copy of that data item

If there is site/link failure while an update is being executed, the DDBS must make sure that the effects will be reflected on the data residing at the failing or unreachable sites as soon as the system recovers from the failure

Page 32: About the Instructor

32

Complicating FactorsComplicating Factors

Maintaining consistency of distributed/replicated data.

Since each site cannot have instantaneous information on the actions currently carried out in other sites, the synchronization of transactions at multiple sites is harder than centralized system.

Page 33: About the Instructor

33

Distributed DBMS IssuesDistributed DBMS Issues

Distributed Database Design Distributed Query Processing Distributed Directory Management Distributed Concurrency Control Distributed Deadlock Management Reliability of Distributed Databases Operating Systems Support Heterogeneous Databases

Page 34: About the Instructor

34

Distributed Database Design Distributed Database Design

The problem is how the database and the applications that run against it should be placed across the sites.

The two fundamental design issues are fragmentation (the separation of the database into partitions called fragments), and allocation (distribution), the optimum distribution of fragments. The general problem is NP hard.

Page 35: About the Instructor

35

Distributed Query Processing Distributed Query Processing

Query processing deals with designing algorithms that analyze queries and convert them into a series of data manipulation operations.

The problem is how to decide on strategy for executing each query over the network in the most cost effective way, however the cost is defined. The objective is to optimize where the inherent parallelism is used to improve the performance of executing the transaction

Page 36: About the Instructor

36

Distributed Directory Distributed Directory Management Management A directory contains information (such

as descriptions and locations) about data items in the database.

A directory may be global to the entire DDBS, or local to each site, distributed, multiple copies, etc.

Page 37: About the Instructor

37

Distributed Concurrency Distributed Concurrency Control Control Concurrency control involves the

synchronization of accesses to the distributed database, such that the integrity of the database is maintained.

One not only has to worry about the integrity of a single database, but also about the consistency of multiple copies of the database (mutual consistency)

Page 38: About the Instructor

38

Reliability of Distributed DBMS Reliability of Distributed DBMS

It is important that mechanisms be provided to ensure the consistency of the database as well as to detect failures and recover from them.

This may be extremely difficult in the case of network partitioning, where the sites are divided into two or more groups with no communication among them.

Page 39: About the Instructor

39

Directory Management

Deadlock Management

Concurrency Control

ReliabilityDistributed DB DesignQuery Processing

Relationship among TopicsRelationship among Topics