Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 ›...

37
1 Mobile and Heterogeneous databases Heterogeneous Distributed Databases Query Processing A.R. Hurson Computer Science Missouri Science & Technology

Transcript of Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 ›...

Page 1: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

1

Mobile and Heterogeneous databases Heterogeneous Distributed Databases

Query Processing

A.R. HursonComputer Science

Missouri Science & Technology

Page 2: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

Note, this unit will be covered in twolectures. In case you finish it earlier, thenyou have the following options:

1) Take the early test and start CS6302.module62) Study the supplement module

(supplement CS6302.module5)3) Act as a helper to help other students in

studying CS6302.module5Note, options 2 and 3 have extra credits as noted in courseoutline.

Heterogeneous Distributed Databases

2

Page 3: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

Glossary of prerequisite topics

Familiar with the topics?No Review CS6302

module5background

Yes

Remedial actionYes

NoTake the Module

Take Test

Pass?

Glossary of topics

Familiar with the topics?Yes

Pass?

Take Test

Yes

Options

Lead a group of students in this module (extra credits)?

Study more advanced related topics (extra credits)?

Study next module?

No

{

Extra Curricular activities

Enforcement of background

{Current Module

At the end: take exam, record the

score, impose remedial action if not

successful

No

Heterogeneous Distributed Databases

3

Page 4: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

You are expected to be familiar with: Heterogeous Distributed Databases, Query processing and Transaction processing in

homogeneous distributed databases If not, you need to study CS6302.module4

Heterogeneous Distributed Databases

4

Page 5: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

At this point you should be fully familiarwith the concept of Heterogeneous DistributedDatabases (multidatabases) and parametersthat distinguish multidatabases from the socalled homogeneous distributed databases.

As a reminder, autonomy and heterogeneityare two major factors that distinguish theaforementioned platforms from each other.

5

Heterogeneous Distributed Databases

Page 6: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

Within the scope of multidatabases you are alsoexpected to be familiar with the “issues” ofconcern.

If you recall, many of these “issues” are tracedback to the “homogeneous distributed databases”.However, autonomy and heterogeneous nature ofmultidatabases make it much harder to deal withthese “issues”.

6

Heterogeneous Distributed Databases

Page 7: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

Finally, You are excepted to be familiarwith different approached to “globalinformation sharing” process and differentsolutions for handling “multidatabases”.

7

Heterogeneous Distributed Databases

Page 8: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

In this module, we will concentrate onquery processing in multidatabases. Thegoal is to identify aspects of queryprocessing in “homogeneous distributeddatabases” that can be transported anddifficulties that are due to the heterogeneousand autonomy characteristics ofmultidatabases.

8

Heterogeneous Distributed Databases

Page 9: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

9

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Many of the distribution query processing andoptimization techniques within the scope ofdistributed systems can be carried over tomultidatabases. However, there are someimportant differences.

Let us review query processing in centralizedand distributed databases.

Page 10: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

10

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Query processing in centralized databases involves

three steps: Query decomposition, Query optimization, and Query execution.

Query processing in distributed databases involves foursteps: Query decomposition/Data localization, Global optimization, Local optimization, and Query execution

Page 11: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

11

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Assume the following query and the tworelations involved:Find names of employees who are managing a projectEMP

ENO ENAME Title

ASG

ENO PNO RESP ADDRESS

Page 12: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

12

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

In SQL the aforementioned query isrepresented as:

SELECT ENAMEFROM EMP, ASGWHERE EMP.ENO = ASG.ENOAND RESP = ‘Manager’;

Page 13: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

13

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

In relational algebra form the query can berepresented in two forms as follows:Πname(σRESP = “Manager”∧ EMP.NO = ASG.NO(EMP X ASG))

Πname(EMP ENO (σRESP = “Manager” (ASG))

Page 14: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

14

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

In a centralized database environment, the choice isclear. Second strategy avoids Cartesian product andhence it is much less computing resource intensive thanthe first strategy.

In distributed environment, as we discussed before,other parameters need to be taken into considerations inorder to define a suitable strategy, i.e., Data Transfercost, site computational capability, …

Page 15: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

15

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Based on the query, location of data sets, size

of the data sets, communication cost,processing capability, … a dynamic strategyshould be laid out.

According to the strategy, then the query isdecomposed into sub-queries.

Sub-queries are sent to the designated sites forexecution.

Page 16: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

16

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Furthermore, assume that the relations arehorizontally fragmented as follows: EMP1: σNo ≤ “E3” (EMP) site3

EMP2: σNo > “E3” (EMP) site4

ASG1: σNo ≤ “E3” (ASG) site1

ASG2: σNo > “E3” (ASG) site2

Page 17: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

17

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Now there are choices to execute this query:

ASG’1 = σRESP= “manager” (ASG1)

Site1

ASG’2 = σRESP= “manager” (ASG2)

Site2

ASG’2ASG’

1

Site3 EMP’1 = EMP1 ENO ASG’

1 Site4 EMP’2 = EMP2 ENO ASG’

2

Result = EMP’1 ∪ EMP’

2Site5

EMP’1 EMP’

2

Page 18: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

18

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Result = (EMP1 ∪ EMP2) ENO (σRESP= “manager” (ASG1 ∪ASG2)Site5

ASG1

Site1

ASG2

Site2

EMP1

Site3

EMP2

Site4

Page 19: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

19

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Query processing in multidatabases is moredifferent and complicated than the one westudied in traditional distributed databases.This complexity is due to the heterogeneity andlocal autonomy aspects of multidatabases.

Page 20: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

MultiDatabase Systems ─ Query processing

Because of heterogeneity: Data representation in different local databases may be

different, The capability of component databases may be different, Cost of processing queries on different local databases may be

different, There may be difficulties in moving data between local

databases, The local optimization capability of local databases might be

quite different.•••

20

Heterogeneous Distributed Databases

Page 21: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

21

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Local autonomy poses additional problems. Forexample: As a result of communication autonomy and/or

association autonomy the local database may terminateits services at any time. This requires query processingmethods that are tolerant to system unavailability.Therefore, the challenge is to respond to a user querywhen the component database is unavailable, unwilling,and uncooperative.

Page 22: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

22

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query optimization The design autonomy may restrict the

availability and accuracy of statisticalinformation needed in order to carry out thequery optimization.

The execution autonomy may limit theapplication of some query processing andoptimization strategies. For example, it may notbe possible to perform semi-join operation.

Page 23: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

23

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Global query is resolved (split) with thehelp of the global schema (schemaintegration). Resolution of the global queryresults in a set of sub-queries to be executedat the local sites.

Page 24: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

24

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Query processing at the global level is asequence of four step process: Compilation and translation, Unification decomposition, Optimization, and Translation and execution.

Page 25: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

25

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Compilation and translation: Query is compiled

and transformed into an internal form. Unification decomposition: Integrated data items

are replaced from corresponding local data itemsalong with inconsistency resolution functions (ifany).

Optimization: The query tree is optimized andanalyzed. At this stage, sub-trees to be resolved bylocal databases are identified.

Translation and execution: Executable sub-trees tobe executed at local databases are constructed andpassed to local systems for execution.

Page 26: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

26

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Steps one and four are similar to those intraditional data base systems(centralized/distributed) and hence willnot be discussed further.

Page 27: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

27

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Unification decomposition: The issue is to

determine how the integrated data can beconstructed and which local data should be usedfor its construction.

The process could be complicated due to theequivalent data items at different local databases─ A simple query that accesses a local data itemmay have to access an arbitrary number of dataitems at other sites because of direct or indirectequivalence relationships.

Page 28: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

28

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Similar to traditional distributed databases,two optimization techniques could be used: Heuristic based optimization Cost based optimization

Page 29: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

29

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Heuristic based optimization: Decompose the

global query into the smallest possible sub-queries where each sub-query is executed at onelocal database (here multiple sub-queries may besent to the same site). Decomposition is relatively easier, More chances to perform global optimization, More work at the global optimizer, More communication between global and

local components.

Page 30: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

30

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Heuristic based optimization: Decomposethe global query into the largest possiblesub-queries where each sub-query can beexecuted at one local database. Less work at global optimizer, Fewer messages between global and local

components, More work at the local databases.

Page 31: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

31

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Cost based optimization: Given a query

Q, its execution plans “execution spaceEQ”, and cost function C on EQ, we want tofind an execution plan eQ∈ EQ that has theminimum cost.

Local autonomy is the key factor thatcomplicates the task beyond itscomplexity in traditional distributeddatabases for two reasons:

Page 32: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

32

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Cost based optimization

Global database management system may nothave complete cost information about globalsub-queries in order to perform the globaloptimization.

Global database management system interactwith the local database management system atits application program interface level. As aresult, it is unaware of internal data structureand functions of the local database managementsystems.

Page 33: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

33

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Cost based optimization: Three alternatives can be

used to determine the cost of executing queries atthe local nodes: Treat local nodes as a black box, run some test queries on

them, and from these determine the necessary costinformation.

Use previous knowledge about local node and theirexternal characteristics to determine the cost information,

Monitor the run-time behavior of the local node anddynamically collect the cost information.

Page 34: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

34

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing

Cost estimation of Global sub-queries We can use a logical cost model to estimate

cost of sub-queries: Cost of a simple query (Q) on a relation is: Cost (Q) = C0 + C1 +C2

C0 is the initialization cost C1 is the cost of finding qualifying tuples C2 is the cost of processing selected tuples.

Page 35: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

35

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing C0 is a function of local data base management system C1 is a function of the relation being accessed, and C2 is a function of the number of tuples being returned.

Cost (Q) = c0 + c1 * R+ c2 * R* s

Cardinality of therelation being accessed Selectivity of

the query

R and s are unknown to the global database management system.

Page 36: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

36

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Cost coefficients c0, c1, and c2 can be derived

by a calibration process ─ run a set of suit ofspecially designed calibration queries, inisolation, on a specially designed calibrationdatabase (synthetic database) on the local site.

This strategy can be extended to the domain ofmore complicated queries and non relationaldatabase systems.

Page 37: Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 › 49225097 › ... · Heterogeneous Distributed. Databases (multidatabases) and parameters

37

Heterogeneous Distributed Databases

MultiDatabase Systems ─ Query processing Cost based optimization: As an alternative

to calibration queries and databases, onecan use probing queries on componentnodes to determine cost information. Thisapproach can be extended to the domain ofthe so called “sample queries” wherequeries can be classified based on differentcriteria and sample queries for each classare issued to derive and measure costinformation.