Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 ›...

1

Mobile and Heterogeneous databases Heterogeneous Distributed Databases

Query Processing

A.R. HursonComputer Science

Missouri Science & Technology

Note, this unit will be covered in twolectures. In case you finish it earlier, thenyou have the following options:

1) Take the early test and start CS6302.module62) Study the supplement module

(supplement CS6302.module5)3) Act as a helper to help other students in

studying CS6302.module5Note, options 2 and 3 have extra credits as noted in courseoutline.

Heterogeneous Distributed Databases

2

Glossary of prerequisite topics

Familiar with the topics?No Review CS6302

module5background

Yes

Remedial actionYes

NoTake the Module

Take Test

Pass?

Glossary of topics

Familiar with the topics?Yes

Pass?

Take Test

Yes

Options

Lead a group of students in this module (extra credits)?

Study more advanced related topics (extra credits)?

Study next module?

No

{

Extra Curricular activities

Enforcement of background

{Current Module

At the end: take exam, record the

score, impose remedial action if not

successful

No


3

You are expected to be familiar with: Heterogeous Distributed Databases, Query processing and Transaction processing in

homogeneous distributed databases If not, you need to study CS6302.module4


4

At this point you should be fully familiarwith the concept of Heterogeneous DistributedDatabases (multidatabases) and parametersthat distinguish multidatabases from the socalled homogeneous distributed databases.

As a reminder, autonomy and heterogeneityare two major factors that distinguish theaforementioned platforms from each other.

5


Within the scope of multidatabases you are alsoexpected to be familiar with the “issues” ofconcern.

If you recall, many of these “issues” are tracedback to the “homogeneous distributed databases”.However, autonomy and heterogeneous nature ofmultidatabases make it much harder to deal withthese “issues”.

6


Finally, You are excepted to be familiarwith different approached to “globalinformation sharing” process and differentsolutions for handling “multidatabases”.

7


In this module, we will concentrate onquery processing in multidatabases. Thegoal is to identify aspects of queryprocessing in “homogeneous distributeddatabases” that can be transported anddifficulties that are due to the heterogeneousand autonomy characteristics ofmultidatabases.

8


9


MultiDatabase Systems ─ Query processing

Many of the distribution query processing andoptimization techniques within the scope ofdistributed systems can be carried over tomultidatabases. However, there are someimportant differences.

Let us review query processing in centralizedand distributed databases.

10


MultiDatabase Systems ─ Query processing Query processing in centralized databases involves

three steps: Query decomposition, Query optimization, and Query execution.

Query processing in distributed databases involves foursteps: Query decomposition/Data localization, Global optimization, Local optimization, and Query execution

11



Assume the following query and the tworelations involved:Find names of employees who are managing a projectEMP

ENO ENAME Title

ASG

ENO PNO RESP ADDRESS

12



In SQL the aforementioned query isrepresented as:

SELECT ENAMEFROM EMP, ASGWHERE EMP.ENO = ASG.ENOAND RESP = ‘Manager’;

13



In relational algebra form the query can berepresented in two forms as follows:Πname(σRESP = “Manager”∧ EMP.NO = ASG.NO(EMP X ASG))

Πname(EMP ENO (σRESP = “Manager” (ASG))

14



In a centralized database environment, the choice isclear. Second strategy avoids Cartesian product andhence it is much less computing resource intensive thanthe first strategy.

In distributed environment, as we discussed before,other parameters need to be taken into considerations inorder to define a suitable strategy, i.e., Data Transfercost, site computational capability, …

15


MultiDatabase Systems ─ Query processing Based on the query, location of data sets, size

of the data sets, communication cost,processing capability, … a dynamic strategyshould be laid out.

According to the strategy, then the query isdecomposed into sub-queries.

Sub-queries are sent to the designated sites forexecution.

16



Furthermore, assume that the relations arehorizontally fragmented as follows: EMP1: σNo ≤ “E3” (EMP) site3

EMP2: σNo > “E3” (EMP) site4

ASG1: σNo ≤ “E3” (ASG) site1

ASG2: σNo > “E3” (ASG) site2

17


MultiDatabase Systems ─ Query processing Now there are choices to execute this query:

ASG’1 = σRESP= “manager” (ASG1)

Site1

ASG’2 = σRESP= “manager” (ASG2)

Site2

ASG’2ASG’

1

Site3 EMP’1 = EMP1 ENO ASG’

1 Site4 EMP’2 = EMP2 ENO ASG’

2

Result = EMP’1 ∪ EMP’

2Site5

EMP’1 EMP’

2

18



Result = (EMP1 ∪ EMP2) ENO (σRESP= “manager” (ASG1 ∪ASG2)Site5

ASG1

Site1

ASG2

Site2

EMP1

Site3

EMP2

Site4

19



Query processing in multidatabases is moredifferent and complicated than the one westudied in traditional distributed databases.This complexity is due to the heterogeneity andlocal autonomy aspects of multidatabases.


Because of heterogeneity: Data representation in different local databases may be

different, The capability of component databases may be different, Cost of processing queries on different local databases may be

different, There may be difficulties in moving data between local

databases, The local optimization capability of local databases might be

quite different.•••

20


21



Local autonomy poses additional problems. Forexample: As a result of communication autonomy and/or

association autonomy the local database may terminateits services at any time. This requires query processingmethods that are tolerant to system unavailability.Therefore, the challenge is to respond to a user querywhen the component database is unavailable, unwilling,and uncooperative.

22


MultiDatabase Systems ─ Query optimization The design autonomy may restrict the

availability and accuracy of statisticalinformation needed in order to carry out thequery optimization.

The execution autonomy may limit theapplication of some query processing andoptimization strategies. For example, it may notbe possible to perform semi-join operation.

23



Global query is resolved (split) with thehelp of the global schema (schemaintegration). Resolution of the global queryresults in a set of sub-queries to be executedat the local sites.

24



Query processing at the global level is asequence of four step process: Compilation and translation, Unification decomposition, Optimization, and Translation and execution.

25


MultiDatabase Systems ─ Query processing Compilation and translation: Query is compiled

and transformed into an internal form. Unification decomposition: Integrated data items

are replaced from corresponding local data itemsalong with inconsistency resolution functions (ifany).

Optimization: The query tree is optimized andanalyzed. At this stage, sub-trees to be resolved bylocal databases are identified.

Translation and execution: Executable sub-trees tobe executed at local databases are constructed andpassed to local systems for execution.

26



Steps one and four are similar to those intraditional data base systems(centralized/distributed) and hence willnot be discussed further.

27


MultiDatabase Systems ─ Query processing Unification decomposition: The issue is to

determine how the integrated data can beconstructed and which local data should be usedfor its construction.

The process could be complicated due to theequivalent data items at different local databases─ A simple query that accesses a local data itemmay have to access an arbitrary number of dataitems at other sites because of direct or indirectequivalence relationships.

28



Similar to traditional distributed databases,two optimization techniques could be used: Heuristic based optimization Cost based optimization

29


MultiDatabase Systems ─ Query processing Heuristic based optimization: Decompose the

global query into the smallest possible sub-queries where each sub-query is executed at onelocal database (here multiple sub-queries may besent to the same site). Decomposition is relatively easier, More chances to perform global optimization, More work at the global optimizer, More communication between global and

local components.

30



Heuristic based optimization: Decomposethe global query into the largest possiblesub-queries where each sub-query can beexecuted at one local database. Less work at global optimizer, Fewer messages between global and local

components, More work at the local databases.

31


MultiDatabase Systems ─ Query processing Cost based optimization: Given a query

Q, its execution plans “execution spaceEQ”, and cost function C on EQ, we want tofind an execution plan eQ∈ EQ that has theminimum cost.

Local autonomy is the key factor thatcomplicates the task beyond itscomplexity in traditional distributeddatabases for two reasons:

32


MultiDatabase Systems ─ Query processing Cost based optimization

Global database management system may nothave complete cost information about globalsub-queries in order to perform the globaloptimization.

Global database management system interactwith the local database management system atits application program interface level. As aresult, it is unaware of internal data structureand functions of the local database managementsystems.

33


MultiDatabase Systems ─ Query processing Cost based optimization: Three alternatives can be

used to determine the cost of executing queries atthe local nodes: Treat local nodes as a black box, run some test queries on

them, and from these determine the necessary costinformation.

Use previous knowledge about local node and theirexternal characteristics to determine the cost information,

Monitor the run-time behavior of the local node anddynamically collect the cost information.

34



Cost estimation of Global sub-queries We can use a logical cost model to estimate

cost of sub-queries: Cost of a simple query (Q) on a relation is: Cost (Q) = C0 + C1 +C2

C0 is the initialization cost C1 is the cost of finding qualifying tuples C2 is the cost of processing selected tuples.

35


MultiDatabase Systems ─ Query processing C0 is a function of local data base management system C1 is a function of the relation being accessed, and C2 is a function of the number of tuples being returned.

Cost (Q) = c0 + c1 * R+ c2 * R* s

Cardinality of therelation being accessed Selectivity of

the query

R and s are unknown to the global database management system.

36


MultiDatabase Systems ─ Query processing Cost coefficients c0, c1, and c2 can be derived

by a calibration process ─ run a set of suit ofspecially designed calibration queries, inisolation, on a specially designed calibrationdatabase (synthetic database) on the local site.

This strategy can be extended to the domain ofmore complicated queries and non relationaldatabase systems.

37


MultiDatabase Systems ─ Query processing Cost based optimization: As an alternative

to calibration queries and databases, onecan use probing queries on componentnodes to determine cost information. Thisapproach can be extended to the domain ofthe so called “sample queries” wherequeries can be classified based on differentcriteria and sample queries for each classare issued to derive and measure costinformation.

Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 ›...

Documents

Transcript of Heterogeneous Distributed Databases Query Processing › uploads › 4 › 9 › 2 › 2 ›...