On Query Optimization in Relational...

On Query Optimization in Relational Databases

by

John Ngubiri

PGDCS(Mak), BSc/Ed(Mak)

[email protected], 071921969

A Dissertation Submitted in Partial Fulfillment of the Requirements

for the Award of a Degree of Master of Science in Computer Science

of Makerere University

May, 2004

Declaration

I, John Ngubiri do hereby declare that this dissertation is my original work and has

never been submitted for any award of a degree in any institution of higher learning.

Where quotations have been used, they are acknowledged through references.

Signed .....................................Date......................................

John Ngubiri

(2002/HD18/423U)

Candidate.

i

Approval

I certify that this is the original work of John Ngubiri and has been done under my

supervision. The work has never been submitted for any award of a degree in any

Institution of higher learning.

Signed......................................Date.......................................

Dr. Venansius Baryamureeba, Ph.D

Supervisor.

ii

Dedication

This Dissertation is dedicated to:

• The two ladies in my life: my mother and my fiance Eve.

• The two gentlemen in my life: my father and my son Matthew.

iii

Acknowledgments

My sincere appreciation goes to Dr. Venansius Baryamureeba Ph.D, Director In-

stitute of Computer Science and my supervisor for all his advise, guidance and

encouragement. Without you, this dissertation would not be the way it is now.

Most likely, it would not be there!

I would also wish to thank the staff at Institute of Computer Science - Makerere

University with whom I associate and hence grow academically. Notable on the list

are Mr Habibu Atib - for the very initial discussions, Dr Vincent Ssembatya (De-

partment of Mathematics) for the Algorithm design background and Ms Josephine

Nabukenya for technical writing assistance.

To you all, I say Thank you.

John Ngubiri -June 2004.

iv

Abstract

Query Optimization is an important process in Relational databases. With the

current load on databases increasing, the need to optimize queries in batches is a

promising way out. Studies have shown that sharing among common sub-expressions

can as well be beyond the optimal plans of the constituent queries. But challenges

of excessively large sample space, buffer management, establishment of optimal or-

der of optimization, and identification of disjoint queries remain in place. In this

dissertation, We propose how We can efficiently establish the extent of inter-query

shareability and exploit it so as to compute common sub-expressions once and share

the output among the queries. We also propose the optimal order of optimization

so that the sharing is done in a more cost saving and time conserving manner.

v

List of Figures

1.1 The stages of query processing . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Query Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 The optimal plan of the query . . . . . . . . . . . . . . . . . . . . . . 17

1.4 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1 The different tree configurations: bushy, complex deep, left deep and

right deep trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 A pipeline Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Random nature of the II algorithm . . . . . . . . . . . . . . . . . . . 41

2.4 Query representation: Tree, DAG and extended DAG . . . . . . . . . 46

vi

Symbols

Symbol Equivalent Relational Algebra Operation

σ Select

Π Project

× Cartesian Product

on Join

θ A comparison operator

vii

Contents

1 Introduction 1

1.1 Background to Databases . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Systematic Query Optimization . . . . . . . . . . . . . . . . . 9

1.4.2 Heuristic Query Optimization . . . . . . . . . . . . . . . . . . 10

1.4.3 Semantic Query Optimization . . . . . . . . . . . . . . . . . . 11

1.5 Single Query Heuristic Optimization . . . . . . . . . . . . . . . . . . 13

1.5.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

viii

1.5.2 Challenges to Single Query Optimizers . . . . . . . . . . . . . 18

1.6 Multi-Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 18

1.6.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.6.2 Challenges of Multi-Query Optimizers . . . . . . . . . . . . . 19

1.7 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.9 Objectives: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.10 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.11 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.12 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.13 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Literature Review 27

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Approaches to Query Optimization . . . . . . . . . . . . . . . . . . . 30

2.2.1 Single Query Optimization . . . . . . . . . . . . . . . . . . . . 30

ix

2.2.2 Multi-Query Optimization (MQO) . . . . . . . . . . . . . . . 32

2.3 The Effect of Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Single Query Optimization Algorithms . . . . . . . . . . . . . . . . . 38

2.4.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . 38

2.4.2 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . 39

2.5 Multi-Query Optimization Algorithms . . . . . . . . . . . . . . . . . 43

2.5.1 The MQO Problem . . . . . . . . . . . . . . . . . . . . . . . . 44

2.5.2 Reuse Based Optimization . . . . . . . . . . . . . . . . . . . . 46

2.6 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Query Shareability Establishment 55

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 The Shareability Problem . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4 The Greedy Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 61

3.5 Improvements to the Greedy Algorithm . . . . . . . . . . . . . . . . . 63

x

3.6 The Improved Greedy Searching Algorithm . . . . . . . . . . . . . . . 65

4 Optimizing the Traversed Plans 67

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.1 The Basic Volcano Algorithm . . . . . . . . . . . . . . . . . . 70

4.2.2 The Volcano-SH . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2.3 The Volcano-RU . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 The Proposed Optimizing Algorithm . . . . . . . . . . . . . . . . . . 75

4.3.1 The Background . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.2 The Optimizing Algorithm . . . . . . . . . . . . . . . . . . . . 77

4.3.3 Benefits of the new Algorithm . . . . . . . . . . . . . . . . . . 79

5 Discussion and Future Work 81

5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xi

Chapter 1

Introduction

1.1 Background to DatabasesRight from the early times of programming, programmers relied on well-organized

data structures to simplify programming tasks [11]. Though it is normal to include

variables in program codes, it is nearly impossible to include large data sets. Includ-

ing data in codes implies that the code owns the data. The code therefore has to

provide facilities for other programs to access that data when required. This brings

in programming hardships since data normally has many processes like insertion,

updates and deletion which have to take place not necessarily simultaneously. Hard-

ships are more prominent in cases where a program has to access data from another

program. For a process to allow access to its data, it must be running as well since

data cannot be got from a dormant program. If the data owner is not running,

1

then it has to initiate some dummy operations hence wasting processor cycles and

memory. Programs owning data come with many programming bottlenecks and the

most outstanding ones, according to Johnson [11] are:-

• Unhealthy dependence between data and programs;

• Repetition of data elements;

• Opportunities for inconsistencies;

• Unorganized scattering of related data across many files;

• Distributed ownership of data;

• Decentralized security of data;

• Unregulated interactions between programs using the same data;

• Inadequate multiuser access;

• Ad hoc approach to error recovery; and

• Over dependence on physical considerations such as disc track and sector ad-

dresses.

During the 1950s and 1960s, the above bottlenecks led to the development of

database systems so that data can be independent and the application programs

just access it. With the development of databases, data and application software

2

became independent but interacting components. A Database is a collection of

logically related data and a Database Management System (DBMS) is a software

product that helps in defining, creating, maintaining and controlling access to a

database. DBMSs are grouped according to the model of their development. A

database model is an organizing principle that specifies particular mechanisms for

data storage and retrieval. There are five database models namely the Hierarchical,

Network, Relational, Object and Deductive models; the Relational model being the

most popular of all.

The Hierarchical and Network models, which were developed before the Relational

model was developed are referred to as the pre-Relational models while the Ob-

ject and Deductive models, which were developed after the Relational model was

developed are referred to as the post-Relational models.

1.2 The Relational Model

The Relational model is the most popular database model currently in use. It uses

tables (relations) to organize the data elements stored. A relation represents an

application entity while a row in the relation represents an instance of the entity. It

replaced the Hierarchical model (whose organizational principle was the tree struc-

ture organization of data) and the Network model (whose organizational principle

was the graphical representation of data)[11]. It was because of the popularity of

3

the Relational model that the Hierarchical and Network models were phased out.

Johnson [11] attributes the popularity of the Relational model to:-

i The existence of a standard, easy and flexible querying language called Struc-

tured Query Language (SQL) which is universal to all Relational Database

Management Systems (RDBMS);

ii The existence of a simple data structure (relations) which makes it easily

understandable even by non-technical users;

iii The existence of a strong mathematical base in the model (Relational Algebra

and Relational Calculus) for its operation.

Rammakshriman and Gherke [18] however observe that the Relational model has

two strong setbacks which are:-

i Performance:

A single SQL query can be written in many different ways each of which can

have a different cost. Most of the queries are so expensive that executing them

degrades the computer system and greatly slows the rate of output generation.

ii Flatness:

The relational model relies on inbuilt primitive data types yet data in real life

situations is increasingly becoming complex. Representing complex fields as a

set of primitive attributes leads to too many fields with often null entries.

4

Motivated by the above weaknesses, two post-Relational models emerged. These

are the Object Oriented model and the Deductive model. The Object Oriented

model represents an entity as a class and an entity occurrence as an object. It

bases on the three principles of Object Oriented Programming which are encapsu-

lation, inheritance and polymorphism. It gives the programmer freedom to create

classes as dictated by the real life situation and the way he envisages the entities

being modeled. This not only makes the database more problem focused, but it

also eases reuse. It is suitable for creating complex systems. The Deductive model

(also called the Inferential model) aims at storing minimal data in combinations

called axioms. It allows for un stored combinations to be deduced from the ex-

isting ones. For example, if an axiom for a student being a member of a class is

Member(studentName, class), the individual students memberships can be repre-

sented as Member(”Mark”, ”FormTwo”) and Member(”Matthew”, ”FormTwo”).

An axiom like ClassMates(class, studentList) is not stored but is deduced from the

Member axiom.

Intensive research took place to address the limited data type and the query per-

formance problems. The limited data type problem was addressed by incorporating

object oriented characteristics in the relational model. This led to the introduction

of a hybrid model, the Object-Relational model. DBMSs like Oracle and post-

greSQL have object oriented characteristics. The inefficient queries problem was

addressed by passing relational (and object relational) queries in a series of steps

between the query source and the physical data storage level so as to have them

5

execute in a cost effective manner. This process is called Query Processing.

1.3 Query Processing

This is the process of transforming a high level query into a plan that executes and

retrieves data from the database (Figure 1.2). It involves four phases which are

Query decomposition, Query optimization, Code generation and Run time query

execution [2].

(a) Query decomposition

In this phase, a query is checked as to whether or not it conforms to the syntax

of the language used (mostly SQL) and in case it does not, the error message is

generated. If the query conforms to the syntax, it is broken into small pieces

and represented in an equivalent relational algebra expression (parsing). A

system catalog is used to cross check the consistency of the query with the

schema.

(b) Query optimization

In this phase, the best execution plan is generated. This is done by putting into

account the resources required to execute the query as well as the resources

required to get the plan. Database statistics are used to make appropriate

decisions.

6

(c) Code generation

After the optimizer has got the best execution plan, the code generator creates

the code equivalent to the plan. This is sent to the internal ANSI spark

architecture level of the database for execution.

(d) Query execution

In this phase, the code interacts with the database and retrieves the data for

consumption of the process or individual who sent the query.

Figure 1.1: The stages of query processing

The query processing activity therefore acts as an interface between the querying

individual/process and the database. It relieves the querying individual/ process of

7

the burden of deciding the best execution strategy. So while the querying individual/

process specifies what, the query processor determines how [4].

In query processing, the least straight forward stage is Query Optimization. It

is the efficiency of the query optimizer that determines how much resources are

to be used and a measure of how suitable the DBMS is for critical and real time

applications.

1.4 Query Optimization

Query Optimization is the process of choosing the efficient execution strategy for

executing a query [2] and it is one of the most important tasks of any RDBMS.

Ramakrishman and Gehrke [18] observe that SQL which is a de facto standard for

data definition and data manipulation in RDBMSs has a variety of ways in which a

user can express, and therefore a system can evaluate a query. The query optimizer

therefore is responsible for finding the best execution strategy so that less resources

are used to retrieve data.

There are three main approaches to query optimization. These are Systematic,

Heuristic and Semantic query optimization.

8

1.4.1 Systematic Query Optimization

In systematic query optimization, the system estimates the cost of every plan and

then chooses the best one. The best cost plan is not always universal since it depends

on the constraints put on data. For example, joining on a primary key may be done

more easily than joining on a foreign key since primary keys are always unique and

therefore after getting a joining partner, there is no other key expected. The system

therefore breaks out of the loop and hence does not scan the whole table. Though

in many cases efficient, it is a time wasting practice and therefore sometimes it can

be done away with [4].

The costs considered in systematic query optimization include access cost to

secondary storage, storage cost, computation cost for intermediate relations and

communication costs. The importance put on these costs depend on the type of

database. For example, for large databases, emphasis is put on minimizing access

cost to storage and memory usage. For small databases however, where outputs can

be stored in memory, emphasis is put on minimizing computational cost. On the

other hand, in distributed databases, where many sites are involved, communication

cost is of paramount importance and it has to be minimized since it normally involve

costs of channel coding, security coding as well as other network related limitations

like bandwidth and noise.

9

1.4.2 Heuristic Query Optimization

In the heuristic approach, the operator ordering is used in a manner that economizes

the resource usage but conserving the form and content of the query output. The

principle aim is to:

(i) Set the size of the intermediate relations to the minimum and increase the rate

at which the intermediate relation size tend towards the final relation so as to

optimize memory.

(ii) Minimize on the amount of processing that has to be done on the data without

affecting the output.

Connolly and Begg [2] state five main rules which are used in heuristic query

optimization:-

(a) Perform selection operations as early as possible. This reduces the cardinality

of the intermediate relation and hence reducing the resources used to process

a column as well as the memory occupied per column.

(b) Combine the Cartesian product with a subsequent selection operation whose

predicate represents a join condition into a join operation i.e σR·aθS·b(R×S) =

RonR·aθS·b S. Elmasri and Navathe [4] observe that this reduces the complexity

of the joining algorithm (which is one of the most expensive operations in data

10

retrieval) for example in cases where individual relations are first sorted on the

joining fields.

(c) Use association of binary operations to rearrange the query so that the most

restrictive selection operation is done first. This increases the rate at which

the intermediate relation size tends to the final relation size hence minimizing

on the memory occupied and the resources required to process a column.

(d) Perform projection operations as early as possible. This reduces the order of

the intermediate relation. It therefore reduces the memory occupied by the

relation together with the amount of resources required to process a row.

(e) Compute common expressions once. If a certain expression appears more than

once, and its not too large, it is kept in memory so that when it is required

again, it is reused. In case the expression is too big to fit in memory, it can

be stored on a disk and later retrieved when wanted so long as the cost of

retrieval is not greater than the cost of recomputing it.

1.4.3 Semantic Query Optimization

This is a combination of Heuristic and Systematic optimization. The constraints

specified in the database schema can be used to modify the procedures of the heuris-

tic rules making the optimal plan selection highly creative. This leads to heuristic

rules that are locally valid though cannot be taken as rules of the thumb. For

11

example, if there is a query such as

12

SELECT Employee.lname, Supervisor.lname

FROM Employee, Supervisor

WHERE Employee·supervisorNo = Supervisor·No

and Employee·salary≥Supervisor·salary

This is a very unlikely invent and its likely to be directly or indirectly in the

database constraints. The database restriction may be like check Employee · salary

between(S1, S2). Check Supervisor · salary between(S3, S4) where S3 > S2. This

shows that the Supervisor can never earn less than the Employee therefore the query

yields no results.

A Heuristic optimizer would go ahead and parse, optimize and execute the query

resulting in no output which is a worst case scenario [9]. A semantic optimizer

would recognize it by use of the constraints and respond ”Empty set” and saves the

resources.

1.5 Single Query Heuristic Optimization

1.5.1 The Process

For a single syntactically correct query, when a query is broken down, and expressed

into a relational algebra expression, a query tree is created [2] [4] with the interme-

13

diate operations as nodes; the source relations as leaves and the output as root. For

example, given the table components below:-

Client

Field Data type Description

clientNo four fixed characters Client identity number, primary key

name twenty five variable characters client Name

prefType ten variable characters property type of preference

maxRent float rounded to 2 decimal places maximum rent affordable by the client

Property


propertyNo five variable characters Property identification number, primary key

street twelve variable characters Street where the property is located

rent float rounded to 2 decimal places monthly rent of property

ownerNo four fixed characters identification number of owner, Foreign key

Viewing


propertyNo five fixed characters Property identification number partial primary key

clientNo four fixed characters Identification number of viewing client partial primary key

14

and with a query

SELECT p.propertyNo, p.street

FROM client c, viewing v, propertyForRent p

WHERE c.prefType = ’Flat’ AND c.clientNo = v.clientNo AND v.propertyNo = p.proprtyNo

AND c.maxRent ≥ p.rent AND c.prefType = p.type AND p.ownerNo = ’C093’;

In the process of getting the suitable plan, it is decomposed into a relational algebra

expression;

πp·propertyNo,p·street(σc·prefType=′F lat′∧c·clientNo=v·clientNo∧·propertyNo=p·proprtyNo∧c.maxRent≥p·rent∧

c.prefType=p·type∧p·ownerNo=′C093′(C×V)×P))

which is expressed as a query tree in Figure 1.2.

The formed plan like one above, is in most cases not optimal. The optimizer

therefore looks for a most cost effective plan. This may be either by using the

heuristic rules and adjust the plan to a cost effective one or by looking for the most

effective plan among the many available. In this case, the optimal tree is shown in

Figure 1.3.

15

.

Figure 1.2: Query Tree Representation

16

.

Figure 1.3: The optimal plan of the query

It is this strategy which is sent to the code generator for code generation and

data fetching.

17

1.5.2 Challenges to Single Query Optimizers

Since the single query optimizers handle one query at ago, they are unsuitable for

handling a high traffic of queries. Since they work on a large sample space, they

cannot be exhaustive in nature since that will be expensive and time wasting. The

algorithm therefore has to look for a search strategy so that a thorough search is

done, and the optimal plan is got without traversing all the options in an acceptably

small interval of time.

1.6 Multi-Query Optimization

1.6.1 The Process

In multi query optimization, queries are optimized and executed in batches. Individ-

ual queries are transformed into relational algebra expressions and are represented

as graphs [13, 16, 20]. According to Roy et al [20], the graphs are created in such a

way that:

(i) Common sub-expressions can be detected and unified;

(ii) Related sub-expressions are identified so that the more encompassing sub-

expression is executed and the other sub-expressions are derived from it. For

example if We have σA≤5(E) and σA≤10(E), σA≤10(E) is executed and σA≤5(E)

18

derived from it.

The optimizer therefore concentrates on:-

(i) Identifying the common expressions so that the database accesses and compu-

tational costs are minimized;

(ii) Identifying groups of sub-expressions where internal derivations are possible

so that derivations can be made and database access is minimized;

(iii) Get a search strategy so that the search is not too long.

1.6.2 Challenges of Multi-Query Optimizers

Generation of optimal global queries is not necessarily done on individual optimal

plans. It is done on the group of them. This leads to a large sample space from

which the composite optimal plan has to be got. For example, if for four relations A,

B, C, and D there are two queries whose optimal states are Q1 = (AonB)onC and Q2

= (B on C) onD with execution costs ε1 and ε2 respectively, the total cost is ε1 + ε2.

Though these queries are individually optimal, the sum is not necessarily optimal.

The query Q1 for example can be rearranged to an individually non optimal state

Q’ = Aon(BonC) whose cost, say ε is greater than ε1. The combination of Q’ and Q2

may make a more optimal plan globally at run time in case they cooperate. Since

there is a common expression (BonC), it can be executed once and the result shared

19

by the two queries. This leads to a cost of ε + ε2 - ξ where ξ is the cost of evaluating

(BonC) which can be less than ε1 + ε2. If sharing was tried on individual optimal

plans, sharing would be impossible hence the saving opportunity would be lost. To

achieve cost savings using sharing, both optimal and non-optimal plans are needed

so that the sharing possibilities are fully explored. This however increase the sample

space for the search hence a more search cost. The search strategy therefore needs

to be efficient enough to be cost effective. Though this approach can lead to a lot

of improvement on the efficiency of a query, it may still have some bottlenecks that

have to be overcome if a global state is at all times to be achieved. The bottlenecks

include:-

(a) The cost of Q’ may be too high that the sum of the independent optimal states

is still the global optimal state;

(b) There may be no possibility at all to have sharable components and therefore

a search for sharable components is a wastage of resources.

(c) The new query plans may have a lower resource requirement than the previous

one but when the resources taken to identify the plan (search cost) take on

more resources than the trade off hence no net saving on resources. This is

very likely given the large sample space.

20

1.7 Statement of the Problem

Parallel real time demands are putting a big load on relational query optimizers es-

pecially in Decision Support and Business Intelligence tools, remote access architec-

tures like databases on networks as well as fragmented architectures like distributed

databases. This leads to overworking of hardware increasing wear and tear rate

hence increasing the maintenance costs. Multi-query optimizers identify common

sub-expressions, execute them once and share the outputs among the queries. This

leads to significant performance gains [22]. Current optimization approaches make

no effort to find out whether queries in a batch really have common sub-expressions

and neither do they have an intelligent way of predicting the absence or exhaustion

of common sub-expressions. Their cost-saving abilities are therefore not assured.

1.8 Justification

Though currently relational database optimizers are available, with a lot of efforts

spent on them (average relational database optimizers are 50 - 60 man years of

development effort [18]), they are highly inefficient in an algorithmic sense. For

example, commonly used bottom up tree optimizers are O(2n) in worst case [5].

There is therefore a need for further studies since:

(a) The saving made depends on how many common sub-expressions are present

21

and how easy it is to find them. We therefore need to establish to what extent

queries are similar. We as well need to detect cases of disjoint queries.

(b) Queries come from different sources and at different times therefore similarities

between them is un-predictable. There is need to study how they can be

scheduled so that they are executed in an optimal order. Though some of

the current algorithms [20] acknowledge the importance of order, they do not

attempt to systematically get the optimal order.

1.9 Objectives:

The objectives of the study were:-

(a) To study ways of establishing the extent of sharing among the queries as well

as establishing the order of optimal execution.

(b) To study existing algorithms, identify weaknesses and strengths with a view

of improving them to minimize the weaknesses and integrate strengths of the

different algorithms into more efficient hybrid algorithms.

(c) Analyze improved and hybrid algorithms for efficiency and effectiveness.

22

1.10 Scope

(a) Optimization approach

The research greatly concentrated on the combination of the Heuristic and

Semantic approach to query optimization.

(b) Database Model

The research highly concentrated on the Relational model. It can however be

applied in an Object-Relational environment.

(c) Optimization Philosophy

The research was based on the multi-query approach to query optimization

23

.

1.11 Conceptual Framework

Figure 1.4: Conceptual Framework

24

1.12 Methodology

In the process of coming up with the results of this study, the following methods

were used.

1. Literature Evaluation

This involved reading literature related to the concepts of query optimization.

It also involved studying the criticisms as well as the improvements on them

by the successive researches.

2. Comparison

This involved comparing ways in which similar optimization styles (like sin-

gle query or multi query optimization ) were handled by different approaches

(algorithms). This was used in identifying similarities hence improving on the

algorithms, or developing algorithms that put together the individual algo-

rithm strengths.

1.13 Dissertation Overview

In Chapter-2, We look into the previous research which has taken place in the

field of query optimization. We examine the rationale for query optimization, the

approaches to query optimization as well as the use of pipelining so that materi-

alization is minimized. We then examine the different algorithms that have been

25

proposed by the different researchers. We identify the merits and demerits of the

algorithms. We then formulate research questions to guide us in the process of

proposing improvements.

In Chapter-3, We lay a foundation for exploiting the similarities between the

different queries in a batch for a multi-query environment. We propose an algorithm

that traverses the whole query plans batch so as to establish the extent of similarities.

We improve the greedy algorithm to be intelligent enough so as to increase its

efficiency. We summarize the inter query shareability in a query sharing matrix M.

In Chapter-4, we use the information in the query sharing matrix and traverse

the batch while merging common sub-expressions. The matrix is used while choosing

the order in which query plans are to be considered for optimization. It also provides

guidelines on the extent sharing between any two plans so that the optimizer makes

appropriate decisions. For example we do not search for sharable sub-expressions

between disjoint query plans.

We then present our conclusion, recomendations and future work in Chapter-5.

26

Chapter 2

Literature Review

2.1 Overview

When the relational model was first launched in the late 1970s, one of the major

criticisms often cited was inadequate performance of queries [2]. This was because

queries used a lot of resources such as processor cycles and memory compared to

other models for equal amounts of data. SQL, which is the de facto standard lan-

guage for data definition and data manipulation in RDBMS [2] offers a variety of

ways in which a query can be structured to achieve the same output. The more the

complexity of the query the higher the number of ways a query can be represented.

On average, the best measure of a relational query complexity is the number of

relations a single query joins. In fact, even at the design level, developers introduce

27

redundancy or merge some relations such that joins in frequently invoked queries

are minimized. Surajit and Kyuseok [1] state that the complexity of a query is

exponential to the number of joins involved. Complex queries like those in data

warehouses that normally join tens of tables are too expensive to process in reason-

able time. Since the structural differences in queries depend highly on the way the

joins are ordered, the more the tables joined the more the options of writing a query

that can bring a single output. These queries, if executed have varying costs and

in most cases (if not all) only one is optimal. The probability of writing the most

optimal query tends to zero as the query complexity increases and the computer

is most likely to waste a lot of resources. It is therefore the complex queries that

must be optimized if a computer system is to work efficiently. An extensive query

optimization phase must select the most efficient plan among the many available

to process the query in an acceptably short time. Without query optimization,

RDBMSs would be inefficient and hence un practical [18]. Query optimization is an

expensive process because it mostly relies on evaluating the different plans (access

paths) and choosing an optimal one among them. The number of alternative access

paths grows at least exponentially with the number of relations participating in the

query [15].

The optimizer therefore, which is nearly 100% sure that the plan sent by a user is

not optimal, has to search for an optimal plan and forward it for execution. This has

to be done with in the time constraint and in a resource-conserving manner. It would

28

not be worthwhile if the difference between the optimized and a pre-optimized query

is less than the cost of finding the optimal plan. The user likewise is supposed to get

what he requested for, in the same logical presentation as well as in an acceptable

time interval. Therefore, as the user specifies what, the optimizer determines how

[4] but still conserving the what. The process of optimization should have no effect

whatsoever on the final query output.

The search strategy therefore, on top of conserving the form and content of

the query request must be efficient. Optimization is not a matter of transferring

the resources that would execute the query to looking for the execution plan. The

ability of the optimizer reaching the optimal plan, at the earliest opportunity, with

substantial resource savings is therefore of paramount importance. Kroger et al [13]

summarizes the goal of an optimizer as follows ”A plan as cost-effective as possible

is looked for as soon as possible”. Kroger et al [13] further observe that the job of

a query optimizer is not necessarily to get the cheapest plan (though the cheapest

plan would of course be the best). In fact, if a stage is reached where the cost of

further optimizing is higher than the resource savings, it is worthwhile to terminate

the search.

The optimizer is supposed to economize the resources spent on looking for a plan

as well as putting into consideration the time of processing (time of execution + time

of optimization). Depending on the nature of the problem therefore, a sub-optimal

plan may be preferred especially in a real time scenario.

29

Given the large number of possible plans, traversing them, one by one, establishing

the cost of each may be the ideal strategy but it is time wasting since the options

are too many and it is likely to produce a low cost query but having spent a lot of

resources to get it.

2.2 Approaches to Query Optimization

Broadly, computers optimize queries either individually (Single-query optimization)

or as batches (Multi-query optimization).

2.2.1 Single Query Optimization

In Single Query Optimization, a query, which is syntactically correct, is broken

down, expressed into a relational algebra expression, and a query plan, represented

as a tree is created [4, 18]. This is a traditional approach to query optimization and

is used in most commercially available optimizers. It is suitable where a database

receives a low traffic of simple queries.

Depending on the algorithm used, either the different representation of the original

query are generated and the best one searched for or the query supplied is adjusted

to the optimal one. If the option of choosing the best tree from the different trees

available is used, there is a high possibility of logical duplicates (two physically dif-

30

ferent trees, doing the same thing, the same way). Since the number of options is

likely to be high, such options normally overload the memory and require an ex-

haustive algorithm. There is a likelihood of groups of plans, with the same cost

implying that more searching is made but with no practical advantage. Using ex-

haustive algorithms is a reason for inefficiency of many query optimization research

works carried out [20]. Single processor optimizers therefore limit the search space

by considering only some tree configurations [12]. This may be left deep, right deep,

complex deep or bushy tree configuration.

Figure 2.1: The different tree configurations: bushy, complex deep, left deep and

right deep trees

The IBM System R optimizer for example, considers only left deep trees. Heuris-

tics may as well be used to eliminate some obviously expensive plans before the

optimization so as to reduce the search space. The rules may be static or dynamic.

Dynamic rules are applied basing on available data like database statistics or system

catalog. Failure to reduce the search space may cause effects that lead to the decline

in the performance and cost effectiveness of the optimizer. These are:-

31

(i) Too many options may overload the memory. In cases where the query is

complex, the computer may not have enough memory space to perform other

routine activities.

(ii) In the process of conserving the memory, the computer may have to store

some of the plans on disk and retrieve them when required. This then brings

in costs of writing and reading from disk which increase optimization costs.

(iii) The many options put a bigger load on the processor during cost estimation

and comparison.

Restricting the tree types used in optimization therefore eliminating duplicates re-

duces the number of possible options hence reducing on the search space saving the

traversal of many options as well as memory usage.

2.2.2 Multi-Query Optimization (MQO)

In MQO, queries are executed in batches. Some of the MQO techniques act in

such a way that in case some queries have a common sub-expression, such a sub-

expression is executed once and the output shared. In some cases, the sharing does

not necessarily take place on individual optimal plans [20] but instead sub-optimal

plans are used. Decision may as well have to be taken whether the common sub-

expressions should be pipelined or materialized [16]. Some multi-query optimization

techniques, like those described by Kyuseok et al [14] basically aim at having parallel

32

optimization of many queries. The queries pass through the different optimization

steps together and as an output,which is a set of optimal plans for each query is

generated. Roy et al [20] criticize this approach on a basis that further cooperation

can be made between the queries that make up the batch. If a certain sub-expression

is common, then the computer should execute it once and share out the results.

This is a guiding principle to the Basic Volcano algorithm proposed by Goetz and

McKenna [7], and the Volcano-SH and Volcano RU optimizer algorithms proposed

by Roy et al [20]. Roy et al [20] further put the sharing of the sub-expressions to a

great importance that even if the sharing takes place on a non-optimal plan of the

query, so long as the total resource requirement is optimal, it is acceptable.

If for example the queries Q1 and Q2 have locally optimal plans Q1 = (RonS)onP

and Q2 = (RonT)onS, there exists no common sub-expressions in them and therefore

need to be executed independently. However, if Q2 takes a locally sub-optimal plan

Q′2 = (RonS)onT, then (RonS) is a common sub-expression. If it is executed once and

Q1 and Q′2 share it, there is a likely reduction in the amount of resources that are

required to execute the two queries to a value less than the sum of the two optimal

plans. This is however done on the sub-optimal plan of Q2.

A multi-query optimizer therefore, according to Roy et al [20] is responsible for

recognizing the possibilities of shared computations and modifying the optimizer

search strategy to explicitly account for shared computations so as to find a globally

optimal plan.

This sharing however may not be necessarily optimal since:-

33

(i) The cost of Q′ may be too high that the sum of the independent optimal plans

is still the global optimal;

(ii) The shared plan may have a lower resource requirement than the non shared

ones but when the resources taken to achieve the plan take more resources

than the trade off hence an efficient query in an inefficient system;

(iii) The sharable components may be too few and the ratio of sharable to total

components is too low. The search for sharable components may produce very

little sharable components that the saving is far less than the searching cost.

The inclusion of sub-optimal plans increases the sample space hence a more efficient

search technique is required.

Sharing in multi-query optimization highly hinges on the availability of the

sharable sub-expressions. The elimination of some plan types, like in the way it

is done in System R optimizer may not be useful since the eliminated types may

have sharable components. The elimination may instead depend on the cost of the

sub-expressions in the plan itself since excessively expensive expressions, even if

shared, are very unlikely to create global optimality. Elimination of some expres-

sion is done so as to reduce the search sample space. In reuse based cases, cost not

configuration should be the basis.

34

2.3 The Effect of Pipelining

Dalvi et al [16] observed that the most multi-query optimization techniques assume

that all sub-expressions have to be materialized so that they are read from disk when

they need to be used. This is actually assumed in the contributions of Roy et al [20]

as well as Graefe and McKenna [7]. Dalvi et al [16] suggest that the materialization

of a sub-expression needs not to be made a rule of thumb since in case the sub-

expression is pipelined instead, the cost of writing to disk while materializing and

the cost of reading from disk while reusing are saved. The technique of pipelining

however brings in new challenges like limited buffer space. The difference between

the rate at which a source operator produces the output and the rate at which the

destination operator needs the output may make the pipeline schedule unrealizable.

Figure 2.2: A pipeline Schedule

35

If for example We take the schedule shown in Figure 2.2, the output of A is shared

between B and C. Let the rate at which A is producing tuples be rA and the rate

at which B and C need the tuples be rB and rC respectively. The schedule is not

realizable for all instances of the rates. If for example rB = rC and rA is less than

rB (hence also less than rC). This means that the schedule is unrealizable since A

cannot produce tuples at a rate satisfying the desires of B and C. If the reverse is

true, then A produces tuples at a rate greater than what B and C can consume.

This implies that tuples will pile in the output buffer of A (if the pull model is used)

or pile in the input buffers of B and C (if the push model is used). The piling of

tuples may lead to filling of the buffers hence a deadlock. It should be noted that rB

is not necessarily equal to rC hence the situation may be more complicated. Dalvi

et al [16] came up with the neccesary and sufficient conditions for a schedule to be

pipelinable otherwise it is materialized.

Rao and Ross [19] propose a less conservative pipeline schedule compared to that

of Dalvi et al [16]. It allows simultaneous input of multiple tuples unlike that of

Dalvi et al [16].

Much as some sub-expressions may satisfy the neccesary and sufficient conditions

for pipelining, it may still be impossible to pipeline them. This is because pipelining

of a sub-expression implies that it has to be kept in cache. Given the limited cache

size, it may be impossible to keep all the sub-expressions in cache.

36

Hulgeri et al [10] aim at optimizing the use of memory by appropriately dividing

it among the pipelinable sub-expressions in the plan. The amount of memory given

to a sub-expression determines the cost estimate associated with it. Given the finite

value of memory therefore, not all the sub-expressions may be finished. Hulgeri

et al [10] however use the memory division to accommodate as many pipelinable

schedules as possible.

Faced with a scenario where all the pipelinable schedules may not be pipelined

due to limitations in memory, there is need to decide on what to pipeline so as

to have a higher benefit. Urhan and Franklin [24] propose a dynamic approach to

pipelining so that the choice to pipeline is guided by the perceived importance (cost

wise) of the output. This way, much as the total number of pipelined sub-expressions

for two similar schemes may be the same, the savings may be different.

Gupta et al [8] uses a rather different approach to optimal use of cache. They use

query (and hence sub-expression) scheduling to achieve optimal use of memory. A

sub-expression may be removed when it still has other processes to serve so long as

it is replaced with one whose benefit is higher than the one removed. With pipeline,

the algorithm has to make a decision to pipeline and materialize where the buffer

space is not enough. More to that, not all reused sub-expressions are pipelinable. It

is therefore limited to the pipelinable schedules and in case the pipelinable schedules

cannot all fit in memory, then the pipelining has to be done in a buffer-conserving

manner.

37

2.4 Single Query Optimization Algorithms

Single query optimization algorithms explore one query at ago, find an optimal plan

and execute it. They mostly employ dynamic programming techniques and are

commonly used in commercial optimizers such as the System R optimizer.

2.4.1 Dynamic Programming Algorithms

In dynamic programming algorithms, queries are turned into and optimized as query

trees. In a tree, a node represents a join operation, the leaf represents a relation

while the edge indicates data flow [15]. The execution space, which is a set of all join

trees for a query is normally restricted deep trees, either right deep or left deep but

sometimes bushy and complex deep trees. This is done to lower the sample space

and the applied heuristics eliminate the duplicates further.

The Left deep Algorithm

The Left Deep algorithm works well for a space that allows only nested loop and

hash joins as join methods. However the algorithm is poor for sort merge joins since

it has to keep multiple plans for sub-queries with different interesting orders [15].

This algorithm, being selective has to be applied when:-

(i) The types of join algorithm to be used are known in advance.

(ii) There is a facility to find out whether the join algorithms that are compatible

38

are the ones used and if they are not, a standby alternative algorithm is used.

2.4.2 Randomized Algorithms

These come in to handle rather complex queries which cannot be well handled by the

dynamic programming algorithms [1]. For complex queries, there happens to be too

many possible alternatives and therefore, a dynamic programming approach is likely

to cause a combinatorial explosion [14]. The algorithms find different states and for

each state, a cost function is used to find the costs of using it to execute the plan.

The algorithm performs random walks from one state to another directly accessible

state (neighbor). In case the move is from a cheaper state to an expensive state,

it is referred to as an uphill move otherwise it is a downhill move. The algorithm

basically computes the costs of alternatives, and makes downhill moves so as to

achieve the minimum cost plan.

There are three popular randomized algorithms: the Iterative Improvement(II),

Simulated Annealing(SA) and Two Phase Optimization(2PO) algorithms.

Iterative Improvement (II)

This algorithm chooses a region at random, finds the local minimum and com-

pares the local minima to come up with the global minimum. It randomly picks a

state and looks for a minimum in the neighborhood.

39

The Iterative Improvement Algorithm

ProcedureII

minS = infinity

while not(Stopping condition) do{

S = random state

# start of inner loop

while(not local minima)do{

S’ = random state neighbour(S)

if(cost(S’)<cost(S)

S = S’

}

# end of inner loop

if(cost(S) < cost(minS)

minS = S

}

return minS

This algorithm has an advantage in that:-

(i) It is very likely that global state will eventually be got, and

(ii) The algorithm needs not to explore all the sample space of plans since it makes

downhill moves.

The algorithm therefore is random and selective but is always focused to the local,

and hence global minima identification. But it has some shortcomings in that:-

(i) Since the process of finding a starting point while searching for a local mini-

mum is random, the same region around a local minimum may be repeatedly

selected. Over iteration around the same local minimum is therefore likely. If

40

for example in Figure 2.3, the circle represents a plan and the cost associated

with it, it can be seen that plan P1, with a cost of 12 is the global minimum.

Each of the three random walks ends at P7 which is one of the local minima

but not a global minimum. This shows the over traversing of the region around

P7 without improving on the cost of the output. This leads to a higher search

cost and longer time taken to identify the minimum.

Figure 2.3: Random nature of the II algorithm

41

(ii) It is hard to prove that the local minima are really exhausted since a local

minimum can exist with in highly expensive neighbors. Such a local minimum

(which may as well be the global minimum) may only be accessed if the random

starting point is on the expensive neighbor and the move is towards it or on

its self. For example in Figure 2.3, the optimal plan is shielded.

Simulated Annealing: (SA)

Simulating Annealing algorithm allows some uphill moves so as to try get the

possibly shielded local minimum but it does it at a continuously reducing probability

to avoid being caught up in a higher cost local minimum [15].

The Simulated Annealing AlgorithmProcedureSA

S = Initial State

T = Initial Temperature

minS = S

while not(frozen) do{

# start inner loop

while not (equilibrium) do {

S’ = random state in neighbours(S)

c = cost(S’)- cost(S)

if (c <0)

S = S’

else

S = S’ with probability exp(-c/T)

if cost(S) < cost(minS)

minS = S

}

#end of inner loop

T = reduce(T)

}

42

Local minima are located by a sub module called a stage. Each stage is performed

on a parameter T called the Temperature. T controls the probability of performing

an uphill move and it goes on reducing until when it reaches zero. When T becomes

zero, the algorithm terminates.

Two phase Optimization (2PO)

Two-phase Optimization uses both the Iterative Improvement and Simulated

Annealing approach [15]. In the first phase, Iterative Improvement runs for a small

period of time, making a few local optimizations. The output of the phase is the

initial state of the Simulated Annealing which runs with low initial probability for

uphill moves.

2.5 Multi-Query Optimization Algorithms

In multi-query optimization, a query is not optimized one by one; instead, the

queries are optimized and hence executed in batches. Kyuseok et al [14], Sellis and

Gosh [23], Cosar et al [3] and Park and Segar [17] carried out research on multi-

query optimization using basically exhaustive algorithms. These algorithms traverse

a good number of the different options, look for the minimum among the different

query plans per query and then output the set of optimal plans.

Though optimized together, plans of queries whether final or intermediate do not

43

intervene or interfere in the generation of other plans. In fact, after individual query

optimization, the queries compete for computer resources at execution. Roy et al

[20] advocates for cooperation and do not entirely optimize the individual queries.

They aim at getting the cheapest way of retrieving all the data so that all the query

requests are serviced.

2.5.1 The MQO Problem

Kyuseok et al [14] formulate the MQO problem as follows

Given n sets of plans {P1, P2, · · · , Pn} with Pi = { Pi1, Pi2, · · ·Piki } being a set of

possible plans for processing Qi;

Find a global access plan GP by selecting one plan from Pi ∀i such that the cost of

GP is minimal.

State - Space Search Algorithm for MQO

The search space for the MQO problem according to Kyuseok [15] is constructed by

defining one state for each possible combinations of plans among the queries. For

n queries Q1, Q2, · · · , Qn; Qi has plans Pi = {Pi1,Pi2, · · · ,Piki} for i = 1, 2, · · · , n

such that:

1. Every state S is an n tuple < P1j1 , P2j2 , · · · , Pnjn >. For each i, either Piji ∈ Pi

44

in which case S suggests that Pji should be used to process Qi or Piji =

”NULL” if Piji = 0, then S suggests that Piji should be used to process Qi or

else Piji = ”NULL” in which case S does not suggest any plan for processing

Qi. The cost of S, denoted by scost(S) is the total cost for processing all the

queries in S.

2. The initial state is S0 =< NULL,NULL, · · · , NULL > while the final state

is SF =< P1j1 , P2j2 , · · · , Pnjn > with Piji 6= NULL ∀i in the final states.

3. Given that S =< P1j1 , P2j2, · · · , Pnjn > let

next(s) =

min{i|Pij = NULL} if{i|Pij = NULL} 6= φ

n + 1 otherwise

4. Let the state S have at least one NULL entry and m = next(S) . Then the

immediate successors of s include s′ =� P1j1 , P2j2, · · · , Pnjn � satisfying the

following properties Piki = Piji for 1 ≤ i ≤ m

Pmkm ∈ Pm

Piki = NULL for m + 1 ≤ i ≤ n

The cost of a transition from S to s′ is the additional cost required to process

the new plan Pmkm given the results of processing the plans in S.

45

This algorithm is not sharing oriented be it during optimization or during execution.

The global cost therefore is a sum of the individual optimal plans. Therefore for

queries Qi, with optimal plans Pi i = 1, 2, · · · n, the cost of execution is the sum of

costs Pi =∑

alli Pi

2.5.2 Reuse Based Optimization

All the multi query algorithms so far explored do not take into consideration pos-

sibilities of sub-expressions that may be common and save more in overall costs.

In reuse-based optimization sharing possibilities are explored. The Volcano [7],

Volcano-SH [20] and Volcano-RU [20] use the same principal though using different

philosophies. Most reuse based algorithms use Directed Acyclic Graphs (DAG) to

represent the search space. In some cases however, search space is represented as an

AND-OR DAG.

Figure 2.4: Query representation: Tree, DAG and extended DAG

46

An AND-OR DAG is a DAG whose nodes are divided into two: the AND nodes

and the OR nodes. AND nodes have only OR nodes as their children and OR

nodes have only AND nodes as their children. The AND node in an AND-OR DAG

has algebraic operation like select (σ), project (π), etc. They are therefore referred

to as operational nodes. The OR node of an AND-OR DAG represents a logical

expressions that generates the same result set as when a child operational node

is applied on its children/ child. OR nodes are therefore referred to as equivalence

nodes. The expanded DAG is used as a representation for modern optimizers because

they are easily extensible.

Graefe and McKenna [7] proposed an algorithm for generating an AND-OR DAG

and Roy et al [21] improved it into an efficient one.

AND - OR DAG Adjustments for Multi query Optimization

Roy et al [20] proposed three main adjustments that need to be made on an

AND-OR DAG in order to get an optimal plan:

(i) Merging

Since each query has a single DAG and the queries are to be optimized simulta-

neously, they need to be merged into a single DAG. A pseudo root equivalence

node is created with all individual DAG roots as children.

47

(ii) Identification of common sub-expressions

If two common sub-expressions exist in the pseudo rooted DAG then there

will be two equivalence nodes, which are exactly the same or the same after

applying join associativity. For example, in cases of X on Y and Y on X.

(iii) Handling of nodes derivable from others

In case expressions that can be derived from others exist, the optimizer adds

some nodes appropriately as children or as parents. For example, if two equiv-

alence nodes e1 = σA≤5(E) and e2 = σA≤10(E) exist, e2 is made a parent of

e1 in the part where e1 is and the output shared where e2 is while e1 is de-

rived from e2. In case there were nodes e3 and e4 where e3 = σA=5(E) and

e4 = σA=10(E), then a node e5 = σA=5∨A=10(E) is created and the two are

derived from it.

The Basic Volcano Algorithm

This determines the cost of the nodes by using a depth first traversal of the DAG.

The cost of operational and equivalence nodes are given by

cost(o) = costofexecuting(o) +∑

ei∈children(o)

cost(ei) (2.1)

and the cost of an equivalence node is given by

cost(e) = min(cost(oi)|o∈children(e) (2.2)

If the equivalence node has no children, then cost(e) = 0. In case a certain node

has to be materialized, then the equation for cost(o) is adjusted to incorporate

48

materialization. For a materialized equivalence node, the minimum between the

cost of reusing the node and the cost of recomputing the node is used. The equation

therefore becomescost(o) = costofexecuting(o) +

∑

ei∈children(o)

C(ei) (2.3)

where

C(ei) =

{cost(ei) if the node is not materialized

min(cost(ei), reusecost(ei)) if the node is materialized

The Volcano SH Algorithm

In Volcano-SH the plan is first optimized using the Basic Volcano algorithm and

then creating a pseudo root merges the Basic Volcano best plans. The optimal query

plans may have common sub-expressions which need to be materialized and reused.

If for example a certain equivalence node has computational cost cost(e), materi-

alization cost matcost(e), reuse cost reusecost(e), and is to be used numuses(e)

times then the optimizer decides whether it should be materialized or not. For it to

be materialized, there must be a saving on costs as a result. Therefore

cost(e) + matcost(e) + numuses(e)× reusecost(e) < numuses(e)× cost(e) (2.4)

which simplifies to

reusecost(e) +matcost(e)

numuses(e)− 1< cost(e) (2.5)

This equation has one practical limitation. The volcano SH algorithm starts from

the bottom upwards and the number of times a node is used depends on whether or

49

not the parents have been materialized. The cost depends on the children therefore

its known but the number of times to be used needs first traversing the DAG which

is an expensive process [20].

The Volcano-SH Algorithm

Procedure VOLCANO-SH(P)

Input: consolidated volcano best plan for virtual root of DAG

Output: sets of nodes to materialize and the corresponding

best plan P

Global variable: M set of nodes chosen to be materialized

M = { }

Perform prepass on P to introduce subsumptions derivations

let Costroot = COMPUTEMATSET(root)

set Costroot = Costroot + Σd∈M (cost(d) + matcost(d))

undo all subsumptions on P where the subsumption node is not

chosen to be materialized

return(M,P)

Procedure COMPUTEMATSET(e)

if e is arleady memoised, return cost(e)

let operator O be a child of e in P

For each input equivalence node e of O

Let C = COMPUTEMATSET(e)

// return computational cost of ei

if e is materialised, let C = reusecost(e)

compute cost(e) = costofoperation(o) + ΣC

if reusecost(e) + matcost(e)(numuses(e)−1)

< cost(e)

if e is not introduced by a subsumption derivation

add e to M

//decide to materialize e

else if reusecost(e) + matcost(e)(numuses(e)−1)

is less than the savings to the parents of e due to the

introduction of a materialised e

add e to M

memoise and return cost(e)

50

In such a scenario,

numuses(e) =∑

p∈parents(e)

U(p)where U(p) =

{numuses(p) if p is not materialised

1 if p is materialised.

Instead of using numuses(e), its under estimate numuses−(e) is used. And the

condition for materialization becomes


(numuses−(e) − 1)< cost(e). (2.6)

Since the under estimate is used, once it holds, the actual value also must hold.

Cases where the under estimate does not hold but the actual value holds have low

savings.

The Volcano-RU Algorithm

The Volcano-RU exploits sharing well beyond the optimal plans of the individual

queries. Though volcano SH algorithm considers sharing, it does it on only individ-

ually optimal plans therefore some sharable components which are in sub-optimal

plans are left out. Including sub-optimal states however implies that the sample

space of the nodes has to increase. The search algorithm must be able to put it into

consideration so that the searching cost is still below the extra savings made.

51

The Volcano-RU Algorithm

Procedure VOLCANO-RU

Input: Expanded DAG on queries Q1, ..., Qn

(including subsumptions derivations)

Output: Set of nodes to materialize M and corresponding best plan P

N = {} //set of potentially materializable nodes

for each equivalence node e, set count[e] = 0

for i=0 to k

compute Pi, the best plan for Qi assuming the nodes in N are materialized

for every equivalence node in Pi

set count[e] = count[e] + 1

if reusecost(e) + matcost(e)(count[e]−1)

< cost(e)

// worth materialising

add e to set N

Combine P1, P2, ... Pn to get a single DAG structured plan P

(M, P) = VOLCANO-SH(P) //volcano-SH makes final materialization decision

The volcano RU algorithm aims at reusing and sharing sub-expressions which are

not necessarily in the individual query optimal plans. Volcano-RU is sequential,

considering possibilities of reusing expressions of previously optimized queries in

subsequent queries. For a set of queries in the same pseudo root, after optimizing

Qi, the nodes in the plans of Qi are identified. Since at that moment the algorithm

has no idea of the structure of the subsequent queries, it checks whether it would be

optimal if a certain node was materialized for reuse one extra time. While optimizing

the next query, costs saving expressions are considered to be present. The Volcano-

SH is then applied to further detect and exploit more sharing opportunities. In such

a case, a query is able to share sub-expressions within itself and materializable plans

are all identified.

52

Volcano-RU depends on the order in which queries are optimized. It can be

done in a certain sequence, then in reverse order and the cheaper alternative is

chosen. Considering more orders have a probability of getting a cheaper order but

it increases the optimization time [20].

Reuse based algorithms are based on the number of economically reusable com-

ponents for a batch of queries. The efficiency therefore is derived from the proportion

of searches that result in materializable and sharable nodes. In some cases however,

the order of optimization is of paramount importance. The materializability of a

node depends on the nature of a node hence it is out of the control of the optimizer.

The shareability of a node however depends on how common the queries in a batch

are. In this research, We investigate the indicators of shareability as well as effi-

ciently establishing the extent of sharing among them so as to make a more guided

search.

2.6 Research Questions

1. Since queries are sent from different sources, they are different and random but

since they are addressed to the same schema, there are possible similarities.

How can We efficiently detect sharing opportunities (and lack of them) so

that the search for common sub-expressions is done with a high probability of

resource savings?

53

2. Materialization is a good sharing option but not the best. Pipelining is limited

to specific schedules and a lot of pipelining is impossible given limited buffer

space. Can’t We have a more cost saving schedule with minimal materializa-

tion for an AND-OR DAG plan?

54

Chapter 3

Query Shareability Establishment

3.1 Motivation

With the load on database applications increasing, there is a need to improve the

efficiency of the query optimizers so as to accommodate the load together with the

non-functional constraints on them like speed. The costs incurred mostly come in

during physical storage access, storage of intermediate relations and computations

on intermediate data. To minimize the frequency at which the database is accessed,

Weave to ensure that common outputs are reused other than accessing the storage

disk multiple times for the same data.

Previous research in multi-query optimization [7, 8, 17, 20, 24] acknowledges the

55

need to exploit the sharable sub-expressions but highly depends on the possibility

of the common sub-expressions existence. Since the queries come from different

sources, there is a possibility that they have nothing in common hence leading to a

worthless search. There is no effort to establish whether they actually exist or not.

When queries are checked for syntax correctness and parsed into plans, they are

sent to the query optimizer to generate the most cost effective plan. Multi-query

optimizers traverse the different query plans (DAGs or AND-OR DAGs) and iden-

tify sub-expressions among which sharing can be done. Sharing of node outcomes

saves memory, disk access and computation costs hence lowering the global resource

requirements for the execution of the queries.

Though the search for sub-expressions to be shared takes some resources (time

and processor cycles), if a good search strategy is used, the savings exceed the

searching cost hence a global advantage. Our aim therefore is to ensure that the

cost of searching is as low as possible while the number of common sub-expressions

are as many as possible so as to yield maximum benefits.

Generally, there are three cases in which sharing is possible. Considering an

AND-OR DAG. If W is any equivalent node of a plan, then sharing is possible

when:-

(i) Nodes produce exactly similar results. e.g s1 = σx=a(W ) and s2 = σx=a(W ).

In this case, only one is executed and the other just uses the output of the

56

executed node.

(ii) The output of one node is an exact superset of the others. For example, if We

have nodes like s3 = σx≤6(W ) and s4 = σx≤10(W ), s4 is executed and s3 is

derived from it i.e. s3 = σx≤6(s4).

(iii) Nodes retrieve data from the same (intermediate) relation or view but on

different instances of the outermost constraint. For example, if We have s5 =

σx=10(W ) and s6 = σx=15(W ), then a different node s7 = σx=10∨x=15(W ) is

created so that the two are derived from it i.e. s5 = σx=10(s7), and s6 =

σx=15(s7).

The efficiency of the multi-query optimizer does not depend on how aggressively

common sub-expressions are looked for but rather on the search strategy [20]. The

search strategy may need to exploit issues like:-

i The ability to tell that there are no more sub-expressions common between

a pair of queries without having to exhaustively search all the available sub-

expressions.

ii The ability to chose the order in which the queries in a batch [8, 20] should

be processed so as to have a cost effective process.

iii To identify which nodes to materialize or pipeline [16], and the order in which

those to pipeline should be assembled [8].

57

iv Mixing materialization and pipelining for specific sub-expressions so as to op-

timize the use of cache.

Given that multi-query optimization takes place on many complex queries with many

relations, comparing sub-expressions exhaustively leads to too many comparisons

hence high comparison cost and time. Our aim is therefore to look for the extent

of sharing without necessarily traversing all the nodes. The output is to be used

while optimizing. If We know that the queries have no node in common, We do not

attempt to exploit the similarity because it is not there.

3.2 The Shareability Problem

While searching for the sharable sub-expressions between queries, We are guided by

the three cases in which the sub-expressions can be reused. If the sub-expressions

are exactly similar, then We can be able to use any of the sub-expressions in all

the instances where it exists. If however the case is not so, and one of the sub-

expressions is a super set of the rest, We then need to make a decision as to what

sub-expression should be actually executed and what sub- expression is to be derived

from the other. More to that, if We have cases where the sub-expressions are not

subsets of equivalent nodes, We then have to look for another sub-expression to be

created so that the rest are derived from it. For the interest of identification, once

a pair of queries have any such sub-expressions, We say they are sharable.

58

Since the information as to whether or not the queries are sharable is to be used

during the actual optimization, We need to establish a way of summarizing the

outcome. We therefore introduce some terminologies to be used in this research.

(a) Query-Sharing Matrix

For n - queries in a batch, a query sharing matrix M is an n×n matrix with in-

tegral entries in which the entry at M[i,j] shows the number of sub-expressions

sharable between the ith and the jth query.

(b) Query Popularity

This is the number of instances in which the query plan sub-expressions

are found to are sharable with other queries in the batch. It should be

noted that popularity may be partially out of intra-query sharing opportu-

nities. For a given query sharing matrix M, the popularity of the ith query is

Σall k∈[1,n]M [k, i] = Σall k∈[1,n]

M [i, k].

(c) Order of a node

This is the number of (not necessarily distinct) relations, with at least a rela-

tional operation that make up a node. It should be noted that if there is no

relational operation (a whole relation) then the order is zero.

Our aim therefore is to establish the extent to which sub-expressions of any two

queries are sharable. This helps in coming up with a decision as to whether or not

an attempt to look for sharable opportunities between any two queries should go

on.

59

3.3 Previous Research

The need to know the number of times a sub-expression is used in a query batch

is evidently of paramount importance. The research on multi-query optimization

[1, 7, 16, 20, 21] employ it. However, none of them makes a deliberate attempt to

establish it with some accuracy.

The approach of the Volcano-SH optimizer [20] is to count the number of parents

a node has and it is taken as an under estimate for the number of times a sub-

expression is used. The parents exist in the parent DAG of the query plan therefore

inter-query sharing is not attempted at all. More to that, the under estimate is not

accurate. In some cases, it is actually greater than the actual number of times a

node is used. For example, if We consider a DAG in which node a is used twice to

produce node b which is used thrice to produce node c (c has two parents and c

together with the parents are all used once). The number of times a is used is 6.

Using the under estimate of Roy et al [20], We come up with 4. Since the under

estimate is less than the actual value, it does not pause any decision contradiction.

However, assuming all nodes were used once, then the number of times a is used is 1

yet the under estimate remains 4. This is dangerous since the node under estimate

may permit materialization yet the actual value refuses it. In this case for example,

since a appears only once, the decision to materialize it should not come up at all.

More to that, this approximation uses a single DAG . It does not consider cases of

inter-query sharing.

60

In the Volcano-RU [20], a node is chosen for materialization if it would cause

savings if reused once. This implies that all nodes that can cause a saving are

materialized. But it is also true that not all nodes that would cause savings when

reused once actually exist more than once. Therefore, some nodes are chosen for

materialization yet they do not exist multiple times. Further more, a node may

exist say four times and it causes savings on materialization yet it would not cause

savings if it existed twice. This implies that on top of the Volcano-RU choosing nodes

that appear once for materialization, it may leave out nodes that are able to cause

savings only because a more accurate approximation of its frequency is not known.

Based on these weaknesses and the importance a more accurate approximation of

the frequency a node is to be used, We establish the inter-query sharing extent

before We proceed to optimize.

3.4 The Greedy Search Algorithm

In the greedy search algorithm, We compare two query plans at ago and for each

pair of nodes, We establish whether or not they are sharable. If they are sharable,

the appropriate entry in the query sharing matrix is incremented. Let us consider

a situation where We have a batch of n queries Q1, Q2, · · ·Qn. For any query Qi,

a set of nodes in the Volcano optimal plan can be got by the method proposed by

Graefe and McKenna [7]. The input, like in the Volcano-SH [20] is a set of volcano

61

best plans. Let Si be the set of equivalence nodes in the plan of Qi. The greedy

algorithm checks nodes pairwise and establishes whether sharing can be possible.

If it is possible for any queries Qi and Qj, the query popularities and M[i,j] are

incremented. This is done until all sets are exhausted.

The Proposed Gready Search Algorithm:

for (i=1; i<=n; i++)

S = set of nodes in the ith plan

for(j=1; j<=n; j++)

P= set of nodes in the jth plan

while(S still has nodes)

a = next node in S

while(P still has nodes)

b = next node in P

if(Sharable(a,b))

increment M[i,j]

endif

endwhile

endwhile

endfor

endfor

Though this algorithm is simple, it is too greedy to be optimal/feasible. This is

because it compares indiscriminately and therefore does too many comparisons.

Assuming a query Qi has ki nodes in its volcano best plan and We consider the

first node for the Q1 plan, it will make k1 − 1 comparisons while making intra-

query comparisons, k2 comparisons while comparing with Q2, and so on up to kn

comparisons for query Qn. This implies that this node alone undergoes Σall iki −

1 comparisons while the whole Q1 plan will undergo k1(Σalliki − 1) comparisons.

The whole batch will need Σall iki × (Σall i

ki − 1) comparisons which is too much.

62

We therefore propose improvements to the greedy algorithm so as to have a less-

expensive search both in processor resources and time.

3.5 Improvements to the Greedy Algorithm

Though the Greedy algorithm will be able to get all the sharable nodes, We need

to make some improvements on it so as to minimize the search cost. In fact, the

search cost is reduced without interfering with the output. This is done by adding

some intelligence into the search algorithm so that some decisions can be deduced.

The following observations can be noted and improvements to address them can

be made without substantially affecting the cost-effectiveness of the scheme.

a. Elimination of duplicate searches

The output of the search between Qi and Qj is equivalent to that between Qj

and Qi since the query shareability is independent of the order in which the

queries are checked. We therefore need to make sure that when the compar-

isons are made, such a pair should never be compared again. To ensure that,

We compare Qi and Qj if and only if i ≤ j.

b. Search by node order

From the conditions that have to be satisfied before query sharing takes place,

sharing can take place between nodes of different queries but strictly of the

63

same order. We therefore need to group nodes by order and for a node, We

search for shareability within the same order. This reduces the sample space

for each node when the search for the sharable sub-expressions is taking place.

c. Null sharing prediction

Moving up the order makes the nodes more specific. If for a pair of queries

there is no sharing for nodes of order m, then there is no sharing for nodes

of order n where n > m. We therefore terminate higher nodes’ search for

the query pair immediately We finish nodes of a certain order and We get no

sharing opportunity.

d. Zero order tip

Relations (zero order node) are too broad to be considered for sharing. We

only use them to find if any two queries are disjoint or not. If We get a sharing

opportunity at this order, We do not update M and go to the next order (order

one). If however there are no zero order nodes sharable, then the queries are

disjoint. We need not continue searching for shareability in higher orders.

64

3.6 The Improved Greedy Searching Algorithm

We now present the enhanced greedy algorithm such that the observations above

are put into consideration in order to have a more optimal search. It outputs the

sharing matrix.

for i =1; i<=n; i++

for j =i; j<=n; j++

ORDER = 0

repeat

nextOrderpermision = false;

Si = set of nodes of order ORDER in Query i

node1 = starting node in Si

Sj = set of nodes of order ORDER in query j

node2 = starting node in Sj

while(Si still has nodes)

while(Sj still has nodes)

if(node1 and node2 are sharable)

nextOrderPermision = true

if(ORDER = 0)

break out of the two while loops

else

increment M[i,j] and M[j,i] and mark node1 and node2

mark node1 and node2

endif

endif

node2 = next(node2)

endwhile

node1 = next(node1)

endwhile

ORDER = next(ORDER)

until(nextOrderPermision = false or orders are exhausted)

endfor

endfor

65

The algorithm traverses the pseudo-rooted DAG and gives a summary of the extent

of sharing between any two queries. Using this information, We can deduce the

popularity.

The matrix acts as a guide to establish which queries have sub-expressions

in common and which queries are disjoint. With such information, searching for

sharable nodes in disjoint queries can be done away with hence targeting the search

efforts on cases where there is a high chance of making cost savings. However, the

search does not put into consideration the cases where multiple use of a parent re-

sults into multiple use of children. So much as the algorithm can tell which query

shares nodes with which query, it cannot tell how many times a node will be used.

The entries in M therefore represent the number of times a node or its sharing

partners appear not the number of times it is used.

66

Chapter 4

Optimizing the Traversed Plans

4.1 BackgroundFrom the very principle of multi-query optimization, the optimizer has to look for the

common sub-expressions or any such sub-expressions where cooperation at execution

time can lead to cost savings. Tests are made to establish whether costs will really be

saved. If the tests are promising, such a sub-expression or a group of sub-expressions

is explored.

From Chapter 3, We were able to traverse the query plans using a greedy but

intelligent algorithm and the output was put in a query sharing matrix M. If We

consider a general query sharing matrix M below:

67

M =

m11 m12 m13 m14 · · · m1n

m21 m22 m23 m24 · · · m2n

· · · · · · · · · · · · · · · · · ·

· · · · · · · · · · · · · · · · · ·

· · · · · · · · · · · · · · · · · ·

mn1 mn2 mn3 mn4 · · · mnn

the entry M [i, j] = mij is an integer that shows how many sub-expressions (equiva-

lence nodes in AND-OR DAGs) that are sharable between queries Qi and Qj. This

is the same as the value of M [j, i]. If We sum up the entries in a column (or a row),

We get the total number of instances in which nodes in the plan have sharable part-

ners. This is called query popularity. All nodes that have partners are marked so

that the optimizer can identify which nodes necessitate checking other plans while

searching for common sub-expressions and which nodes do not. This helps in elim-

inating null searches hence a more efficient strategy. After identifying the extent of

sharing among the queries, We can be able to tell which pair or pairs of queries have

nothing in common. For such queries, We do not need to search for common sub-

expressions since they do not exist. Likewise, We can be able to tell for each query,

which other queries in the batch share nodes with it and to what extent. When We

start searching for common sub-expressions, We start with one query and search for

the sharable sub-expressions with other queries in the DAG. The Query plan, which

68

is at the center of the searching process is called the focal node. Since the plans are

already traversed at the searching stage, We have nodes which are marked (those

with sharable sub-expressions else where) and those which are not. It is only on

marked nodes that We search for other plans for sharable nodes. It should be noted

that a node being marked does not mean that its partners will be searched for in

all plans. Since the query sharing matrix has the summary of what query shares

with what, only those with non-zero entries in the sharing matrix are searched for

a column/ row representing the extent of sharing between the focal plan and such

a plan are checked. The algorithm therefore searches only in optimistic cases.

In this chapter, We use the information in the query sharing matrix to guide us

so as to exploit the sharing opportunities among the queries that make up a batch.

4.2 Related Work

Looking for common sub-expressions in a query batch has been done in most multi-

query optimization algorithms but following different approaches and principles.

The algorithms [7, 20] however seem not to make enough preparations to exploit

them to the full.

69

4.2.1 The Basic Volcano Algorithm

This was proposed by Graefe and McKenna [7] as a reaction to the previously

proposed Exodus Optimizer [6]. It uses DAG as a representation of the query plans.

It has a problem of extensibility since AND-OR DAGs are easier to extend than the

DAGs [21].

The Basic Volcano algorithm materializes all nodes that appear more than once.

This brings in a problem that not all nodes that appear more than once cause savings

when materialized. As observed in [20], for some nodes, it is cheaper to recompute

than to materialize and reuse them. This is because materialization involves writing

and reading to disk which is costly. The Basic Volcano lays a foundation for cost-

effective reuse but:-

i It does not establish the cost effectiveness of the node - candidate to materi-

alization before choosing it for materialization.

ii Its search is exhaustive therefore the optimizer incurs a high search cost which

brings a negative impact on the overall cost effectiveness of the query processor.

The Basic Volcano algorithm therefore incurs a lot of cots in searching for the

sharable sub-expressions which may render it inefficient especially for large (and

therefore) complex queries.

70

4.2.2 The Volcano-SH

The Volcano-SH [20] is an extension of the Basic Volcano algorithm in that it uses

the Basic Volcano optimal plans as an input. The volcano SH computes the cost of

each node and decides whether or not it is cost effective to materialize it. This is

done by considering a scenario of materialization and reuse against re computation.

If for example We have an equivalence node e with the following characteristics:

Number of times it is to be used = numuses(e)

Cost of computing the node = cost(e)

Cost of materializing the node = matcost(e) and

Cost of reusing the node =reusecost(e).

A decision has to be made whether to materialize and reuse the node or to recompute

the node whenever it is needed. If all the nodes are computed from the database,

the cost incurred would be

cost(e) × numuses(e)

and if the node was computed once, then materialized so that for subsequent times

it is just reused, the cost incurred would be

cost(e) + matcost(e) + reusecost(e)× (numuses(e)− 1)

71

. Materialization is cost effective if

cost(e)+matcost(e)+reusecost(e)×(numuses(e)−1) < cost(e)×numuses(e) (4.1)

or more simply


(numuses(e)− 1)< cost(e)[20] (4.2)

. Volcano SH therefore selects whether or not to materialize depending on the cost

effectiveness of the scheme.

The volcano-SH traverses the DAG from the leaves towards the root. Since the

cost of a node is computed from the children (leaves), the cost of a node can be

accurately established as the algorithm traverses the DAG. The number of times a

node is used however depends on the materialization status of the parents. Since a

node is reached before the parents are reached, it can not be easily established. Roy

et al [20] uses an under estimate numuses−(e) which is got by counting the number

of parents of a node. The condition for materialization is therefore modified to


(numuses−(e) − 1)< cost(e) (4.3)

.

It is advantageous that it eliminates blind materialization hence saving more re-

sources. It however has some shortcomings that need to be addressed. The short-

comings are:-

72

i Inter-query Sharing elimination

Its estimate of the frequency a node appears puts no consideration of other

plans in the pseudo rooted DAG yet the inter-query sharing makes a lot of

savings on resources.

ii Accuracy of Estimates

Its estimate is inaccurate. In fact in some cases the under estimate is higher

than the actual value which may lead to wrong decisions.

iii Order of Processing

It does not attempt to exploit the inter-query extents of similarities. This

makes it unable to decide on the optimal order in which the queries should be

processed.

iv DAG Trimming

It does not trim already catered for nodes hence the search works on a fixed

sample space leading to non worthwhile search efforts.

4.2.3 The Volcano-RU

Unlike the volcano-SH [20], the volcano-RU [20] does not take in the Basic Volcano

outputs and neither does it attempt to establish the number of times a node is

used. It optimizes one query at ago and any node, (whether it is on the optimal

plan or not) that would cause savings if reused once is chosen for materialization.

73

The subsequent queries are optimized putting into consideration the fact that some

nodes are already materialized. Its strength lies in exploiting shareability beyond

the Basic Volcano optimal plans. Given its approach, the order of optimization is of

paramount importance since the node to be materialized depends on which query

has been optimized so far.

It however has the following weaknesses:-

i Order of Processing

It does not go into details of establishing the exact optimal order of opti-

mization. Roy et al [20] propose that after optimizing in a specific order, We

optimize in the reverse order and the cheaper option is chosen. However, since

the first order was arbitrary, a more optimal order is very likely to exist. At-

tempting to randomly choosing other orders while searching for the optimal

order leads to time wasting [20].

ii Over Materialization

It also has a problem of excessive materialization since not all nodes that would

cause saving if reused twice actually exist more than once. The excessive

materialization leads to further costs incurred at materialization hence a more

costly query processor.

iii Inaccurate Materialization Criteria

In a DAG, some nodes appear more than twice and cause savings yet they

74

would not make the savings if they appeared twice. Such nodes are left out

since the criteria only consider those which would cause savings when reused

once. The materialization therefore may be insufficient.

4.3 The Proposed Optimizing Algorithm

4.3.1 The Background

In designing this algorithm, We attempt to address the weaknesses identified in the

Basic Volcano, Volcano SH and Volcano RU. It is against these addressed weak-

nesses, coupled with the strengths already exhibited by the algorithms that We lay

a foundation for the new algorithm.

a. Optimization order

The Volcano-RU [20] acknowledges the use of the order of optimization since

by its philosophy, the content of the materialized set is of high importance.

Though this algorithm unlike the Volcano-RU is to use the Basic Volcano opti-

mal plans as the inputs, it uses order in another way. We identify the plan with

the highest popularity and We establish its details (node costs and frequency).

The common sub-expressions are searched for in the rest of the queries follow-

ing the stable marriage preference list of the focal plan. If according to the

sharing matrix M, a certain plan is not having any node to share with the

75

focal plan, We do not search it.

b. Sample space management

If We are to reduce the sample space without affecting the output and optimal

plan, We minimize the optimization cost in the algorithm. If We identify more

than one node to be sharable, a decision is made to decide which of the nodes

will remain in the plan and which (the rest) is/ are to be removed from the

plan. Action is done such that then plans are supplied with the output of the

retained node so that the output of the removed nodes is compensated. In an

AND-OR DAG, the equivalence node will have multiple parents. This reduces

the sample space for the subsequent searches without affecting the output of

the query batch. It also saves the cost of recomputation.

c. Estimate of numuses(e)

When a focal plan is identified, the costs and number of times a node is used

in the plan, count are established. If a node is marked, then it means that it

has sharable partners in other plans. The sharable nodes are searched for in

the plans that share at least a node with the focal plan. Each time a sharable

node is found, the count is incremented by one and its assured to be of the

same cost as the node in the focal plans. The test for materialization used

in the Volcano SH is then applied using the new estimated cost and the node

count in the pseudo rooted plan.

76

d. Direction of Optimization

In this algorithm, We choose nodes to materialize from the high order nodes

downwards. This is used in such a way that if say two nodes are sharable and

two have to be removed, this reduces the sample space of subsequent searches

and saves the cost of computing already catered for nodes.

e. Elimination of Repetitive search

We already established that it is not worthwhile to search between Qi and Qj

and then search Qj and Qi as this is a repetition. Since We search plans in the

decreasing order of their popularity, any plan is searched if the focal plan has

a higher popularity, this eliminates the duplicates of making searches which

are already catered for in previous searches.

f. Disjoint plans identification

If a plan has zero popularity then it will not participate in inter-query search

since it is unique. This helps saving resources that would be spent searching

for non-existent partner nodes.

4.3.2 The Optimizing Algorithm

The new algorithm inputs the DAG made up of the Basic Volcano plans for each

query. The inter-query shareability information and individual query popularities

are got from the query sharing matrix M. In this algorithm plans are assigned focal

77

roles in the order of their decreasing popularity. Searching for sharable partners

for any marked node in a focal plan is done in the stable marriage preference order

of less popular plans. Searching starts from higher order nodes and plans of Zero

popularity are not assigned focal roles.

78

S = set of plans that make up the virtual DAG in a decreasing

order of their popularity

focalPlan = first plan in S

repeat

establish node cost and numuses for each node for the focal plan

S* = subset of S who share at least a node with focalPlan query

candidateOrder = highest order of focalPlan

repeat

e = node in candidateOrder

if( e is marked)

repeat

traverse S* searching for marked equivalent nodes of

the same order with and sharable with e

increment numuses(e) whenever a sharable node is met

until(S* is exhausted)

if(sharable condition(e))

chose which node to be materialized and add it to the

materializable node

remove the rest

update the DAG for the materialized node to cater

for the removed node’s parents unmark the chosen node

endif

else

if(sharablecondition(e))

add e to materialization node

endif

endif

until(nodes in candidateOrder are exhausted)

focalPlan = next(focalPlan)

until(plans on non zero popularity are exhausted)

4.3.3 Benefits of the new Algorithm

1. Better estimation of numuses−(e)

The Algorithm uses the exact number of times the node exists in the focal plan

79

and increments it by one whenever a sharable node is found outside the focal

plan. The nodes may actually be greater if the non focal plan nodes are used

multiple times in their parent plans. Therefore numuses−(e) ≤ numuses(e)

∀e .

2. Elimination of null searches:

The sharing matrix has details of the extent of sharing for any pair of plans.

If the entry for a pair is zero, then We need not to search for shareability

between them. If the popularity of a certain query is zero, then its plan is not

involved in the search.

3. DAG trimming

If We have say three nodes of order five and they are sharable, and it is decided

that one has to be materialized and the rest of the plans use it, then it is not

worthwhile to process the other nodes since the ultimate goal is to get the

root of the tree. This algorithm removes the sub DAGs whose output can

be got from common/ sharable nodes so that such children do not enter the

optimization process since their output is catered for.

4. Optimal order of optimization

Since the strategy eliminates catered for sub-DAGs, it is better if it does it

as early as possible so that the subsequent search space is reduced without

affecting the outcome. Starting with the most popular query does this. This

saves time, memory and processor cycles.

80

Chapter 5

Discussion and Future Work

5.1 Discussion

In this dissertation, we studied the strategies and approaches to query optimization.

The duty of the query optimizer is to establish the most cost effective execution

plan of a query. This has to be done with in several limitations and notable among

the limitations are:-

(a) Time.

The process should be timely. Exhaustive searches that take a lot of time have

to either be eliminated or improved to make intelligent deductions so that they

have a low runtime.

81

(b) Form and Content.

The transformation should have no effect whatsoever on the Form and content

of the request made by the user.

(c) Net Savings.

The optimization process should not be a mere transfer of resources from

executing a complex form of a query to searching for a cheap form of the

query and executing it. It has to make substantial savings.

We Therefore examined existing approaches, studied how they order queries for

optimization, how they optimize and how they exploit the geometry of query plans

representation (Trees, DAGs and AND-OR DAGs) to make the scheme more cost

effective.

We Then Proposed a greedy search algorithm that traverses the composite plan in

search for common sub-expressions. We used the geometry of the plan representation

(AND-OR DAG) to propose improvements on the greedy algorithm so that the

searching is more intelligent and therefore searches equally exhaustively but with

fewer operations and in less time. We sumerised the sharability in a query sharing

matrix.

We also proposed an optimization algorithm that exploits the sharability extents

in an optimal order to minimize null searches for common sub-expressions. This way

we reduce run time and increase efficiency. we also propose trimming of catered for

82

sub DAGs so as to eliminate catered for nodes and reduce the sample space of

subsequent searches.

Lastly, We make comparisons of the proposed optimization schemes with the

existing ones.

5.2 Future Work

This Research has implied some future research work in the field of Query Opti-

mization

(a) Maintenance of the Sharing matrix.

The Sharing matrix is used to get the order of assigning focal roles among the

plans during optimization and for each focal plan, we get the order in which

the plan should exploit sharability. We however trim plans whenever we get

nodes catered for due to sharability. This trimming takes away all the children

of the catered for node. Some of these children may be sharable else where

hence removing them leads to an exaggerated sharing matrix. There is need

for Research on how the matrix can be cost-effectively updated whenever we

trim the composite DAG.

(b) Incorporating Pipelining.

Multi-Query Optimization is materialization intensive and the current sched-

83

ules of materialization use DAGs not AND-OR DAGs. Besides, they are too

strict on the qualifications for pipelining. AND-OR DAG can use the equiv-

alence nodes as staging areas for intermediate results and hence can work on

less strict schedule. This however may lead to filling of the memory. There

is therefore a need for incorporating pipelining and memory management in

AND-OR DAG structured plans so that materialization costs are reduced.

(c) Numerical Evaluation.

The evaluation done on the proposed algorithms in this dissertation is not

out of coding but rather our of comparison at the algorithm level. There is

need to code the proposed algorithms and run them with the existing ones

(such as Basic Volcano, Volcano-RU, Volcano-SH) on similar data and similar

hardware. This will give the runtime comparison with the existing schemes.

84

References

[1] Chaudhuri, S. and Kyuseok, S. (1999). Optimization of queries with user

defined predicates. ACM Transactions on Database Systems, vol.24 No.2,

pages 177-228.

[2] Connoly, T. and Begg, C. (2001). Database Systems: A practical Ap-

proach to design, Implimentation and Management, Third Edition.

Edison and Wesley.

[3] Cosar, A. Lim, E. and Jaideep, S. (2001). Multiple query Optimization

with depth-first branch and bond and dynamic query ordering. Inter-

national Conference on Information and Knowledge Management.

[4] Elmasri, R. and Navathe, B.S. (1994). Fundamentals of database systems

, Second Edition. Benjamin Cummings.

[5] Fegaras, L. (1998) A new Heuristic for Optimising Large Queries Re-

search Paper, Department of Computer Science and Engineering . The Univer-

sity of Texas at Arlington.

85

[6] Graefe, G. and DeWitt, D.J. (1987). The EXODUS Optimizer Generator

ACM SIGMOD Records, Volume 16, Issue 3 pages 160-172

[7] Graefe, G. and McKenna,W.J. (1991).Extensibility and Search efficiency

in the Volcano Optimiser generator. Technical report CU-CS-91-563. Uni-

versity of Colorado

[8] Gupta, A. Sudarshan, S. Viswanathan, S.(2001) Query Scheduling in muti-

query optimization Research paper, Indian Institute of Technology - Bom-

bay.

[9] Horowitz, E. Sahni, S. and Rajasekaran, S. (1996). Computer Algolithms

C++ , Computer Science Press.

[10] Hurigeri, S. Seshadri, S. and Sudarshan, S. (2001). Memory Cognizant

Query Optimization. Research paper, Indian Institute of Technology - Bom-

bay.

[11] Johnson, J.L.(1997). Database: Models, Languages, Design, Oxford Uni-

versity Press.

[12] Kremer, M. and Gryz, J. (1999). A Survey of Query Optimization in

Parallel Database. Technical Report CS-1999-04, Department of Computer

Science, York University.

[13] Kroger, J. Stefan, P. and Heuer, A. (2001). Query optimisation. On the or-

dering of Rules. Research paper, cost - and rule based optimisation of object

86

- oriented queries (CROQUE) Project. University of Rostock and University of

Hamburg, Germany.

[14] Kyuseok, S. Sellis, T.K and Nau, D. (1994). Improvements on a Heuristic

algorithm for Multiple-query Optimization. Technical report, University

of Maryland, Department of Computer Science.

[15] Kyuseok, S. (1993). Advanced query optimization techniques for rela-

tional database systems. PhD dissertation, University of Maryland.

[16] Nilesh, N.V. Sumit, K.S. Roy,P. and Sudarshan, S. (2001). Pipelining in

multi-query optimisation. Research paper, Indian Institute of Technology.

Bombay.

[17] Park, P. and Segar, A. (1988). Using common sub-expressions to optimise

multiple queries. Proceedings of the IEEE International Conference on Data

Engineering.

[18] Ramakrishnan, R. and Gehrke, J. (2000). Database Management Systems

Third Edition. McGraw Hill.

[19] Rao, J. and Ross, R.K. (2000) Power Pipelining for Enhanced Query

Performance . Technical Report CUCS-007-00, Columbia University.

[20] Roy, P. Seshadri, S. Sudarshan, S. and Bhobe, S. (2001). Efficient and exten-

sible algorithms for Multi query optimisation. Researh Paper, SIGMOD

International Conference on management of data.

87

[21] Roy, P. Seshadri,S. Sudarshan,S. and Bhobe,S (1998) Practical Algolithms

for multi query Optimisation. Technical report, Indian institute of Tech-

nology, Bombay.

[22] Sallis, L. (1998). A new Heuristic for optimising large queries (1998)

Research Paper, Department of Computer Science and Engineering . The Uni-

versity of Texas at Arlington.

[23] Sellis,T.K. and Gosh, S. (1990). On Multi-query Optimization Problem.

IEEE Transactions on Knowledge and Data Engineering pp. 262-266.

[24] Urhan, T. and Franklin, M.J. (2001). Dynamic Pipelining Scheduling for

improving interactive query performance. In proceedings of the 27th Very

Large Databases Conference. Roma - Italy.

88

Apendix A: Paper

89

On Query Optimization in Relational...

Documents

Transcript of On Query Optimization in Relational...