Post on 02-Feb-2018
On Query Optimization in Relational Databases
by
John Ngubiri
PGDCS(Mak), BSc/Ed(Mak)
ngubiri@ics.mak.ac.ug, 071921969
A Dissertation Submitted in Partial Fulfillment of the Requirements
for the Award of a Degree of Master of Science in Computer Science
of Makerere University
May, 2004
Declaration
I, John Ngubiri do hereby declare that this dissertation is my original work and has
never been submitted for any award of a degree in any institution of higher learning.
Where quotations have been used, they are acknowledged through references.
Signed .....................................Date......................................
John Ngubiri
(2002/HD18/423U)
Candidate.
i
Approval
I certify that this is the original work of John Ngubiri and has been done under my
supervision. The work has never been submitted for any award of a degree in any
Institution of higher learning.
Signed......................................Date.......................................
Dr. Venansius Baryamureeba, Ph.D
Supervisor.
ii
Dedication
This Dissertation is dedicated to:
• The two ladies in my life: my mother and my fiance Eve.
• The two gentlemen in my life: my father and my son Matthew.
iii
Acknowledgments
My sincere appreciation goes to Dr. Venansius Baryamureeba Ph.D, Director In-
stitute of Computer Science and my supervisor for all his advise, guidance and
encouragement. Without you, this dissertation would not be the way it is now.
Most likely, it would not be there!
I would also wish to thank the staff at Institute of Computer Science - Makerere
University with whom I associate and hence grow academically. Notable on the list
are Mr Habibu Atib - for the very initial discussions, Dr Vincent Ssembatya (De-
partment of Mathematics) for the Algorithm design background and Ms Josephine
Nabukenya for technical writing assistance.
To you all, I say Thank you.
John Ngubiri -June 2004.
iv
Abstract
Query Optimization is an important process in Relational databases. With the
current load on databases increasing, the need to optimize queries in batches is a
promising way out. Studies have shown that sharing among common sub-expressions
can as well be beyond the optimal plans of the constituent queries. But challenges
of excessively large sample space, buffer management, establishment of optimal or-
der of optimization, and identification of disjoint queries remain in place. In this
dissertation, We propose how We can efficiently establish the extent of inter-query
shareability and exploit it so as to compute common sub-expressions once and share
the output among the queries. We also propose the optimal order of optimization
so that the sharing is done in a more cost saving and time conserving manner.
v
List of Figures
1.1 The stages of query processing . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Query Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 The optimal plan of the query . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1 The different tree configurations: bushy, complex deep, left deep and
right deep trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 A pipeline Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Random nature of the II algorithm . . . . . . . . . . . . . . . . . . . 41
2.4 Query representation: Tree, DAG and extended DAG . . . . . . . . . 46
vi
Symbols
Symbol Equivalent Relational Algebra Operation
σ Select
Π Project
× Cartesian Product
on Join
θ A comparison operator
vii
Contents
1 Introduction 1
1.1 Background to Databases . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Systematic Query Optimization . . . . . . . . . . . . . . . . . 9
1.4.2 Heuristic Query Optimization . . . . . . . . . . . . . . . . . . 10
1.4.3 Semantic Query Optimization . . . . . . . . . . . . . . . . . . 11
1.5 Single Query Heuristic Optimization . . . . . . . . . . . . . . . . . . 13
1.5.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
viii
1.5.2 Challenges to Single Query Optimizers . . . . . . . . . . . . . 18
1.6 Multi-Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Challenges of Multi-Query Optimizers . . . . . . . . . . . . . 19
1.7 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 21
1.8 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 Objectives: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.10 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.11 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.12 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.13 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Literature Review 27
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Approaches to Query Optimization . . . . . . . . . . . . . . . . . . . 30
2.2.1 Single Query Optimization . . . . . . . . . . . . . . . . . . . . 30
ix
2.2.2 Multi-Query Optimization (MQO) . . . . . . . . . . . . . . . 32
2.3 The Effect of Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Single Query Optimization Algorithms . . . . . . . . . . . . . . . . . 38
2.4.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . 38
2.4.2 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . 39
2.5 Multi-Query Optimization Algorithms . . . . . . . . . . . . . . . . . 43
2.5.1 The MQO Problem . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.2 Reuse Based Optimization . . . . . . . . . . . . . . . . . . . . 46
2.6 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3 Query Shareability Establishment 55
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 The Shareability Problem . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 The Greedy Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Improvements to the Greedy Algorithm . . . . . . . . . . . . . . . . . 63
x
3.6 The Improved Greedy Searching Algorithm . . . . . . . . . . . . . . . 65
4 Optimizing the Traversed Plans 67
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 The Basic Volcano Algorithm . . . . . . . . . . . . . . . . . . 70
4.2.2 The Volcano-SH . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.3 The Volcano-RU . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 The Proposed Optimizing Algorithm . . . . . . . . . . . . . . . . . . 75
4.3.1 The Background . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.2 The Optimizing Algorithm . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Benefits of the new Algorithm . . . . . . . . . . . . . . . . . . 79
5 Discussion and Future Work 81
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xi
Chapter 1
Introduction
1.1 Background to DatabasesRight from the early times of programming, programmers relied on well-organized
data structures to simplify programming tasks [11]. Though it is normal to include
variables in program codes, it is nearly impossible to include large data sets. Includ-
ing data in codes implies that the code owns the data. The code therefore has to
provide facilities for other programs to access that data when required. This brings
in programming hardships since data normally has many processes like insertion,
updates and deletion which have to take place not necessarily simultaneously. Hard-
ships are more prominent in cases where a program has to access data from another
program. For a process to allow access to its data, it must be running as well since
data cannot be got from a dormant program. If the data owner is not running,
1
then it has to initiate some dummy operations hence wasting processor cycles and
memory. Programs owning data come with many programming bottlenecks and the
most outstanding ones, according to Johnson [11] are:-
• Unhealthy dependence between data and programs;
• Repetition of data elements;
• Opportunities for inconsistencies;
• Unorganized scattering of related data across many files;
• Distributed ownership of data;
• Decentralized security of data;
• Unregulated interactions between programs using the same data;
• Inadequate multiuser access;
• Ad hoc approach to error recovery; and
• Over dependence on physical considerations such as disc track and sector ad-
dresses.
During the 1950s and 1960s, the above bottlenecks led to the development of
database systems so that data can be independent and the application programs
just access it. With the development of databases, data and application software
2
became independent but interacting components. A Database is a collection of
logically related data and a Database Management System (DBMS) is a software
product that helps in defining, creating, maintaining and controlling access to a
database. DBMSs are grouped according to the model of their development. A
database model is an organizing principle that specifies particular mechanisms for
data storage and retrieval. There are five database models namely the Hierarchical,
Network, Relational, Object and Deductive models; the Relational model being the
most popular of all.
The Hierarchical and Network models, which were developed before the Relational
model was developed are referred to as the pre-Relational models while the Ob-
ject and Deductive models, which were developed after the Relational model was
developed are referred to as the post-Relational models.
1.2 The Relational Model
The Relational model is the most popular database model currently in use. It uses
tables (relations) to organize the data elements stored. A relation represents an
application entity while a row in the relation represents an instance of the entity. It
replaced the Hierarchical model (whose organizational principle was the tree struc-
ture organization of data) and the Network model (whose organizational principle
was the graphical representation of data)[11]. It was because of the popularity of
3
the Relational model that the Hierarchical and Network models were phased out.
Johnson [11] attributes the popularity of the Relational model to:-
i The existence of a standard, easy and flexible querying language called Struc-
tured Query Language (SQL) which is universal to all Relational Database
Management Systems (RDBMS);
ii The existence of a simple data structure (relations) which makes it easily
understandable even by non-technical users;
iii The existence of a strong mathematical base in the model (Relational Algebra
and Relational Calculus) for its operation.
Rammakshriman and Gherke [18] however observe that the Relational model has
two strong setbacks which are:-
i Performance:
A single SQL query can be written in many different ways each of which can
have a different cost. Most of the queries are so expensive that executing them
degrades the computer system and greatly slows the rate of output generation.
ii Flatness:
The relational model relies on inbuilt primitive data types yet data in real life
situations is increasingly becoming complex. Representing complex fields as a
set of primitive attributes leads to too many fields with often null entries.
4
Motivated by the above weaknesses, two post-Relational models emerged. These
are the Object Oriented model and the Deductive model. The Object Oriented
model represents an entity as a class and an entity occurrence as an object. It
bases on the three principles of Object Oriented Programming which are encapsu-
lation, inheritance and polymorphism. It gives the programmer freedom to create
classes as dictated by the real life situation and the way he envisages the entities
being modeled. This not only makes the database more problem focused, but it
also eases reuse. It is suitable for creating complex systems. The Deductive model
(also called the Inferential model) aims at storing minimal data in combinations
called axioms. It allows for un stored combinations to be deduced from the ex-
isting ones. For example, if an axiom for a student being a member of a class is
Member(studentName, class), the individual students memberships can be repre-
sented as Member(”Mark”, ”FormTwo”) and Member(”Matthew”, ”FormTwo”).
An axiom like ClassMates(class, studentList) is not stored but is deduced from the
Member axiom.
Intensive research took place to address the limited data type and the query per-
formance problems. The limited data type problem was addressed by incorporating
object oriented characteristics in the relational model. This led to the introduction
of a hybrid model, the Object-Relational model. DBMSs like Oracle and post-
greSQL have object oriented characteristics. The inefficient queries problem was
addressed by passing relational (and object relational) queries in a series of steps
between the query source and the physical data storage level so as to have them
5
execute in a cost effective manner. This process is called Query Processing.
1.3 Query Processing
This is the process of transforming a high level query into a plan that executes and
retrieves data from the database (Figure 1.2). It involves four phases which are
Query decomposition, Query optimization, Code generation and Run time query
execution [2].
(a) Query decomposition
In this phase, a query is checked as to whether or not it conforms to the syntax
of the language used (mostly SQL) and in case it does not, the error message is
generated. If the query conforms to the syntax, it is broken into small pieces
and represented in an equivalent relational algebra expression (parsing). A
system catalog is used to cross check the consistency of the query with the
schema.
(b) Query optimization
In this phase, the best execution plan is generated. This is done by putting into
account the resources required to execute the query as well as the resources
required to get the plan. Database statistics are used to make appropriate
decisions.
6
(c) Code generation
After the optimizer has got the best execution plan, the code generator creates
the code equivalent to the plan. This is sent to the internal ANSI spark
architecture level of the database for execution.
(d) Query execution
In this phase, the code interacts with the database and retrieves the data for
consumption of the process or individual who sent the query.
Figure 1.1: The stages of query processing
The query processing activity therefore acts as an interface between the querying
individual/process and the database. It relieves the querying individual/ process of
7
the burden of deciding the best execution strategy. So while the querying individual/
process specifies what, the query processor determines how [4].
In query processing, the least straight forward stage is Query Optimization. It
is the efficiency of the query optimizer that determines how much resources are
to be used and a measure of how suitable the DBMS is for critical and real time
applications.
1.4 Query Optimization
Query Optimization is the process of choosing the efficient execution strategy for
executing a query [2] and it is one of the most important tasks of any RDBMS.
Ramakrishman and Gehrke [18] observe that SQL which is a de facto standard for
data definition and data manipulation in RDBMSs has a variety of ways in which a
user can express, and therefore a system can evaluate a query. The query optimizer
therefore is responsible for finding the best execution strategy so that less resources
are used to retrieve data.
There are three main approaches to query optimization. These are Systematic,
Heuristic and Semantic query optimization.
8
1.4.1 Systematic Query Optimization
In systematic query optimization, the system estimates the cost of every plan and
then chooses the best one. The best cost plan is not always universal since it depends
on the constraints put on data. For example, joining on a primary key may be done
more easily than joining on a foreign key since primary keys are always unique and
therefore after getting a joining partner, there is no other key expected. The system
therefore breaks out of the loop and hence does not scan the whole table. Though
in many cases efficient, it is a time wasting practice and therefore sometimes it can
be done away with [4].
The costs considered in systematic query optimization include access cost to
secondary storage, storage cost, computation cost for intermediate relations and
communication costs. The importance put on these costs depend on the type of
database. For example, for large databases, emphasis is put on minimizing access
cost to storage and memory usage. For small databases however, where outputs can
be stored in memory, emphasis is put on minimizing computational cost. On the
other hand, in distributed databases, where many sites are involved, communication
cost is of paramount importance and it has to be minimized since it normally involve
costs of channel coding, security coding as well as other network related limitations
like bandwidth and noise.
9
1.4.2 Heuristic Query Optimization
In the heuristic approach, the operator ordering is used in a manner that economizes
the resource usage but conserving the form and content of the query output. The
principle aim is to:
(i) Set the size of the intermediate relations to the minimum and increase the rate
at which the intermediate relation size tend towards the final relation so as to
optimize memory.
(ii) Minimize on the amount of processing that has to be done on the data without
affecting the output.
Connolly and Begg [2] state five main rules which are used in heuristic query
optimization:-
(a) Perform selection operations as early as possible. This reduces the cardinality
of the intermediate relation and hence reducing the resources used to process
a column as well as the memory occupied per column.
(b) Combine the Cartesian product with a subsequent selection operation whose
predicate represents a join condition into a join operation i.e σR·aθS·b(R×S) =
RonR·aθS·b S. Elmasri and Navathe [4] observe that this reduces the complexity
of the joining algorithm (which is one of the most expensive operations in data
10
retrieval) for example in cases where individual relations are first sorted on the
joining fields.
(c) Use association of binary operations to rearrange the query so that the most
restrictive selection operation is done first. This increases the rate at which
the intermediate relation size tends to the final relation size hence minimizing
on the memory occupied and the resources required to process a column.
(d) Perform projection operations as early as possible. This reduces the order of
the intermediate relation. It therefore reduces the memory occupied by the
relation together with the amount of resources required to process a row.
(e) Compute common expressions once. If a certain expression appears more than
once, and its not too large, it is kept in memory so that when it is required
again, it is reused. In case the expression is too big to fit in memory, it can
be stored on a disk and later retrieved when wanted so long as the cost of
retrieval is not greater than the cost of recomputing it.
1.4.3 Semantic Query Optimization
This is a combination of Heuristic and Systematic optimization. The constraints
specified in the database schema can be used to modify the procedures of the heuris-
tic rules making the optimal plan selection highly creative. This leads to heuristic
rules that are locally valid though cannot be taken as rules of the thumb. For
11
example, if there is a query such as
12
SELECT Employee.lname, Supervisor.lname
FROM Employee, Supervisor
WHERE Employee·supervisorNo = Supervisor·No
and Employee·salary≥Supervisor·salary
This is a very unlikely invent and its likely to be directly or indirectly in the
database constraints. The database restriction may be like check Employee · salary
between(S1, S2). Check Supervisor · salary between(S3, S4) where S3 > S2. This
shows that the Supervisor can never earn less than the Employee therefore the query
yields no results.
A Heuristic optimizer would go ahead and parse, optimize and execute the query
resulting in no output which is a worst case scenario [9]. A semantic optimizer
would recognize it by use of the constraints and respond ”Empty set” and saves the
resources.
1.5 Single Query Heuristic Optimization
1.5.1 The Process
For a single syntactically correct query, when a query is broken down, and expressed
into a relational algebra expression, a query tree is created [2] [4] with the interme-
13
diate operations as nodes; the source relations as leaves and the output as root. For
example, given the table components below:-
Client
Field Data type Description
clientNo four fixed characters Client identity number, primary key
name twenty five variable characters client Name
prefType ten variable characters property type of preference
maxRent float rounded to 2 decimal places maximum rent affordable by the client
Property
Field Data type Description
propertyNo five variable characters Property identification number, primary key
street twelve variable characters Street where the property is located
rent float rounded to 2 decimal places monthly rent of property
ownerNo four fixed characters identification number of owner, Foreign key
Viewing
Field Data type Description
propertyNo five fixed characters Property identification number partial primary key
clientNo four fixed characters Identification number of viewing client partial primary key
14
and with a query
SELECT p.propertyNo, p.street
FROM client c, viewing v, propertyForRent p
WHERE c.prefType = ’Flat’ AND c.clientNo = v.clientNo AND v.propertyNo = p.proprtyNo
AND c.maxRent ≥ p.rent AND c.prefType = p.type AND p.ownerNo = ’C093’;
In the process of getting the suitable plan, it is decomposed into a relational algebra
expression;
πp·propertyNo,p·street(σc·prefType=′F lat′∧c·clientNo=v·clientNo∧·propertyNo=p·proprtyNo∧c.maxRent≥p·rent∧
c.prefType=p·type∧p·ownerNo=′C093′(C×V)×P))
which is expressed as a query tree in Figure 1.2.
The formed plan like one above, is in most cases not optimal. The optimizer
therefore looks for a most cost effective plan. This may be either by using the
heuristic rules and adjust the plan to a cost effective one or by looking for the most
effective plan among the many available. In this case, the optimal tree is shown in
Figure 1.3.
15
.
Figure 1.2: Query Tree Representation
16
.
Figure 1.3: The optimal plan of the query
It is this strategy which is sent to the code generator for code generation and
data fetching.
17
1.5.2 Challenges to Single Query Optimizers
Since the single query optimizers handle one query at ago, they are unsuitable for
handling a high traffic of queries. Since they work on a large sample space, they
cannot be exhaustive in nature since that will be expensive and time wasting. The
algorithm therefore has to look for a search strategy so that a thorough search is
done, and the optimal plan is got without traversing all the options in an acceptably
small interval of time.
1.6 Multi-Query Optimization
1.6.1 The Process
In multi query optimization, queries are optimized and executed in batches. Individ-
ual queries are transformed into relational algebra expressions and are represented
as graphs [13, 16, 20]. According to Roy et al [20], the graphs are created in such a
way that:
(i) Common sub-expressions can be detected and unified;
(ii) Related sub-expressions are identified so that the more encompassing sub-
expression is executed and the other sub-expressions are derived from it. For
example if We have σA≤5(E) and σA≤10(E), σA≤10(E) is executed and σA≤5(E)
18
derived from it.
The optimizer therefore concentrates on:-
(i) Identifying the common expressions so that the database accesses and compu-
tational costs are minimized;
(ii) Identifying groups of sub-expressions where internal derivations are possible
so that derivations can be made and database access is minimized;
(iii) Get a search strategy so that the search is not too long.
1.6.2 Challenges of Multi-Query Optimizers
Generation of optimal global queries is not necessarily done on individual optimal
plans. It is done on the group of them. This leads to a large sample space from
which the composite optimal plan has to be got. For example, if for four relations A,
B, C, and D there are two queries whose optimal states are Q1 = (AonB)onC and Q2
= (B on C) onD with execution costs ε1 and ε2 respectively, the total cost is ε1 + ε2.
Though these queries are individually optimal, the sum is not necessarily optimal.
The query Q1 for example can be rearranged to an individually non optimal state
Q’ = Aon(BonC) whose cost, say ε is greater than ε1. The combination of Q’ and Q2
may make a more optimal plan globally at run time in case they cooperate. Since
there is a common expression (BonC), it can be executed once and the result shared
19
by the two queries. This leads to a cost of ε + ε2 - ξ where ξ is the cost of evaluating
(BonC) which can be less than ε1 + ε2. If sharing was tried on individual optimal
plans, sharing would be impossible hence the saving opportunity would be lost. To
achieve cost savings using sharing, both optimal and non-optimal plans are needed
so that the sharing possibilities are fully explored. This however increase the sample
space for the search hence a more search cost. The search strategy therefore needs
to be efficient enough to be cost effective. Though this approach can lead to a lot
of improvement on the efficiency of a query, it may still have some bottlenecks that
have to be overcome if a global state is at all times to be achieved. The bottlenecks
include:-
(a) The cost of Q’ may be too high that the sum of the independent optimal states
is still the global optimal state;
(b) There may be no possibility at all to have sharable components and therefore
a search for sharable components is a wastage of resources.
(c) The new query plans may have a lower resource requirement than the previous
one but when the resources taken to identify the plan (search cost) take on
more resources than the trade off hence no net saving on resources. This is
very likely given the large sample space.
20
1.7 Statement of the Problem
Parallel real time demands are putting a big load on relational query optimizers es-
pecially in Decision Support and Business Intelligence tools, remote access architec-
tures like databases on networks as well as fragmented architectures like distributed
databases. This leads to overworking of hardware increasing wear and tear rate
hence increasing the maintenance costs. Multi-query optimizers identify common
sub-expressions, execute them once and share the outputs among the queries. This
leads to significant performance gains [22]. Current optimization approaches make
no effort to find out whether queries in a batch really have common sub-expressions
and neither do they have an intelligent way of predicting the absence or exhaustion
of common sub-expressions. Their cost-saving abilities are therefore not assured.
1.8 Justification
Though currently relational database optimizers are available, with a lot of efforts
spent on them (average relational database optimizers are 50 - 60 man years of
development effort [18]), they are highly inefficient in an algorithmic sense. For
example, commonly used bottom up tree optimizers are O(2n) in worst case [5].
There is therefore a need for further studies since:
(a) The saving made depends on how many common sub-expressions are present
21
and how easy it is to find them. We therefore need to establish to what extent
queries are similar. We as well need to detect cases of disjoint queries.
(b) Queries come from different sources and at different times therefore similarities
between them is un-predictable. There is need to study how they can be
scheduled so that they are executed in an optimal order. Though some of
the current algorithms [20] acknowledge the importance of order, they do not
attempt to systematically get the optimal order.
1.9 Objectives:
The objectives of the study were:-
(a) To study ways of establishing the extent of sharing among the queries as well
as establishing the order of optimal execution.
(b) To study existing algorithms, identify weaknesses and strengths with a view
of improving them to minimize the weaknesses and integrate strengths of the
different algorithms into more efficient hybrid algorithms.
(c) Analyze improved and hybrid algorithms for efficiency and effectiveness.
22
1.10 Scope
(a) Optimization approach
The research greatly concentrated on the combination of the Heuristic and
Semantic approach to query optimization.
(b) Database Model
The research highly concentrated on the Relational model. It can however be
applied in an Object-Relational environment.
(c) Optimization Philosophy
The research was based on the multi-query approach to query optimization
23
.
1.11 Conceptual Framework
Figure 1.4: Conceptual Framework
24
1.12 Methodology
In the process of coming up with the results of this study, the following methods
were used.
1. Literature Evaluation
This involved reading literature related to the concepts of query optimization.
It also involved studying the criticisms as well as the improvements on them
by the successive researches.
2. Comparison
This involved comparing ways in which similar optimization styles (like sin-
gle query or multi query optimization ) were handled by different approaches
(algorithms). This was used in identifying similarities hence improving on the
algorithms, or developing algorithms that put together the individual algo-
rithm strengths.
1.13 Dissertation Overview
In Chapter-2, We look into the previous research which has taken place in the
field of query optimization. We examine the rationale for query optimization, the
approaches to query optimization as well as the use of pipelining so that materi-
alization is minimized. We then examine the different algorithms that have been
25
proposed by the different researchers. We identify the merits and demerits of the
algorithms. We then formulate research questions to guide us in the process of
proposing improvements.
In Chapter-3, We lay a foundation for exploiting the similarities between the
different queries in a batch for a multi-query environment. We propose an algorithm
that traverses the whole query plans batch so as to establish the extent of similarities.
We improve the greedy algorithm to be intelligent enough so as to increase its
efficiency. We summarize the inter query shareability in a query sharing matrix M.
In Chapter-4, we use the information in the query sharing matrix and traverse
the batch while merging common sub-expressions. The matrix is used while choosing
the order in which query plans are to be considered for optimization. It also provides
guidelines on the extent sharing between any two plans so that the optimizer makes
appropriate decisions. For example we do not search for sharable sub-expressions
between disjoint query plans.
We then present our conclusion, recomendations and future work in Chapter-5.
26
Chapter 2
Literature Review
2.1 Overview
When the relational model was first launched in the late 1970s, one of the major
criticisms often cited was inadequate performance of queries [2]. This was because
queries used a lot of resources such as processor cycles and memory compared to
other models for equal amounts of data. SQL, which is the de facto standard lan-
guage for data definition and data manipulation in RDBMS [2] offers a variety of
ways in which a query can be structured to achieve the same output. The more the
complexity of the query the higher the number of ways a query can be represented.
On average, the best measure of a relational query complexity is the number of
relations a single query joins. In fact, even at the design level, developers introduce
27
redundancy or merge some relations such that joins in frequently invoked queries
are minimized. Surajit and Kyuseok [1] state that the complexity of a query is
exponential to the number of joins involved. Complex queries like those in data
warehouses that normally join tens of tables are too expensive to process in reason-
able time. Since the structural differences in queries depend highly on the way the
joins are ordered, the more the tables joined the more the options of writing a query
that can bring a single output. These queries, if executed have varying costs and
in most cases (if not all) only one is optimal. The probability of writing the most
optimal query tends to zero as the query complexity increases and the computer
is most likely to waste a lot of resources. It is therefore the complex queries that
must be optimized if a computer system is to work efficiently. An extensive query
optimization phase must select the most efficient plan among the many available
to process the query in an acceptably short time. Without query optimization,
RDBMSs would be inefficient and hence un practical [18]. Query optimization is an
expensive process because it mostly relies on evaluating the different plans (access
paths) and choosing an optimal one among them. The number of alternative access
paths grows at least exponentially with the number of relations participating in the
query [15].
The optimizer therefore, which is nearly 100% sure that the plan sent by a user is
not optimal, has to search for an optimal plan and forward it for execution. This has
to be done with in the time constraint and in a resource-conserving manner. It would
28
not be worthwhile if the difference between the optimized and a pre-optimized query
is less than the cost of finding the optimal plan. The user likewise is supposed to get
what he requested for, in the same logical presentation as well as in an acceptable
time interval. Therefore, as the user specifies what, the optimizer determines how
[4] but still conserving the what. The process of optimization should have no effect
whatsoever on the final query output.
The search strategy therefore, on top of conserving the form and content of
the query request must be efficient. Optimization is not a matter of transferring
the resources that would execute the query to looking for the execution plan. The
ability of the optimizer reaching the optimal plan, at the earliest opportunity, with
substantial resource savings is therefore of paramount importance. Kroger et al [13]
summarizes the goal of an optimizer as follows ”A plan as cost-effective as possible
is looked for as soon as possible”. Kroger et al [13] further observe that the job of
a query optimizer is not necessarily to get the cheapest plan (though the cheapest
plan would of course be the best). In fact, if a stage is reached where the cost of
further optimizing is higher than the resource savings, it is worthwhile to terminate
the search.
The optimizer is supposed to economize the resources spent on looking for a plan
as well as putting into consideration the time of processing (time of execution + time
of optimization). Depending on the nature of the problem therefore, a sub-optimal
plan may be preferred especially in a real time scenario.
29
Given the large number of possible plans, traversing them, one by one, establishing
the cost of each may be the ideal strategy but it is time wasting since the options
are too many and it is likely to produce a low cost query but having spent a lot of
resources to get it.
2.2 Approaches to Query Optimization
Broadly, computers optimize queries either individually (Single-query optimization)
or as batches (Multi-query optimization).
2.2.1 Single Query Optimization
In Single Query Optimization, a query, which is syntactically correct, is broken
down, expressed into a relational algebra expression, and a query plan, represented
as a tree is created [4, 18]. This is a traditional approach to query optimization and
is used in most commercially available optimizers. It is suitable where a database
receives a low traffic of simple queries.
Depending on the algorithm used, either the different representation of the original
query are generated and the best one searched for or the query supplied is adjusted
to the optimal one. If the option of choosing the best tree from the different trees
available is used, there is a high possibility of logical duplicates (two physically dif-
30
ferent trees, doing the same thing, the same way). Since the number of options is
likely to be high, such options normally overload the memory and require an ex-
haustive algorithm. There is a likelihood of groups of plans, with the same cost
implying that more searching is made but with no practical advantage. Using ex-
haustive algorithms is a reason for inefficiency of many query optimization research
works carried out [20]. Single processor optimizers therefore limit the search space
by considering only some tree configurations [12]. This may be left deep, right deep,
complex deep or bushy tree configuration.
Figure 2.1: The different tree configurations: bushy, complex deep, left deep and
right deep trees
The IBM System R optimizer for example, considers only left deep trees. Heuris-
tics may as well be used to eliminate some obviously expensive plans before the
optimization so as to reduce the search space. The rules may be static or dynamic.
Dynamic rules are applied basing on available data like database statistics or system
catalog. Failure to reduce the search space may cause effects that lead to the decline
in the performance and cost effectiveness of the optimizer. These are:-
31
(i) Too many options may overload the memory. In cases where the query is
complex, the computer may not have enough memory space to perform other
routine activities.
(ii) In the process of conserving the memory, the computer may have to store
some of the plans on disk and retrieve them when required. This then brings
in costs of writing and reading from disk which increase optimization costs.
(iii) The many options put a bigger load on the processor during cost estimation
and comparison.
Restricting the tree types used in optimization therefore eliminating duplicates re-
duces the number of possible options hence reducing on the search space saving the
traversal of many options as well as memory usage.
2.2.2 Multi-Query Optimization (MQO)
In MQO, queries are executed in batches. Some of the MQO techniques act in
such a way that in case some queries have a common sub-expression, such a sub-
expression is executed once and the output shared. In some cases, the sharing does
not necessarily take place on individual optimal plans [20] but instead sub-optimal
plans are used. Decision may as well have to be taken whether the common sub-
expressions should be pipelined or materialized [16]. Some multi-query optimization
techniques, like those described by Kyuseok et al [14] basically aim at having parallel
32
optimization of many queries. The queries pass through the different optimization
steps together and as an output,which is a set of optimal plans for each query is
generated. Roy et al [20] criticize this approach on a basis that further cooperation
can be made between the queries that make up the batch. If a certain sub-expression
is common, then the computer should execute it once and share out the results.
This is a guiding principle to the Basic Volcano algorithm proposed by Goetz and
McKenna [7], and the Volcano-SH and Volcano RU optimizer algorithms proposed
by Roy et al [20]. Roy et al [20] further put the sharing of the sub-expressions to a
great importance that even if the sharing takes place on a non-optimal plan of the
query, so long as the total resource requirement is optimal, it is acceptable.
If for example the queries Q1 and Q2 have locally optimal plans Q1 = (RonS)onP
and Q2 = (RonT)onS, there exists no common sub-expressions in them and therefore
need to be executed independently. However, if Q2 takes a locally sub-optimal plan
Q′2 = (RonS)onT, then (RonS) is a common sub-expression. If it is executed once and
Q1 and Q′2 share it, there is a likely reduction in the amount of resources that are
required to execute the two queries to a value less than the sum of the two optimal
plans. This is however done on the sub-optimal plan of Q2.
A multi-query optimizer therefore, according to Roy et al [20] is responsible for
recognizing the possibilities of shared computations and modifying the optimizer
search strategy to explicitly account for shared computations so as to find a globally
optimal plan.
This sharing however may not be necessarily optimal since:-
33
(i) The cost of Q′ may be too high that the sum of the independent optimal plans
is still the global optimal;
(ii) The shared plan may have a lower resource requirement than the non shared
ones but when the resources taken to achieve the plan take more resources
than the trade off hence an efficient query in an inefficient system;
(iii) The sharable components may be too few and the ratio of sharable to total
components is too low. The search for sharable components may produce very
little sharable components that the saving is far less than the searching cost.
The inclusion of sub-optimal plans increases the sample space hence a more efficient
search technique is required.
Sharing in multi-query optimization highly hinges on the availability of the
sharable sub-expressions. The elimination of some plan types, like in the way it
is done in System R optimizer may not be useful since the eliminated types may
have sharable components. The elimination may instead depend on the cost of the
sub-expressions in the plan itself since excessively expensive expressions, even if
shared, are very unlikely to create global optimality. Elimination of some expres-
sion is done so as to reduce the search sample space. In reuse based cases, cost not
configuration should be the basis.
34
2.3 The Effect of Pipelining
Dalvi et al [16] observed that the most multi-query optimization techniques assume
that all sub-expressions have to be materialized so that they are read from disk when
they need to be used. This is actually assumed in the contributions of Roy et al [20]
as well as Graefe and McKenna [7]. Dalvi et al [16] suggest that the materialization
of a sub-expression needs not to be made a rule of thumb since in case the sub-
expression is pipelined instead, the cost of writing to disk while materializing and
the cost of reading from disk while reusing are saved. The technique of pipelining
however brings in new challenges like limited buffer space. The difference between
the rate at which a source operator produces the output and the rate at which the
destination operator needs the output may make the pipeline schedule unrealizable.
Figure 2.2: A pipeline Schedule
35
If for example We take the schedule shown in Figure 2.2, the output of A is shared
between B and C. Let the rate at which A is producing tuples be rA and the rate
at which B and C need the tuples be rB and rC respectively. The schedule is not
realizable for all instances of the rates. If for example rB = rC and rA is less than
rB (hence also less than rC). This means that the schedule is unrealizable since A
cannot produce tuples at a rate satisfying the desires of B and C. If the reverse is
true, then A produces tuples at a rate greater than what B and C can consume.
This implies that tuples will pile in the output buffer of A (if the pull model is used)
or pile in the input buffers of B and C (if the push model is used). The piling of
tuples may lead to filling of the buffers hence a deadlock. It should be noted that rB
is not necessarily equal to rC hence the situation may be more complicated. Dalvi
et al [16] came up with the neccesary and sufficient conditions for a schedule to be
pipelinable otherwise it is materialized.
Rao and Ross [19] propose a less conservative pipeline schedule compared to that
of Dalvi et al [16]. It allows simultaneous input of multiple tuples unlike that of
Dalvi et al [16].
Much as some sub-expressions may satisfy the neccesary and sufficient conditions
for pipelining, it may still be impossible to pipeline them. This is because pipelining
of a sub-expression implies that it has to be kept in cache. Given the limited cache
size, it may be impossible to keep all the sub-expressions in cache.
36
Hulgeri et al [10] aim at optimizing the use of memory by appropriately dividing
it among the pipelinable sub-expressions in the plan. The amount of memory given
to a sub-expression determines the cost estimate associated with it. Given the finite
value of memory therefore, not all the sub-expressions may be finished. Hulgeri
et al [10] however use the memory division to accommodate as many pipelinable
schedules as possible.
Faced with a scenario where all the pipelinable schedules may not be pipelined
due to limitations in memory, there is need to decide on what to pipeline so as
to have a higher benefit. Urhan and Franklin [24] propose a dynamic approach to
pipelining so that the choice to pipeline is guided by the perceived importance (cost
wise) of the output. This way, much as the total number of pipelined sub-expressions
for two similar schemes may be the same, the savings may be different.
Gupta et al [8] uses a rather different approach to optimal use of cache. They use
query (and hence sub-expression) scheduling to achieve optimal use of memory. A
sub-expression may be removed when it still has other processes to serve so long as
it is replaced with one whose benefit is higher than the one removed. With pipeline,
the algorithm has to make a decision to pipeline and materialize where the buffer
space is not enough. More to that, not all reused sub-expressions are pipelinable. It
is therefore limited to the pipelinable schedules and in case the pipelinable schedules
cannot all fit in memory, then the pipelining has to be done in a buffer-conserving
manner.
37
2.4 Single Query Optimization Algorithms
Single query optimization algorithms explore one query at ago, find an optimal plan
and execute it. They mostly employ dynamic programming techniques and are
commonly used in commercial optimizers such as the System R optimizer.
2.4.1 Dynamic Programming Algorithms
In dynamic programming algorithms, queries are turned into and optimized as query
trees. In a tree, a node represents a join operation, the leaf represents a relation
while the edge indicates data flow [15]. The execution space, which is a set of all join
trees for a query is normally restricted deep trees, either right deep or left deep but
sometimes bushy and complex deep trees. This is done to lower the sample space
and the applied heuristics eliminate the duplicates further.
The Left deep Algorithm
The Left Deep algorithm works well for a space that allows only nested loop and
hash joins as join methods. However the algorithm is poor for sort merge joins since
it has to keep multiple plans for sub-queries with different interesting orders [15].
This algorithm, being selective has to be applied when:-
(i) The types of join algorithm to be used are known in advance.
(ii) There is a facility to find out whether the join algorithms that are compatible
38
are the ones used and if they are not, a standby alternative algorithm is used.
2.4.2 Randomized Algorithms
These come in to handle rather complex queries which cannot be well handled by the
dynamic programming algorithms [1]. For complex queries, there happens to be too
many possible alternatives and therefore, a dynamic programming approach is likely
to cause a combinatorial explosion [14]. The algorithms find different states and for
each state, a cost function is used to find the costs of using it to execute the plan.
The algorithm performs random walks from one state to another directly accessible
state (neighbor). In case the move is from a cheaper state to an expensive state,
it is referred to as an uphill move otherwise it is a downhill move. The algorithm
basically computes the costs of alternatives, and makes downhill moves so as to
achieve the minimum cost plan.
There are three popular randomized algorithms: the Iterative Improvement(II),
Simulated Annealing(SA) and Two Phase Optimization(2PO) algorithms.
Iterative Improvement (II)
This algorithm chooses a region at random, finds the local minimum and com-
pares the local minima to come up with the global minimum. It randomly picks a
state and looks for a minimum in the neighborhood.
39
The Iterative Improvement Algorithm
ProcedureII
minS = infinity
while not(Stopping condition) do{
S = random state
# start of inner loop
while(not local minima)do{
S’ = random state neighbour(S)
if(cost(S’)<cost(S)
S = S’
}
# end of inner loop
if(cost(S) < cost(minS)
minS = S
}
return minS
This algorithm has an advantage in that:-
(i) It is very likely that global state will eventually be got, and
(ii) The algorithm needs not to explore all the sample space of plans since it makes
downhill moves.
The algorithm therefore is random and selective but is always focused to the local,
and hence global minima identification. But it has some shortcomings in that:-
(i) Since the process of finding a starting point while searching for a local mini-
mum is random, the same region around a local minimum may be repeatedly
selected. Over iteration around the same local minimum is therefore likely. If
40
for example in Figure 2.3, the circle represents a plan and the cost associated
with it, it can be seen that plan P1, with a cost of 12 is the global minimum.
Each of the three random walks ends at P7 which is one of the local minima
but not a global minimum. This shows the over traversing of the region around
P7 without improving on the cost of the output. This leads to a higher search
cost and longer time taken to identify the minimum.
Figure 2.3: Random nature of the II algorithm
41
(ii) It is hard to prove that the local minima are really exhausted since a local
minimum can exist with in highly expensive neighbors. Such a local minimum
(which may as well be the global minimum) may only be accessed if the random
starting point is on the expensive neighbor and the move is towards it or on
its self. For example in Figure 2.3, the optimal plan is shielded.
Simulated Annealing: (SA)
Simulating Annealing algorithm allows some uphill moves so as to try get the
possibly shielded local minimum but it does it at a continuously reducing probability
to avoid being caught up in a higher cost local minimum [15].
The Simulated Annealing AlgorithmProcedureSA
S = Initial State
T = Initial Temperature
minS = S
while not(frozen) do{
# start inner loop
while not (equilibrium) do {
S’ = random state in neighbours(S)
c = cost(S’)- cost(S)
if (c <0)
S = S’
else
S = S’ with probability exp(-c/T)
if cost(S) < cost(minS)
minS = S
}
#end of inner loop
T = reduce(T)
}
42
Local minima are located by a sub module called a stage. Each stage is performed
on a parameter T called the Temperature. T controls the probability of performing
an uphill move and it goes on reducing until when it reaches zero. When T becomes
zero, the algorithm terminates.
Two phase Optimization (2PO)
Two-phase Optimization uses both the Iterative Improvement and Simulated
Annealing approach [15]. In the first phase, Iterative Improvement runs for a small
period of time, making a few local optimizations. The output of the phase is the
initial state of the Simulated Annealing which runs with low initial probability for
uphill moves.
2.5 Multi-Query Optimization Algorithms
In multi-query optimization, a query is not optimized one by one; instead, the
queries are optimized and hence executed in batches. Kyuseok et al [14], Sellis and
Gosh [23], Cosar et al [3] and Park and Segar [17] carried out research on multi-
query optimization using basically exhaustive algorithms. These algorithms traverse
a good number of the different options, look for the minimum among the different
query plans per query and then output the set of optimal plans.
Though optimized together, plans of queries whether final or intermediate do not
43
intervene or interfere in the generation of other plans. In fact, after individual query
optimization, the queries compete for computer resources at execution. Roy et al
[20] advocates for cooperation and do not entirely optimize the individual queries.
They aim at getting the cheapest way of retrieving all the data so that all the query
requests are serviced.
2.5.1 The MQO Problem
Kyuseok et al [14] formulate the MQO problem as follows
Given n sets of plans {P1, P2, · · · , Pn} with Pi = { Pi1, Pi2, · · ·Piki } being a set of
possible plans for processing Qi;
Find a global access plan GP by selecting one plan from Pi ∀i such that the cost of
GP is minimal.
State - Space Search Algorithm for MQO
The search space for the MQO problem according to Kyuseok [15] is constructed by
defining one state for each possible combinations of plans among the queries. For
n queries Q1, Q2, · · · , Qn; Qi has plans Pi = {Pi1,Pi2, · · · ,Piki} for i = 1, 2, · · · , n
such that:
1. Every state S is an n tuple < P1j1 , P2j2 , · · · , Pnjn >. For each i, either Piji ∈ Pi
44
in which case S suggests that Pji should be used to process Qi or Piji =
”NULL” if Piji = 0, then S suggests that Piji should be used to process Qi or
else Piji = ”NULL” in which case S does not suggest any plan for processing
Qi. The cost of S, denoted by scost(S) is the total cost for processing all the
queries in S.
2. The initial state is S0 =< NULL,NULL, · · · , NULL > while the final state
is SF =< P1j1 , P2j2 , · · · , Pnjn > with Piji 6= NULL ∀i in the final states.
3. Given that S =< P1j1 , P2j2, · · · , Pnjn > let
next(s) =
min{i|Pij = NULL} if{i|Pij = NULL} 6= φ
n + 1 otherwise
4. Let the state S have at least one NULL entry and m = next(S) . Then the
immediate successors of s include s′ =� P1j1 , P2j2, · · · , Pnjn � satisfying the
following properties Piki = Piji for 1 ≤ i ≤ m
Pmkm ∈ Pm
Piki = NULL for m + 1 ≤ i ≤ n
The cost of a transition from S to s′ is the additional cost required to process
the new plan Pmkm given the results of processing the plans in S.
45
This algorithm is not sharing oriented be it during optimization or during execution.
The global cost therefore is a sum of the individual optimal plans. Therefore for
queries Qi, with optimal plans Pi i = 1, 2, · · · n, the cost of execution is the sum of
costs Pi =∑
alli Pi
2.5.2 Reuse Based Optimization
All the multi query algorithms so far explored do not take into consideration pos-
sibilities of sub-expressions that may be common and save more in overall costs.
In reuse-based optimization sharing possibilities are explored. The Volcano [7],
Volcano-SH [20] and Volcano-RU [20] use the same principal though using different
philosophies. Most reuse based algorithms use Directed Acyclic Graphs (DAG) to
represent the search space. In some cases however, search space is represented as an
AND-OR DAG.
Figure 2.4: Query representation: Tree, DAG and extended DAG
46
An AND-OR DAG is a DAG whose nodes are divided into two: the AND nodes
and the OR nodes. AND nodes have only OR nodes as their children and OR
nodes have only AND nodes as their children. The AND node in an AND-OR DAG
has algebraic operation like select (σ), project (π), etc. They are therefore referred
to as operational nodes. The OR node of an AND-OR DAG represents a logical
expressions that generates the same result set as when a child operational node
is applied on its children/ child. OR nodes are therefore referred to as equivalence
nodes. The expanded DAG is used as a representation for modern optimizers because
they are easily extensible.
Graefe and McKenna [7] proposed an algorithm for generating an AND-OR DAG
and Roy et al [21] improved it into an efficient one.
AND - OR DAG Adjustments for Multi query Optimization
Roy et al [20] proposed three main adjustments that need to be made on an
AND-OR DAG in order to get an optimal plan:
(i) Merging
Since each query has a single DAG and the queries are to be optimized simulta-
neously, they need to be merged into a single DAG. A pseudo root equivalence
node is created with all individual DAG roots as children.
47
(ii) Identification of common sub-expressions
If two common sub-expressions exist in the pseudo rooted DAG then there
will be two equivalence nodes, which are exactly the same or the same after
applying join associativity. For example, in cases of X on Y and Y on X.
(iii) Handling of nodes derivable from others
In case expressions that can be derived from others exist, the optimizer adds
some nodes appropriately as children or as parents. For example, if two equiv-
alence nodes e1 = σA≤5(E) and e2 = σA≤10(E) exist, e2 is made a parent of
e1 in the part where e1 is and the output shared where e2 is while e1 is de-
rived from e2. In case there were nodes e3 and e4 where e3 = σA=5(E) and
e4 = σA=10(E), then a node e5 = σA=5∨A=10(E) is created and the two are
derived from it.
The Basic Volcano Algorithm
This determines the cost of the nodes by using a depth first traversal of the DAG.
The cost of operational and equivalence nodes are given by
cost(o) = costofexecuting(o) +∑
ei∈children(o)
cost(ei) (2.1)
and the cost of an equivalence node is given by
cost(e) = min(cost(oi)|o∈children(e) (2.2)
If the equivalence node has no children, then cost(e) = 0. In case a certain node
has to be materialized, then the equation for cost(o) is adjusted to incorporate
48
materialization. For a materialized equivalence node, the minimum between the
cost of reusing the node and the cost of recomputing the node is used. The equation
therefore becomescost(o) = costofexecuting(o) +
∑
ei∈children(o)
C(ei) (2.3)
where
C(ei) =
{cost(ei) if the node is not materialized
min(cost(ei), reusecost(ei)) if the node is materialized
The Volcano SH Algorithm
In Volcano-SH the plan is first optimized using the Basic Volcano algorithm and
then creating a pseudo root merges the Basic Volcano best plans. The optimal query
plans may have common sub-expressions which need to be materialized and reused.
If for example a certain equivalence node has computational cost cost(e), materi-
alization cost matcost(e), reuse cost reusecost(e), and is to be used numuses(e)
times then the optimizer decides whether it should be materialized or not. For it to
be materialized, there must be a saving on costs as a result. Therefore
cost(e) + matcost(e) + numuses(e)× reusecost(e) < numuses(e)× cost(e) (2.4)
which simplifies to
reusecost(e) +matcost(e)
numuses(e)− 1< cost(e) (2.5)
This equation has one practical limitation. The volcano SH algorithm starts from
the bottom upwards and the number of times a node is used depends on whether or
49
not the parents have been materialized. The cost depends on the children therefore
its known but the number of times to be used needs first traversing the DAG which
is an expensive process [20].
The Volcano-SH Algorithm
Procedure VOLCANO-SH(P)
Input: consolidated volcano best plan for virtual root of DAG
Output: sets of nodes to materialize and the corresponding
best plan P
Global variable: M set of nodes chosen to be materialized
M = { }
Perform prepass on P to introduce subsumptions derivations
let Costroot = COMPUTEMATSET(root)
set Costroot = Costroot + Σd∈M (cost(d) + matcost(d))
undo all subsumptions on P where the subsumption node is not
chosen to be materialized
return(M,P)
Procedure COMPUTEMATSET(e)
if e is arleady memoised, return cost(e)
let operator O be a child of e in P
For each input equivalence node e of O
Let C = COMPUTEMATSET(e)
// return computational cost of ei
if e is materialised, let C = reusecost(e)
compute cost(e) = costofoperation(o) + ΣC
if reusecost(e) + matcost(e)(numuses(e)−1)
< cost(e)
if e is not introduced by a subsumption derivation
add e to M
//decide to materialize e
else if reusecost(e) + matcost(e)(numuses(e)−1)
is less than the savings to the parents of e due to the
introduction of a materialised e
add e to M
memoise and return cost(e)
50
In such a scenario,
numuses(e) =∑
p∈parents(e)
U(p)where U(p) =
{numuses(p) if p is not materialised
1 if p is materialised.
Instead of using numuses(e), its under estimate numuses−(e) is used. And the
condition for materialization becomes
reusecost(e) +matcost(e)
(numuses−(e) − 1)< cost(e). (2.6)
Since the under estimate is used, once it holds, the actual value also must hold.
Cases where the under estimate does not hold but the actual value holds have low
savings.
The Volcano-RU Algorithm
The Volcano-RU exploits sharing well beyond the optimal plans of the individual
queries. Though volcano SH algorithm considers sharing, it does it on only individ-
ually optimal plans therefore some sharable components which are in sub-optimal
plans are left out. Including sub-optimal states however implies that the sample
space of the nodes has to increase. The search algorithm must be able to put it into
consideration so that the searching cost is still below the extra savings made.
51
The Volcano-RU Algorithm
Procedure VOLCANO-RU
Input: Expanded DAG on queries Q1, ..., Qn
(including subsumptions derivations)
Output: Set of nodes to materialize M and corresponding best plan P
N = {} //set of potentially materializable nodes
for each equivalence node e, set count[e] = 0
for i=0 to k
compute Pi, the best plan for Qi assuming the nodes in N are materialized
for every equivalence node in Pi
set count[e] = count[e] + 1
if reusecost(e) + matcost(e)(count[e]−1)
< cost(e)
// worth materialising
add e to set N
Combine P1, P2, ... Pn to get a single DAG structured plan P
(M, P) = VOLCANO-SH(P) //volcano-SH makes final materialization decision
The volcano RU algorithm aims at reusing and sharing sub-expressions which are
not necessarily in the individual query optimal plans. Volcano-RU is sequential,
considering possibilities of reusing expressions of previously optimized queries in
subsequent queries. For a set of queries in the same pseudo root, after optimizing
Qi, the nodes in the plans of Qi are identified. Since at that moment the algorithm
has no idea of the structure of the subsequent queries, it checks whether it would be
optimal if a certain node was materialized for reuse one extra time. While optimizing
the next query, costs saving expressions are considered to be present. The Volcano-
SH is then applied to further detect and exploit more sharing opportunities. In such
a case, a query is able to share sub-expressions within itself and materializable plans
are all identified.
52
Volcano-RU depends on the order in which queries are optimized. It can be
done in a certain sequence, then in reverse order and the cheaper alternative is
chosen. Considering more orders have a probability of getting a cheaper order but
it increases the optimization time [20].
Reuse based algorithms are based on the number of economically reusable com-
ponents for a batch of queries. The efficiency therefore is derived from the proportion
of searches that result in materializable and sharable nodes. In some cases however,
the order of optimization is of paramount importance. The materializability of a
node depends on the nature of a node hence it is out of the control of the optimizer.
The shareability of a node however depends on how common the queries in a batch
are. In this research, We investigate the indicators of shareability as well as effi-
ciently establishing the extent of sharing among them so as to make a more guided
search.
2.6 Research Questions
1. Since queries are sent from different sources, they are different and random but
since they are addressed to the same schema, there are possible similarities.
How can We efficiently detect sharing opportunities (and lack of them) so
that the search for common sub-expressions is done with a high probability of
resource savings?
53
2. Materialization is a good sharing option but not the best. Pipelining is limited
to specific schedules and a lot of pipelining is impossible given limited buffer
space. Can’t We have a more cost saving schedule with minimal materializa-
tion for an AND-OR DAG plan?
54
Chapter 3
Query Shareability Establishment
3.1 Motivation
With the load on database applications increasing, there is a need to improve the
efficiency of the query optimizers so as to accommodate the load together with the
non-functional constraints on them like speed. The costs incurred mostly come in
during physical storage access, storage of intermediate relations and computations
on intermediate data. To minimize the frequency at which the database is accessed,
Weave to ensure that common outputs are reused other than accessing the storage
disk multiple times for the same data.
Previous research in multi-query optimization [7, 8, 17, 20, 24] acknowledges the
55
need to exploit the sharable sub-expressions but highly depends on the possibility
of the common sub-expressions existence. Since the queries come from different
sources, there is a possibility that they have nothing in common hence leading to a
worthless search. There is no effort to establish whether they actually exist or not.
When queries are checked for syntax correctness and parsed into plans, they are
sent to the query optimizer to generate the most cost effective plan. Multi-query
optimizers traverse the different query plans (DAGs or AND-OR DAGs) and iden-
tify sub-expressions among which sharing can be done. Sharing of node outcomes
saves memory, disk access and computation costs hence lowering the global resource
requirements for the execution of the queries.
Though the search for sub-expressions to be shared takes some resources (time
and processor cycles), if a good search strategy is used, the savings exceed the
searching cost hence a global advantage. Our aim therefore is to ensure that the
cost of searching is as low as possible while the number of common sub-expressions
are as many as possible so as to yield maximum benefits.
Generally, there are three cases in which sharing is possible. Considering an
AND-OR DAG. If W is any equivalent node of a plan, then sharing is possible
when:-
(i) Nodes produce exactly similar results. e.g s1 = σx=a(W ) and s2 = σx=a(W ).
In this case, only one is executed and the other just uses the output of the
56
executed node.
(ii) The output of one node is an exact superset of the others. For example, if We
have nodes like s3 = σx≤6(W ) and s4 = σx≤10(W ), s4 is executed and s3 is
derived from it i.e. s3 = σx≤6(s4).
(iii) Nodes retrieve data from the same (intermediate) relation or view but on
different instances of the outermost constraint. For example, if We have s5 =
σx=10(W ) and s6 = σx=15(W ), then a different node s7 = σx=10∨x=15(W ) is
created so that the two are derived from it i.e. s5 = σx=10(s7), and s6 =
σx=15(s7).
The efficiency of the multi-query optimizer does not depend on how aggressively
common sub-expressions are looked for but rather on the search strategy [20]. The
search strategy may need to exploit issues like:-
i The ability to tell that there are no more sub-expressions common between
a pair of queries without having to exhaustively search all the available sub-
expressions.
ii The ability to chose the order in which the queries in a batch [8, 20] should
be processed so as to have a cost effective process.
iii To identify which nodes to materialize or pipeline [16], and the order in which
those to pipeline should be assembled [8].
57
iv Mixing materialization and pipelining for specific sub-expressions so as to op-
timize the use of cache.
Given that multi-query optimization takes place on many complex queries with many
relations, comparing sub-expressions exhaustively leads to too many comparisons
hence high comparison cost and time. Our aim is therefore to look for the extent
of sharing without necessarily traversing all the nodes. The output is to be used
while optimizing. If We know that the queries have no node in common, We do not
attempt to exploit the similarity because it is not there.
3.2 The Shareability Problem
While searching for the sharable sub-expressions between queries, We are guided by
the three cases in which the sub-expressions can be reused. If the sub-expressions
are exactly similar, then We can be able to use any of the sub-expressions in all
the instances where it exists. If however the case is not so, and one of the sub-
expressions is a super set of the rest, We then need to make a decision as to what
sub-expression should be actually executed and what sub- expression is to be derived
from the other. More to that, if We have cases where the sub-expressions are not
subsets of equivalent nodes, We then have to look for another sub-expression to be
created so that the rest are derived from it. For the interest of identification, once
a pair of queries have any such sub-expressions, We say they are sharable.
58
Since the information as to whether or not the queries are sharable is to be used
during the actual optimization, We need to establish a way of summarizing the
outcome. We therefore introduce some terminologies to be used in this research.
(a) Query-Sharing Matrix
For n - queries in a batch, a query sharing matrix M is an n×n matrix with in-
tegral entries in which the entry at M[i,j] shows the number of sub-expressions
sharable between the ith and the jth query.
(b) Query Popularity
This is the number of instances in which the query plan sub-expressions
are found to are sharable with other queries in the batch. It should be
noted that popularity may be partially out of intra-query sharing opportu-
nities. For a given query sharing matrix M, the popularity of the ith query is
Σall k∈[1,n]M [k, i] = Σall k∈[1,n]
M [i, k].
(c) Order of a node
This is the number of (not necessarily distinct) relations, with at least a rela-
tional operation that make up a node. It should be noted that if there is no
relational operation (a whole relation) then the order is zero.
Our aim therefore is to establish the extent to which sub-expressions of any two
queries are sharable. This helps in coming up with a decision as to whether or not
an attempt to look for sharable opportunities between any two queries should go
on.
59
3.3 Previous Research
The need to know the number of times a sub-expression is used in a query batch
is evidently of paramount importance. The research on multi-query optimization
[1, 7, 16, 20, 21] employ it. However, none of them makes a deliberate attempt to
establish it with some accuracy.
The approach of the Volcano-SH optimizer [20] is to count the number of parents
a node has and it is taken as an under estimate for the number of times a sub-
expression is used. The parents exist in the parent DAG of the query plan therefore
inter-query sharing is not attempted at all. More to that, the under estimate is not
accurate. In some cases, it is actually greater than the actual number of times a
node is used. For example, if We consider a DAG in which node a is used twice to
produce node b which is used thrice to produce node c (c has two parents and c
together with the parents are all used once). The number of times a is used is 6.
Using the under estimate of Roy et al [20], We come up with 4. Since the under
estimate is less than the actual value, it does not pause any decision contradiction.
However, assuming all nodes were used once, then the number of times a is used is 1
yet the under estimate remains 4. This is dangerous since the node under estimate
may permit materialization yet the actual value refuses it. In this case for example,
since a appears only once, the decision to materialize it should not come up at all.
More to that, this approximation uses a single DAG . It does not consider cases of
inter-query sharing.
60
In the Volcano-RU [20], a node is chosen for materialization if it would cause
savings if reused once. This implies that all nodes that can cause a saving are
materialized. But it is also true that not all nodes that would cause savings when
reused once actually exist more than once. Therefore, some nodes are chosen for
materialization yet they do not exist multiple times. Further more, a node may
exist say four times and it causes savings on materialization yet it would not cause
savings if it existed twice. This implies that on top of the Volcano-RU choosing nodes
that appear once for materialization, it may leave out nodes that are able to cause
savings only because a more accurate approximation of its frequency is not known.
Based on these weaknesses and the importance a more accurate approximation of
the frequency a node is to be used, We establish the inter-query sharing extent
before We proceed to optimize.
3.4 The Greedy Search Algorithm
In the greedy search algorithm, We compare two query plans at ago and for each
pair of nodes, We establish whether or not they are sharable. If they are sharable,
the appropriate entry in the query sharing matrix is incremented. Let us consider
a situation where We have a batch of n queries Q1, Q2, · · ·Qn. For any query Qi,
a set of nodes in the Volcano optimal plan can be got by the method proposed by
Graefe and McKenna [7]. The input, like in the Volcano-SH [20] is a set of volcano
61
best plans. Let Si be the set of equivalence nodes in the plan of Qi. The greedy
algorithm checks nodes pairwise and establishes whether sharing can be possible.
If it is possible for any queries Qi and Qj, the query popularities and M[i,j] are
incremented. This is done until all sets are exhausted.
The Proposed Gready Search Algorithm:
for (i=1; i<=n; i++)
S = set of nodes in the ith plan
for(j=1; j<=n; j++)
P= set of nodes in the jth plan
while(S still has nodes)
a = next node in S
while(P still has nodes)
b = next node in P
if(Sharable(a,b))
increment M[i,j]
endif
endwhile
endwhile
endfor
endfor
Though this algorithm is simple, it is too greedy to be optimal/feasible. This is
because it compares indiscriminately and therefore does too many comparisons.
Assuming a query Qi has ki nodes in its volcano best plan and We consider the
first node for the Q1 plan, it will make k1 − 1 comparisons while making intra-
query comparisons, k2 comparisons while comparing with Q2, and so on up to kn
comparisons for query Qn. This implies that this node alone undergoes Σall iki −
1 comparisons while the whole Q1 plan will undergo k1(Σalliki − 1) comparisons.
The whole batch will need Σall iki × (Σall i
ki − 1) comparisons which is too much.
62
We therefore propose improvements to the greedy algorithm so as to have a less-
expensive search both in processor resources and time.
3.5 Improvements to the Greedy Algorithm
Though the Greedy algorithm will be able to get all the sharable nodes, We need
to make some improvements on it so as to minimize the search cost. In fact, the
search cost is reduced without interfering with the output. This is done by adding
some intelligence into the search algorithm so that some decisions can be deduced.
The following observations can be noted and improvements to address them can
be made without substantially affecting the cost-effectiveness of the scheme.
a. Elimination of duplicate searches
The output of the search between Qi and Qj is equivalent to that between Qj
and Qi since the query shareability is independent of the order in which the
queries are checked. We therefore need to make sure that when the compar-
isons are made, such a pair should never be compared again. To ensure that,
We compare Qi and Qj if and only if i ≤ j.
b. Search by node order
From the conditions that have to be satisfied before query sharing takes place,
sharing can take place between nodes of different queries but strictly of the
63
same order. We therefore need to group nodes by order and for a node, We
search for shareability within the same order. This reduces the sample space
for each node when the search for the sharable sub-expressions is taking place.
c. Null sharing prediction
Moving up the order makes the nodes more specific. If for a pair of queries
there is no sharing for nodes of order m, then there is no sharing for nodes
of order n where n > m. We therefore terminate higher nodes’ search for
the query pair immediately We finish nodes of a certain order and We get no
sharing opportunity.
d. Zero order tip
Relations (zero order node) are too broad to be considered for sharing. We
only use them to find if any two queries are disjoint or not. If We get a sharing
opportunity at this order, We do not update M and go to the next order (order
one). If however there are no zero order nodes sharable, then the queries are
disjoint. We need not continue searching for shareability in higher orders.
64
3.6 The Improved Greedy Searching Algorithm
We now present the enhanced greedy algorithm such that the observations above
are put into consideration in order to have a more optimal search. It outputs the
sharing matrix.
for i =1; i<=n; i++
for j =i; j<=n; j++
ORDER = 0
repeat
nextOrderpermision = false;
Si = set of nodes of order ORDER in Query i
node1 = starting node in Si
Sj = set of nodes of order ORDER in query j
node2 = starting node in Sj
while(Si still has nodes)
while(Sj still has nodes)
if(node1 and node2 are sharable)
nextOrderPermision = true
if(ORDER = 0)
break out of the two while loops
else
increment M[i,j] and M[j,i] and mark node1 and node2
mark node1 and node2
endif
endif
node2 = next(node2)
endwhile
node1 = next(node1)
endwhile
ORDER = next(ORDER)
until(nextOrderPermision = false or orders are exhausted)
endfor
endfor
65
The algorithm traverses the pseudo-rooted DAG and gives a summary of the extent
of sharing between any two queries. Using this information, We can deduce the
popularity.
The matrix acts as a guide to establish which queries have sub-expressions
in common and which queries are disjoint. With such information, searching for
sharable nodes in disjoint queries can be done away with hence targeting the search
efforts on cases where there is a high chance of making cost savings. However, the
search does not put into consideration the cases where multiple use of a parent re-
sults into multiple use of children. So much as the algorithm can tell which query
shares nodes with which query, it cannot tell how many times a node will be used.
The entries in M therefore represent the number of times a node or its sharing
partners appear not the number of times it is used.
66
Chapter 4
Optimizing the Traversed Plans
4.1 BackgroundFrom the very principle of multi-query optimization, the optimizer has to look for the
common sub-expressions or any such sub-expressions where cooperation at execution
time can lead to cost savings. Tests are made to establish whether costs will really be
saved. If the tests are promising, such a sub-expression or a group of sub-expressions
is explored.
From Chapter 3, We were able to traverse the query plans using a greedy but
intelligent algorithm and the output was put in a query sharing matrix M. If We
consider a general query sharing matrix M below:
67
M =
m11 m12 m13 m14 · · · m1n
m21 m22 m23 m24 · · · m2n
· · · · · · · · · · · · · · · · · ·
· · · · · · · · · · · · · · · · · ·
· · · · · · · · · · · · · · · · · ·
mn1 mn2 mn3 mn4 · · · mnn
the entry M [i, j] = mij is an integer that shows how many sub-expressions (equiva-
lence nodes in AND-OR DAGs) that are sharable between queries Qi and Qj. This
is the same as the value of M [j, i]. If We sum up the entries in a column (or a row),
We get the total number of instances in which nodes in the plan have sharable part-
ners. This is called query popularity. All nodes that have partners are marked so
that the optimizer can identify which nodes necessitate checking other plans while
searching for common sub-expressions and which nodes do not. This helps in elim-
inating null searches hence a more efficient strategy. After identifying the extent of
sharing among the queries, We can be able to tell which pair or pairs of queries have
nothing in common. For such queries, We do not need to search for common sub-
expressions since they do not exist. Likewise, We can be able to tell for each query,
which other queries in the batch share nodes with it and to what extent. When We
start searching for common sub-expressions, We start with one query and search for
the sharable sub-expressions with other queries in the DAG. The Query plan, which
68
is at the center of the searching process is called the focal node. Since the plans are
already traversed at the searching stage, We have nodes which are marked (those
with sharable sub-expressions else where) and those which are not. It is only on
marked nodes that We search for other plans for sharable nodes. It should be noted
that a node being marked does not mean that its partners will be searched for in
all plans. Since the query sharing matrix has the summary of what query shares
with what, only those with non-zero entries in the sharing matrix are searched for
a column/ row representing the extent of sharing between the focal plan and such
a plan are checked. The algorithm therefore searches only in optimistic cases.
In this chapter, We use the information in the query sharing matrix to guide us
so as to exploit the sharing opportunities among the queries that make up a batch.
4.2 Related Work
Looking for common sub-expressions in a query batch has been done in most multi-
query optimization algorithms but following different approaches and principles.
The algorithms [7, 20] however seem not to make enough preparations to exploit
them to the full.
69
4.2.1 The Basic Volcano Algorithm
This was proposed by Graefe and McKenna [7] as a reaction to the previously
proposed Exodus Optimizer [6]. It uses DAG as a representation of the query plans.
It has a problem of extensibility since AND-OR DAGs are easier to extend than the
DAGs [21].
The Basic Volcano algorithm materializes all nodes that appear more than once.
This brings in a problem that not all nodes that appear more than once cause savings
when materialized. As observed in [20], for some nodes, it is cheaper to recompute
than to materialize and reuse them. This is because materialization involves writing
and reading to disk which is costly. The Basic Volcano lays a foundation for cost-
effective reuse but:-
i It does not establish the cost effectiveness of the node - candidate to materi-
alization before choosing it for materialization.
ii Its search is exhaustive therefore the optimizer incurs a high search cost which
brings a negative impact on the overall cost effectiveness of the query processor.
The Basic Volcano algorithm therefore incurs a lot of cots in searching for the
sharable sub-expressions which may render it inefficient especially for large (and
therefore) complex queries.
70
4.2.2 The Volcano-SH
The Volcano-SH [20] is an extension of the Basic Volcano algorithm in that it uses
the Basic Volcano optimal plans as an input. The volcano SH computes the cost of
each node and decides whether or not it is cost effective to materialize it. This is
done by considering a scenario of materialization and reuse against re computation.
If for example We have an equivalence node e with the following characteristics:
Number of times it is to be used = numuses(e)
Cost of computing the node = cost(e)
Cost of materializing the node = matcost(e) and
Cost of reusing the node =reusecost(e).
A decision has to be made whether to materialize and reuse the node or to recompute
the node whenever it is needed. If all the nodes are computed from the database,
the cost incurred would be
cost(e) × numuses(e)
and if the node was computed once, then materialized so that for subsequent times
it is just reused, the cost incurred would be
cost(e) + matcost(e) + reusecost(e)× (numuses(e)− 1)
71
. Materialization is cost effective if
cost(e)+matcost(e)+reusecost(e)×(numuses(e)−1) < cost(e)×numuses(e) (4.1)
or more simply
reusecost(e) +matcost(e)
(numuses(e)− 1)< cost(e)[20] (4.2)
. Volcano SH therefore selects whether or not to materialize depending on the cost
effectiveness of the scheme.
The volcano-SH traverses the DAG from the leaves towards the root. Since the
cost of a node is computed from the children (leaves), the cost of a node can be
accurately established as the algorithm traverses the DAG. The number of times a
node is used however depends on the materialization status of the parents. Since a
node is reached before the parents are reached, it can not be easily established. Roy
et al [20] uses an under estimate numuses−(e) which is got by counting the number
of parents of a node. The condition for materialization is therefore modified to
reusecost(e) +matcost(e)
(numuses−(e) − 1)< cost(e) (4.3)
.
It is advantageous that it eliminates blind materialization hence saving more re-
sources. It however has some shortcomings that need to be addressed. The short-
comings are:-
72
i Inter-query Sharing elimination
Its estimate of the frequency a node appears puts no consideration of other
plans in the pseudo rooted DAG yet the inter-query sharing makes a lot of
savings on resources.
ii Accuracy of Estimates
Its estimate is inaccurate. In fact in some cases the under estimate is higher
than the actual value which may lead to wrong decisions.
iii Order of Processing
It does not attempt to exploit the inter-query extents of similarities. This
makes it unable to decide on the optimal order in which the queries should be
processed.
iv DAG Trimming
It does not trim already catered for nodes hence the search works on a fixed
sample space leading to non worthwhile search efforts.
4.2.3 The Volcano-RU
Unlike the volcano-SH [20], the volcano-RU [20] does not take in the Basic Volcano
outputs and neither does it attempt to establish the number of times a node is
used. It optimizes one query at ago and any node, (whether it is on the optimal
plan or not) that would cause savings if reused once is chosen for materialization.
73
The subsequent queries are optimized putting into consideration the fact that some
nodes are already materialized. Its strength lies in exploiting shareability beyond
the Basic Volcano optimal plans. Given its approach, the order of optimization is of
paramount importance since the node to be materialized depends on which query
has been optimized so far.
It however has the following weaknesses:-
i Order of Processing
It does not go into details of establishing the exact optimal order of opti-
mization. Roy et al [20] propose that after optimizing in a specific order, We
optimize in the reverse order and the cheaper option is chosen. However, since
the first order was arbitrary, a more optimal order is very likely to exist. At-
tempting to randomly choosing other orders while searching for the optimal
order leads to time wasting [20].
ii Over Materialization
It also has a problem of excessive materialization since not all nodes that would
cause saving if reused twice actually exist more than once. The excessive
materialization leads to further costs incurred at materialization hence a more
costly query processor.
iii Inaccurate Materialization Criteria
In a DAG, some nodes appear more than twice and cause savings yet they
74
would not make the savings if they appeared twice. Such nodes are left out
since the criteria only consider those which would cause savings when reused
once. The materialization therefore may be insufficient.
4.3 The Proposed Optimizing Algorithm
4.3.1 The Background
In designing this algorithm, We attempt to address the weaknesses identified in the
Basic Volcano, Volcano SH and Volcano RU. It is against these addressed weak-
nesses, coupled with the strengths already exhibited by the algorithms that We lay
a foundation for the new algorithm.
a. Optimization order
The Volcano-RU [20] acknowledges the use of the order of optimization since
by its philosophy, the content of the materialized set is of high importance.
Though this algorithm unlike the Volcano-RU is to use the Basic Volcano opti-
mal plans as the inputs, it uses order in another way. We identify the plan with
the highest popularity and We establish its details (node costs and frequency).
The common sub-expressions are searched for in the rest of the queries follow-
ing the stable marriage preference list of the focal plan. If according to the
sharing matrix M, a certain plan is not having any node to share with the
75
focal plan, We do not search it.
b. Sample space management
If We are to reduce the sample space without affecting the output and optimal
plan, We minimize the optimization cost in the algorithm. If We identify more
than one node to be sharable, a decision is made to decide which of the nodes
will remain in the plan and which (the rest) is/ are to be removed from the
plan. Action is done such that then plans are supplied with the output of the
retained node so that the output of the removed nodes is compensated. In an
AND-OR DAG, the equivalence node will have multiple parents. This reduces
the sample space for the subsequent searches without affecting the output of
the query batch. It also saves the cost of recomputation.
c. Estimate of numuses(e)
When a focal plan is identified, the costs and number of times a node is used
in the plan, count are established. If a node is marked, then it means that it
has sharable partners in other plans. The sharable nodes are searched for in
the plans that share at least a node with the focal plan. Each time a sharable
node is found, the count is incremented by one and its assured to be of the
same cost as the node in the focal plans. The test for materialization used
in the Volcano SH is then applied using the new estimated cost and the node
count in the pseudo rooted plan.
76
d. Direction of Optimization
In this algorithm, We choose nodes to materialize from the high order nodes
downwards. This is used in such a way that if say two nodes are sharable and
two have to be removed, this reduces the sample space of subsequent searches
and saves the cost of computing already catered for nodes.
e. Elimination of Repetitive search
We already established that it is not worthwhile to search between Qi and Qj
and then search Qj and Qi as this is a repetition. Since We search plans in the
decreasing order of their popularity, any plan is searched if the focal plan has
a higher popularity, this eliminates the duplicates of making searches which
are already catered for in previous searches.
f. Disjoint plans identification
If a plan has zero popularity then it will not participate in inter-query search
since it is unique. This helps saving resources that would be spent searching
for non-existent partner nodes.
4.3.2 The Optimizing Algorithm
The new algorithm inputs the DAG made up of the Basic Volcano plans for each
query. The inter-query shareability information and individual query popularities
are got from the query sharing matrix M. In this algorithm plans are assigned focal
77
roles in the order of their decreasing popularity. Searching for sharable partners
for any marked node in a focal plan is done in the stable marriage preference order
of less popular plans. Searching starts from higher order nodes and plans of Zero
popularity are not assigned focal roles.
78
S = set of plans that make up the virtual DAG in a decreasing
order of their popularity
focalPlan = first plan in S
repeat
establish node cost and numuses for each node for the focal plan
S* = subset of S who share at least a node with focalPlan query
candidateOrder = highest order of focalPlan
repeat
e = node in candidateOrder
if( e is marked)
repeat
traverse S* searching for marked equivalent nodes of
the same order with and sharable with e
increment numuses(e) whenever a sharable node is met
until(S* is exhausted)
if(sharable condition(e))
chose which node to be materialized and add it to the
materializable node
remove the rest
update the DAG for the materialized node to cater
for the removed node’s parents unmark the chosen node
endif
else
if(sharablecondition(e))
add e to materialization node
endif
endif
until(nodes in candidateOrder are exhausted)
focalPlan = next(focalPlan)
until(plans on non zero popularity are exhausted)
4.3.3 Benefits of the new Algorithm
1. Better estimation of numuses−(e)
The Algorithm uses the exact number of times the node exists in the focal plan
79
and increments it by one whenever a sharable node is found outside the focal
plan. The nodes may actually be greater if the non focal plan nodes are used
multiple times in their parent plans. Therefore numuses−(e) ≤ numuses(e)
∀e .
2. Elimination of null searches:
The sharing matrix has details of the extent of sharing for any pair of plans.
If the entry for a pair is zero, then We need not to search for shareability
between them. If the popularity of a certain query is zero, then its plan is not
involved in the search.
3. DAG trimming
If We have say three nodes of order five and they are sharable, and it is decided
that one has to be materialized and the rest of the plans use it, then it is not
worthwhile to process the other nodes since the ultimate goal is to get the
root of the tree. This algorithm removes the sub DAGs whose output can
be got from common/ sharable nodes so that such children do not enter the
optimization process since their output is catered for.
4. Optimal order of optimization
Since the strategy eliminates catered for sub-DAGs, it is better if it does it
as early as possible so that the subsequent search space is reduced without
affecting the outcome. Starting with the most popular query does this. This
saves time, memory and processor cycles.
80
Chapter 5
Discussion and Future Work
5.1 Discussion
In this dissertation, we studied the strategies and approaches to query optimization.
The duty of the query optimizer is to establish the most cost effective execution
plan of a query. This has to be done with in several limitations and notable among
the limitations are:-
(a) Time.
The process should be timely. Exhaustive searches that take a lot of time have
to either be eliminated or improved to make intelligent deductions so that they
have a low runtime.
81
(b) Form and Content.
The transformation should have no effect whatsoever on the Form and content
of the request made by the user.
(c) Net Savings.
The optimization process should not be a mere transfer of resources from
executing a complex form of a query to searching for a cheap form of the
query and executing it. It has to make substantial savings.
We Therefore examined existing approaches, studied how they order queries for
optimization, how they optimize and how they exploit the geometry of query plans
representation (Trees, DAGs and AND-OR DAGs) to make the scheme more cost
effective.
We Then Proposed a greedy search algorithm that traverses the composite plan in
search for common sub-expressions. We used the geometry of the plan representation
(AND-OR DAG) to propose improvements on the greedy algorithm so that the
searching is more intelligent and therefore searches equally exhaustively but with
fewer operations and in less time. We sumerised the sharability in a query sharing
matrix.
We also proposed an optimization algorithm that exploits the sharability extents
in an optimal order to minimize null searches for common sub-expressions. This way
we reduce run time and increase efficiency. we also propose trimming of catered for
82
sub DAGs so as to eliminate catered for nodes and reduce the sample space of
subsequent searches.
Lastly, We make comparisons of the proposed optimization schemes with the
existing ones.
5.2 Future Work
This Research has implied some future research work in the field of Query Opti-
mization
(a) Maintenance of the Sharing matrix.
The Sharing matrix is used to get the order of assigning focal roles among the
plans during optimization and for each focal plan, we get the order in which
the plan should exploit sharability. We however trim plans whenever we get
nodes catered for due to sharability. This trimming takes away all the children
of the catered for node. Some of these children may be sharable else where
hence removing them leads to an exaggerated sharing matrix. There is need
for Research on how the matrix can be cost-effectively updated whenever we
trim the composite DAG.
(b) Incorporating Pipelining.
Multi-Query Optimization is materialization intensive and the current sched-
83
ules of materialization use DAGs not AND-OR DAGs. Besides, they are too
strict on the qualifications for pipelining. AND-OR DAG can use the equiv-
alence nodes as staging areas for intermediate results and hence can work on
less strict schedule. This however may lead to filling of the memory. There
is therefore a need for incorporating pipelining and memory management in
AND-OR DAG structured plans so that materialization costs are reduced.
(c) Numerical Evaluation.
The evaluation done on the proposed algorithms in this dissertation is not
out of coding but rather our of comparison at the algorithm level. There is
need to code the proposed algorithms and run them with the existing ones
(such as Basic Volcano, Volcano-RU, Volcano-SH) on similar data and similar
hardware. This will give the runtime comparison with the existing schemes.
84
References
[1] Chaudhuri, S. and Kyuseok, S. (1999). Optimization of queries with user
defined predicates. ACM Transactions on Database Systems, vol.24 No.2,
pages 177-228.
[2] Connoly, T. and Begg, C. (2001). Database Systems: A practical Ap-
proach to design, Implimentation and Management, Third Edition.
Edison and Wesley.
[3] Cosar, A. Lim, E. and Jaideep, S. (2001). Multiple query Optimization
with depth-first branch and bond and dynamic query ordering. Inter-
national Conference on Information and Knowledge Management.
[4] Elmasri, R. and Navathe, B.S. (1994). Fundamentals of database systems
, Second Edition. Benjamin Cummings.
[5] Fegaras, L. (1998) A new Heuristic for Optimising Large Queries Re-
search Paper, Department of Computer Science and Engineering . The Univer-
sity of Texas at Arlington.
85
[6] Graefe, G. and DeWitt, D.J. (1987). The EXODUS Optimizer Generator
ACM SIGMOD Records, Volume 16, Issue 3 pages 160-172
[7] Graefe, G. and McKenna,W.J. (1991).Extensibility and Search efficiency
in the Volcano Optimiser generator. Technical report CU-CS-91-563. Uni-
versity of Colorado
[8] Gupta, A. Sudarshan, S. Viswanathan, S.(2001) Query Scheduling in muti-
query optimization Research paper, Indian Institute of Technology - Bom-
bay.
[9] Horowitz, E. Sahni, S. and Rajasekaran, S. (1996). Computer Algolithms
C++ , Computer Science Press.
[10] Hurigeri, S. Seshadri, S. and Sudarshan, S. (2001). Memory Cognizant
Query Optimization. Research paper, Indian Institute of Technology - Bom-
bay.
[11] Johnson, J.L.(1997). Database: Models, Languages, Design, Oxford Uni-
versity Press.
[12] Kremer, M. and Gryz, J. (1999). A Survey of Query Optimization in
Parallel Database. Technical Report CS-1999-04, Department of Computer
Science, York University.
[13] Kroger, J. Stefan, P. and Heuer, A. (2001). Query optimisation. On the or-
dering of Rules. Research paper, cost - and rule based optimisation of object
86
- oriented queries (CROQUE) Project. University of Rostock and University of
Hamburg, Germany.
[14] Kyuseok, S. Sellis, T.K and Nau, D. (1994). Improvements on a Heuristic
algorithm for Multiple-query Optimization. Technical report, University
of Maryland, Department of Computer Science.
[15] Kyuseok, S. (1993). Advanced query optimization techniques for rela-
tional database systems. PhD dissertation, University of Maryland.
[16] Nilesh, N.V. Sumit, K.S. Roy,P. and Sudarshan, S. (2001). Pipelining in
multi-query optimisation. Research paper, Indian Institute of Technology.
Bombay.
[17] Park, P. and Segar, A. (1988). Using common sub-expressions to optimise
multiple queries. Proceedings of the IEEE International Conference on Data
Engineering.
[18] Ramakrishnan, R. and Gehrke, J. (2000). Database Management Systems
Third Edition. McGraw Hill.
[19] Rao, J. and Ross, R.K. (2000) Power Pipelining for Enhanced Query
Performance . Technical Report CUCS-007-00, Columbia University.
[20] Roy, P. Seshadri, S. Sudarshan, S. and Bhobe, S. (2001). Efficient and exten-
sible algorithms for Multi query optimisation. Researh Paper, SIGMOD
International Conference on management of data.
87
[21] Roy, P. Seshadri,S. Sudarshan,S. and Bhobe,S (1998) Practical Algolithms
for multi query Optimisation. Technical report, Indian institute of Tech-
nology, Bombay.
[22] Sallis, L. (1998). A new Heuristic for optimising large queries (1998)
Research Paper, Department of Computer Science and Engineering . The Uni-
versity of Texas at Arlington.
[23] Sellis,T.K. and Gosh, S. (1990). On Multi-query Optimization Problem.
IEEE Transactions on Knowledge and Data Engineering pp. 262-266.
[24] Urhan, T. and Franklin, M.J. (2001). Dynamic Pipelining Scheduling for
improving interactive query performance. In proceedings of the 27th Very
Large Databases Conference. Roma - Italy.
88
Apendix A: Paper
89