SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.

26
SPRING 2004 CENG 352 1 Query Evaluation Chapters 12, 14
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    223
  • download

    2

Transcript of SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.

SPRING 2004 CENG 352 1

Query Evaluation

Chapters 12, 14

SPRING 2004 CENG 352 2

Basic Steps in Query Processing1. Parsing and translation

2. Optimization

3. Evaluation

SPRING 2004 CENG 352 3

Basic Steps in Query Processing (Cont.)

• Parsing and translation– translate the query into its internal form. This is then

translated into relational algebra.– Parser checks syntax, verifies relations

• Evaluation– The query-execution engine takes a query-evaluation plan,

executes that plan, and returns the answers to the query.

SPRING 2004 CENG 352 4

Basic Steps: Optimization

• A relational algebra expression may have many equivalent expressions– E.g., balance2500(balance(account)) is equivalent to

balance(balance2500(account))

• Each relational algebra operation can be evaluated using one of several different algorithms– Correspondingly, a relational-algebra expression can be evaluated in

many ways.

• Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.– e.g., can use an index on balance to find accounts with balance < 2500,

– or, can perform complete relation scan and discard accounts with balance 2500

SPRING 2004 CENG 352 5

Basic Steps: Optimization (Cont.)

• Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. – Cost is estimated using statistical information from the

database catalog• e.g. number of tuples in each relation, size of tuples, etc.

• We first study– How to measure query costs– Algorithms for evaluating relational algebra operations– How to combine algorithms for individual operations in order to

evaluate a complete expression

• Then– We study how to optimize queries, that is, how to find an

evaluation plan with lowest estimated cost

SPRING 2004 CENG 352 6

Measures of Query Cost• Cost is generally measured as total elapsed time for

answering query– Many factors contribute to time cost

• disk accesses, CPU, or even network communication

• Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account– Number of seeks * average-seek-cost

– Number of blocks read * average-block-transfer-cost

– Number of blocks written * average-block-transfer-cost

SPRING 2004 CENG 352 7

Some Common Techniques• Algorithms for evaluating relational operators use

some simple ideas extensively:– Indexing: Can use WHERE conditions to retrieve small

set of tuples (selections, joins)

– Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.)

– Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.

SPRING 2004 CENG 352 8

Statistics and Catalogs• Need information about the relations and indexes

involved. Catalogs typically contain at least:– # tuples (NTuples) and # pages (NPages) for each

relation.– # distinct key values (NKeys) and NPages for each index.– Index height, low/high key values (Low/High) for each

tree index.

• Catalogs updated periodically.– Updating whenever data changes is too expensive; lots of

approximation anyway, so slight inconsistency ok.

• More detailed information (e.g., histograms of the values in some field) are sometimes stored.

SPRING 2004 CENG 352 9

Relational Operations

• We will consider how to implement:– Selection ( ) Selects a subset of rows from relation.– Projection ( ) Deletes unwanted columns from relation.– Join ( ) Allows us to combine two relations.– Set-difference ( ) Tuples in reln. 1, but not in reln. 2.– Union ( ) Tuples in reln. 1 and in reln. 2.– Aggregation (SUM, MIN, etc.) and GROUP BY

• Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.

SPRING 2004 CENG 352 10

Selection Operation• File scan – search algorithms that locate and retrieve

records that fulfill a selection condition.• Algorithm A1 (linear search). Scan each file block and

test all records to see whether they satisfy the selection condition.– Cost estimate (number of disk blocks scanned) = br

• br denotes number of blocks containing records from relation r

– If selection is on a key attribute, cost = (br /2) • stop on finding record

– Linear search can be applied regardless of • selection condition or• ordering of records in the file, or • availability of indices

SPRING 2004 CENG 352 11

Selection Operation (Cont.)• A2 (binary search). Applicable if selection is

an equality comparison on the attribute on which file is ordered. – Assume that the blocks of a relation are stored

contiguously

– Cost estimate (number of disk blocks to be scanned): log2(br) — cost of locating the first tuple by a binary

search on the blocks

• Plus number of blocks containing records that satisfy selection condition

SPRING 2004 CENG 352 12

Selections Using Indices• Index scan – search algorithms that use an index

– selection condition must be on search-key of index.• A3 (primary index on candidate key, equality). Retrieve a

single record that satisfies the corresponding equality condition – Cost = HTi + 1

• A4 (primary index on nonkey, equality) Retrieve multiple records. – Records will be on consecutive blocks – Cost = HTi + number of blocks containing retrieved records

• A5 (equality on search-key of secondary index).– Retrieve a single record if the search-key is a candidate key

• Cost = HTi + 1– Retrieve multiple records if search-key is not a candidate key

• Cost = HTi + number of records retrieved – Can be very expensive!

• each record may be on a different block – one block access for each retrieved record

SPRING 2004 CENG 352 13

Selections Involving Comparisons• Can implement selections of the form AV (r) or A V(r) by using

– a linear file scan or binary search,– or by using indices in the following ways:

• A6 (primary index, comparison). (Relation is sorted on A)– For A V(r) use index to find first tuple v and scan relation

sequentially from there– For AV (r) just scan relation sequentially till first tuple > v; do not use

index• A7 (secondary index, comparison).

– For A V(r) use index to find first index entry v and scan index sequentially from there, to find pointers to records.

– For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v

– In either case, retrieve records that are pointed to• requires an I/O for each record• Linear file scan may be cheaper if many records are

to be fetched!

SPRING 2004 CENG 352 14

Implementation of Complex Selections• Conjunction: 1 2. . . n(r)

• A8 (conjunctive selection using one index). – Select a combination of i and algorithms A1 through A7 that results

in the least cost fori (r).– Test other conditions on tuple after fetching it into memory buffer.

• A9 (conjunctive selection using multiple-key index). – Use appropriate composite (multiple-key) index if available.

• A10 (conjunctive selection by intersection of identifiers). – Requires indices with record pointers. – Use corresponding index for each condition, and take intersection of

all the obtained sets of record pointers. – Then fetch records from file– If some conditions do not have appropriate indices, apply test in

memory.

SPRING 2004 CENG 352 15

Algorithms for Complex Selections• Disjunction:1 2 . . . n (r).

• A11 (disjunctive selection by union of identifiers). – Applicable if all conditions have available indices.

• Otherwise use linear scan.

– Use corresponding index for each condition, and take union of all the obtained sets of record pointers.

– Then fetch records from file

• Negation: (r)

– Use linear scan on file

– If very few records satisfy , and an index is applicable to • Find satisfying records using index and fetch from file

SPRING 2004 CENG 352 16

Schema for Examples

• Similar to old schema; rname added for variations.• Reserves:

– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.

• Sailors:– Each tuple is 50 bytes long, 80 tuples per page, 500

pages.

Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)

SPRING 2004 CENG 352 17

Simple Selections• With no index, unsorted: Must essentially scan the whole relation;

cost is 1000 I/Os (#pages in R).• With no index, sorted data: Utilize the sort order on rname by

doing a binary search to locate the first Joe. Cost is log2 1000 10 I/Os.

• With a B+ tree index on selection attribute: Use index to find qualifying data entries, then retrieve corresponding data records. Cost of finding the starting page is 2 or 3 I/Os; for a clustered index add one more I/O; for an unclustered index add one page per qualifying tuple.

• Hash index: 1 or 2 I/Os to retrieve the index pages. If 100 reservations by Joe then an additional 1-100 disk accesses depending how these records are distributed.

SELECT *FROM Reserves RWHERE R.rname = ‘Joe’

SPRING 2004 CENG 352 18

Using an Index for Selections

• Cost depends on #qualifying tuples, and clustering.• Assume we estimate roughly 10% of Reserves tuples

will be in result ( = 10,000 tuples, or 100 pages).– With a clustered index: cost is 100 I/Os + 1 or 2 disk

accesses for index.

– With an unclustered index: cost could be as high as 10,000 I/Os in the worst case. (might be cheaper to simply scan the entire relation)

SELECT *FROM Reserves RWHERE R.rname < ‘C%’

SPRING 2004 CENG 352 19

A Note on Complex Selections

• Selection conditions are first converted to conjunctive normal form (CNF):

(day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3)

(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3

SPRING 2004 CENG 352 20

Two Approaches to General Selections• Consider the selection condition:

day<8/9/94 AND bid=5 AND sid=3

• First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:1. A B+ tree index on day can be used; then, bid=5 and sid=3 must be

checked for each retrieved tuple.2. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must

then be checked.

– Terms that match the index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.

SPRING 2004 CENG 352 21

Intersection of Rids• Second approach (if we have 2 or more matching indexes) :

– Get sets of rids of data records using each matching index.– Then intersect these sets of rids.– Retrieve the records and apply any remaining terms.

• For the given example condition:

– If we have a B+ tree index on day and an index on sid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of records satisfying sid=3 using the second, intersect, retrieve records and check bid=5.

SPRING 2004 CENG 352 22

The Projection Operation• To implement projection we have to do the

following:1. Remove unwanted attributes.2. Eliminate any duplicate tuples produced.

• The expensive part is removing duplicates.– SQL systems don’t remove duplicates unless the

keyword DISTINCT is specified in a query.

• There are two basic algorithms:1. Sorting Approach.2. Hashing Approach.

SPRING 2004 CENG 352 23

Approach based on sorting

• Modify Pass 1 of external sort to eliminate unwanted fields. If B buffer pages are available, runs of about 2B pages can be produced, but tuples in runs are smaller than input tuples. (Size ratio depends on # and size of fields that are dropped.)

• Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.)

SPRING 2004 CENG 352 24

Example

Cost: • In Pass 1, read original relation (1000 pages), write out

same number of smaller tuples.– Assume that each smaller tuple is 10 bytes long.– Thus cost is 250 pages.

• In merging passes, fewer tuples written out in each pass. – Assuming we have 20 buffer pages, the temporary relation can be

sorted in 2 passes.– In the first pass 250 pages are written out as 7 runs about 40 pages

long.– In the second pass we read the runs at a cost of 250 I/Os and

merge them.

• The total cost is 1500 I/Os.

SELECT DISTINCT R.sid, R.bidFROM Reserves R

SPRING 2004 CENG 352 25

Projection Based on Hashing• Partitioning phase: Read R using one input buffer. For each

tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers.– Result is B-1 partitions (of tuples with no unwanted fields). 2 tuples

from different partitions guaranteed to be distinct.

• Duplicate elimination phase: For each partition, read it and build an in-memory hash table, using hash fn h2 ( h1) on all fields, while discarding duplicates.– If partition does not fit in memory, can apply hash-based projection

algorithm recursively to this partition.

• Cost: For partitioning, read R, write out each tuple, but with fewer fields. This is read in next phase.– In our projection example this cost is 1000 + 2 * 250 = 1500 I/Os.

SPRING 2004 CENG 352 26

Discussion of Projection

• Sort-based approach is the standard; better handling of duplicate elimination and result is sorted.

• If an index on the relation contains all wanted attributes in its search key, can do index-only scan.– Apply projection techniques to data entries (much smaller!)

• If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better:– Retrieve data entries in order (index-only scan), discard unwanted

fields, compare adjacent tuples to check for duplicates.