ArchDBMS-3-2x2

6
Module 3: File Organizations and Indexes Module Outline 3.1 Comparison of file organizations 3.2 Overview of indexes 3.3 Properties of indexes 3.4 Indexes and SQL ! This is a repetition of material from “Information Systems” . Files and Index Structures Buffer Manager Disk Space Manager Recovery Manager Plan Executor Operator Evaluator Optimizer Parser Applications Web Forms SQL Interface SQL Commands Query Processor Concurrency Control DBMS Database Index Files Data Files System Catalog Transaction Manager Lock Manager You are here! 77 A heap file provides just enough structure to maintain a collection of records (of a table). The heap file supports sequential scans (openScan) over the collection, e.g. SELECT A, B FROM R No further operations receive specific support from the heap file. For queries like SELECT A, B FROM R WHERE C> 42 or SELECT A, B FROM R ORDER BY C ASC it would definitely be helpful if the SQL query processor could rely on a particular organization of the records in the file for table R. File organization for table R Which organization of records in the file for table R could speed up the evaluation of the two queries above? 78 This section ... ... presents a comparison of 3 file organizations: 1 files of randomly ordered records (heap files) 2 files sorted on some record field(s) 3 files hashed on some record field(s). ... introduces the index concept: A file organization is tuned to make a certain query (class) efficient, but if we have to support more than one query class, we may be in trouble. Consider: Q SELECT A, B, C FROM R WHERE A> 0 AND A< 100 If the file for table R is sorted on C, this does not buy us anything for query Q. If Q is an important query but is not supported by R’s file organization, we can build a support data structure, an index, to speed up (queries similar to) Q. 79 3.1 Comparison of file organizations We will now enter a competition in which 3 file organizations are assessed in 5 disciplines: 1 Scan: fetch all records in a given file. 2 Search with equality test: needed to implement SQL queries like SELECT * FROM R WHERE C = 42 . 3 Search with range selection: needed to implement SQL queries like (upper or lower bound might be unspecified) SELECT * FROM R WHERE A> 0 AND A< 100 . 4 Insert a given record in the file, respecting the file’s organization. 5 Delete a record (identified by its rid ), fix up the file’s organization if needed. 80

description

arch dbms

Transcript of ArchDBMS-3-2x2

  • Module 3: File Organizations and Indexes

    Module Outline

    3.1 Comparison of file organizations

    3.2 Overview of indexes

    3.3 Properties of indexes

    3.4 Indexes and SQL

    ! This is a repetition of materialfrom Information Systems .

    Files and Index Structures

    Buffer Manager

    Disk Space Manager

    RecoveryManager

    Plan Executor

    Operator Evaluator Optimizer

    Parser

    ApplicationsWeb Forms SQL Interface

    SQL Commands

    Query Processor

    Concurrency ControlDBMS

    Database

    Index Files

    Data FilesSystem Catalog

    TransactionManager

    LockManager

    You are here!

    77

    I A heap file provides just enough structure to maintain a collection of records (ofa table).

    I The heap file supports sequential scans (openScan) over the collection, e.g.

    SELECT A,B

    FROM R

    No further operations receive specific support from the heap file.

    I For queries like

    SELECT A,B

    FROM R

    WHERE C > 42

    or

    SELECT A,B

    FROM R

    ORDER BY C ASC

    it would definitely be helpful if the SQL query processor could rely on a particular

    organization of the records in the file for table R.

    . File organization for table RWhich organization of records in the file for table R could speed up the

    evaluation of the two queries above?

    78

    This section . . .

    I . . . presents a comparison of 3 file organizations:

    1 files of randomly ordered records (heap files)2 files sorted on some record field(s)3 files hashed on some record field(s).

    I . . . introduces the index concept: A file organization is tuned to make a certain query (class) efficient, but if we

    have to support more than one query class, we may be in trouble. Consider:

    Q SELECT A,B, C

    FROM R

    WHERE A > 0 AND A < 100

    If the file for table R is sorted on C, this does not buy us anything for query

    Q.

    If Q is an important query but is not supported by Rs file organization, we

    can build a support data structure, an index, to speed up (queries similar to)

    Q.79

    3.1 Comparison of file organizations

    I We will now enter a competition in which 3 file organizations are assessed in 5disciplines:

    1 Scan: fetch all records in a given file.2 Search with equality test: needed to implement SQL queries like

    SELECT FROM R

    WHERE_ _ _

    _ _ _C = 42 .

    3 Search with range selection: needed to implement SQL queries like (upperor lower bound might be unspecified)

    SELECT FROM R

    WHERE_ _ _ _ _ _ _ _ _

    _ _ _ _ _ _ _ _ _A > 0 AND A < 100 .

    4 Insert a given record in the file, respecting the files organization.5 Delete a record (identified by its rid), fix up the files organization if needed.

    80

  • 3.1.1 Cost model

    I Performing these 5 database operations clearly involves block I/O, the majorcost factor.

    I However, we have to additionally pay for CPU time used to search inside apage, compare a record field to a selection constant, etc.

    I To analyze cost more accurately, we introduce the following parameters:

    Parameter Description

    b # of pages in the file

    r # of records on a page

    D time needed to read/write a disk page

    C CPU time needed to process a record (e.g., compare a field value)

    H CPU time taken to apply a hash function to a record

    I Remarks: D 15ms C H 0.1s This is a coarse model to estimate the actual execution time

    (we do not model network access, cache effects, burst I/O, . . .).81

    ! Aside: HashingA hashed file uses a hash function h to map a given record onto a specific

    page of the file.

    Example: h uses the lower 3 bits of the first field (of type INTEGER) of

    the record to compute the corresponding page number:

    h (42, true, "foo") 2 (42 = 1010102)h (14, true, "bar") 6 (14 = 11102)h (26, false, "baz") 2 (26 = 110102)

    I The hash function determines the page number only; record placementinside a page is not prescribed by the hashed file.

    I If a page p is filled to capacity, a chain of overflow pages is maintained(hanging off page p) to store additional records with h (. . . ) = p.

    I To avoid immediate overflowing when a new record is inserted into ahashed file, pages are typically filled to 80% only when a heap file is

    initially (re)organized into a hashed file.

    (We will come back to hashing later.)

    82

    3.1.2 Scan

    1 Heap fileScanning the records of a file involves reading all b pages as well as processing

    each of the r records on each page:

    Scanheap = b (D + r C)2 Sorted file

    The sort order does not help much here. However, the scan retrieves the records

    in sorted order (which can be big plus):

    Scansort = b (D + r C)3 Hashed file

    Again, the hash function does not help. We simply scan from the beginning

    (skipping over the spare free space typically found in hashed files):

    Scanhash = (100/80) =1.25

    b (D + r C)

    . Scanning a hashed fileIn which order does a scan of a hashed file retrieve its records?

    83

    3.1.3 Search with equality test (A = const)

    1 Heap fileThe equality test is (a) on a primary key, (b) not on a primary key :

    (a) Searchheap = 1/2 b (D + r C)(b) Searchheap = b (D + r C)

    2 Sorted file (sorted on A)We assume the equality test to be on the field determining the sort order. The

    sort order enables us to use binary search:

    Searchsort = log2 b D + log2 r C(If more than one record qualifies, all other matches are stored right after the

    first hit.)

    3 Hashed file (hashed on A)Hashed files support equality searching best. The hash function directly leads us

    to the page containing the hit (overflow chains ignored here):

    (a) Searchhash = H +D + 1/2 r C(b) Searchhash = H +D + r C

    (All qualifying records are on the same page or, if present, in its overflow chain.)84

  • 3.1.4 Search with range selection (A lower AND A upper)1 Heap file

    Qualifying records can appear anywhere in the file:

    Rangeheap = b (D + r C)

    2 Sorted file (sorted on A)Use equality search (with A = lower), then sequentially scan the file until a

    record with A > upper is found:

    Rangesort = log2 b D + log2 r C +n/r D + n C

    (n denotes the number of hits in the range)

    3 Hashed file (sorted on A)Hashing offers no help here as hash functions are designed to scatter records all

    over the hashed file (e.g., h (7, . . . ) = 7, h (8, . . . ) = 0) :

    Rangehash = 1.25 b (D + r C)

    85

    3.1.5 Insert

    1 Heap fileWe can add the record to some arbitrary page (e.g., the last page). This involves

    reading and writing the page:

    Insertheap = 2 D + C

    2 Sorted fileOn average, the new record will belong in the middle of the file. After insertion,

    we have to shift all subsequent records (in the latter half of the file):

    Insertsort = log2 b D + log2 r C search

    + 1/2 b (2 D + r C) shift latter half

    3 Hashed fileWe pretend to search for the record, then read and write the page determined

    by the hash function (we assume the spare 20% space on the page is sufficient

    to hold the new record):

    Inserthash = H +D search

    + C +D

    86

    3.1.6 Delete (record specified by its rid)

    1 Heap fileIf we do not try to compact the file (because the file uses free space management)

    after we have found and removed the record, the cost is:

    Deleteheap = Dsearchby rid

    + C +D

    2 Sorted fileAgain, we access the records page and then (on average) shift the latter half

    the file to compact the file:

    Deletesort = D + 1/2 b (2 D + r C) shift latter half

    3 Hashed fileAccessing the page using the rid is even faster than the hash function, so the

    hashed file behaves like the heap file:

    Deletehash = D + C +D

    87

    I There is no single file organization that serves all 5 operations equally fast.Dilemma: more advanced file organizations make a real difference in speed!

    range selections for increasing file sizes (D = 15ms, C = 0.1s, r = 100, n = 10):

    0.01

    0.1

    1

    10

    100

    1000

    10000

    10 100 1000 10000 100000

    time

    [s]

    b [pages]

    sorted fileheap/hashed file

    deletions for increasing file sizes (as above, n = 1):

    0.01

    0.1

    1

    10

    100

    1000

    10000

    10 100 1000 10000 100000

    time

    [s]

    b [pages]

    sorted fileheap/hashed file

    I There exist index structures which offer all the advantages of a sorted file andsupport insertions/deletions efficiently1: B+ trees.

    1At the cost of a modest space overhead.88

  • 3.2 Overview of indexes

    I If the basic organization of a file does not support a particular operation, we canadditionally maintain an auxiliary structure, an index, which adds the needed

    support.

    I We will use indexes like guides. Each guide is specialized to accelerate searcheson a specific attribute A (or a combination of attributes) of the records in its

    associated file:

    1 Query the index for the location of a record with A = k (k is the searchkey),

    2 The index responds with an associated index entry k(k contains enough information to access the actual record in the file),

    3 Read the actual record by using the guiding information in k;the record will have an A-field with value k .2

    k1

    // index2// k

    3// . . . , A = k, . . .

    2This is true for so-called exact match indexes. In the more general case with similarity

    indexes, the records are not guaranteed to contain the value k, they are only candidates for

    having this value.89

    I We can design the index entries, i.e., the k, in various ways:

    Variant Index entry ka

    k, . . . , A = k, . . .

    bk, rid

    c

    k, [rid1, rid2, . . . ]

    I Remarks:

    With variant a, there is no need to store the data records in addition theindexthe index itself is a special file organization.

    If we build multiple indexes for a file, at most one of these should use variant

    a to avoid redundant storage of records. Variants b and c use rid(s) to point into the actual data file. Variant c leads to less index entries if multiple records match a search key k ,but index entries are of variable length.

    90

    Example:3

    I The data file contains name, age, sal records, the file itself (index entry varianta) is hashed on field age (hash function h1).

    I The index file contains sal, rid index entries (variant b), pointing into the datafile.

    I This file organization + index efficiently supports equality searches on the ageand sal keys.

    2 121

    data filehashed on age

    index file

    h(age)=2

    h(age)=0

    h(age)=1

    h(sal)=3

    h(sal)=0

    salage

    3

    entries

    Jones, 40, 6003Tracy, 44, 5004

    Basu, 33, 4003

    Cass, 50, 5004

    Smith, 44, 3000

    Ashby, 25, 3000

    6003

    6003

    2007

    4003

    3000

    3000

    5004

    5004

    Daniels, 22, 6003

    h1 h2

    Bristow, 29, 2007

    3 1 . . . 3 refer to the index lookup scheme at the beginning of this section.91

    3.3 Properties of indexes

    3.3.1 Clustered vs. unclustered indexes

    I Suppose, we have to support range selections on records such that lower A upper for field A.

    I If we maintain an index on the A-field, we can1 query the index once for a record with A = lower , and then2 sequentially scan the data file from there until we encounter a record with field

    A > upper .

    I This will work provided that the data file is sorted on the field A:

    (k*)

    +B tree

    index file

    data filedatarecords

    index entries

    92

  • I If the data file associated with an index is sorted on the index search key, theindex is said to be clustered.

    I In general, the cost for a range selection grows tremendously if the index on Ais unclustered. In this case, proximity of index entries does not imply proximity

    in the data file. As before, we can query the index for a record with A = lower . To continue the

    scan, however, we have to revisit the index entries which point us to data pages

    scattered all over the data file:

    (k*)

    B tree+

    index entries

    index file

    datarecords

    data file

    I Remarks: If the index entries (k) are of variant a, the index is obviously clustered by defini-tion.

    A data file can have at most one clustered index (but any number of unclustered

    indexes).93

    3.3.2 Dense vs. sparse indexes

    I A clustered index comes with more advantages than the improved speed forrange selections presented above. We can additionally design the index to be

    space efficient:

    To keep the size of the index file small, we maintain one index entry k perdata file page (not one index entry per data record). Key k is the smallest

    search key on that page.

    Indexes of this kind are called sparse (otherwise indexes are dense).

    I To search a record with field A = k in a sparse A-index, we

    1 locate the largest index entry k such that k k , then2 access the page pointed to by k , and3 scan this page (and the following pages, if needed) to find records with. . . , A = k, . . . .Since the data file is clustered (i.e., sorted) on field A, we are guaranteed to

    find matching records in the proximity.

    94

    Example:

    I Again, the data file contains name, age, sal records. We maintain a clusteredsparse index on field name and an unclustered dense index on field age. Both

    use index entry variant b to point into the data file:Ashby, 25, 3000

    Smith, 44, 3000

    Ashby

    CassSmith

    22

    25

    30

    4044

    44

    50

    Sparse Indexon

    Name Data File

    Dense Indexon

    Age

    33

    Bristow, 30, 2007Basu, 33, 4003

    Cass, 50, 5004

    Tracy, 44, 5004

    Daniels, 22, 6003

    Jones, 40, 6003

    I Remarks: Sparse indexes need 23 orders of magnitude less space than dense indexes. We cannot build a sparse index that is unclustered (i.e., there is at most one

    sparse index per file).

    . SQL queries and index exploitationHow do you propose to evaluate query

    `SELECT MAX(age) FROM employees

    ?

    How about`SELECT MAX(name) FROM employees

    ?

    95

    3.3.3 Primary vs. secondary indexes

    . TerminologyIn the literature, you may often find a distinction between primary

    (mostly used for indexes on the primary key) and secondary (mostly

    used for indexes on other attributes) indexes. The terminology, however,

    is not very uniform, so some text books may use those terms for different

    properties.

    You might as well find primary denoting variant 1 of indexes accordingto Section 3.2, while secondary may be used to characterize the other

    two variants 2 and 3.

    96

  • 3.3.4 Multi-attribute indexes

    Each of the index techniques sketched so far and discussed in the sequel can be

    applied to a combination of attribute values in a straightforward way:

    I concatenate indexed attributes to form an index key,e.g. lastname,firstname 7 searchkey

    I define index on searchkey

    I index will support lookup based on both attribute values,e.g. . . . WHERE lastname=Doe AND firstname=John . . .

    I possibly will also support lookup based on prefix of values,e.g. . . . WHERE lastname=Doe . . .

    There are more sophisticated index structures that can provide support for symmetric

    lookups for both attributes alone (or, to be more general, for all subsets of indexes

    attributes). These are often called multi-dimensional indexes. A large number of

    such indexes have been proposed, especially for geometric applications.

    97

    3.4 Indexes and SQL

    The SQL-92 standard does not include any statement for the specification (creation,

    dropping) of index structures. In fact, SQL does not even require SQL systems to

    provide indexes at all!

    Nonetheless, practically all SQL implementations support one or more kinds of in-

    dexes. A typical SQL statement to create an index would look like this:

    I Index specification in SQL1 CREATE INDEX IndAgeRating ON Students

    2 WITH STRUCTURE = BTREE,

    3 KEY = (age,gpa)

    N.B. SQL-99 does not include indexes either.

    98

    Bibliography

    Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley,

    Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von

    Datenbanken.

    Harder, T. (1999). Datenbanksysteme: Konzepte und Techniken der Implementierung.

    Springer.

    Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Intl Thomp-

    son Publishing, Bonn.

    Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Imple-

    mentierungsaspekte. Vieweg.

    Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill,

    New York, 3 edition.

    99