Select

42
Query Tree Question • Should we do a pname, pnumber then pname = ‘Aquarius’ then pnumber ? • No, since the operations are done together – the processor would read a row of project, see if pname = ‘Aquarius’ then use pnumber to perform the join. • Our query tree has only 2 groups, not 3

Transcript of Select

Page 1: Select

Query Tree Question

• Should we do a pname, pnumber then pname = ‘Aquarius’ then pnumber ?

• No, since the operations are done together– the processor would read a row of project,

see if pname = ‘Aquarius’ then use pnumber to perform the join.

• Our query tree has only 2 groups, not 3

Page 2: Select

Select Operation Strategies

And Indexing

(Chapter 8)

*Some info on slides from Dr. S. Son, U. Va

Page 3: Select

Disk access

• DBs traditionally stored on disk

• Cheaper to store on disk than in memory

• Costs for:– Seek time, latency, data transfer time

•  Disk access is page oriented

• 2 - 4 KB page size

Page 4: Select

Access time

• Time to randomly access a page :– 12-20 ms which is 50-83 I/O's per second

• Large disparity between disk access and memory access (10-200 ns)

• System initially determines if page in memory buffer (page tables, etc.)

Page 5: Select

Table scan

• Linear search - all data rows read in – I/O parallelism can be used

• multiple I/O read requests satisfied at the same time

• stripe the data across different disks       

– Problems with parallelism?• must balance disk arm load to gain maximum

parallelism • requires the same total number of random I/O's,

but using devices for a shorter time

Page 6: Select

Sequential prefetch I/O

• Retrieve one disk page after another (on same track) - typically 32

• Seek time no longer a problem

• Must know in advance to read 32 successive pages

• Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Page 7: Select

Access time

• Seek time – 10-15ms

• Latency time – 2-5 ms

• Data transfer time – 10-200 ns

Page 8: Select

Access time for fast I/O

RIO            Seq. Prefetch .010             .010                    Seek - disk arm to cylinder .002             .002                    Latency - platter to sector .0015           .048                 Data transfer - Page .0135           .060                   1 page vs. 32 pages

.43 seconds  .060 seconds for 32 pages for both

Page 9: Select

Textbook access time

RIO            Seq. Prefetch .008             .008                   Seek - disk arm to cylinder .004            .004                  Latency - platter to sector .0005           .016               Data transfer - Page .0125           .028                   1 page vs. 32 pages

.40 seconds  .028 seconds for 32 pages for both

Page 10: Select

Disk allocation

• Disk Resource Allocation for Databases (DBA has control)

• Goal – contiguous sectors on disk - want data as close together as possible  to minimize seek time

• No standard SQL approach, but general way to deal with allocation

• Some OS allow specification of size of file and disk device

Page 11: Select

Tablespace

• Allocation medium for tables and indexes for ORACLE, DB2, etc.

• Usually relations (files) cannot span disk devices • Can put >1 table in a table space if accessed

together • Tablespace corresponds to 1 or more OS files

and can span disk devices

Page 12: Select

Query Language

• ORACLE DB's contain several tablespaces, including one called system -     data description +  indexes + user-defined tables

Create tablespace tspace1 datafile 'fname1', 'fname2';

• default tablespace given to each user • if multiple tablespaces - better control over load

balancing • can take some disk space off-line

Page 13: Select

Extent

• extent - contiguous storage on disk • when data segment or index segment first created,

given an initial extent from tablespace 10KB (5 pages) • if need more space given next contiguous extent

• can increase the size by a positive % (cannot decrease)                     initial n - size of initial extent                     next n - size of next                     max extents - maximum number of extents                     min extents - number of extents initially

allocated                     pct increase n - % by which next extent

grows over previous one

Page 14: Select

Create table

• Create table statement - can specify tablespace, no. of extents– When initial extent full, new extent allocated

• pctfree - determine how much space can be used for inserts of new rows – if pctfree =10%, inserts stop when page is 90% full

• pctused – determines when new inserts start again – if fall below certain percentage of total, default

pctused = 40%                  pctfree + pctused < 100

Page 15: Select

Rows

• Row layout on each disk page (see figure) • Row directory – row number and page byte

offset– Row number is row number in page – book calls it

slot#– Page byte offset – with varchar, row size not constant

• To identify a particular row use RID (RowID) –

page #, slot # [file#]

slot# is number in row directory (logical #)

Page 16: Select

Differences in DBMSs

• RID can be retrieved in ORACLE but not DB2 (violates relational model rule)

• ORACLE • rows can be slit between pages (row record

fragmentation) • Can have rows from multiple tables on same page,

more info

• DB2, no splitting, entire row moved to new page, need forwarding pointer

Page 17: Select

Binary Search

• ``Find all students with gpa > 3.0’’– If data is in sorted file, do binary search to find

first such student, then scan to find others.– Cost of binary search can be quite high.

• Simple idea: Create an `index’ file.

Page 1 Page 2 Page NPage 3 Data File

k2 kNk1 Index File

Page 18: Select

Binary Search

• Binary search on disk – optimal for comparisons - not optimal for disk-

based look-up – must keep data in order – may be reading values from same page at

different times

•  Instead use B+-tree index

Page 19: Select

Indexing

• Keyed access retrieval method • index is a sorted file - sorted by index key • index entries:

index key pointer  (RID)

   • pointer is RID   • index resides on disk, partially memory resident when

accessed

Page 20: Select

Indexing

• As for any index, 3 alternatives for data entries k*:– Data record with key value k– <k, rid of data record with search key value k>– <k, list of rids of data records with search key k>

• Choice is orthogonal to the indexing technique used to locate data entries k*.

• Tree-structured indexing techniques support both range searches and equality searches.

• B+ tree: dynamic, adjusts gracefully under inserts and deletes.

Page 21: Select

B+-tree

• Most commonly used index structure type in DBs today

• Based on B-tree

• Used to minimize disk I/O

• available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes) Example

Page 22: Select

Structure of B+ Trees

• leaf level pointers to data (RIDs)

• the remaining are directory (index) nodes that point to other index nodes

Index Entries

Data Entries("Sequence set")

(Direct search)

Page 23: Select

Characteristics of B+ Tree

• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)

• Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree.

• Supports equality and range-searches efficiently

Page 24: Select

Cost of I/O for B+-tree

• Assume number of entries in each index node fits on one page - one node is one page

• If tree with depth of 3, 3 I/Os to get pointer to data B+-tree structured to get most out of every disk page read

• Read in index node, can make multiple probes to same page if remains in memory

– likely since frequent access to upper -level nodes of actively used B+-trees

Page 25: Select

B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.– average fanout = 133

• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records

• Can often hold top levels in buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes

Page 26: Select

B+-tree

• Index has a directory structure that allows retrieval of a range of values efficiently – search for leftmost index entry Si such that

X <= Si • Index entries always placed in sequence

by value - can use sequential prefetch on index

• Index entries shorter than data rows and require proportionately less I/O

Page 27: Select

B+-tree

• Balancing of B+-trees - insert, delete

• nodes usually not full

• utilities to reorganize to lower disk I/O

• most systems allow nodes to become depopulated- no automatic algorithm to balance

• average node below root level 71% full in active growing B+-trees

Page 28: Select

Inserting into B+ Tree

• Find correct leaf L. • Put data entry onto L.

– If L has enough space, done!– Else, must split L (into L and a new node L2)

• Redistribute entries evenly, copy up middle key.• Insert index entry pointing to L2 into parent of L.

• This can happen recursively– To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.)• Splits “grow” tree; root split increases height.

– Tree growth: gets wider or one level taller at top.– Algorithm from MSU

Page 29: Select

Deleting from B+ tree

• Start at root, find leaf L where entry belongs.• Remove the entry.

– If L is at least half-full, done! – If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and sibling.

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

• Merge could propagate to root, decreasing height.• Algorithm from MSU

Page 30: Select

Duplicate key values

• Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily

duplicated key entails a long search through the leaf-level of the B+-tree

• Index compression - with multiple duplicates | header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|

where PrX is count of RID values

Page 31: Select

Create Index

   Options:         multiple columns

        tablespace         storage - initial extents, etc.         percent free default = 10

% of each page left unfilled free page (1 free page for every n

index pages)     Can control % of B+-tree node pages left

unfilled when index created, refers to initial creation

Page 32: Select

Clustering

• Placing rows on disk in order by some common index key value        (remember the index itself is always sorted)

• clustered (clustering) index - index with rows in the same order as the key values

• efficiency advantage        read in a page, get all of the rows with

the same value • clustering is useful for range queries

        e.g.  between keyval1 and keyval2

Page 33: Select

Clustering

• can only cluster table by 1 clustering index at a time

• In DB2 – – if the table is empty, rows sorted as placed on

disk – subsequent insertions not clustered, must use

REORG

Page 34: Select

Indexes vs. table scan

• To illustrate the difference between table scan, secondary index (non clustered)     and clustered index

Assume 10 M customers, 200 cities2KB/page, row = 100 bytes, 20 rows/page             Select *

            From Customers             Where city = Birmingham

1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Page 35: Select

Table Scan

Table Scan - read entire table

10,000,000/20 = 500,000 pages   

If use prefetch?

500000/32 * .? =

Page 36: Select

Clustering Index

• Clustering Index –

• All entries for B'ham clustered on same pages

• 50,000/20 = 2500 pages (with 20 rows per page)  

• (3 + 50 + 2500)*?=

Page 37: Select

Secondary Index

Secondary Index–

• In the worst case 1 entry for B'ham per page • 50,000 pages (10M/200)• 3 upper nodes of the tree   • Assume 1000 index entries per leaf node, read

50000/1000 index pages • (3 + 50 + 50,000)*?=

Page 38: Select

List Prefetch

• Create list of data pages to access

• system orders pages to minimize disk I/O– E.g. elevator algorithm for disk request

scheduling

Page 39: Select

% Free

• Redo the previous calculations assuming relations created with 50% free option specified.

Page 40: Select

Multiple Indexes

• More than one index on a relation             – e.g. class - one index, gender - one index

Page 41: Select

Composite Index

• One index based on more than one attribute  Create Index index_name on Table (col1, col2,... coln)

•    Composite index entry - values for each attribute             class, gender             entry in index is:  C1, C2, RID

• What would B+ tree look like?

Page 42: Select

Creating Indexes

When determining what indexes to create consider: workload - mix of queries and frequencies of requests             20% of requests are updates, etc.

            can create lots of indexes but:                 cost to create                 insertions                 initial load time high if a large table                 index entries can become longer and

longer as multiple columns included