Database Indexes

DATABASE INDEXES

Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b)

Outline Why Indexes ? Demo Types of Single-level Ordered Indexes

• Primary Indexes• Clustering Indexes• Secondary Indexes

Multilevel Indexes Dynamic Multilevel Indexes Using B-Trees and B+-Trees Indexes on Multiple Keys Disadvantages of Indexes Indexes in SQL SERVER

Clustered Index Non Clustered Index

Why Indexes? Indexing mechanisms used to speed up access to desired data.

They are much the same as book indexes, providing the database with quick jump points on where to find the full reference.

Search Key - attribute or set of attributes used to look up records in a file.

An index file consists of records (called index entries) of the form

Index files are typically much smaller than the original file Access types supported efficiently. E.g.,

Records with a specified value in the attribute Records with an attribute value falling in a specified range of values.

Search-Key

Pointer

Demo Given a CUSTOMER table, find all

customers whose name = ‘Justine’ Query :

SELECT * FROM CUSTOMERWHERE name ='Justine‘

Single Level Indexes - Indexes as Access Paths A single-level index is an auxiliary file that makes it

more efficient to search for a record in the data file.

The index is usually specified on one field of the file (although it could be specified on several fields)

One form of an index is a file of entries <field value, pointer to record>, which is ordered by field value

The index is called an access path on the field.

Indexes as Access Paths (contd.) The index file usually occupies considerably less disk

blocks than the data file because its entries are much smaller

A binary search on the index yields a pointer to the file record

Indexes can also be characterized as dense or sparse.

A dense index has an index entry for every search key value (and hence every record) in the data file.

A sparse (or nondense) index, on the other hand, has index entries for only some of the search values

Indexes as Access Paths (contd.)Example: Given the following data file:

EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )Suppose that:Record size R = 150 bytesBlock size B = 512 bytesNo of Records r = 30000 records

Then, we get:Blocking factor Bfr = B div R = 512 div 150 = 3 records/blockNumber of file blocks b = (r/Bfr) = (30000/3) = 10000 blocks

For an index on the SSN field, assume the field size VSSN= 9 bytes,Assume the record pointer size PR= 7 bytes. Then:Index entry size RI=(VSSN+ PR)=(9+7)=16 bytesIndex blocking factor BfrI= B div RI= 512 div 16= 32 entries/blockNumber of index blocks b= (r/ BfrI)= (30000/32)= 938 blocksBinary search needs log2bI= log2938= 10 block accesses

This is compared to an average linear search cost of: (b/2)= 30000/2= 15000 block accessesIf the file records are ordered, the binary search cost would be: log2b= log230000= 15 block accesses

Types of Single-Level Indexes Primay Index

Defined on an ordered data file

The data file is ordered on a key field

Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor

A similar scheme can use the last record in a block.

A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value.

Primary index on

the ordering key field

of an ordered data file

Primary Index (contd.)

Types of Single-Level Indexes Clustering Index

Defined on an ordered data file

The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record.

Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value.

It is another example of nondense index where Insertion and Deletion is relatively straightforward with a clustering index.

Clustering Index (contd.)A clustering index on the DEPTNUMBER ordering nonkey field of an EMPLOYEE file.

Clustering Index (contd.)Clustering index with a separate block cluster for each group of records that share the same value for the clustering field.

Types of Single-Level Indexes Secondary indexes

A secondary index provides a secondary means of accessing a file for which some primary access already exists.

The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values.

The index is an ordered file with two fields. The first field is of the same data type as some nonordering

field of the data file that is an indexing field. The second field is either a block pointer or a record pointer.

There can be many secondary indexes (and hence, indexing fields) for the same file.

Includes one entry for each record in the data file; hence, it is a dense index

Secondary Indexes (contd.)

A dense secondary index (with block pointers) on a nonordering key field of a file.

Secondary Indexes (contd.)A secondary index (with recored pointers) on a nonkey field implemented using one level of indirection so that index entries are of fixed length and have unique field values.

Types of Single-Level Indexes

Multi-Level Indexes Because a single-level index is an ordered file, we can create a

primary index to the index itself ; in this case, the original index file is called the first-level index and the index to the index is called the second-level index.

We can repeat the process, creating a third, fourth, ..., top level until all entries of the top level fit in one disk block

A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block

Multi-Level Indexes (contd.)A two-level primary index resembling ISAM (Indexed Sequential Access Method) organization.

Multi-Level Indexes (contd.) Such a multi-level index is a form of search tree ;

however, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file.

A node in a search tree with pointers to subtrees below it.

Dynamic Multilevel Indexes Using B-Trees and B+ -Trees

Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-tree data structures, which leave space in each tree node (disk block) to allow for new index entries

These data structures are variations of search trees that allow efficient insertion and deletion of new search values.

In B-Tree and B+-Tree data structures, each node corresponds to a disk block

Each node is kept between half-full and completely full

Dynamic Multilevel Indexes Using B-Trees and B+ -Trees

An insertion into a node that is not full is quite efficient; if a node is full the insertion causes a split into two nodes

Splitting may propagate to other tree levels

A deletion is quite efficient if a node does not become less than half full

If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

Difference between B-tree and B+-tree

In a B-tree, pointers to data records exist at all levels of the tree

In a B+-tree, all pointers to data records exists at the leaf-level nodes

A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

B-tree allows search-key values to appear only once; eliminates redundant storage of search keys.

Search keys in nonleaf nodes appear nowhere else in the Btree; an additional pointer field for each search key in a nonleaf node must be included.

B-tree structures. (a) A node in a B-tree with q – 1 search values. (b) A B-tree of order p = 3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6.

The nodes of a B+-tree. (a) Internal node of a B+-tree with q –1 search values. (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers.

Indexes on Multiple Keys Use multiple indices for certain types of queries. Example:

select account-number from account where branch-name = “Perryridge” and balance = 1000

Possible strategies for processing query using indices on single attributes:1. Use index on branch-name to find accounts with branch

name of Perryridge; test balance = 1000.2. Use index on balance to find accounts with balances of

$1000; test branch-name = “Perryridge”.3. Use branch-name index to find pointers to all records

pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained.

Indexes on Multiple Keys All of these alternatives eventually give the correct result.

However, if the set of records that meet each condition (branch-name = “Perryridge” OR balance = 1000) individually are large, yet only a few records satisfy the combined condition, then none of the above is a very efficient technique for the given search request.

Suppose we have an index on combined search-key (branch-name, balance). With the where clause

where branch-name = “Perryridge” and balance = 1000 the index on the combined search-key will fetch only records that satisfy both conditions.

Can also efficiently handle where branch-name = “Perryridge” and balance < 1000

But cannot efficiently handle where branch-name < “Perryridge” and balance = 1000. May fetch many records that satisfy the first but not the second condition.

Disadvantages of Indexes Indexes require additional storage space

on disk, just as the index in a book require additional pages.

Too many indexes can actually slow your database down. Each time a page or database row is updated or removed, the reference or index also has to be updated

Indexes in Microsoft SQL Server

Indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf

nodes. Any index levels between the root and the leaf nodes

are collectively known as intermediate levels. The pages in each level of the index are linked in a

doubly-linked list. SQL Server primarily supports 2 types of indexes

CLUSTERED Index NON-CLUSTERED Index

Clustered Index A clustered index is a special type of index that reorders

the way records in the table are physically stored. The root and intermediate level nodes contain index

pages holding index rows Each index row contains a key value and a pointer to

either an intermediate level page in the B-tree, or a data row in the leaf level of the index.

Leaf node contains the actual data pages. The data row of the table are sorted and stored in the

table based on their clustered index key (i.e. based on the index column(s)).

You can have only one clustered index per table.

Clustered Index

Clustered Index CREATE CLUSTERED INDEX <index

name> ON <table name> (<column name list>)

Eg: CREATE CLUSTERED INDEX

IX_CUSTOMER_id ON CUSTOMER(id)

Clustered Index Potential candidates for the clustered index:

Columns with a number of duplicate values that are searched frequently, for example, WHERE last_name = ‘Smith’.

Columns that are often specified in the ORDER BY / GROUP BY clause

Columns that are often searched for within a range of values, for example, WHERE price between $10 and $20.

Columns, other than the primary key, that are frequently used in join clauses.

Queries that return large result sets.

Clustered Index Clustered indexes are not a good choice for:

Columns that undergo frequent changes: This results in the entire row moving (because SQL Server must keep the data values of a row in physical order). This is an important consideration in high-volume transaction processing systems where data tends to be volatile.

Wide keys: The key values from the clustered index are used by all nonclustered indexes as lookup keys and therefore are stored in each nonclustered index leaf entry.

Non-Clustered Index A nonclustered index is a special type of index in which the

logical order of the index does not match the physical stored order of the rows on disk.

Leaf node contains index pages instead of data pages. Each index row in the nonclustered index contains the

nonclustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value Heap: The row locator is a pointer to the row. Row locator is

built based on the following. ROW ID (RowLocator)= file identifier + page number + row number on the page

Clustered Index : Row locator is the clustered index key for the row

You can have up to 249 Non Clustered Index per table.

Non-Clustered Index

Non-Clustered Index CREATE [NONCLUSTERED] INDEX

<index name> ON <table name> (<column name list>)

Eg: CREATE NONCLUSTERED INDEX

IX_CUSTOMER_name ON CUSTOMER(name)

Non-Clustered Index Potential candidates for nonclustered

indexes for your environment: Columns referenced in SARGs or join clauses

that have a relatively high selectivity(the density value is low).

Columns referenced in both the WHERE clause and the ORDER BY clause

Nonclustered indexes are useful for single-row lookups, joins, queries on columns that are highly selective, or queries with small range retrievals.

Non-Clustered Index Index with Included Columns

Functionality of nonclustered indexes can be extended by adding nonkey columns to the leaf level of the nonclustered index

An index with included nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns.

Questions

?

Database Indexes

Documents

Transcript of Database Indexes