CS346: Advanced Databases Graham Cormode [email protected] Storage, Files and Indexing.

CS346: Advanced DatabasesGraham Cormode [email protected]

Storage, Files and Indexing

Outline

Part 1: ¨ Disk properties and file storage¨ File organizations: ordered, unordered, and hashed¨ Storage topics: RAID and Storage area networks

¨ Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe

Part 2: Indexes

CS346 Advanced Databases2

Why?

¨ Important to understand how high-level abstractions (databases) map down to low-level concepts (disks, files)– Get a sense of the scale of the quantities involved

(seek times, overhead of inefficient solutions)– Appreciate the difference that smart solutions can bring– Understand where the bottlenecks lie

¨ Give a “bottom-up” perspective on data management– See the whole picture starting from the low-level– Demystify some aspects that can seem opaque

(B-trees, hashing, file organization)– Apply to many areas of computer science (OS, algorithms…)


The Memory Hierarchy


Flash Storage

Data on Disks

¨ Databases ultimately rely on non-volatile disk storage– Data typically does not fit in (volatile) memory

¨ Physical properties of disks affect performance of the DBMS– Need to understand some basics of disks

¨ A few exceptions to disk-based databases:– Some real-time applications use “in-memory databases”– Some legacy/massive applications use tape storage as well

¨ Different tradeoffs with flash-based storage– Much faster to read, but limits on number of deletions– No major difference between random access and linear scan– “Flash databases” are a niche, but growing area


Rotating Disk: 5000 – 10000RPM

¨ Sector size: 0.5KB – 4KB, basic unit of data transfer from disk¨ Seek time: move read head into position, currently ~4ms

– Includes rotational delay: wait for sector to come under read head – Random access: 1/0.004 * 4KB = 1MB/second: quite slow

¨ Track-to-track move, currently ~0.4ms: 10 times faster– Sustained read/write time: 100MB/second (caching can improve)


Disk properties: the fundamental contrast¨ Random access is slow, sequential access is fast

– By factors of up to 100s– Want to design storage of data to avoid or minimize random access

and make data access as fast as possible

¨ Buffering can help in multithreaded systems: – Work on other processes while waiting for data to arrive– Double buffering: maintain two buffers of data

work on current buffer of data, while other buffer fills from disk– Maximizes parallel utilization, but doesn’t make my thread faster


Records: the basic unit of the database¨ Databases fundamentally composed of records

– Each record describes an object with a number of fields¨ Fields have a type (integer, float, string, time, compound…)

– Fixed or variable length¨ Need to know when one field ends

and the next begins– Field length codes– Field separators (special characters)

¨ Leads to variable length records– How to effectively search through data with variable length records?


Records and Blocks

¨ Records get stored on disks organized into blocks¨ Small records: pack an integer number into each block

– Leaves some space left over in blocks– Blocking Factor: (average) number of records per block

¨ Large records: may not be effective to leave slack– Records may span across multiple blocks (spanned organization)– May use a pointer at end of block to point to next block


Files

¨ A sequence of records is stored as a file– Either using OS file system support, or handled by DBMS

¨ Database requires support for various file operations:– Open file, return new file handler– Scan for the next record that satisfies a search condition– Read the next record from disk into memory– Delete the current record and (eventually) update file on disk– Modify the current record and (eventually) update file on disk– Insert a new record at the current location– Close the file, flush any buffers and postponed operations

¨ Need suitable file layout and indices to allow fast scan operation


File organization: unordered

¨ Just dump the records on disk in no particular order

¨ Insert is very efficient: just add to last block

¨ Scan is very inefficient: need to do a linear search– Read half the file on average

¨ Delete could be inefficient:– Read whole file, write it back with deleted record omitted– Instead, just “mark” record as deleted– Periodically remove marked records


File organization: ordered

¨ Keep records ordered on some (key) attribute

¨ Can scan through recordsin that order very easily

¨ Can search for a value(or range of values)by binary search– Binary search: log2 b seeks to find desired record out of b blocks– Linear search: b/2 seeks on average to find record

¨ Insertion is rather more expensive and complex to do well– Keep recent records in “overflow buffer” for periodic merge

¨ If modifying the key field, treat as a deletion and an insertionCS346 Advanced Databases13

File organization: hashed

¨ Use hashing to ensurerecords with same keyare grouped together

¨ Arrange file blocks intoM equal sized buckets– Often, 1 block = 1 bucket

¨ Apply hash function to key field to determine its bucket¨ Usual hash table concerns emerge

– Need to deal with collisions, e.g. by open addressing, or chaining– Deletions also get messy, depending on collision method used


External hashing


¨ Don’t store records directly in buckets, store pointers to records– Pointers are small, fit more in a block– “All problems in computer science can be solved by another level of

indirection” – David Wheeler

External hashing: issues

¨ Aim for 70-90% occupancy of the hash table– Not too much wastage, not too many collisions

¨ Hash function should spread records evenly across buckets– If very skewed distribution, we lose benefits of hashing

¨ Still costly if access to records ordered by key is required– And doesn’t help with accessing records not by key

¨ Main disadvantage: hard to adjust if number of records grows– Need to resize the hash table

¨ What if too many records hash to the same bucket?– Can handle extra records by “chaining” to overflow buckets


Hashing: Overflow buckets


Extendible hashing

¨ Hashing scheme that allows the hash table to grow and shrink– Avoid wasted space and avoid excessive collisions

¨ Makes use of a directory of bucket addresses– Directory size is a power of two, 2d

– So can double or halve the directory size as needed– The first d bits of the hash value are used to index into the directory

¨ Directory entries point to disk blocks storing records– Contiguous directory entries can point to same disk block– Disk blocks can have a local value of d, d’

¨ Insertions into a block may cause it to overflow and split in two– The directory is then updated accordingly



¨ Extendible hashing example– Some values of d’ less

than global d

Extendible Hashing: Updating d

¨ If a bucket becomes full, may need to increase d d + 1– Double the size of the directory

¨ Similarly, if all buckets have local d’ < d, can decrease d d – 1– Halve the size of the directory

¨ Other adaptive hashing variants exist– Dynamic hashing: binary tree directory


RAID disk technology

¨ RAID originally a way to combine multiple cheap disks for reliability– “Redundant Array of Inexpensive Disks” (1980s)

¨ Now general purpose approach to providing reliability– “Redundant Array of Independent Disks – Sets of different levels of replication

¨ RAID 0: spread data over multiple disks (striping)– Increases throughput, but increases risk of data loss


Important RAID levels

¨ RAID 1: duplication of data across multiple disks (mirroring)– Data copied to 2 (or more) disks– Disk reliability measured in “mean time between failures” (MTBF)– Typical MTBF is 100K hours – 1M hours (~ 1 century)– Chance of both disks failing at same time is small– So enough time to recover a copy

¨ RAID 5: block level striping and parity coding spread over disks– Parity coding: allows recovery of 1 missing disk


1 0 1 1 0 1

Data bits Parity bit

RAID levels

¨ RAID 6: Reed-Solomon coding allows multiple disk losses¨ Other RAID levels (2, 3, 4) not in common usage


Storage Area Networks

¨ Storage Area Networks: virtual disks – Disks attached to “headless” server – Easy to configure, low maintenance overhead

¨ Many advantages to SANs: – Flexible configuration: hot-swap new disks in/out– Can be physically remote from other network elements

Provided on fast (fibre-based) network– Separate storage for server configuration, OS updates etc.


Outline

Part 2¨ Indexes: primary and secondary¨ Multilevel indexes and B-trees

¨ Chapter: “Indexing Structure for Files” in Elmasri and Navathe


Indexing for Files¨ Chapter: “Indexing Structure for Files” in Elmasri and Navathe

– Move focus from how file is stored on disk to how file is accessed / indexed by the DBMS

¨ Index: an auxiliary file that makes it faster to find certain records– An index is usually for one field of the record (e.g. index by name)– Can have multiple indexes, each for different fields

¨ A basic form of an index is a sorted list of pointers– <field value, pointer to record>, ordered by field value– “An access path” for the indexed field


Indexes as access paths

¨ Indexes usually take up much less space than the original file– Each index entry is much smaller than the full record– Just need a field value, and a pointer (few bytes)

¨ Efficient to look up matching records– Binary search on the index, then follow pointer

¨ The index may be dense or sparse– Dense index: contains an entry for every possible search value– Sparse index: contains entries only for some search values

¨ Can have an index on the field that the file is sorted on! Why?– Can be faster to search via index than do binary search on file


Primary Index¨ A primary index applies when the file is ordered by a key field¨ A sparse index: one entry for each block of the data file

– An index for the first record in the block (the block anchor)– Can be much fewer entries in index than in the data file

¨ Straightforward to search for a record– Use the index to find the block that the record should be in– Retrieve the block and see if the record is there

¨ Insertion and deletion of records in the main file is a pain!– Almost all the pointers change!

¨ Some standard tricks to mitigate the pain– Buffer updates in an “overflow” file and check against this– Linked list of overflow records for each block as needed– Mark records as deleted, and only purge periodically


Indexing Example


¨ Example: Given a data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )

¨ Suppose that:– record size R=100 bytes (fixed size)– block size B=1024 bytes– file size r=30000 records

¨ Blocking factor Bfr= B / R= 1024 / 100 = 10 records/block

¨ Number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks

Indexing Example

¨ For an index on the SSN field, assume the field size VSSN=9 bytes and the record pointer size PR=6 bytes. Then:– index entry size RI=(VSSN+ PR)=(9+6)=15 bytes– index blocking factor BfrI= B / RI = 1024/15 = 68 entries/block– number of index blocks bI = (r/ BfrI)= (3000/68)= 45 blocks– binary search needs log2(bI)= log2(45)= 6 block accesses

[In practice, likely that these 45 blocks would end up in cache]¨ This is compared to an average linear search cost of:

– (b/2) = 30000/2 = 15000 block accesses¨ If the file records are ordered, the binary search cost would be:

– log2b = log23000 = 12 block accesses


Clustering Index

¨ Clustering index applies when data is ordered on a non-key field– The field on which data is ordered is called the clustering field– The data file is described as a clustered file – Clustering index is sorted list of <field value, pointer> pairs

¨ Why make a distinction between clustering and primary index? – Field values can appear in many consecutive records– Only one entry in index for each distinct field value

No point having multiple entries– Index points to first data block containing the matching value

¨ Same issues with insertion and deletion as for primary index


¨ Cluster index where each distinct value is allocated a whole disk block

¨ Linked list if more than one block is needed


Secondary indexes

¨ Secondary indexes provide a secondary means of access to data– For when some primary access already exists (e.g. index on key)

¨ A secondary index is on some other field(s)– Either other candidate key fields which are unique for every record– Or non-key field with duplicate values

¨ Secondary index is an ordered file of <field value, pointer> pairs– Pointer can be to a file block, or record within a file– A dense index: must be one pointer per record

¨ Many secondary indexes can be created for a file– Allowing access based on different fields– By contrast, there can be only one primary index


¨ Secondary indexwith block pointers

¨ Unique data valuesso structure is simple


Secondary index example

¨ Same set up as previous example:r=30000 records of size R=100 bytes, block size B = 1024 bytes

¨ File is stored in 3000 blocks as worked out before¨ Search for a record based on a field of V = 9 bytes

– Linear search would read 1500 blocks on average¨ Secondary index on target attribute (9 + 6) = 15 bytes/record

– Blocking factor for index is 1024/15 = 68 entries per block– Need 30000/68 = 442 blocks to store the (dense) index– Binary search on index takes log2 442 = 9 block accesses– Slightly more than the primary index (why?)


Secondary index for non-key, non-ordering ¨ Secondary index for a non-key non-ordering field

– I.e. a field that has duplicate values in many records– Several possible approaches

1. Include duplicate index entries for the same field value (dense)2. Have variable length entries in the index: a list of pointers to all

blocks containing the target value3. Use an extra level of indirection: fixed length index entries point

to list of pointers, arranged as list of disk blocks¨ Option 3 is most commonly used

– All options are painful when data file is subject to insert/deletes



¨ “Option 3” secondary index

Single Level Indexing Summary

¨ Primary index: on the field that the data is sorted by– Allows faster access than searching the file directly

¨ Secondary index: on any field(s) in the data– Can have multiple secondary indexes– Typically dense

¨ All indexes require extra effort to maintain if the data is subject to frequent updates (insert/delete operations)


Multilevel Indexing

¨ The indexes described so far miss a trick: they do binary search– But we can read a block of k index records at a time– Can do a k-way split instead of a 2-way split– Improves cost from log2 N to logk N

¨ Another way to look at it: if index is large, build index on index…– Original index is first level index, then there is second level index– Can repeat, creating third level index, fourth level index…– Until top level of index fits into one disk block– For all realistic file sizes, a constant number of levels is needed

¨ Apply this idea to any index type (primary, secondary, cluster)– Assume first level index has fixed length, distinct valued entries


Two-levelindex


Example

¨ Convert previous example into a multilevel index– Blocking factor for indexes remains 68– 442 blocks of first level index– Second level index: 442/68 = 7 blocks– Third level index fits in 1 block: stop here!

¨ Hence, need three levels of index: three accesses to find (pointer to) target record


Dynamic multilevel indexes

¨ Can we modify our storage of indices to make handling inserts/deletes less painful?

¨ Use tree-structure to directly access data– Keep some space in file blocks to reduce cost of updates

¨ Use the language of trees to describe the structure:


Search trees

¨ A search tree: a tree where each node contains at most p-1 search values and p pointers as P1, K1, P2, K2, … Kq-1, Pq, q ≤ p– The values are in order: K1 < K2 < … Kq-1

– Each pointer Pi points to a subtree so that Ki-1 < X ≤ Ki for all keys in subtree

¨ Rules allow efficient search for any key value– Search within the only subtree it can be in at each level


Search tree example

¨ Leaf-level entries have the full record¨ Insertion is easier: we can add a new block without having to

rewrite the rest of the tree¨ If tree is unbalanced (some very deep paths), searches are long

– Try to avoid by using rules to avoid tree getting unbalanced– Perform occasional rebalancing or “self-balancing” trees


B-trees and B+-trees

¨ B-trees add the constraint that the tree should be balanced– The root to leaf path should be about the same length for all leaves– Avoid wasted space: each node should be between half full and full

¨ B+-tree is a slight modification of B-tree that is now the standard– B-trees: allow pointers to data at all levels of the tree– B+-tree: pointers to data only at the leaf level– B+-tree slightly simpler (fewer cases to deal handle with updates)

¨ The trees can be used for (primary, secondary) multi-level indexes– Updates to data can be reflected in tree easily

¨ These trees are widely used in file systems and database systems– File systems: NTFS [Windows], NSS, XFS, JFS – for directory entries– DBMSs: IBM DB2, Informix, MS SQL Server, Oracle, SQLite


B+-tree


¨ Internal nodes: P1, K1, P2, K2, … Kq-1, Pq, where p/2 < q ≤ p¨ Leaf nodes: K1, Pr1, K2, Pr2, … Kq-1, Prq-1, Pnext, p/2 < q ≤ p– Ki, Pri : Pri points to record with value Ki

– Pnext points to the next leaf node in the tree (for linear access)

B+-tree: Search¨ Search on a B+-tree is fairly straightforward

– Start at root block– While not at a leaf block

Determine between which values in the block the key falls Follow the relevant pointer to the new block

– Search current leaf block for desired value– If found, follow pointer to retrieve record


B-tree: insertion

¨ As with many tree algorithms, insertion is based on search– Start by searching for where the record should be– If room in the leaf block, insert a pointer to the new record– Else, split the leaf block into two, and insert the pointer

¨ Now there are two leaf blocks: need to update parent– Similar process to update parent: may need to split parent– May propagate back to root

¨ Note that we do not explicitly attempt to keep tree balanced– The condition p/2 < q ≤ p ensures that it can’t be too unbalanced

¨ Algorithms fans: condition ensures height is O(log n) for n keys– Worst case time for {insert, delete, search} is O(log n)


B+-tree: deletion

¨ Essentially the inverse of insertion– Find the record to delete from the B+-tree– Remove the pointer and if block is still large enough, halt– Else, try to redistribute: move entries from sibling block– If can’t redistribute, merge the two siblings– Then delete one pointer from parent and recurse up tree


Summary

¨ Disk properties and file storage¨ File organizations: ordered, unordered, and hashed¨ Storage topics: RAID and Storage area networks¨ Indexes: primary and secondary¨ Multilevel indexes and B-trees

¨ Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe

¨ Chapter: “Indexing Structure for Files” in Elmasri and Navathe


CS346: Advanced Databases Graham Cormode [email protected] Storage, Files and Indexing.

Documents

Transcript of CS346: Advanced Databases Graham Cormode [email protected] Storage, Files and Indexing.