“Developing a Step-by-Step Checklist for Content Creation in Different Formats”
Review How DBMS physically organizes data Different file organizations or access methods Record...
-
Upload
maia-caufield -
Category
Documents
-
view
221 -
download
0
Transcript of Review How DBMS physically organizes data Different file organizations or access methods Record...
Lecture # 7
Review◦ How DBMS physically organizes data◦ Different file organizations or access methods
Record formats Page Formats What is Indexing? Different indexing methods How to create indexes using SQL
Agenda
DBMS has to store data somewhere Choices:
◦ Main memory Expensive – compared to secondary and tertiary storage Fast – in memory operations are fast Volatile – not possible to save data from one run to its next Used for storing current data
◦ Secondary storage (hard disk) Less expensive – compared to main memory Slower – compared to main memory, faster compared to tapes Persistent – data from one run can be saved to the disk to be used in
the next run Used for storing the database
◦ Tertiary storage (tapes) Cheapest Slowest – sequential data access Used for data archives
Review previous lecture
This means that data needs to be◦ read from the hard disk into memory (RAM)◦ Written from the memory onto the hard disk
Because I/O disk operations are slow query performance depends upon how data is stored on hard disks
The lowest component of the DBMS performs storage management activities
Other DBMS components need not know how these low level activities are performed
DBMS stores data on hard disks
A disk is organized into a number of blocks or pages
A page is the unit of exchange between the disk and the main memory
A collection of pages is known as a file DBMS stores data in one or more files on
the hard disk
Basics of Data storage on hard disk
Database tables are made up of one or more tuples (rows)
Each tuple has one or more attributes One or more tuples from a table are
written into a page on the hard disk◦ Larger tuples may need more than one page!◦ Tuples on the disk are known as records◦ Records are separated by record delimiter◦ Attributes on the hard disk are known as fields◦ Fields are separated by field delimiter
Database Tables on Hard Disk
Page : abstraction is used for I/O Record : data granularity for higher level of
DBMS
How to arrange records in pages?◦ Identify a record:
<page_id, slot_number>, where slot_number = rid Most cases, use <page_id, slot_number> as rid.
Alternative approaches to manage slots on a page
How to support insert/deleting/searching?
Page Formats
Information about field types same for all records in a file
Stored record format in system catalogs.+ Finding i’th field does not require scan of
record, just offset calculation.
Records Formats: Fixed Length Record
Base address (B)
L1 L2 L3 L4
F1 F2 F3 F4
Address = B+L1+L2
Record id = <page id, slot #>.
Note: In first alternative, moving records for free space management changes rid; may not be acceptable if existing external references to the record that is moved.
Page Formats: Fixed Length Records
Slot 1Slot 2
Slot N
. . . . . .
N M10. . .
M ... 3 2 1PACKED UNPACKED, BITMAP
Slot 1Slot 2
Slot N
FreeSpace
Slot M
11
number of records
numberof slots
Two alternative formats (# fields is fixed):
Record Formats: Variable Length
+ Second offers direct access to i’th field+ efficient storage of nulls ; - small directory overhead.
4 $ $ $ $
FieldCount
Fields Delimited by Special Symbols
F1 F2 F3 F4
F1 F2 F3 F4
Array of Field Offsets
Slot directory = {<record_offset, record_length>}
Page Formats: Variable Length Records
Page iRid = (i,N)
Rid = (i,2)
Rid = (i,1)
Pointerto startof freespace
SLOT DIRECTORY
N . . . 2 120 16 24 N
# slots
Offset of record from start of data area
Length = 24
Length = 16
Length = 20
Slot directory = {<record_offset, record_length>} Dis/Advantages:
+ Moving: rid is not changed+ Deletion: offset = -1 (rid changed?
Can we delete slot? Why?)+ Insertion: Reuse deleted slot.
Only insert if none available.
Free space? Free space pointer? Recycle after deletion?
Page Formats: Variable Length Records
Meta information stored in system catalogs.
For each index:◦ structure (e.g., B+ tree) and search key fields
For each relation:◦ name, file name, file structure (e.g., Heap file)◦ attribute name and type, for each attribute◦ index name, for each index◦ integrity constraints
For each view:◦ view name and definition
Plus statistics, authorization, buffer pool size, etc.
System Catalogs
Catalogs are themselves stored as relations!
Attr_Cat(attr_name, rel_name, type, position)
attr_name rel_name type positionattr_name Attribute_Cat string 1rel_name Attribute_Cat string 2type Attribute_Cat string 3position Attribute_Cat integer 4sid Students string 1name Students string 2login Students string 3age Students integer 4gpa Students real 5fid Faculty string 1fname Faculty string 2sal Faculty real 3
File Organization & Indexing
The physical arrangement of data in a file into records and pages on the disk
File organization determines the set of access methods for◦ Storing and retrieving records from a file
Therefore, ‘file organization’ synonymous with ‘access method’
We study three types of file organization◦ Unordered or Heap files◦ Ordered or sequential files◦ Hash files
We examine each of them in terms of the operations we perform on the database◦ Insert a new record◦ Search for a record (or update a record)◦ Delete a record
File Organization
Records are stored in the same order in which they are created
Insert operation◦ Fast – because the incoming record is written at the end
of the last page of the file Search (or update) operation
◦ Slow – because linear search is performed on pages Delete Operation
◦ Slow – because the record to be deleted is first searched for
◦ Deleting the record creates a hole in the page◦ Periodic file compacting work required to reclaim the
wasted space
Unordered Or Heap File
Records are sorted on the values of one or more fields◦ Ordering field – the field on which the records are sorted◦ Ordering key – the key of the file when it is used for record sorting
Search (or update) Operation◦ Fast – because binary search is performed on sorted records◦ Update the ordering field?
Delete Operation◦ Fast – because searching the record is fast◦ Periodic file compacting work is, of course, required
Insert Operation◦ Poor – because if we insert the new record in the correct position we
need to shift all the subsequent records in the file◦ Alternatively an ‘overflow file’ is created which contains all the new
records as a heap◦ Periodically overflow file is merged with the main file◦ If overflow file is created search and delete operations for records in the
overflow file have to be linear!
Ordered or Sequential File
Is an array of buckets◦ Given a record, r a hash function, h(r) computes the index
of the bucket in which record r belongs◦ h uses one or more fields in the record called hash fields◦ Hash key - the key of the file when it is used by the hash
function Example hash function
◦ Assume that the staff last name is used as the hash field◦ Assume also that the hash file size is 26 buckets - each
bucket corresponding to each of the letters from the alphabet
◦ Then a hash function can be defined which computes the bucket address (index) based on the first letter in the last name.
Hash File
Insert Operation◦ Fast – because the hash function computes the
index of the bucket to which the record belongs If that bucket is full you go to the next free one
Search Operation◦ Fast – because the hash function computes the
index of the bucket Performance may degrade if the record is not found in
the bucket suggested by hash function Delete Operation
◦ Fast – once again for the same reason of hashing function being able to locate the record quick
Hash File (2)
Can we do anything else to improve query performance other than selecting a good file organization?
Yes, the answer lies in indexing Index - a data structure that allows the DBMS to locate
particular records in a file more quickly◦ Very similar to the index at the end of a book to locate various
topics covered in the book Types of Index
◦ Primary index – one primary index per file◦ Clustering index – one clustering index per file – data file is
ordered on a non-key field and the index file is built on that non-key field
◦ Secondary index – many secondary indexes per file Sparse index – has only some of the search key values in
the file Dense index – has an index corresponding to every search
key value in the file
Indexing
Primary Indexes
The data file is sequentially ordered on the key field
Index file stores all (dense) or some (sparse) values of the key field and the page number of the data file in which the corresponding record is storedB002 1
B003 1
B004 2
B005 2
B007 3
Branch B002 recordBranch B003 record
Branch B004 recordBranch B005 record
Branch B007 record
Branch BranchNo Street City Postcode
B002 56 Clover Dr London NW10 6EU
B003 163 Main St Glasgow G11 9QX
B004 32 Manse Rd Bristol BS99 1NZ
B005 22 Deer Rd London SW1 4EH
B007 16 Argyll St Aberdeen AB2 3SU
1
2
3
4
ISAM – Indexed sequential access method is based on primary index
Default access method or table type in MySQL, MyISAM is an extension of ISAM
Insert and delete operations disturb the sorting◦ You need an overflow file which periodically needs
to be merged with the main file
Indexed Sequential Access Method
An index file that uses a non primary field as an index e.g. City field in the branch table
They improve the performance of queries that use attributes other than the primary key
You can use a separate index for every attribute you wish to use in the WHERE clause of your select query
But there is the overhead of maintaining a large number of these indexes
Secondary Indexes
You can create an index for every table you create in SQL
For example◦ CREATE INDEX branchNoIndex on
branch(branchNo);◦ CREATE INDEX numberCityIndex on
branch(branchNo,city);◦ DROP INDEX branchNoIndex;
Creating indexes in SQL
Disks provide cheap, non-volatile storage.◦ Random access, but cost depends on location of
page on disk◦ Important to arrange data sequentially to
minimize seek and rotation delays. Buffer manager brings pages into RAM.
◦ Page stays in RAM until released by requestor.◦ Written to disk when frame chosen for
replacement. ◦ Frame to replace based on replacement policy.◦ Tries to pre-fetch several pages at a time.
Summary
DBMS vs. OS File Support◦ DBMS needs features not found in many OSs.
forcing a page to disk controlling the order of page writes to disk files spanning disks ability to control pre-fetching and page
replacement policy based on predictable access patterns
Formats for Records and Pages :◦ Slotted page format : supports variable length
records and allows records to move on page.◦ Variable length record format : field offset
directory offers support for direct access to i’th field and null values.
More Summary
File layer keeps track of pages in a file, and supports abstraction of a collection of records.◦ Pages with free space identified using linked
list or directory structure
Indexes support efficient retrieval of records based on the values in some fields.
Catalog relations store information about relations, indexes and views. ◦ Information common to all records in
collection.
Even More Summary
File organization or access method determines the performance of search, insert and delete operations.◦ Access methods are the primary means to
achieve improved performance Index structures help to improve the
performance further◦ More index structures in the next lecture
Summary