The Memory Hierarchy Computer systems have several different
components in which data may be stored. Data capacities &
access speeds range over at least seven orders of magnitude Devices
with smallest capacity also offer the fastest access speed
Slide 4
Description of Levels 1.Cache Megabyte or more of Cache
storage. On-board cache : On same chip. Level-2 cache : On another
chip. Cache data accessed in few nanoseconds. Data moved from main
memory to cache when needed by processor Volatile
Slide 5
Description of Levels 2.Main Memory 1 GB or more of main
memory. Instruction execution & Data Manipulation - involves
information resident in main memory. Time to move data from main
memory to the processor or cache is in the 10-100 nanosecond range.
Volatile 3.Secondary Storage Typically a magnetic disk. Capacity
upto 1 TB. One machine can have several disk units. Time to
transfer a single byte between disk & main memory is around 10
milliseconds.
Slide 6
Description of Levels 4. Tertiary Storage Holds data volumes
measured in terabytes. Significantly higher read/write times.
Smaller cost per bytes. Retrieval takes seconds or minutes, but
capacities in the petabyte range are possible.
Slide 7
Transfer of Data Between Levels Data moves between adjacent
levels of the hierarchy. Each level is organized to transfer large
amounts of data to or from the level below Key technique for
speeding up database operations is to arrange data so that when one
piece of a disk block is needed, it is likely that other data on
the same block will also be needed at about the same time.
Slide 8
Volatile & Non Volatile Storage A volatile device forgets
what is stored in it when the power goes off. Example: Main Memory
A nonvolatile device, on the other hand, is expected to keep its
contents intact even for long periods when the device is turned off
or there is a power failure. Example: Secondary & Tertiary
Storage Note: No change to the database can be considered final
until it has migrated to nonvolatile, secondary storage.
Slide 9
Virtual Memory Managed by Operating System. Some memory in main
memory & rest on disk. Transfer between the two is in units of
disk blocks (pages). Not a level of the memory hierarchy
Slide 10
CS-257 Database System Principles Avinash Anantharamu 102
Slide 11
Index 13.2 Disks 13.2.1 Mechanics of Disks 13.2.2 The Disk
Controller 13.2.3 Disk Access Characteristics
Slide 12
Disks: The use of secondary storage is one of the important
characteristics of a DBMS, and secondary storage is almost
exclusively based on magnetic disks
Slide 13
Structure of a Disk
Slide 14
Data in Disk 0s and 1s are represented by different patterns in
the magnetic material. A common diameter for the disk platters is
3.5 inches.
Slide 15
Mechanics of Disks Two principal moving pieces of hard drive 1-
Head Assembly 2- Disk Assembly Disk Assembly has 1 or more circular
platters that rotate around a central spindle. Platters are covered
with thin magnetic material
Slide 16
Top View of Disk Surface
Slide 17
Mechanics of Disks Tracks are concentric circles on a platter.
Tracks are organized into sectors which are segments of circular
platter. Sectors are indivisible as far as errors are concerned.
Blocks are logical data transfer units.
Slide 18
Disk Controller Control the actuator to move head assembly
Selecting the surface from which to read or write Transfer bits
from desired sector to main memory
Slide 19
Simple Single Processor Computer
Slide 20
Disk Access characteristics Seek time: The disk controller
positions the head assembly at the cylinder containing the track on
which the block is located. The time to do so is the seek time.
Rotational latency: The disk controller waits while the first
sector of the block moves under the head. This time is called the
rotational latency.
Slide 21
Disk Access characteristics Transfer time: All the sectors and
the gaps between them pass under the head, while the disk
controller reads or writes data in these sectors. This delay is
called the transfer time. Latency of the disk: The sum of the seek
time, rotational latency, transfer time is the latency of the
time.
Slide 22
13.3 Accelerating Access to Secondary Storage San Jose State
University Spring 2012
Slide 23
13.3 Accelerating Access to Secondary Storage Section Overview
13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by
Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks
13.3.5: Disk Scheduling and the Elevator Algorithm 13.3.6:
Prefetching and Large-Scale Buffering
Slide 24
13.3 Introduction Average block access is ~10ms. Disks may be
busy. Requests may outpace access delays, leading to infinite
scheduling latency. There are various strategies to increase disk
throughput. The I/O Model is the correct model to determine speed
of database operations
Slide 25
13.3 Introduction (Contd.) Actions that improve database access
speed: Place blocks closer, within the same cylinder Increase the
number of disks Mirror disks Use an improved disk-scheduling
algorithm Use prefetching
Slide 26
13.3.1 The I/O Model of Computation If we have a computer
running a DBMS that: Is trying to serve a number of users Has 1
processor, 1 disk controller, and 1 disk Each user is accessing
different parts of the DB It can be assumed that: Time required for
disk access is much larger than access to main memory; and as a
result: The number of block accesses is a good approximation of
time required by a DB algorithm
Slide 27
13.3.2 Organizing Data by Cylinders It is more efficient to
store data that might be accessed together in the same or adjacent
cylinder(s). In a relational database, related data should be
stored in the same cylinder.
Slide 28
13.3.3 Using Multiple Disks If the disk controller supports the
addition of multiple disks and has efficient scheduling, using
multiple disks can improve performance significantly By striping a
relation across multiple disks, each chunk of data can be retrieved
in a parallel fashion, improving performance by up to a factor of
n, where n is the total number of disks the data is striped
over
Slide 29
A drawback of striping data across multiple disks is that you
increase your chances of disk failure. To mitigate this risk, some
DBMS use a disk mirroring configuration Disk mirroring makes each
disk a copy of the other disks, so that if any disk fails, the data
is not lost Since all the data is in multiple places, access
speedup can be increased by more than n since the disk with the
head closest to the requested block can be chosen 13.3.4 Mirroring
Disks
Slide 30
AdvantagesDisadvantages StripingRead/Write speedup ~n Capacity
increased by ~n Higher risk of failure MirroringRead speedup ~n
Reduced failure risk Fast initial access High cost per bit Slow
writes compared to striping
Slide 31
One way to improve disk throughput is to improve disk
scheduling, prioritizing requests such that they are more efficient
The elevator algorithm is a simple yet effective disk scheduling
algorithm The algorithm makes the heads of a disk oscillate back
and forth similar to how an elevator goes up and down The access
requests closest to the heads current position are processed first
13.3.5 Disk Scheduling
Slide 32
When sweeping outward, the direction of head movement changes
only after the largest cylinder request has been processed When
sweeping inward, the direction of head movement changes only after
the smallest cylinder request has been processed Example: 13.3.5
Disk Scheduling CylinderTime Requested (ms) 80000 240000 560000
1600010 6400020 4000030 CylinderTime Completed (ms) 80004.3
2400013.6 5600026.9 6400034.2 4000045.5 1600056.8
Slide 33
In some cases we can anticipate what data will be needed We can
take advantage of this by prefetching data from the disk before the
DBMS requests it Since the data is already in memory, the DBMS
receives it instantly 13.3.6 Prefetching and Large-Scale
Buffering
Slide 34
Chapter 13.4 Presented by Timothy Chen Spring 2013
Slide 35
Index 13.4 Disk Failures 13.4.1 Intermittent Failures 13.4.2
Organizing Data by Cylinders 13.4.3 Stable Storage 13.4.4 Error-
Handling Capabilities of Stable Storage 13.4.5 Recovery from Disk
Crashes 13.4.6 Mirroring as a Redundancy Technique 13.4.7 Parity
Blocks 13.4.8 An Improving: RAID 5 13.4.9 Coping With Multiple Disk
Crashers
Slide 36
Intermittent Failures If we try to read the sector but the
correct content of that sector is not delivered to the disk
controller Controller will check good and bad sector If the write
is correct: Read is performed Good sector and bad sector is known
by the read operation
Slide 37
CheckSum Read operation that determine the good or bad
status
Slide 38
How CheckSum perform Each sector has some additional bits Set
depending on the values of the data bits stored in each sector If
the data bit in the not proper we know there is an error reading
Odd number of 1: bits have odd parity(01101000) Even number of 1:
bit have even parity (111011100) Find Error is the it is one bit
parity
Slide 39
Stable Storage Deal with disk error Sectors are paired and each
pair X showing left and right copies as Xl and Xr It check the
parity bit of left and right by subsituting spare sector of Xl and
Xr until the good value is returned
Slide 40
Error-Handling Capabilities of Stable Storage Since it has XL
and XR, one of them fail we can still read other one Chance both of
them fail are pretty small The write Fail, it happened during power
outage
Slide 41
Recover Disk Crash The most serious mode of failure for disks
is head crash where data permanently destroyed. The way to recover
from crash, we use RAID method
Slide 42
Mirroring as a Redundancy Technique it is call Raid 1 Just
mirror each disk
Slide 43
Raid 1 graph
Slide 44
Parity Block It often call Raid 4 technical read block from
each of the other disks and modulo-2 sum of each column and get
redundant disk disk 1: 11110000 disk 2: 10101010 disk 3: 00111000
get redundant disk 4(even 1= 0, odd 1 =1) disk 4: 01100010
Slide 45
Raid 4 graphic
Slide 46
Parity Block- Fail Recovery It can only recover one disk fail
If it has more than one like two disk Then it cant be recover us
modulo-2 sum
Slide 47
An Improvement Raid 5
Slide 48
Coping with multiple Disk Crash For more one disk fail Either
raid 4 and raid 5 cant be work So we need raid 6 It is need at
least 2 redundant disk
Slide 49
Raid 6
Slide 50
Secondary Storage Management 13.5 Arranging data on disk
Mangesh Dahale ID-105 CS 257
Slide 51
Outline Fixed-Length Records Example of Fixed-Length Records
Packing Fixed-Length Records into Blocks Example of Packing
Fixed-Length Records into Blocks Details of Block header
Slide 52
Arranging Data on Disk A data element such as a tuple or object
is represented by a record, which consists of consecutive bytes in
some disk block.
Slide 53
Fixed Length Records The Simplest record consists of fixed
length fields. The record begins with a header, a fixed-length
region where information about the record itself is kept. Fixed
Length Record header 1. A pointer to record schema. 2. The length
of the record. 3. A timestamp indicating when the record was
created.
Slide 54
Example CREATE TABLE employee( name CHAR(30) PRIMARY KEY,
address VARCHAR(255), gender CHAR(1), birthdate DATE );
Slide 55
Packing Fixed Length Records into Blocks Records are stored in
blocks of the disk and moved into main memory when we need to
access or update them. A block header is written first and it is
followed by series of blocks.
Slide 56
Example Along with the header we can pack as many record as we
can in one block as shown in the figure and remaining space will be
unused
Slide 57
Block header contains following information Links to one or
more other blocks that are part of a network blocks Information
about the role played by this block in such a network Information
about which relation the tuples of this block belong to. A
directory giving the offset of each round in the block Timestamp(s)
indicating the time of the block's last modification and / or
access
Slide 58
Chapter 13.7 Ashwin Kalbhor Class ID : 107
Slide 59
Agenda Records with Variable Length Fields Records with
Repeating Fields Variable Format Records Records that do not fit in
a block
Slide 60
Example of a record nameaddressgenderbirth date 0 30 286 287
297
Slide 61
Records with Variable Length Fields Simple and Effective way to
represent variable length records is as follows 1. Fixed length
fields are kept ahead of the variable length records. 2. A header
is put in front of the of the record. 3. Record header contains
Length of the record Pointers to the beginning of all variable
length fields except the first one.
Slide 62
Example Record with name and address as variable length field.
birth datenameaddress header information record length to address
gender
Slide 63
Records with repeating fields Repeating fields simply means
fields of the same length L. All occurrences of Field F grouped
together. Pointer pointing to the first field F is put in the
header. Based on the length L the starting offset of any repeating
field can be obtained.
Slide 64
Example of a record with Repeating Fields Movie star record
with movies as the repeating field. nameaddress other header
information record length to address to movie pointers pointers to
movies
Slide 65
Alternative representation Record is of fixed length Variable
length fields stored on a separate block. The record itself keeps
track of - 1. Pointers to the place where each repeating field
begins, and 2. Either how many repetitions there are, or where the
repetitions end.
Slide 66
Storing variable length fields separately from the record.
Slide 67
Variable Format Records Records that do not have fixed schema
Represented by sequence of tagged fields Each of the tagged fields
consist of information Attribute or field name Type of the field
Length of the field Value of the field
Slide 68
Variable Format Records N16SS14Clint EastwoodHogs Breath InnR
code for namecode for restaurant owned code for string type
length
Slide 69
Records that do not fit in a block When the length of a record
is greater than block size,then record is divided and placed into
two or more blocks Portion of the record in each block is referred
to as a RECORD FRAGMENT Record with two or more fragments is called
a SPANNED RECORD Record that do not cross a block boundary is
called UNSPANNED RECORD
Slide 70
Spanned Records Spanned records require the following extra
header information A bit indicates whether it is fragment or not A
bit indicates whether it is first or last fragment of a record
Pointers to the next or previous fragment for the same record
Slide 71
Spanned Records record 1 record 3 record 2 - a record 2 - b
block header record header block 1 block 2
Slide 72
CS257 Lok Kei Leong ( 108 )
Slide 73
Outline Record Insertion Record Deletion Record Update
Slide 74
Insertion Insert new records into a relation - records of a
relation in no particular order - record of a relation in fixed
order (e.g. sorted by primary key) A pointer to a record from
outside the block is a structured address Record 4 Record 3 Record
2 Record 1 unusedheader Offeset table
Slide 75
What If The Block is Full? If we need to insert the record in a
particular block but the block is full. What should we do? Find
room outside the Block There are 2 solutions I. Find Space on
Nearby Block II. Create an Overflow Block
Slide 76
Insertion (solution 1) Find space on a nearby block Block B1
has no space If space available on block B2 move records of B1 to
B2 If there are external pointers to records of B1 moved to B2
leave forwarding address in offset table of B1
Slide 77
Insertion (solution 2) Create an overflow block Each block B
has its header pointer to an overflow block where additional blocks
of B can be placed Block B Overflow block for B
Slide 78
Deletion Slide around the block Cannot slide records - maintain
an available-space list in the block header to keep track of space
available Avoid dangle or wind up pointing to a new record
Slide 79
Tombstone What about pointer to deleted records ? A tombstone
is placed in place of each deleted record A tombstone is a bit
placed at first byte of deleted record to indicate the record was
deleted ( 0 Not Deleted 1 Deleted) A tombstone is permanent Record
1 Record 2
Slide 80
Update For Fixed-Length Records, there is no effect on the
storage system For variable length records: associated with
insertion and deletion (never create a tombstone for old record)
Longer updated record create more space on its block - sliding
records - create an overflow block
Slide 81
Slide 82
Sweta Shah CS257: Database Systems ID: 118
Slide 83
Agenda Query Processor Query compilation Physical Query Plan
Operators Scanning Tables Table Scan Index scan Sorting while
scanning tables Model of computation for physical operators
Parameters for measuring cost Iterators
Slide 84
Query Processor The Query Processor is a group of components of
a DBMS that turns user queries and data-modification commands into
a sequence of database operations and executes those operations
Query processor is responsible for supplying details regarding how
the query is to be executed
Slide 85
The major parts of the query processor
Slide 86
Query compilation Query compilation itself is a multi-step
process consisting of : Parsing: in which a parse tree representing
query and its structure is constructed Query rewrite: in which the
parse tree is converted to an initial query plan Physical plan
generation: where the logical query plan is turned into a physical
query plan by selecting algorithms.
Slide 87
Outline of query compilation
Slide 88
Physical Query Plan Operators Physical query plans are built
from operators Each of the operators implement one step of the plan
They are particular implementations for one of the operators of
relational algebra. They can also be non relational algebra
operators like scan which scans tables.
Slide 89
Scanning Tables One of the most basic things in a physical
query plan. Necessary when we want to perform join or union of a
relation with another relation.
Slide 90
Two basic approaches to locating the tuples of a relation R
Table-scan Relation R is stored in secondary memory with its tuples
arranged in blocks it is possible to get the blocks one by one This
operation is called Table Scan
Slide 91
Two basic approaches to locating the tuples of a relation R
Index-scan there is an index on any attribute of Relation R Use
this index to get all the tuples of R This operation is called
Index Scan
Slide 92
Sorting While Scanning Tables Why do we need sorting while
scanning? the query could include an ORDER BY clause requiring that
a relation be sorted Various algorithms for relational-algebra
operations require one or both of their arguments to be sorted
relation Sort-scan takes a relation R and a specification of the
attributes on which the sort is to be made, and produces R in that
sorted order
Slide 93
Model of Computation for Physical Operators Choosing physical
plan operators wisely is an essential for a good query processor.
Cost for an operation is measured in number of disk i/o operations.
If an operator requires the final answer to a query to be written
back to the disk, the total cost will depend on the length of the
answer and will include the final write back cost to the total cost
of the query.
Slide 94
Improvements in cost Major improvements in cost of the physical
operators can be achieved by avoiding or reducing the number of
disk i/o operations This can be achieved by passing the answer of
one operator to the other in the main memory itself without writing
it to the disk.
Slide 95
Parameters for Measuring Costs Parameters that affect the
performance of a query Buffer space availability in the main memory
at the time of execution of the query Size of input and the size of
the output generated The size of memory block on the disk and the
size in the main memory also affects the performance
Slide 96
Iterators for Implementation of Physical Operators Many
physical operators can be implemented as an iterator It is a group
of three functions that allows a consumer of the result of the
physical operator to get the result one tuple at a time
Slide 97
Iterator The three functions forming the iterator are: Open:
This function starts the process of getting tuples. It initializes
any data structures needed to perform the operation
Slide 98
Iterator GetNext This function returns the next tuple in the
result Adjusts data structures as necessary to allow subsequent
tuples to be obtained If there are no more tuples to return,
GetNext returns a special value NotFound
Slide 99
Iterator Close This function ends the iteration after all
tuples it calls Close on any arguments of the operator
Slide 100
One-pass algorithm for database operations Chetan Sharma
008565661
we can divide algorithms for operators into three degrees of
difficulty and cost: 1) Require at least one of the arguments to
fit in main memory.-one pass 2) Some methods work for data that is
too large to fit in available main memory but not for the largest
imaginable data sets.-two pass 3) Some methods work without a limit
on the size of the data.- multipass:recursive generalizations of
the two-pass algorithms.
Slide 103
One-Pass Algorithm Reading the data only once from disk.
Usually, they require at least one of the arguments to fit in main
memory
Slide 104
Tuple-at-a-Time These operations do not require an entire
relation, or even a large part of it, in memory at once. Thus, we
can read a block at a time, use one main memory buffer, and produce
our output. Ex- selection and projection
Slide 105
Tuple-at-a-Time A selection or projection being performed on a
relation R
Slide 106
Full-relation, unary operations Now, let us consider the unary
operations that apply to relations as a whole, rather than to one
tuple at a time: a)Duplicate elimination. b)Grouping.
Slide 107
a) Duplicate elimination
Slide 108
b) Grouping MIN (a),MAX (a) aggregate, record the minimum or
maximum value, respectively, of attribute a seen for any tuple in
the group so far. COUNT aggregation, add one for each tuple of the
group that is seen. SUM (a), add the value of attribute a to the
accumulated sum for its group. AVG (a) is the hard case. We must
maintain two accumulations: the count of the number of tuples in
the group and the sum of the a-values of these tuples.
Slide 109
b) Grouping When all tuples of R have been read into the input
buffer and contributed to the aggregation(s) for their group, we
can produce the output by writing the tuple for each group. Note-:
that until the last tuple is seen, we cannot begin to create output
for a operation. Thus, this algorithm does not fit the iterator
framework very well; The entire grouping has to be done by the Open
method before the first tuple can be retrieved
Slide 110
One-Pass Algorithms for Binary Operations All other operations
are in this class: set and bag versions of union, intersection,
difference, joins, and products. binary operations require reading
the smaller of the operands R and S into main memory and building a
suitable data structure so tuples can be both inserted quickly and
found quickly. to be performed in one pass is: min(B(R),B(S))
Joining by Using an Index (Algorithm 1) Consider natural join:
R(X,Y) |>
Joining by Using an Index (Algorithm 1) Analysis Consider
R(X,Y) |>
Join Using a Sorted Index Consider R(X,Y) |>
Join Using a Sorted Index (Zig-zag join) Consider R(X,Y)
|>
Multipass sort-based algorithm. INDUCTION: (B(R)> M) 1. If R
does not fit into main memory then partitioning the blocks hold R
into M groups, which call R 1, R 2, , R M 2.Recursively sorting R i
from i =1 to M 3.Once sorting is done, the algorithm merges the M
sorted sub-lists.
Slide 195
Slide 196
Performance: Multipass Sort-Based Algorithms 1) Each pass of a
sorting algorithm: 1.Reading data from the disk. 2. Sorting data
with any sorting algorithms 3. Writing data back to the disk. 2-1)
(k)-pass sorting algorithm needs 2k B(R) disk I/Os 2-2)To calculate
(Multipass)-pass sorting algorithm needs = > A+ B A: 2(K-1 )
(B(R) + B(S) ) [ disk I/O operation to sort the sublists] B: B(R) +
B(S)[ disk I/O operation to read the sorted the sublists in the
final pass] Total: (2k-1)(B(R)+B(S)) disk I/Os
Slide 197
Multipass Hash-Based Algorithms 1. Hashing the relations into
M-1 buckets, where M is number of memory buffers. 2. Unary case: It
applies the operation to each bucket individually. 1.Duplicate
elimination ( ) and grouping ( ). 1) Grouping: Min, Max, Count,
Sum, AVG, which can group the data in the table 2) Duplicate
elimination: Distinct Basis: If the relation fits in M memory
block, -> Reading relation into memory and perform the
operations. 3. Binary case: It applies the operation to each
corresponding pair of buckets. Query operations: union,
intersection, difference, and join If either relations fits in M-1
memory blocks, -> Reading that relation into main memory M-1
blocks -> Reading next relation to 1 block at a time into the M
th block Then performing the operations.
Slide 198
INDUCTION If Unary and Binary relation does not fit into the
main memory buffers. 1. Hashing each relation into M-1 buckets. 2.
Recursively performing the operation on each bucket or
corresponding pair of buffers. 3. Accumulating the output from each
buckets or pair.
Slide 199
Hash-Based Algorithms : Unary Operatiors
Slide 200
Perfermance: Hash-Based Algorithms R: Realtion. Operations are
like and M: Buffers U(M, k): Number of blocks in largest relation
with k-pass hashing algorithm.
Slide 201
Performance: Induction Induction: 1. Assuming that the first
step divides relation R into M-1 equal buckets. 2. The buckets for
the next pass must be small enough to handle in k-1 passes 3.Since
R is divided into M-1 buckets, we need to have (M-1)u(M, k-1).
Slide 202
Sort-Based VS Hash-Based 1. Sort-based can produce output in
sorted order. It might be helpful to reduce rotational latency or
seek time 2. Hash-based depends on buckets being of equal size. For
binary operations, hash-based only limits size of smaller relation.
Therefore, hash-based can be faster than sort-based for small size
of relation.
Query Processing Query Compilation Query Execution query query
plan metadata data Query is compiled. This involves extensive
optimization using operations of relational algebra. First compiled
into a logical query plans, e.g. using expressions of relational
algebra. Then converted to a physical query plan such as selecting
implementation for each operator, ordering joins and etc. Query is
then executed.
Slide 206
Outline of Query Compilation Parse query Select logical plan
SQL query expression tree query optimization Parsing: A parse tree
for the query is constructed. Query Rewrite: The parse tree is
converted to an initial query plan and transformed into logical
query plan. Physical Plan Generation: Logical plan is converted
into physical plan by selecting algorithms and order of executions.
Select physical plan Execute plan logical query plan tree physical
query plan tree
Slide 207
Table Scanning There are two approaches for locating tuples of
relation R: Table-scan: Get the blocks one by one. Index-scan: Use
index to lead us to all blocks holding R. Sort-scan takes a
relation R and sorting specifications and produces R in a sorted
order. This can be accomplished with SQL clause ORDER BY.
Slide 208
Estimates of cost are essential for query optimization. It
allows us to determine the slow and fast parts of a query plan.
Reading many consecutive blocks on a track is extremely important
since disk I/Os are expensive in term of time. EXPLAIN SELECT *
FROM a JOIN b on a.id = b.id; Cost Measures
Slide 209
EXPLAIN SELECT snp.* FROM snp JOIN chr ON snp.chr_key =
chr.chr_key WHERE snp_name '' Cost Measures Optimizing
Queries:
Slide 210
One-pass Methods Tuple-at-a-time: Selection and projection that
do not require an entire relation in memory at once. Full-relation,
unary operations. Must see all or most of tuples in memory at once.
Uses grouping and duplicate-eliminator operators. Hash table O(n)
or a balanced binary search tree O(n log n) is used for duplicate
eliminations to speed up the detections. Full-relation, binary
operations. These include union, intersection, difference, product
and join. Review of Algorithms
Slide 211
Nested-Loop Joins In a sense, it is one-and-a-half passes,
since one argument has its tuples read only once, while the other
will be read repeatedly. Can use relation of any size and does not
have to fit all in main memory. Two variations of nested-loop
joins: Tuple-based: Simplest form, can be very slow since it takes
T(R)*T(S) disk I/Os if we are joining R(x,y) with S(y,z).
Block-based: Organizing access to both argument relations by blocks
and use as much main memory as we can to store tuples. Review of
Algorithms
Slide 212
Two-pass Algorithms Usually enough even for large relations.
Based on Sorting: Partition the arguments into memory-sized, sorted
sublists. Sorted sublists are then merged appropriately to produce
desired results. Based on Hashing: Partition the arguments into
buckets. Useful if data is too big to store in memory. Review of
Algorithms
Slide 213
Two-pass Algorithms Sort-based vs. Hash-based: Hash-based are
often superior to sort-based since they require only one of the
arguments to be small. Sorted-based works well when there is reason
to keep some of the data sorted. Review of Algorithms
Slide 214
Index-based Algorithms Index-based joins are excellent when one
of the relations is small, and the other has an index on join
attributes. Clustering and non-clustering indexes: Clustering index
has all tuples with fixed value packed into minimum number of
blocks. A clustered relation can have non-clustering indexes.
Review of Algorithms
Slide 215
Multi-pass Algorithms Two-pass algorithms based on sorting or
hashing can usually take three or more passes and will work for
larger data sets. Each pass of a sorting algorithm reads all data
from disk and writes it out again. Thus, a k-pass sorting algorithm
requires 2kB(R) disk I/Os. Review of Algorithms
Slide 216
Chapter 18
Slide 217
Dona Baysa ID: 127 CS 257 Spring 2013
Slide 218
Intro Concurrency Control Scheduler Serializability Schedules
Serial and Serializable
Slide 219
Intro: Concurrency Control & Scheduler Concurrently
executing transactions can cause inconsistent database state
Concurrency Control assures transactions preserve consistency
Scheduler: Regulates individual steps of different transactions
Takes reads/writes requests from transactions and executes/delays
them
Slide 220
Intro: Scheduler Transaction requests passed to Scheduler
Scheduler determines execution of requests Transaction manager
Scheduler Buffers Read/Write requests Reads and writes
Slide 221
Serializability How to assure concurrently executing
transactions preserve database state correctness? Serializability
schedule transactions as if they were executed one-at-a-time
Determine a Schedule
Slide 222
Schedules Schedule sequence of important actions performed by
transactions Actions: reads and writes Example: Transactions and
actions T1T1 T2T2 READ(A, t)READ(A, s) t := t+100s := s*2
WRITE(A,t)WRITE(A,s) READ(B,t)READ(B,s) t := t+100s := s*2
WRITE(B,t)WRITE(B,s)
Slide 223
Serial Schedules All actions of one transactions are followed
by all actions of another transaction, and so on. No mixing of
actions Depends only on order of transactions Serial Schedules: T 1
precedes T 2 T 2 precedes T 1
Slide 224
Serial Schedule: Example T 1 precedes T 2 Notation: (T 1,T 2 )
Consistency constraint: A = B Final value: A = B = 250 Consistency
is preserved T1T1 T2T2 AB READ(A, t)25 t := t+100 WRITE(A,t ) 125
READ(B,t) t := t+100 WRITE(B,t ) 125 READ(A, s) s := s*2 WRITE(A,s
) 250 READ(B,s) s := s*2 WRITE(B,s ) 250
Slide 225
Serializable Schedules Serial schedules preserve consistency
Any other schedules that also guarantee consistency? Serializable
schedules Definition: A schedule S is serializable if theres a
serial schedule S such that for every initial database state, the
effects of S and S are the same.
Slide 226
Serializable Schedule: Example Serializable, but not serial,
schedule T 2 acts on A after T 1, but before T 1 acts on B Effect
is same as serial schedule (T 1, T 2 ) T1T1 T2T2 AB 25 READ(A, t) t
:= t+100 WRITE(A,t)125 READ(A, s) s := s*2 WRITE(A,s)250 READ(B,t)
t := t+100 WRITE(B,t)125 READ(B,s) s := s*2 WRITE(B,s)250
Slide 227
Notation: Transactions and Schedules Transaction: T i (for
example T 1, T 2,) Database element: X Actions: read/write r Ti (X)
= r i (X) w Ti (X) = w i (X) Examples Transactions: T 1 : r 1 (A);
w 1 (A); r 1 (B); w 1 (B); T 2 : r 2 (A); w 2 (A); r 2 (B); w 2
(B); Schedule: r 1 (A); w 1 (A); r 2 (A); w 2 (A); r 1 (B); w 1
(B); r 2 (B); w 2 (B);
Slide 228
Geetha Ranjini Viswanathan ID: 121
Slide 229
18.2 Conflict-Serializability 18.2.1 Conflicts 18.2.2
Precedence Graphs and a Test for Conflict-Serializability 18.2.3
Why the Precedence-Graph Test Works
Slide 230
18.2.1 Conflicts Conflict - a pair of consecutive actions in a
schedule such that, if their order is interchanged, the final state
produced by the schedule is changed.
Slide 231
18.2.1 Conflicts Non-conflicting situations: Assuming T i and T
j are different transactions, i.e., i j: r i (X); r j (Y) will
never conflict, even if X = Y. r i (X); w j (Y) will not conflict
for X Y. w i (X); r j (Y) will not conflict for X Y. w i (X); w j
(Y) will not conflict for X Y.
Slide 232
18.2.1 Conflicts Two actions of the same transactions always
conflict r i (X); w i (Y) Two writes of the same database element
by different transactions conflict w i (X); w j (X) A read and a
write of the same database element by different transaction
conflict r i (X); w j (X) w i (X); r j (X) Conflicting situations:
Three situations where actions may not be swapped:
Slide 233
18.2.1 Conflicts Conclusions: Any two actions of different
transactions may be swapped unless: They involve the same database
element, and At least one is a write The schedules S and S are
conflict-equivalent, if S can be transformed into S by a sequence
of non- conflicting swaps of adjacent actions. A schedule is
conflict-serializable if it is conflict- equivalent to a serial
schedule.
Slide 234
18.2.1 Conflicts Example 18.6 Conflict-serializable schedule S:
r 1 (A); w 1 (A); r 2 (A); w 2 (A); r 1 (B); w 1 (B); r 2 (B); w 2
(B); Above schedule is converted to the serial schedule S (T 1, T 2
) through a sequence of swaps. r 1 (A); w 1 (A); r 2 (A); w 2 (A);
r 1 (B); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 2 (A); r 1
(B); w 2 (A); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 1 (B);
r 2 (A); w 2 (A); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 1
(B); r 2 (A); w 1 (B); w 2 (A); r 2 (B); w 2 (B); S: r 1 (A); w 1
(A); r 1 (B); w 1 (B); r 2 (A); w 2 (A); r 2 (B); w 2 (B);
Slide 235
18.2.2 Precedence Graphs and a Test for
Conflict-Serializability Given a schedule S, involving transactions
T 1 and T 2, T 1 takes precedence over T 2 (T 1
18.2.3 Why the Precedence-Graph Test Works Consider a cycle
involving n transactions T 1 > T 2... > T n > T 1 In the
hypothetical serial order, the actions of T 1 must precede those of
T 2, which precede those of T 3, and so on, up to T n. But the
actions of T n, which therefore come after those of T 1, are also
required to precede those of T 1. This puts constraints on legal
swaps between T 1 and T n. Thus, if there is a cycle in the
precedence graph, then the schedule is not
conflict-serializable.
Slide 240
Shailesh Padave ID 111 CS257 Spring 2013 18.3 Enforcing
Serializability by locks
Slide 241
INTRODUCTION Enforcing serializability by locks Locks Locking
scheduler Two phase locking
Slide 242
Locks Maintained on database element to prevent unserializable
behavior It works like as follows : A request from transaction
Scheduler checks in the lock table to guide the decision Generates
a serializable schedule of actions.
Slide 243
Consistency of transactions Actions and locks must relate each
other Transactions can only read & write only if it has a lock
on the database elements involved in the transaction. Unlocking an
element is compulsory. Legality of schedules No two transactions
can acquire the lock on same element without the prior one
releasing it.
Slide 244
Locking scheduler Grants lock requests only if it is in a legal
schedule. Lock table stores the information about current locks on
the elements. Consider l i (X): Transaction T i requests a lock on
database element X u i (X): Transaction T i releases its lock on
database element X
Slide 245
Locking scheduler (contd.) A legal schedule of consistent
transactions but unfortunately it is not a serializable but it is
legal. T1T2AB l1(A); r1(A); A:=A+100; w1(A);u1(A); l1(B); r1(B);
B:=B+100; w1(B);u1(B); l2(A); r2(A); A:=A*2; w2(A);u2(A); l2(B);
r2(B); B:=B*2; w2(B);u2(B); 25 125 250 25 50 150
Slide 246
Locking schedule (contd.) The locking scheduler delays requests
in order to maintain a consistent database state. T1T2AB l1(A);
r1(A); A:=A+100; w1(A);l1(B);u1(A); r1(B);B:=B+100; w1(B);u1(B);
l2(A); r2(A); A:=A*2; w2(A);u2(A); L2(B); Denied l2(B);
u2(A);r2(B); B:=B*2; w2(B);u2(B); 25 125 250 25 125 250
Slide 247
Two-phase locking(2PL) Guarantees a legal schedule of
consistent transactions is conflict-serializable. All lock requests
proceed all unlock requests. The growing phase: Obtain all the
locks and no unlocks allowed. The shrinking phase: Release all the
locks and no locks allowed.
Slide 248
Working of Two-Phase locking Assures serializability. Two
protocols for 2PL: Strict two phase locking : Transaction holds all
its write locks till commit / abort. Rigorous two phase locking :
Transaction holds all locks till commit / abort. Two phase
transactions are ordered in the same order as their first
unlocks.
Slide 249
Two Phase Locking Locks required Time Instantaneously executes
now Every two-phase-locked transaction has a point at which it may
be thought to execute instantaneously
18. Concurrency Control 18.4. Locking Systems With Several Lock
Modes by Kiruthika Sivaraman ID: 129
Slide 252
Lock Types Shared Lock or Read Lock To read database element X
we use shared lock. There can be more than one shared lock on X.
Exclusive Lock or Write Lock To write to database element X we use
exclusive lock. There can be only one exclusive lock on X.
Slide 253
Notations Used sl i (X) Transaction T i requests shared lock on
database element X. xl i (X) Transaction T i requests exclusive
lock on database element X. u i (X) Transaction T i unlocks X.
Slide 254
Requirements Consistency of transactions Transaction may not
write without an exclusive lock and cannot read without any kind of
lock. Two phase locking of transactions Locking must precede
unlocking xl i (X) or sl i (X) cannot be preceded by u i (Y) for
any Y. Legality of schedules An element can be locked exclusive by
one transaction or by several in shared mode but not both.
Slide 255
Compatibility Matrices Compatibility matrix describes lock
management policy. Has row and column for each lock mode Row
corresponds to lock held on element X by a transaction. Column
corresponds to mode of lock requested on X. SX SYesNo X
Slide 256
Upgrading Locks Transaction T first takes a shared lock on X.
Later when T is ready to write it upgrades it lock to exclusive
lock on X. u i (X) releases all the lock that was established by
transaction T i on X. By this way transaction T remains friendly
with other transactions.
Slide 257
Upgrading Locks - Drawback T1 first establishes a shared lock
on X. T2 also establishes a shared lock on X. T1 tries to upgrade
its lock to exclusive lock. T2 also tries the same. Deadlock! SXU
SYesNoYes XNo U
Slide 258
Update Locks To avoid deadlock Update lock is similar to shared
locks with only difference that the transaction requesting update
lock can only upgrade its lock to exclusive lock. Once a
transaction requests update lock on X, then no other locks will be
granted on X. T1T2 Sl1(A) Sl2(A) Xl1(A) Denied Xl2(A) Denied
Slide 259
Increment Lock Two transaction can establish an increment lock
on a database element at the same time. Useful when the order of
write is not important. INC(A,2) INC(A,10) INC(A,2) A = 5 A = 17 A
= 15 A = 7
Slide 260
Increment Lock Compatibility Matrix SXI SYesNo X I Yes
Slide 261
Presented By: Akash Patel ID: 113 Akash Patel
Slide 262
Overview Overview of Locking Scheduler Scheduler That Inserts
Lock Actions The Lock Table Handling Locking and Unlocking
Request
Slide 263
Principles of simple scheduler architecture The transactions
themselves do not request locks, or cannot be relied upon to do so.
It is the job of the scheduler to insert lock actions into the
stream of reads, writes, and other actions that access data.
Transactions do not release locks. Rather, the scheduler releases
the locks when the transaction manager tells it that the
transaction will commit or abort.
Slide 264
Scheduler That Inserts Lock Actions into the transactions
request stream Scheduler, Part 1 Scheduler, Part 2 Lock(A);Read(A)
Read(A);Write(B) ;Commit(T) Lock Table From Transaction
Slide 265
The scheduler maintains a lock table, which, although it is
shown as secondary-storage data, may be partially or completely in
main memory Actions requested by a transaction are generally
transmitted through the scheduler and executed on the database.
Under some circumstances a transaction is delayed, waiting for a
lock, and its requests are not (yet) transmitted to the
database.
Slide 266
The two parts of the scheduler perform Part I takes the stream
of requests generated by the transactions and inserts appropriate
lock actions ahead of all database-access operations, such as read,
write, increment, or update. Part II takes the sequence of lock and
database- access actions passed to it by Part I, and executes each
appropriately
Slide 267
Determine the transaction (T) that action belongs and status of
T (delayed or not). If T is not delayed then 1. Database access
action is transmitted to the database and executed 2. If lock
action is received by PartII, it checks the L Table whether lock
can be granted or not i> Granted, the L Table is modified to
include granted lock ii>Not G. then update L Table about
requested lock then PartII delays transaction T
Slide 268
3. When a T = commits or aborts, PartI is notified by the
transaction manager and releases all locks. - If any transactions
are waiting for locks PartI notifies PartII. 4. Part II when
notified about the lock on some DB element, determines next
transaction T to get lock to continue.
Slide 269
The Lock Table A relation that associates database elements
with locking information about that element Implemented with a hash
table using database elements as the hash key Size is proportional
to the number of lock elements only, not to the size of the entire
database DB element A Lock information for A
Slide 270
Slide 271
Lock table Entry Field Group Mode S means that only shared
locks are held. U means that there is one update lock and perhaps
one or more shared locks. X means there is one exclusive lock and
no other locks. Waiting Waiting bit tells that there is at least
one transaction waiting for a lock on A. A list A list describing
all those transactions that either currently hold locks on A or are
waiting for a lock on A.
Slide 272
Handling Lock Requests Suppose transaction T requests a lock on
A If there is no lock table entry for A, then there are no locks on
A, so create the entry and grant the lock request If the lock table
entry for A exists, use the group mode to guide the decision about
the lock request
Slide 273
1) If group mode is U (update) or X (exclusive) No other lock
can be granted Deny the lock request by T Place an entry on the
list saying T requests a lock And Wait? = yes 2) If group mode is S
(shared) Another shared or update lock can be granted Grant request
for an S or U lock Create entry for T on the list with Wait? = no
Change group mode to U if the new lock is an update lock How to
deal with existing Lock
Slide 274
Handling Unlock Requests Now suppose transaction T unlocks A
Delete Ts entry on the list for A If Ts lock is not the same as the
group mode, no need to change group mode Otherwise check entire
list for new group mode
Slide 275 RT(X), set RT(X) := TS(T); otherwise do not change
RT(X). If C(X) is false, delay T until C(X) becomes true or
transaction that wrote X aborts. If TS(T) < WT(X), the read is
physically unrealizable. Rollback T.
Slide 312
Rules for Timestamps-Based scheduling (Cont.) Scheduler
receives a request WT(X). If TS(T) RT(X) and TS(T) WT(X), write is
physically realizable and must be performed. Write the new value
for X, Set WT(X) := TS(T), and Set C(X) := false. if TS(T) RT(X)
but TS(T) < WT(X), then the write is physically realizable, but
there is already a later values in X. If C(X) is true, then the
previous writers of X is committed, and ignore the write by T. If
C(X) is false, we must delay T. if TS(T) < RT(X), then the write
is physically unrealizable, and T must be rolled back.
Slide 313
Rules for Timestamps-Based scheduling (Cont.) Scheduler
receives a request to commit T. It must find all the database
elements X written by T and set C(X) := true. If any transactions
are waiting for X to be committed, these transactions are allowed
to proceed. Scheduler receives a request to abort T or decides to
rollback T, then any transaction that was waiting on an element X
that T wrote must repeat its attempt to read or write.
Slide 314
Three transactions executing under a timestamp-based
scheduler
Slide 315
Timestamps and Locking Generally, timestamping performs better
than locking in situations where: Most transactions are read-only.
It is rare that concurrent transaction will try to read and write
the same element. In high-conflict situation, locking performs
better than timestamps The argument for this rule-of-thumb is:
Locking will frequently delay transactions as they wait for locks.
But if concurrent transactions frequently read and write elements
in common, then rollbacks will be frequent in a timestamp
scheduler, introducing even more delay than a locking system.
Slide 316
Anusha Damodaran ID : 130 CS 257 : Database System Principles
Section 18.9
Slide 317
At a Glance What is Validation? Architecture of Validation
based Scheduler Validation Rules Comparison between Concurrency
Control Mechanisms
Slide 318
Validation (p1) Another type of Optimistic Concurrency control
Allows transactions to access data without locks Validation
Scheduler: Keeps record of what active transactions are doing Goes
through Validation Phase before the transaction starts to write
values of database elements If there is a physically unrealizable
behavior, the transaction is rolled back
Slide 319
18.9.1 Architecture of Validation based Scheduler (p1)
Scheduler must be told for each transaction T Read Set, RS(T) -
Sets of database elements T reads Write Set, WS(T) - Sets of
database elements T writes Three phases of the Validation Scheduler
Read Transaction reads from Database all elements in its Read Set.
Also computes in its local address space all results its going to
write. Validate Validates the transaction by comparing its read and
write sets with those of other transactions. If validation fails,
transaction is rolled back, else proceeds to write phase. Write
Writes to the database its values for the elements in its write
set.
Slide 320
Validation based Scheduler Scheduler has an assumed serial
order of the transactions to work with. Maintains three sets START
: Set of transactions that have started but not yet completed START
(T) time at which transaction started VAL : Set of transactions
that have been validated but not yet finished the writing of phase
3 START(T) & VAL(T) time at which T validated FIN : Set of
transactions that completed phase 3 START(T), VAL(T), FIN(T) time
at which T finished
Slide 321
18.9.2 Validation Rules Case 1: U is in VAL or FIN, that is U
is validated FIN(U) > START(T), that is U did not finish before
T started RS(T) WS(U) is not empty (let it contain database element
X) Since we dont know whether or not T got to read Us value, we
must rollback T to avoid a risk that the actions of T and U will
not be consistent with the assumed serial order. T reads X U writes
X U starts T starts U validated T validating
Slide 322
18.9.2 Validation Rules Case 2: U is in VAL, i.e. U has
successfully validated FIN(U) > VAL(T), i.e. U did not finish
before T entered its validation phase WS(T) WS(U) is not empty (let
database element X be in both write sets) T and U must both write
values of X, and if we let T validate, it is possible that it will
write X before U does. Since we cannot be sure, we rollback T to
make sure it does not violate the assumed serial order in which it
follows U. T writes X U writes X U validated T validating U
finish
Slide 323
Rules for Validating a transaction T Check that RS(T) WS(U) =
for any previously validated U that did not finish before T
started, i.e., if FIN(U) > START(T). Check that WS(T) WS(U) =
for any previously validated U that did not finish before T
validated, i.e., if FIN(U) > VAL(T). 1. RS(T) WS(U) = ; FIN(U)
> START(T) 2. WS(T) WS(U) = ; FIN(U) > VAL(T)
Slide 324
Example 18.2.9 4 Transactions T, U,V,W attempt to execute and
validate T: RS = {A,B} WS ={A,C} U : RS = {B} WS = {D} W : RS
={A,D} WS = {A,C} V : RS = {B} WS = {D, E} - Read - Validate -
Write
Slide 325
Example 18.2.9 Validation of U [RS = {B}; WS = {D}] Nothing to
check ; Reads {B} U validates successfully Writes {D} Validation of
T [RS = {A,B}; WS ={A,C}] FIN(U) > START(T) ; RS(T) WS(U) should
be empty {A,B} {D} = FIN(U) > VAL(T) ; WS(T) WS(U) should be
empty {A,C} {D} = Validation of V [RS = {B}; WS = {D, E}] FIN(T)
> START(V); RS(V) WS(T) should be empty {B} {A,C} = FIN(T) >
VAL(V) ;WS(V) WS(T) should be empty {D,E} {A,C} = FIN(U) >
START(V) ;RS(V) WS(U) should be empty {B} {D} = Validation of W [RS
={A,D}; WS = {A,C}] FIN(T) > START(W); RS(W) WS(T) should be
empty {A,D} {A,C} = {A} FIN(V) > START(W);RS(W) WS(V) should be
empty {A,D} {D,E} = {D} FIN(V) > VAL(W);WS(W) WS(V) should be
empty {A,C} {D,E} = W is not validated, Is rolled back and hence
does not write values A and C
Slide 326
18.9.3 Comparison between Concurrency Control Mechanisms
Storage Utilization Concurrency control MechanismsStorage
Utilization LocksSpace in the lock table is proportional to the
number of database elements locked. TimestampsSpace is needed for
read- and write-times with every database element, whether or not
it is currently accessed. ValidationSpace is used for timestamps
and read or write sets for each currently active transaction, plus
a few more transactions that finished after some currently active
transaction began. Timestamping and validation may use slightly
more space than a locking. A potential problem with validation is
that the write set for a transaction must be known before the
writes occur
Slide 327
18.9.3 Comparison between Concurrency Mechanisms - Delay The
performance of the three methods depends on whether interaction
among transactions is high or low. (Interaction the likelihood that
a transaction will access an element that is also being accessed by
a concurrent transaction) Locking delays transactions but avoids
rollbacks, even when interaction is high. Timestamps and validation
do not delay transactions, but can cause them to rollback, which is
a more serious form of delay and also wastes resources. If
interference is low, then neither timestamps nor validation will
cause many rollbacks, and is preferable to locking