Changes 13.1 – Added content to slide #8 13.2 – Added extra slide 13.4 – Added content to...

Changes

• 13.1 – Added content to slide #8• 13.2 – Added extra slide• 13.4 – Added content to slide #9• 13.8 – Added content to slide #7• 15.1 – Added content to slide #8• 15.2 – Added content to slide #3• 15.4 – Added title to slide #12 • 18.3 – Added content to slide #2 and 3• 18.4 – Added extra slide

Secondary Storage ManagementThe Memory Hierarchy

The Memory Hierarchy• Computer systems have

several different components in which data may be stored.

• Data capacities & access speeds range over at least seven orders of magnitude

• Devices with smallest capacity also offer the fastest access speed

Description of Levels

1. Cache

• Megabyte or more of Cache storage.

• On-board cache : On same chip.

• Level-2 cache : On another chip.

• Cache data accessed in few nanoseconds.

• Data moved from main memory to cache when needed by processor

• Volatile

Description of Levels 2. Main Memory

• 1 GB or more of main memory.• Instruction execution & Data Manipulation -

involves information resident in main memory.• Time to move data from main memory to the

processor or cache is in the 10-100 nanosecond range.

• Volatile3. Secondary Storage

• Typically a magnetic disk.• Capacity upto 1 TB.• One machine can have several disk units.• Time to transfer a single byte between disk &

main memory is around 10 milliseconds.

Description of Levels

4. Tertiary Storage

• Holds data volumes measured in terabytes.

• Significantly higher read/write times.

• Smaller cost per bytes.

• Retrieval takes seconds or minutes, but capacities in the petabyte range are possible.

Transfer of Data Between Levels

• Data moves between adjacent levels of the hierarchy.

• Each level is organized to transfer large amounts of data to or from the level below

• Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed, it is likely that other data on the same block will also be needed at about the same time.

Volatile & Non Volatile Storage• A volatile device “forgets” what is stored in it

when the power goes off. • Example: Main Memory

• A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure.

• Example: Secondary & Tertiary Storage

Note: No change to the database can be considered final until ithas migrated to nonvolatile, secondary storage.

Virtual Memory• Managed by Operating System.

• Some memory in main memory & rest on disk.

• Transfer between the two is in units of disk blocks (pages).

• Not a level of the memory hierarchy, it is an artifact of the operating system and its use of the machine’s hardware.

Thank you!

Section 13.2 – Secondary storage management

CS-257 Database System PrinciplesAvinash Anantharamu

102

• 13.2 Disks

• 13.2.1 Mechanics of Disks • 13.2.2 The Disk Controller • 13.2.3 Disk Access Characteristics

Index

• The use of secondary storage is one of the important characteristics of a DBMS, and secondary storage is almost exclusively based on magnetic disks

Disks:

Structure of a Disk

• 0’s and 1’s are represented by different patterns in the magnetic material.

• A common diameter for the disk platters is 3.5 inches.

Data in Disk

• Two principal moving pieces of hard drive1- Head Assembly2- Disk Assembly

• Disk Assembly has 1 or more circular platters that rotate around a central spindle.• Platters are covered with thin magnetic

material

Mechanics of Disks

Top View of Disk Surface

• Tracks are concentric circles on a platter.

• Tracks are organized into sectors which are segments of circular platter.

• Sectors are indivisible as far as errors are concerned.

• Blocks are logical data transfer units.

Mechanics of Disks

• Should a portion of the magnetic layer become corrupted in some way, so that it cannot store information, then the entire sector containing this portion cannot be used.

Corruptions

• Control the actuator to move head assembly

• Selecting the surface from which to read or write

• Transfer bits from desired sector to main memory

Disk Controller

Simple Single Processor Computer

• Seek time: The disk controller positions the head assembly at the cylinder containing the track on which the block is located. The time to do so is the seek time.

• Rotational latency: The disk controller waits while the first sector of the block moves under the head. This time is called the rotational latency.

Disk Access characteristics

• Transfer time: All the sectors and the gaps between them pass under the head, while the disk controller reads or writes data in these sectors. This delay is called the transfer time.

• Latency of the disk: The sum of the seek time, rotational latency, transfer time is the latency of the time.

Disk Access characteristics

• Database Systems -The complete Book-Second Edition

Reference:

Thank you Any Questions?

? Questions ?

13.3 Accelerating Access to Secondary StorageSan Jose State University

Spring 2012

13.3 Accelerating Access to Secondary Storage

Section Overview

13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks 13.3.5: Disk Scheduling and the Elevator

Algorithm 13.3.6: Prefetching and Large-Scale Buffering

13.3 Introduction Average block access is ~10ms. Disks may be busy. Requests may outpace access delays, leading

to infinite scheduling latency. There are various strategies to increase disk

throughput. The “I/O Model” is the correct model to

determine speed of database operations

13.3 Introduction (Contd.)

Actions that improve database access speed:

– Place blocks closer, within the same cylinder

– Increase the number of disks

– Mirror disks

– Use an improved disk-scheduling algorithm

– Use prefetching

13.3.1 The I/O Model of Computation

If we have a computer running a DBMS that:

– Is trying to serve a number of users

– Has 1 processor, 1 disk controller, and 1 disk

– Each user is accessing different parts of the DB

It can be assumed that:

– Time required for disk access is much larger than access to main memory; and as a result:

– The number of block accesses is a good approximation of time required by a DB algorithm

13.3.2 Organizing Data by Cylinders

It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s).

In a relational database, related data should be stored in the same cylinder.

13.3.3 Using Multiple Disks

If the disk controller supports the addition of multiple disks and has efficient scheduling, using multiple disks can improve performance significantly

By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over

A drawback of striping data across multiple disks is that you increase your chances of disk failure.

To mitigate this risk, some DBMS use a disk mirroring configuration

Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost

Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chosen

13.3.4 Mirroring Disks

13.3.4 Mirroring Disks

Advantages Disadvantages

Striping Read/Write speedup ~nCapacity increased by ~n

Higher risk of failure

Mirroring Read speedup ~nReduced failure riskFast initial access

High cost per bitSlow writes compared to striping

One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient

– The elevator algorithm is a simple yet effective disk scheduling algorithm

– The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down

– The access requests closest to the heads current position are processed first

13.3.5 Disk Scheduling

When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed

When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed

Example:

13.3.5 Disk Scheduling

Cylinder Time Requested (ms)

8000 0

24000 0

56000 0

16000 10

64000 20

40000 30

Cylinder Time Completed (ms)

8000 4.3

24000 13.6

56000 26.9

64000 34.2

40000 45.5

16000 56.8

In some cases we can anticipate what data will be needed

We can take advantage of this by prefetching data from the disk before the DBMS requests it

Since the data is already in memory, the DBMS receives it instantly

13.3.6 Prefetching and Large-Scale Buffering

? Questions ?

Disk Failures

Presented by Timothy ChenSpring 2013

Index

• 13.4 Disk Failures13.4.1 Intermittent Failures13.4.2 Organizing Data by Cylinders13.4.3 Stable Storage13.4.4 Error- Handling Capabilities of Stable

Storage13.4.5 Recovery from Disk Crashes13.4.6 Mirroring as a Redundancy Technique13.4.7 Parity Blocks13.4.8 An Improving: RAID 513.4.9 Coping With Multiple Disk Crashers

Intermittent Failures

• If we try to read the sector but the correct content of that sector is not delivered to the disk controller

• Controller will check good and bad sector• If the write is correct: Read is performed• Good sector and bad sector is known by the

read operation

CheckSum

• Read operation that determines the good or bad status

How CheckSum is performed

• Each sector has some additional bits• Set depending on the values of the data bits

stored in each sector• If the data bit in the not proper we know there is

an error reading• Odd number of 1: bits have odd parity(01101000)• Even number of 1: bit have even parity

(111011100)• Find Error is the it is one bit parity

Stable Storage

• Deal with disk error• Sectors are paired and each pair X showing left

and right copies as Xl and Xr • It check the parity bit of left and right by

subsituting spare sector of Xl and Xr until the good value is returned

Error-Handling Capabilities of Stable Storage

• Since it has XL and XR, one of them fail we can still read other one

• Chance both of them fail are pretty small• The write Fail, it happened during power

outage

Recover Disk Crash

• The most serious mode of failure for disks is “head crash” where data permanently destroyed.

• The way to recover from crash , we use RAID method

Mirroring as a Redundancy Technique

• it is call Raid 1• Just mirror each disk• We call one disk the data disk and the second

a redundant disk.

Raid 1 graph

Parity Block

• It often call Raid 4 technical• read block from each of the other disks and

modulo-2 sum of each column and get redundant disk

disk 1: 11110000disk 2: 10101010disk 3: 00111000

get redundant disk 4(even 1= 0, odd 1 =1)disk 4: 01100010

Raid 4 graphic

Parity Block- Fail Recovery

• It can only recover one disk fail• If it has more than one like two disk• Then it can’t be recover us modulo-2 sum

An Improvement Raid 5

Coping with multiple Disk Crash

• For more one disk fail• Either raid 4 and raid 5 can’t be work• So we need raid 6• It is need at least 2 redundant disk

Raid 6

Reference

• http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

• http://en.wikipedia.org/wiki/RAID

http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

Secondary Storage Management

13.5 Arranging data on disk

Mangesh Dahale

ID-105

CS 257

Outline

• Fixed-Length Records• Example of Fixed-Length Records• Packing Fixed-Length Records into

Blocks• Example of Packing Fixed-Length

Records into Blocks• Details of Block header

Arranging Data on Disk

• A data element such as a tuple or object is represented by a record, which consists of consecutive bytes in some disk block.

Fixed Length Records

The Simplest record consists of fixed length fields.

The record begins with a header, a fixed-length regionwhere information about the record itself is kept.

Fixed Length Record header1. A pointer to record schema.2. The length of the record.3. A timestamp indicating when the record was created.

ExampleCREATE TABLE employee(

name CHAR(30) PRIMARY KEY,

address VARCHAR(255),

gender CHAR(1),

birthdate DATE

);

Packing Fixed Length Records into Blocks

• Records are stored in blocks of the disk and moved into main memory when we need to access or update them.

• A block header is written first and it is followed by series of blocks.

Example

•Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused

Block header contains following information

• Links to one or more other blocks that are part of a network blocks

• Information about the role played by this block in such a network

• Information about which relation the tuples of this block belong to.

• A “directory” giving the offset of each round in the block

• Timestamp(s) indicating the time of the block's last modification and / or access

•Thank You

Block header contains following information

• Links to one or more other blocks that are part of a network blocks

• Information about the role played by this block in such a network

• Information about which relation the tuples of this block belong to.

• A “directory” giving the offset of each round in the block

• Timestamp(s) indicating the time of the block's last modification and / or access

Variable Length Data and Records

- Ashwin Kalbhor Class ID : 107

Agenda

• Records with Variable Length Fields• Records with Repeating Fields• Variable Format Records• Records that do not fit in a block

• Example of a record

name

address

gender

birth date

0 30 286 287 297

Records with Variable Length Fields

• Simple and Effective way to represent variable length records is as follows –1. Fixed length fields are kept ahead of the variable length records.2. A header is put in front of the of the record.3. Record header contains• Length of the record• Pointers to the beginning of all variable length

fields except the first one.

Example

Record with name and address as variable length field.

birth date

name address

header informationrecord

lengthto address

gender

Records with repeating fields

• Repeating fields simply means fields of the same length L.

• All occurrences of Field F grouped together.• Pointer pointing to the first field F is put in the

header.• Based on the length L the starting offset of any

repeating field can be obtained.

Example of a record with Repeating Fields

Movie star record with “movies” as the repeating field.

name address

other header informationrecord

lengthto addressto movie pointers

pointers to movies

Alternative representation

• Record is of fixed length• Variable length fields stored on a separate

block.• The record itself keeps track of -

1. Pointers to the place where each repeating field begins, and2. Either how many repetitions there are, or where the repetitions end.

Storing variable length fields separately from the record.

Variable Format Records

• Records that do not have fixed schema• Represented by sequence of tagged fields• Each of the tagged fields consist of information

• Attribute or field name• Type of the field• Length of the field• Value of the field

Variable Format Records

N 16

S S14

Clint Eastwood

Hog’s Breath Inn

R

code for name

code for restaurant ownedcode for string

typecode for string type length

length

Records that do not fit in a block

• When the length of a record is greater than block size ,then record is divided and placed into two or more blocks

• Portion of the record in each block is referred to as a RECORD FRAGMENT

• Record with two or more fragments is called a SPANNED RECORD

• Record that do not cross a block boundary is called UNSPANNED RECORD

Spanned Records

• Spanned records require the following extra header information –• A bit indicates whether it is fragment or not• A bit indicates whether it is first or last fragment

of a record• Pointers to the next or previous fragment for the

same record

Spanned Records

record 1 record 3 record 2 - a

record 2 - b

block header

record header

block 1 block 2

Thank You.

13.8 Record Modifications

CS257Lok Kei Leong ( 108 )

Outline

• Record Insertion

• Record Deletion

• Record Update

Insertion• Insert new records into a relation

- records of a relation in no particular order- record of a relation in fixed order

(e.g. sorted by primary key)• A pointer to a record from outside the block is a “structured

address”

Record 4 Record 3 Record 2

Record 1

unusedheader

Offeset table

What If The Block is Full?

• If we need to insert the record in a particular block but the block is full. What should we do?

• Find room outside the Block• There are 2 solutions I. Find Space on Nearby BlockII. Create an Overflow Block

Insertion (solution 1)

• Find space on a “nearby” block• Block B1 has no space • If space available on block B2 move records of B1 to

B2 • If there are external pointers to records of B1 moved

to B2 leave forwarding address in offset table of B1

Insertion (solution 2)

• Create an overflow block• Each block B has its header pointer to an overflow

block where additional blocks of B can be placed

Block B Overflow block for B

Deletion• Slide around the block, using an offset table• If we cannot slide records:

- maintain an available-space list in the block headerto keep track of space available

• Since there may be pointers to the deleted record, we need to avoid dangling pointers or winding up pointing to a new record

Tombstone• What about pointer to deleted records ?• A tombstone is placed in place of each

deleted record• A tombstone is a bit placed at first byte of

deleted record to indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)

• A tombstone is permanent

Record 1 Record 2

Update

• For Fixed-Length Records, there is no effect on the storage system

• For variable length records:• associated with insertion and deletion

(never create a tombstone for old record) • Longer updated record

create more space on its block- sliding records - create an overflow block

Question?

Query ExecutionSection 15.1

Sweta ShahCS257: Database Systems

ID: 118

Query Processor Query compilation Physical Query Plan Operators

Scanning Tables Table Scan Index scan

Sorting while scanning tables Model of computation for physical operators Parameters for measuring cost Iterators

Agenda

The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations

Query processor is responsible for supplying details regarding how the query is to be executed

Query Processor

The major parts of the query processor

Query compilation itself is a multi-step process consisting of : Parsing: in which a parse tree representing query

and its structure is constructed Query rewrite: in which the parse tree is

converted to an initial query plan Physical plan generation: where the logical query

plan is turned into a physical query plan by selecting algorithms.

Query compilation

Outline of query compilation

Physical query plans are built from operators Each of the operators implement one step of

the plan. They are particular implementations for one

of the operators of relational algebra. They can also be non relational algebra

operators like “scan” which scans tables.

Physical Query Plan Operators

One of the most basic things in a physical query plan.

We read the entire contents of relation R. Necessary when we want to perform join or

union of a relation with another relation.

Scanning Tables

Two basic approaches to locating the tuples of a relation R

Table-scan Relation R is stored in secondary memory

with its tuples arranged in blocks it is possible to get the blocks one by one This operation is called Table Scan

Two basic approaches to locating the tuples of a relation R

Index-scan there is an index on any attribute of

Relation R Use this index to get all the tuples of R This operation is called Index Scan

Why do we need sorting while scanning? the query could include an ORDER BY clause

requiring that a relation be sorted

Various algorithms for relational-algebra operations require one or both of their arguments to be sorted relation

Sort-scan takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order

Sorting While Scanning Tables

Choosing physical plan operators wisely is an essential for a good query processor.

Cost for an operation is measured in number of disk i/o operations.

If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.

Model of Computation for Physical Operators

Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations

This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.

Improvements in cost

Parameters that affect the performance of a query Buffer space availability in the main

memory at the time of execution of the query

Size of input and the size of the output generated

The size of memory block on the disk and the size in the main memory also affects the performance

Parameters for Measuring Costs

Many physical operators can be implemented as an iterator

It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time

Iterators for Implementation of Physical Operators

The three functions forming the iterator are: Open: This function starts the process of getting

tuples. It initializes any data structures needed to

perform the operation

Iterator

GetNext This function returns the next tuple in the

result Adjusts data structures as necessary to allow

subsequent tuples to be obtained If there are no more tuples to return, GetNext

returns a special value NotFound

Iterator

Close This function ends the iteration after all tuples it calls Close on any arguments of the operator

Iterator

Thank You !!!

Query Execution

One-pass algorithm for database operations

Chetan Sharma008565661

Overview

One-Pass Algorithm

One-Pass Algorithm Methods:

1) Tuple-at-a-time, unary operations.

2) Full-relation, unary operations.

3) Full-relation, binary operations.

Various Algorithms

We can divide algorithms for operators into three “degrees” of difficulty and cost:

1) One-Pass Algorithm: Require at least one of the arguments to fit in main memory.

2) Two-pass Algorithm: Some methods work for data that is too large to fit in available main memory but not for the largest imaginable data sets.

3) 3 or more pass Algorithms: Some methods work without a limit on the size of the data. -multipass:recursive generalizations of the two-pass algorithms.

One-Pass Algorithm

• Reading the data only once from disk.

• Usually, they require at least one of the arguments to fit in main memory

Tuple-at-a-Time

• These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memory buffer, and produce our output.

• Ex- selection and projection

Tuple-at-a-Time

A selection or projection being performed on a relation R

Full-relation, unary operations

• Now, let us consider the unary operations that apply to relations as a whole , rather than to one tuple at a time:

a)Duplicate elimination. b)Grouping .

a) Duplicate elimination

b) Grouping

• MIN (a),MAX (a) aggregate, record the minimum or maximum value, respectively, of attribute a seen for any tuple in the group so far.

• COUNT aggregation, add one for each tuple of the group that is seen.

• SUM (a), add the value of attribute a to the accumulated sum for its group.

• AVG (a) is the hard case. We must maintain two accumulations: the count of the number of tuples in the group and the sum of the a-values of these tuples.

b) Grouping

When all tuples of R have been read into the input buffer and contributed to the aggregation(s) for their group, we can produce the output by writing the tuple for each group. Note-: that until the last tuple is seen, we cannot begin to create output for a operation. Thus, this algorithm does not fit the iterator framework very well; The entire grouping has to be done by the Open method before the first tuple can be retrieved

One-Pass Algorithms for Binary Operations

• All other operations are in this class: set and bag versions of union, intersection, difference, joins, and products.

• binary operations require reading the smaller of the operands R and S into main memory and building a suitable data structure so tuples can be both inserted quickly and found quickly.

• to be performed in one pass is: min(B(R),B(S)) <= M

Some examples

In each case, we assume R is the larger of the relations, and we house S in main memory.

• Set Union:

-We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple. -All these tuples are also copied to the output. -Read each block of R into the Mth buffer, one at a time.-For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.

• Set Intersection :

-Read S into M - 1 buffers and build a search structure with full tuples as the search key. -Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.

examples continued..

• Product• Read S into M — 1 buffers of main memory; no special data structure is needed.

Then read each block of R, and for each tuple t of R concatenate t with each tuple of S in main memory. Output each concatenated tuple as it is formed.

Summery

One-Pass Algorithm

One-Pass Algorithm Methods:

1) Tuple-at-a-time, unary operations.

2) Full-relation, unary operations.

3) Full-relation, binary operations.

Questions

&

Nested Loops Joins

Book Section of chapter 15.3

Guide: Prof. Dr. T.Y. LINName: Sanya Valsan

Roll: 120

Topic to be covered

• Tuple-Based Nested-Loop Join• An Iterator for Tuple-Based Nested-Loop Join• A Block-Based Nested-Loop Join Algorithm• Analysis of Nested-Loop Join

Introduction

• Nested – loop joins are, in a sense, one and a half passes since in each variation one of the two arguments has its tuples read only once, while the other argument has to be read repeatedly.

• Nested-loop joins can be used for relations of any size; it is not necessary that one relation fit in main memory.

15.3.1 Tuple-Based Nested-Loop Join

• The simplest variation of nested-loop join has loops that range over individual tuples of the relations involved. In this algorithm, which we call tuple-based nested-loop join, we compute the join as follows

RS

Continued• For each tuple s in S DO

For each tuple r in R Doif r and s join to make a tuple t THEN

output t;– If we are careless about how to buffer the blocks of

relations R and S, then this algorithm could require as many as T(R)T(S) disk I/O’s. There are many situations where this algorithm can be modified to have much lower cost.

Note: T(R) – number of tuples in relation R.

Continued

• One improvement looks much more carefully at the way tuples of R and S are divided among blocks, and uses as much of the memory as it can to reduce the number of disk I/O's as we go through the inner loop. We shall consider this block-based version of nested-loop join.

15.3.2 An Iterator for Tuple-Based Nested-Loop Join

• Open() {– R.Open();– S.open();– A:=S.getnext();}

GetNext() {Repeat {

r:= R.Getnext();IF(r= Not found) {/* R is exhausted for the

current s*/R.close();s:=S.Getnext();

IF( s= Not found) RETURN Not Found;/* both R & S are exhausted*/R.Close();r:= R.Getnext();

}}until ( r and s join)RETURN the join of r and s;

}Close() {

R.close ();S.close ();

}

15.3.3 A Block-Based Nested-Loop Join Algorithm

We can Improve Nested loop Join by compute R join S.

1. Organizing access to both argument relations by blocks.

2. Using as much main memory as we can to store tuples belonging to the relation S, the relation of the outer loop.

The nested-loop join algorithmFOR each chunk of M-1 blocks of S DO BEGIN

read these blocks into main-memory buffers;organize their tuples into a search structure whose

search key is the common attributes of R and S;FOR each block b of R DO BEGIN

read b into main memory;FOR each tuple t of b DO BEGIN

find the tuples of S in main memory thatjoin with t ;output the join of t with each of these tuples;

END ;END ;

END ;

15.3.4 Analysis of Nested-Loop Join

Assuming S is the smaller relation, the number of chunks or iterations of outer loop is B(S)/(M - 1). At each iteration, we read hf - 1 blocks of S andB(R) blocks of R. The number of disk I/O's is thus

B(S)/M-1(M-1+B(R)) or B(S)+B(S)B(R)/M-1

Assuming all of M, B(S), and B(R) are large, but M is the smallest of these, an approximation to the above formula is B(S)B(R)/M. That is, cost is proportional to the product of the sizes of the two relations, divided by the amount of available main memory.

Example• B(R) = 1000, B(S) = 500, M = 101

– Important Aside: 101 buffer blocks is not as unrealistic as it sounds. There may be many queries at the same time, competing for main memory buffers.

• Outer loop iterates 5 times • At each iteration we read M-1 (i.e. 100) blocks of S and all of R (i.e.

1000) blocks. • Total time: 5*(100 + 1000) = 5500 I/O’s

• Question: What if we reversed the roles of R and S? • We would iterate 10 times, and in each we would read 100+500 blocks,

for a total of 6000 I/O’s. • Compare with one-pass join, if it could be done! • We would need 1500 disk I/O’s if B(S) x M-1

Continued…….

1. The cost of the nested-loop join is not much greater than the cost of a one-pass join, which is 1500 disk 110's for this example.

2. Nested-loop join is generally not the most efficient join algorithm.

Summary of the topic

In this topic we have learned about how the nested tuple Loop join are used in database using query execution and what is the process for that.

Any Questions

?

Two Pass Algorithm Based On Sorting

Section 15.4CS257 Spring2013Swapna Vemparala

Class ID : 131

Contents

• Two-Pass Algorithms• Two-Phase, Multiway Merge-Sort• Duplicate Elimination Using Sorting• Grouping and Aggregation Using Sorting• A Sort-Based Union Algorithm• Sort-Based Intersection and Difference• A Simple Sort-Based Join Algorithm• A More Efficient Sort-Based Join

Two-Pass Algorithms

• Data from operand relation is read into main memory, processed, written out to disk again, and reread from disk to complete the operation.

15.4.1 Two-Phase, Multiway Merge-Sort

• To sort very large relations in two passes.

• Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any main-memory sorting algorithm. Write out each sorted sublist to secondary storage.

• Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M — 1 sorted sublists, which limits the size of R. We allocate one input block to each sorted sublist and one block to the output.

Merging

• Find the smallest key• Move smallest element to first available position of

output block.• If output block full -write to disk and reinitialize the

same buffer in main memory to hold the next output block.

• If this block -exhausted of records, read next block from the same sorted sub list into the same buffer that was used for the block just exhausted.

• If no blocks remain- stop.

15.4.2 Duplicate Elimination Using Sorting

• Same as previous…• Instead of sorting on the second pass, -

repeatedly select first unconsidered tuple t among all sorted sub lists.

• Write one copy of t to the output and eliminate from the input blocks all occurrences of t.

• Output - exactly one copy of any tuple in R.

15.4.3 Grouping and Aggregation Using Sorting

• Read the tuples of R into memory, M blocks at a time. Sort the tuples in each set of M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.

• Use one main-memory buffer for each sublist, and initially load the first block of each sublist into its buffer.

• Repeatedly find the least value of the sort key present among the first available tuples in the buffers.

15.4.4 A Sort-Based Union Algorithm

• In the first phase, create sorted sublists from both R and S.

• Use one main-memory buffer for each sublist of R and S. Initialize each with the first block from the corresponding sublist.

• Repeatedly find the first remaining tuple t among all the buffers

15.4.5 Sort-Based Intersection and Difference

• For both set version and bag version, the algorithm is same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sub lists.

• For set intersection -output t if it appears in both R and S.• For bag intersection -output t the minimum of the

number of times it appears in R and in S.• For set difference -output t if and only if it appears in R

but not in S.• For bag difference-output t the number of times it

appears in R minus the number of times it appears in S.

15.4.6 A Simple Sort-Based Join Algorithm

• Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main memory for buffers

• Sort R, using 2PMMS, with Y as the sort key• Sort S similarly• Merge the sorted R and S, use only two

buffers

15.4.8 A More Efficient Sort-Based Join

• If we do not have to worry about very large numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself

To compute R(X, Y) S(Y, Z) using M►◄ main-memory buffers

Create sorted sublists of size M, using Y as the sort key, for both R and S.

Bring the first block of each sublist into a buffer

Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value

15.4.8 A More Efficient Sort-Based Join

Thank you

Query Execution15.5 Two-pass Algorithms based on Hashing

ByAvinash Anantharamu

At a glimpse

• Introduction• Partitioning Relations by Hashing• Algorithm for Duplicate Elimination• Grouping and Aggregation• Union, Intersection, and Difference• Hash-Join Algorithm• Sort based Vs Hash based• Summary

Introduction

Hashing is done if the data is too big to store in main memory buffers. – Hash all the tuples of the argument(s) using an

appropriate hash key. – For all the common operations, there is a way to

select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value.

– This reduces the size of the operand(s) by a factor equal to the number of buckets.

Partitioning Relations by HashingAlgorithm:

initialize M-1 buckets using M-1 empty buffers;FOR each block b of relation R DO BEGIN

read block b into the Mth buffer;FOR each tuple t in b DO BEGIN

IF the buffer for bucket h(t) has no room for t THENBEGIN

copy the buffer t o disk;initialize a new empty block in that buffer;

END; copy t to the buffer for bucket h(t);END ;

END ;FOR each bucket DO

IF the buffer for this bucket is not empty THENwrite the buffer to disk;

Duplicate Elimination• For the operation δ(R) hash R to M-1 Buckets.(Note that two copies of the same tuple t will hash to the same

bucket)• Do duplicate elimination on each bucket Ri independently,

using one-pass algorithm• The result is the union of δ(Ri), where Ri is the portion of R

that hashes to the ith bucket

Requirements

• Number of disk I/O's: 3*B(R)– B(R) < M(M-1), only then the two-pass, hash-based

algorithm will work• In order for this to work, we need:

– hash function h evenly distributes the tuples among the buckets

– each bucket Ri fits in main memory (to allow the one-pass algorithm)

– i.e., B(R) ≤ M2

Grouping and AggregationHash all the tuples of relation R to M-1 buckets, using a hash

function that depends only on the grouping attributes(Note: all tuples in the same group end up in the same bucket)

Use the one-pass algorithm to process each bucket independently

Uses 3*B(R) disk I/O's, requires B(R) ≤ M2

Union, Intersection, and Difference

• For binary operation we use the same hash function to hash tuples of both arguments.

• R U S we hash both R and S to M-1• R ∩ S we hash both R and S to 2(M-1)• R-S we hash both R and S to 2(M-1)• Requires 3(B(R)+B(S)) disk I/O’s.• Two pass hash based algorithm requires

min(B(R)+B(S))≤ M2

Hash-Join Algorithm

• Use same hash function for both relations; hash function should depend only on the join attributes

• Hash R to M-1 buckets R1, R2, …, RM-1

• Hash S to M-1 buckets S1, S2, …, SM-1

• Do one-pass join of Ri and Si, for all i• 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2

Sort based Vs Hash based

• For binary operations, hash-based only limits size to min of arguments, not sum

• Sort-based can produce output in sorted order, which can be helpful

• Hash-based depends on buckets being of equal size

• Sort-based algorithms can experience reduced rotational latency or seek time

Summary

• Partitioning Relations by Hashing• Algorithm for Duplicate Elimination• Grouping and Aggregation• Union, Intersection, and Difference• Hash-Join Algorithm• Sort based Vs Hash based

Thank you

15.6 Index Based Algorithms

By: Tomas Tupy (123)

Outline

• Terminology• Clustered Indexes

– Example• Non-Clustered Indexes• Index Based Selection• Joining Using an Index• Join Using a Sorted Index

What is an Index?

• A data structure which improves the speed of data retrieval ops on a relation, at the cost of slower writes and the use of more storage space.

• Enables sub-linear time lookup.• Data is stored in arbitrary order, while logical

ordering is achieved by using the index.

Terminology Recap

• B(R) – Number of blocks needed to hold R• T(R) – Number of tuples in R• V(R,a) – Number of distinct values of the

column for a in R.• Clustered Relation – Tuples are packed into as

few blocks as possible.• Clustered Indexes – Indexes on attribute(s)

such that all tuples with a fixed value for the search key appear on a few blocks as possible.

Clustering Indexes

• A relation is clustered if its tuples are packed into relatively few blocks.

• Clustering indexes are indexes on an attribute or attributes such that all the tuples with a fixed value for the search key of this index appear in as little blocks as possible.

• Tuples are stored to match the index order.• A relation that isn’t clustered cannot have a

clustering index.

Clustering Indexes

• Let R(a,b) be a relation sorted on attribute a.• Let the index on a be a clustering index.• Let a1 be a specific value for a.

• A clustering index has all tuples with a fixed value packed into minumum # of blocks.

a1 a1 a1 a1 a1 a1 a1 a1a1 a1 a1

All the a1 tuples

Pros/Cons

• Pros– Faster reads for particular selections

• Cons– Writing to a table with a clustered index can be

slower since there might be a need to rearrange data.

– Only one clustered index possible.

Clustered Index Example

Customer

ID

Name

Address

Order

ID

CustomerID

Price

Problem: We want to quickly retrieve all orders for a particular customer.

How do we do this?

Clustered Index Example

• Solution: Create a clustered index on the “CustomerID” column of the Order table.

• Now the tuples with the same CustomerID will be physically stored closed to one another on the disk.

Non Clustered Indexes

• There can be many per table• Quicker for insert and update operations.• The physical order of tuples is not the same as

index order.

Index Based Algorithms

• Especially useful for the selection operator.• Join and other binary operators also benefit.

Index-Based Selection

• No index– Without an index on relation R, we have to read

all the tuples in order to implement selection oC(R), and see which tuples match our condition C.

– What is the cost in terms of disk I/O’s to implement oC(R)? (For both clustered and non-clustered relations)


• No index– Answer:

• B(R) if our relation is clustered• Upto T(R) if relations in not-clustered.


• Let us consider an index on attribute a where our condition C is a = v.

• oa=v(R)• In this case we just search the index for value

v and we get pointers to exactly the tuples we need.


• Let’s say for our selection oa=v(R), our index is clustering.

• What is the cost in the # of disk I/O’s to retrieve the set oa=v(R)?


• Answer– the average is: B(R) / V(R,a)

• A few more I/Os:– Index might not be in main memory– Tuples which a = v might not be block aligned.– Even if clustered, might not be packed as tight as

possible. (Extra space for insertion)


• Non-clustering index for our selection oa=v(R) • What is the cost in the # of disk I/O’s to

retrieve the set oa=v(R)?


• Answer– Worst case is: T(R) / V(R,a)– This can happen if tuples live in different blocks.

Joining by Using an Index(Algorithm 1)

• Consider natural join: R(X,Y) |><| S(Y,Z)• Suppose S has and index on attribute Y.• Start by examining each block of R, and within

each block consider each tuple t, where tY is a component of t corresponding to the attribute Y.

• Now we use the index to find tuples of S that have tY in their Y component.

• These tuples will create the join.

Joining by Using an Index(Algorithm 1) Analysis

• Consider R(X,Y) |><| S(Y,Z)• If R is clustered, then we have to read B(R)

blocks to get all tuples of R. If R is not clustered then up to T(R) disk I/O’s are required.

• For each tuple t of R, we must read an average of T(S) / V(S,Y) tuples of S.

• Total: B(R)T(S) / V(S,Y) for clustered index, and T(R)T(S) / V(S,Y) for non-clustered index.

Join Using a Sorted Index

• Consider R(X,Y) |><| S(Y,Z)• Data structures such as B-Trees provide the

best sorted indexes.• In the best case, if we have sorting indexes on

Y for both R and S then we perform only the last step of the simple sort-based join. (Sometimes called zig-zag join)

Join Using a Sorted Index(Zig-zag join)

• Consider R(X,Y) |><| S(Y,Z) where we have indexes on Y for both R and S.

• Tuples from R with a Y value that does not appear in S never need to be retrieved, and vice-versa…

Index on Y in R

Index on Y in S

Thank You!

• Questions?

15.7 Buffer Management

By Snigdha Rao ParvatneniSJSU ID: 008648978

Class Roll Number: 124

Course: CS257

Agenda

• Introduction• Role of Buffer Manager• Architecture of Buffer Management• Buffer Management Strategies• Relation between Physical Operator Selection and Buffer Management• Example

Introduction

• Generally, we assume that operators in relations have some main memory buffers to store the data.

• It is very rare that these buffers are allocated in advance to the operator.

• Task of assigning main memory buffers to processes is given to the Buffer Manager.

• The Buffer Manager is responsible for allocating the main memory to the process as per the need and minimizing the delays and unsatisfiable requests.

Role of Buffer Manager

• Buffer Manager responds to the request for main memory access to disk blocks. Below picture depicts it.

Architecture of Buffer Management

• There are two broad architectures for a buffer manager:

– Buffer manager controls main memory directly like in many Relational DBMS.

– Buffer manager allocates buffers in virtual memory and let the OS decide which buffers should be in main memory and which buffer should be in OS managed disk swap space like in many Object Oriented DBMS and Main Memory DBMS.

Problem

• Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in available main memory.

– When buffer manager controls main memory directly• If requests exceeds available space then buffer manager has to select a buffer to

empty by returning its contents to disk.• When blocks have not been changed then they are simply erased from main

memory. But, when blocks have been changed then they are written back to its place on disk.

– When buffer manager allocates space in virtual memory • Buffer manager has the option of allocating more buffers, which can actually fit into

main memory. When all these buffers will be in use then there will be thrashing.• It is an operating system problem where many blocks are moved in and out of disk’s

swap space. Therefore, system will end up spending most of time in swapping blocks and getting very little work done.

Solution

• To resolve this problem When DBMS is initialized then the number of buffers is set.

• User need not worry about mode of buffering used.

• For users there is a fixed size buffer pool, in other words set of buffers are available to query and to other database actions.

Buffer Management Strategies

• Buffer Manager needs to make a critical choice of which block to keep and which block to discard when buffer is needed for newly requested blocks.

• Then buffer manager uses buffer replacement strategies. Some common strategies are –

– Least-Recently-Used (LRU)

– First-In-First-Out (FIFO)

– The Clock Algorithm (Second Chance)

– System Control

Last-Recently Used (LRU)

• This rule is to throw out the block which has not been read or written from

long time.

• To do this the Buffer Manager needs to maintain a table which will indicate

the last time when block in each buffer was accessed.

• It is also needed that each database access should make an entry in this table.

Significant amount of is involved effort in maintaining this information.

• Buffers which are not used from long time is less likely to be accessed before

than those buffers which have been accessed recently. Hence, It is an

effective strategy.

First-In-First-Out (FIFO)

• In this rule, when buffer is needed then the buffer which has been occupied for longest by same block is emptied and used by new block.

• To do this Buffer Manager needs to know only the time at which block occupying the buffer was loaded into the buffer.

• Entry in the table is made when block is read from disk, not every time it is accessed.

• Involves less maintenance than LRU but it is more prone to mistakes.

The Clock Algorithm

• It is an efficient approximation of LRU and is commonly implemented.

• Buffers are treated to be arranged in circle where arrow points to one of the buffers. Arrow will rotate clockwise if it needs to find a buffer to place a disk block.

• Each buffer has an associated flag with value 0 or 1. Buffers with flag value 0 are vulnerable to content transfer to disk whereas buffer with flag value 1 are not vulnerable.

• Whenever block is read into buffer or contents of buffer are accessed, flag associated with it is set to 1.

Working of Clock’s Algorithm

• Whenever buffer is needed for the

block arrow looks for first 0 it

can find in clockwise direction.

• Arrow move changes flag value

from 1 to 0.

• Block is thrown out of buffer only

when it remains unaccessed i.e.

flag value 0 for the time

between two rotations of the arrow.

• First rotation when flag is set from

1 to 0 and second rotation when

arrow comes back to check flag value.

System Control

• Query processor and other DBMS components can advice buffer manager to avoid some mistake which occurs with LRU, FIFO or Clock.

• Some blocks cannot be moved out of main memory without modifying other blocks pointing to it. Such blocks are called pinned blocks.

• Buffer Manager needs to modify buffer replacement strategy, to avoid expelling pinned blocks. That’s why some blocks are remains in main memory even though there is no technical reason for not writing it to the disk.

Relation Between Physical Operator Selection And Buffer Management

• Query optimizer selects the physical operator to execute the query. These physical operator expects certain number of buffers for execution.

• However, the buffer Manager does not guarantee the availability of these buffers when query is executed.

• In this situation two question arises

– Can an algorithm adapt to changes in the number of available main memory buffers?

– When expected number of available buffers are less, then some blocks needs to be put in the disk instead of main memory. How buffer replacement strategy affects the performance?

Example

• Block based nested loop join – algorithm does not depends upon number of available buffers M, but performance depends.

• For each M-1 blocks of outer loop relation S, read blocks in main memory, organize the tuple into search structure where key is the common attribute between R and S.

• Now for each block b of inner loop relation R, read b into main memory and for each tuple t of b find tuples in S that join with t.

• The S uses M-1 buffers and it depends upon average number of buffers available at each iteration. One buffer is reserved for R.

• If we pin M-1 block that we use for S in one iteration of outer loop then we cannot loose these buffers during that round. If more buffers will become available then more blocks of R can be kept in the memory. Will it improve the running time?

Cases with LRU• Case1

– When LRU is used as buffer replacement strategy and k buffers are available to hold blocks of R.

– R is read in order such that blocks that remains in the buffer at the end of iteration of outer loop will be last k blocks of R.

– For next iteration we will again start from beginning of R. Therefore, k buffers for R needs to be replace.

• Case2– With better implementation of nested loop join when LRU is used visit blocks

of R in order that alternates first to last then last to first.

– In this we save k disk I/O on each iteration except first iteration.

With Other Algorithms

• Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-replacement strategy used by the buffer manager.

• In sort based algorithm when availability of buffers reduces we can change the size of a sub-lists. Major limitation of this is we will be forced to create many sub-lists that we cannot then allocate a buffer for each sub-list in the merging process.

• In hash based algorithm when availability of buffers reduces we can reduce the size of buckets, provided bucket then should not become so large that they do not fit into the allotted main memory.

References

• DATABASE SYSTEMS: The Complete Book, Second Edition by Hector Garcia-Molina, Jeffrey D. Ullman & Jennifer Widom

Thank You

15.8 Algorithms using more than two passes

Presented By: Seungbeom Ma (ID 125)

Professor: Dr. T. Y. Lin

Computer Science Department

San Jose State University

Multipass Algorithms

• Previously , most of algorithms are required two passes.

• There is a case that we need more than two passes.

• Case : Data is too big to store in main memory.– We have to hash or sort the relation with multipass

algorithms.

Agenda

• 1. Multipass Sort-Based Algorithm • 2. Multipass Hash-Based Algorithm

Multipass sort-based algorithm.

• M: Number of Memory Buffers• R: Relation• B(R) : Number of blocks for holding relation.• BASIS:• 1. If R fits in M block (B (R) <= M).• 2. Reading R into main memory.• 3. Sorting R in the main memory with any sorting

algorithm.• 4. Write the sorted relation to disk.

Multipass sort-based algorithm.

• INDUCTION: (B(R)> M)• 1. If R does not fit into main memory then

partitioning the blocks hold R into M groups, which call R1, R2, …, RM

• 2.Recursively sorting Ri from i =1 to M

• 3.Once sorting is done, the algorithm merges the M sorted sub-lists.

Performance: Multipass Sort-Based Algorithms

1) Each pass of a sorting algorithm:

1.Reading data from the disk.

2. Sorting data with any sorting algorithms

3. Writing data back to the disk.

2-1) (k)-pass sorting algorithm needs

2k B(R) disk I/O’s

2-2)To calculate (Multipass)-pass sorting algorithm needs

= > A+ B

A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]

B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass]

Total: (2k-1)(B(R)+B(S)) disk I/O’s

Multipass Hash-Based Algorithms

• 1. Hashing the relations into M-1 buckets, where M is number of memory buffers.• 2. Unary case: • It applies the operation to each bucket individually. • 1.Duplicate elimination (δ) and grouping (γ).

– 1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table– 2) Duplicate elimination: Distinct

Basis:

If the relation fits in M memory block,

-> Reading relation into memory and perform the operations.

• 3. Binary case: It applies the operation to each corresponding pair of buckets. • Query operations: union, intersection, difference , and join

– If either relations fits in M-1 memory blocks,– -> Reading that relation into main memory M-1 blocks– -> Reading next relation to 1 block at a time into the Mth block– Then performing the operations.

INDUCTION

• If Unary and Binary relation does not fit into the main memory buffers.

1. Hashing each relation into M-1 buckets.

2. Recursively performing the operation on each bucket or corresponding pair of buffers.

3. Accumulating the output from each buckets or pair.

Hash-Based Algorithms : Unary Operatiors

Perfermance: Hash-Based Algorithms

• R: Realtion.• Operations are like δ and γ• M: Buffers• U(M, k): Number of blocks in largest relation with k-pass hashing

algorithm.

Performance: Induction

Induction:

1. Assuming that the first step divides relation R into M-1 equal buckets.

2. The buckets for the next pass must be small enough to handle in k-1 passes

3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).

Sort-Based VS Hash-Based

1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time

2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.

THANKS

David LeCS257, ID: 126Feb 28, 2013

15.9 Query Execution Summary

Query ProcessingOutline of Query CompilationTable ScanningCost MeasuresReview of Algorithms

One-pass MethodsNested-Loop JoinTwo-pass

Sort-basedHash-based

Index-basedMulti-pass

Overview

Query Processing

Query Compilation

Query Execution

query

query plan

metadata

data

Query is compiled. This involves extensive optimization using operations of relational algebra.

First compiled into a logical query plans, e.g. using expressions of relational algebra.

Then converted to a physical query plan such as selecting implementation for each operator, ordering joins and etc.

Query is then executed.

Outline of Query Compilation

Parse query

Select logical plan

SQL query

expressiontree

queryoptimization

Parsing: A parse tree for the query is constructed.

Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan.

Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions.

Select physical plan

Execute plan

logical queryplan tree

physical queryplan tree

Table ScanningThere are two approaches for

locating tuples of relation R:Table-scan: Get the blocks one

by one.Index-scan: Use index to lead us

to all blocks holding R.Sort-scan takes a relation R and

sorting specifications and produces R in a sorted order. This can be accomplished with SQL clause ‘ORDER BY’.

Estimates of cost are essential for query optimization.

It allows us to determine the slow and fast parts of a query plan.

Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time.

EXPLAIN SELECT * FROM a JOIN b on a.id = b.id;

Cost Measures

EXPLAIN SELECT snp.* FROM snp JOIN chr ON snp.chr_key = chr.chr_key WHERE snp_name <> ''

Cost MeasuresOptimizing Queries:

One-pass Methods

Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once.

Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators. Hash table O(n) or a balanced binary search tree O(n log n) is used for duplicate eliminations to speed up the detections.

Full-relation, binary operations. These include union, intersection, difference, product and join.

Review of Algorithms

Nested-Loop Joins

In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly.

Can use relation of any size and does not have to fit all in main memory.

Two variations of nested-loop joins:Tuple-based: Simplest form, can be very slow

since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z).

Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.


Two-pass Algorithms

Usually enough even for large relations.Based on Sorting:

Partition the arguments into memory-sized, sorted sublists.

Sorted sublists are then merged appropriately to produce desired results.

Based on Hashing:Partition the arguments into buckets. Useful if data is too big to store in memory.


Two-pass Algorithms

Sort-based vs. Hash-based:Hash-based are often superior to sort-based

since they require only one of the arguments to be small.

Sorted-based works well when there is reason to keep some of the data sorted.


Index-based Algorithms

Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes.

Clustering and non-clustering indexes:Clustering index has all tuples with fixed value

packed into minimum number of blocks.A clustered relation can have non-clustering

indexes.


Multi-pass Algorithms

Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets.

Each pass of a sorting algorithm reads all data from disk and writes it out again.

Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.


Database Systems: The Complete Book, 2nd Edition. Chapter 15, sections 1 to 9.

Reference

Thank You.

Concurrency Control

18.1 Serial and Serializable Schedules

Dona BaysaID: 127

CS 257Spring 2013

Chapter 18.1 Topics

• Intro Concurrency Control Scheduler

• Serializability• Schedules

Serial and Serializable

Intro: Concurrency Control & Scheduler

• Concurrently executing transactions can cause inconsistent database state

• Concurrency Control assures transactions preserve consistency

• Scheduler: Regulates individual steps of different transactions Takes reads/writes requests from transactions and

executes/delays them

Intro: Scheduler

• Transaction requests passed to Scheduler• Scheduler determines execution of requests

Transaction manager

Scheduler

Buffers

Read/Writerequests

Reads and writes

Serializability

• How to assure concurrently executing transactions preserve database state correctness? Serializability – schedule transactions as if they

were executed one-at-a-time Determine a Schedule

Schedules

• Schedule – sequence of important actions performed by transactions Actions: reads and writes

• Example: Transactions and actionsT1 T2

READ(A, t) READ(A, s)

t := t+100 s := s*2

WRITE(A,t) WRITE(A,s)

READ(B,t) READ(B,s)

t := t+100 s := s*2

WRITE(B,t) WRITE(B,s)

Serial Schedules

• All actions of one transactions are followed by all actions of another transaction, and so on.

• No mixing of actions• Depends only on order of transactions• Serial Schedules:

T1 precedes T2 T2 precedes T1

Serial Schedule: Example

• T1 precedes T2

Notation: (T1 ,T2)

• Consistency constraint: A = B

• Final value: A = B = 250 Consistency is preserved

T1 T2 A B

READ(A, t) 25 25

t := t+100

WRITE(A,t) 125

READ(B,t)

t := t+100

WRITE(B,t) 125

READ(A, s)

s := s*2

WRITE(A,s) 250

READ(B,s)

s := s*2

WRITE(B,s) 250

Serializable Schedules

• Serial schedules preserve consistency• Any other schedules that also guarantee

consistency? Serializable schedules

• Definition: A schedule S is serializable if there’s a serial

schedule S’ such that for every initial database state, the effects of S and S’ are the same.

Serializable Schedule: Example

• Serializable, but not serial, schedule

• T2 acts on A after T1,

but before T1 acts on B• Effect is same as serial schedule (T1, T2 )

T1 T2 A B

25 25

READ(A, t)

t := t+100

WRITE(A,t) 125

READ(A, s)

s := s*2

WRITE(A,s) 250

READ(B,t)

t := t+100

WRITE(B,t) 125

READ(B,s)

s := s*2

WRITE(B,s) 250

Notation: Transactions and Schedules

• Transaction: Ti (for example T1, T2,…)• Database element: X• Actions: read/write

rTi (X) = ri (X) wTi (X) = wi (X)

• Examples Transactions:

T1: r1 (A); w1 (A); r1 (B); w1 (B); T2: r2 (A); w2 (A); r2 (B); w2 (B);

Schedule: r1 (A); w1 (A); r2 (A); w2 (A); r1 (B); w1 (B); r2 (B); w2 (B);

Thank you!

Concurrency Control

18.2 Conflict Serializability

Geetha Ranjini ViswanathanID: 121

18.2 Conflict-Serializability

• 18.2.1 Conflicts• 18.2.2 Precedence Graphs and a Test for

Conflict-Serializability• 18.2.3 Why the Precedence-Graph Test Works

18.2.1 Conflicts

• Conflict - a pair of consecutive actions in a schedule such that, if their order is interchanged, the final state produced by the schedule is changed.

18.2.1 Conflicts

• Non-conflicting situations: Assuming Ti and Tj are different transactions, i.e., i ≠ j:

• ri(X); rj(Y) will never conflict, even if X = Y.

• ri(X); wj(Y) will not conflict for X ≠ Y.

• wi(X); rj(Y) will not conflict for X ≠ Y.

• wi(X); wj(Y) will not conflict for X ≠ Y.

18.2.1 Conflicts

• Two actions of the same transactions always conflictri(X); wi(Y)

• Two writes of the same database element by different transactions conflictwi(X); wj(X)

• A read and a write of the same database element by different transaction conflictri(X); wj(X)wi(X); rj(X)

• Conflicting situations: Three situations where actions may not be swapped:

18.2.1 Conflicts

• Conclusions:Any two actions of different transactions may be swapped unless:

• They involve the same database element, and• At least one is a write

• The schedules S and S’ are conflict-equivalent, if S can be transformed into S’ by a sequence of non-conflicting swaps of adjacent actions.

• A schedule is conflict-serializable if it is conflict-equivalent to a serial schedule.

18.2.1 Conflicts

• Example 18.6Conflict-serializable scheduleS: r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);Above schedule is converted to the serial schedule S’ (T1, T2) through a sequence of swaps.

r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);r1(A); w1(A); r2(A); r1(B); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w1(B); w2(A); r2(B); w2(B);

S’: r1(A); w1(A); r1(B); w1(B); r2(A); w2(A); r2(B); w2(B);

18.2.2 Precedence Graphs and a Test for Conflict-Serializability

• Given a schedule S, involving transactions T1 and T2, T1 takes precedence over T2 (T1 <s T2), if there are actions A1 of T1 and A2 of T2, such that:

• A1 is ahead of A2 in S,

• Both A1 and A2 involve the same database element, and

• At least one of A1 and A2 is a write action

• We cannot swap the order of A1 and A2.

• A1 will appear before A2 in any schedule that is conflict-equivalent to S.

• A conflict-equivalent serial schedule must have T1 before T2.


• Precedence graph:• Nodes represent transactions of S• Arc from node i to node j, if Ti <S Tj

• Example 18.7Given ScheduleS: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

Precedence Graph

Acyclic graph Schedule is conflict-serializable

T1 T2 T3


• Example 18.8S: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

Convert S to serial schedule S’ (T1, T2, T3).

r1(B); r2(A); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

r1(B); r2(A); w2(A); w1(B); r3(A); w3(A); r2(B); w2(B);

r1(B); r2(A); w1(B); w2(A); r3(A); w3(A); r2(B); w2(B);

r1(B); w1(B); r2(A); w2(A); r3(A); w3(A); r2(B); w2(B);

r1(B); w1(B); r2(A); w2(A); r3(A); r2(B); w3(A); w2(B);

r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w3(A); w2(B);

r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w2(B); w3(A);

r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);

S’: r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);


• Example 18.9Given ScheduleS: r2(A); r1(B); w2(A); r2(B); r3(A); w1(B); w3(A); w2(B);

Precedence Graph

Cyclic graph Schedule is NOT conflict-serializable

T1 T2 T3

18.2.3 Why the Precedence-Graph Test Works

• Consider a cycle involving n transactions T1 —> T2 ... —> Tn —> T1

• In the hypothetical serial order, the actions of T1 must precede those of T2, which precede those of T3, and so on, up to Tn.

• But the actions of Tn, which therefore come after those of T1, are also required to precede those of T1. This puts constraints on legal swaps between T1 and Tn.

• Thus, if there is a cycle in the precedence graph, then the schedule is not conflict-serializable.

Questions?

Shailesh Padave

ID 111

CS257Spring 2013

18.3 Enforcing Serializability by locks

INTRODUCTION

• Enforcing serializability by locks– Locks– Locking scheduler– Two phase locking

Locks

• Most common architecture for a scheduler, one in which locks are maintained on database element to prevent unserializable behavior

• It works like as follows : – A request from transaction– Scheduler checks in the lock table to guide the decision– Generates a serializable schedule of actions.

Consistency of transactions

• Actions and locks must relate to each other in expected ways:– Transactions can only read & write only if it has a

lock on the database elements involved in the transaction.

– Unlocking an element is compulsory.

• Legality of schedules– No two transactions can acquire the lock on same

element without the prior one releasing it.

Locking scheduler

• Grants lock requests only if it is in a legal schedule.

• Lock table stores the information about current locks on the elements.

• Consider– li(X): Transaction Ti requests a lock on database

element X– ui(X): Transaction Ti releases its lock on database

element X

Locking scheduler (contd.)

• A legal schedule of consistent transactions but unfortunately it is not a serializable but it is legal.

T1 T2 A B

l1(A); r1(A);A:=A+100;

w1(A);u1(A);

l1(B); r1(B);B:=B+100;

w1(B);u1(B);

l2(A); r2(A);A:=A*2;

w2(A);u2(A);l2(B); r2(B);

B:=B*2;w2(B);u2(B);

25

125

250

25

50

150

Locking schedule (contd.)

• The locking scheduler delays requests in order to maintain a consistent database state.T1 T2 A B

l1(A); r1(A);A:=A+100;

w1(A);l1(B);u1(A);

r1(B);B:=B+100;w1(B);u1(B);

l2(A); r2(A);A:=A*2;

w2(A);u2(A);L2(B); Denied

l2(B); u2(A);r2(B);B:=B*2;

w2(B);u2(B);

25

125

250

25

125

250

Two-phase locking(2PL)

• Guarantees a legal schedule of consistent transactions is conflict-serializable.

• All lock requests proceed all unlock requests.

• The growing phase:– Obtain all the locks and no unlocks allowed.

• The shrinking phase:– Release all the locks and no locks allowed.

Working of Two-Phase locking

• Assures serializability.• Two protocols for 2PL:

– Strict two phase locking : Transaction holds all its write locks till commit / abort.

– Rigorous two phase locking : Transaction holds all locks till commit / abort.

• Two phase transactions are ordered in the same order as their first unlocks.

Two Phase Locking

Locks

required

Time

Instantaneously executes now

Every two-phase-locked transaction has a point at which it may be thought to execute instantaneously

Failure of 2PL.

• 2PL fails to provide security against deadlocks.• T1: l1(A); r1(A); A:=A+100; w1(A); l1(B); u1(A); r1(B);

B:=B+100; w1(B); u1(B);• T2: l2(B); r2(B); B:=B*2; w2(B); l2(A); u2(B);

r2(A);A:=A*2; w2(A); u2(A);T1 T2 A B

l1(A); r1(A);

A:=A+100;

w1(A);

l1(B); Denied

l2(B); r2(B);

B:=B*2;

W2(B);l2(A); Denied

25

125

25

50

Thank You

18. Concurrency Control18.4. Locking Systems With

Several Lock Modes

byKiruthika Sivaraman

ID: 129008663109

Introduction

•Problem:• A Transaction T must take a lock on a database element X even if it only wants

to read X and not write it• This is to prevent another transaction from modifying the element as we read

it. This would cause unserializable behavior.• Therefore we have two locks:

• Shared lock (“read lock”)• Exclusive long (“write lock”)

Lock Types

• Shared Lock or Read Lock– To read database element X we use

shared lock.– There can be more than one shared lock

on X.

• Exclusive Lock or Write Lock– To write to database element X we use

exclusive lock.– There can be only one exclusive lock on

X.

Notations Used

• sli(X) – Transaction Ti requests shared lock on database element X.

• xli(X) – Transaction Ti requests exclusive lock on database element X.

• ui(X) – Transaction Ti unlocks X.

Requirements

• Consistency of transactions– Transaction may not write without an exclusive lock

and cannot read without any kind of lock.

• Two phase locking of transactions– Locking must precede unlocking

– xli(X) or sli(X) cannot be preceded by ui(Y) for any Y.

• Legality of schedules– An element can be locked exclusive by one

transaction or by several in shared mode but not both.

Compatibility Matrices

• Compatibility matrix describes lock management policy.

• Has row and column for each lock mode

• Row corresponds to lock held on element X by a transaction.

• Column corresponds to mode of lock requested on X.

Upgrading Locks

• Transaction T first takes a shared lock on X.

• Later when T is ready to write it upgrades it lock to exclusive lock on X.

• ui(X) releases all the lock that was established by transaction Ti on X.

• By this way transaction T remains friendly with other transactions.

Upgrading Locks - Drawback

• T1 first establishes a shared lock on X.

• T2 also establishes a shared lock on X.

• T1 tries to upgrade its lock to exclusive lock.

• T2 also tries the same.

• Deadlock!

Update Locks

• To avoid deadlock

• Update lock is similar to shared locks with only difference that the transaction requesting update lock can only upgrade its lock to exclusive lock.

• Once a transaction requests update lock on X, then no other locks will be granted on X.

Increment Lock

• Two transaction can establish an increment lock on a database element at the same time.

• Useful when the order of write is not important.

INC(A,2)INC(A,10)

INC(A,10)INC(A,2)

Increment Lock Compatibility Matrix

THANK YOU!

Akash Patel

Concurrency ControlArchitecture for a Locking Scheduler

Section 18.5

Presented By:Akash Patel

ID: 113

Overview

• Overview of Locking Scheduler

• Scheduler That Inserts Lock Actions

• The Lock Table

• Handling Locking and Unlocking Request

Principles of simple scheduler architecture

• The transactions themselves do not request locks, or cannot be relied upon to do so. It is the job of the scheduler to insert lock actions into the stream of reads, writes, and other actions that access data.

• Transactions do not release locks. Rather, the scheduler releases the locks when the transaction manager tells it that the transaction will commit or abort.

Scheduler That Inserts Lock Actions into the transactions request stream

Scheduler, Part 1

Scheduler, Part 2

Lock(A);Read(A)

Read(A);Write(B)

Read(A);Write(B);Commit(T)

LockTable

From Transaction

• The scheduler maintains a lock table, which, although it is shown as secondary-storage data, may be partially or completely in main memory

• Actions requested by a transaction are generally transmitted through the scheduler and executed on the database.

• Under some circumstances a transaction is delayed, waiting for a lock, and its requests are not (yet) transmitted to the database.

The two parts of the scheduler perform

• Part I takes the stream of requests generated by the transactions and inserts appropriate lock actions ahead of all database-access operations, such as read, write, increment, or update.

• Part II takes the sequence of lock and database-access actions passed to it by Part I, and executes each appropriately

Determine the transaction (T) that action belongs and status of T (delayed or not). If T is not delayed then

1. Database access action is transmitted to the database and executed

2. If lock action is received by PartII, it checks the L Table whether lock can be granted or not

– i> Granted, the L Table is modified to include granted lock– ii>Not G. then update L Table about requested lock then PartII

delays transaction T

3. When a T = commits or aborts, PartI is notified by the transaction manager and releases all locks. - If any transactions are waiting for locks PartI notifies PartII.

4. Part II when notified about the lock on some DB element, determines next transaction T’ to get lock to continue.

The Lock Table

• A relation that associates database elements with locking information about that element

• Implemented with a hash table using database elements as the hash key

• Size is proportional to the number of lock elements only, not to the size of the entire database

DB element A

Lock information for A

Lock table Entry Field

• Group Mode• S means that only shared locks are held.• U means that there is one update lock and perhaps one

or more shared locks.• X means there is one exclusive lock and no other locks.

• Waiting • Waiting bit tells that there is at least one transaction waiting

for a lock on A.

• A list • A list describing all those transactions that either currently

hold locks on A or are waiting for a lock on A.

Handling Lock Requests

• Suppose transaction T requests a lock on A

• If there is no lock table entry for A, then there are no locks on A, so create the entry and grant the lock request

• If the lock table entry for A exists, use the group mode to guide the decision about the lock request

1) If group mode is U (update) or X (exclusive)No other lock can be granted

• Deny the lock request by T• Place an entry on the list saying T requests a lock• And Wait? = ‘yes’

2) If group mode is S (shared)Another shared or update lock can be granted

• Grant request for an S or U lock• Create entry for T on the list with Wait? = ‘no’• Change group mode to U if the new lock is an update lock

How to deal with existing Lock

Handling Unlock Requests

• Now suppose transaction T unlocks A

• Delete T’s entry on the list for A

• If T’s lock is not the same as the group mode, no need to change group mode

• Otherwise check entire list for new group mode

Handling Unlock Requests

o If the value of waiting is “yes" need to grant one or more locks using following approaches

• First-Come-First-Served: • Grant the lock to the longest waiting request. • No starvation (waiting forever for lock)• Priority to Shared Locks: • Grant all S locks waiting, then one U lock. • Grant X lock if no others waiting• Priority to Upgrading: • If there is a U lock waiting to upgrade to an X lock, grant

that first.

Thank you

18. Concurrency Control

18.6 Hierarchies of Database Elements

303

Mehal Patel (ID: 114)

Hierarchies of Database Elements

• Two problems arises when tree structure in data is encountered – How to provide locking in each case.1) Hierarchy of lockable elements (This section ) - Large elements ( relations ) - Smaller elements within relations ( blocks or individual tuples )2) Data it-self organized as trees. ( next section

18.7 )

Locks with Multiple Granularity

• A database element can be a relation, block or a tuple

• Different Database Systems use different size of data base elements for locking purpose

Example

• Consider a database for a bank– Choosing relations as database elements means one

lock for an entire relation– If we were dealing with a relation having account

balances, this kind of lock would be very inflexible and thus provide very little concurrency

– This means only 1 deposit/withdrawal can be made– Better way to lock individual blocks or pages

such that two accounts in different blocks can be updated simultaneously

Warning (Intention) Locks

• These protocol helps manage lock at different level of hierarchy

Database Elements Organized in Hierarchy

Warning Protocol Rules

• These involve both ordinary (S and X) and warning (IS and IX) locks

• Below Rules Followed– Begin at the root of hierarchy– Request the S or X lock if we are at the desired element– If the desired element id further down the hierarchy, place

a warning lock (IS if S and IX if X)– When the warning lock is granted, we proceed to the child

node and repeat the above steps until desired node is reached

Uses Compatibility matrix

Compatibility Matrix for Shared, Exclusive and Intention Locks

IS IX S X

IS Yes Yes Yes No

IX Yes Yes No No

S Yes No Yes No

X No No No No

Example

• Movie(title, year, length, studioName)Transaction T1

- Select * from Movie where Title = ‘King kong’

Transaction T2- Update Movie set year = 1939 where title=‘ gone with the wind’

• If T2 had updated tuples related to ‘King Kong’ then it would have to wait untill T1 release S lock.

• S and X are not compatible

Phantoms and Handling Insertions Correctly

• This arises when transactions create new sub elements of lockable elements

• Since we can lock only existing elements the new elements fail to be locked

• Example followed

Example

• Consider a transaction T3

– Select sum(length) from movie where studioName =‘Disney’

– This calculates the total length of all tuples having studioName=Disney

– Thus, T3 acquires IS for relation and S for targeted tuples

– Now, if another transaction T4 inserts a new tuple for studioName = ‘Disney’, the result of T3 becomes incorrect

…contd

• Not a concurrency problem since the serial order (T3, T4) is maintained

• Consider below now: T4 writing new tuple while T3 is reading tuples on same relation– r3(d1);r3(d2);w4(d3);w4(X);w3(L);w3(X)– Want to find total length of all movies under studioName –

outcome should be consistent.– Above scenario is not consistent. T3 will not get correct

sum because lock on new element (d3) was not obtained.

…contd

• This problem is due to the relation has a phantom tuple (the new inserted tuple), which should have been locked but it did not existed when lock was acquired by T3.

• The occurrence of phantoms can be avoided if all insertion and deletion transactions are treated as write operations (Excelusive lock – X ) on the whole relation.

Thank You

18.7 THE TREE PROTOCOL

MALLIKA PEREPAID: 115

Outline-

• Introduction• Motivation• Rules for Access to Tree-

Structured Data• Why the Tree Protocol Works

Introduction-

• In this section, we deal with tree structures that are formed by the link pattern.– Database elements are disjoint pieces of data, but the only

way to get to a node is through its parent.

– B-trees are an important example of this sort of data.

– Traversing a particular path to an element gives us freedom to manage locks differently from the two-phase locking approaches we have seen previously.

B-tree Details-

• Basic DS:- Keeps records in sorted order, allows

searches, sequential access, insertions and deletions in logarithmic time.

- It is used in databases and file systems.• Locking structure:

- Granularity is at node level. Treating Smaller pieces as elements is not beneficial and entire B-tree as a single element is infeasible!

Motivation for Tree based locking-• If we use a standard set of lock modes (shared, update

and exclusive) and two phase locking, then concurrent use of the B-tree is almost impossible.

• The reason is that every transaction must begin by locking the root node of the B-tree.

• Any transaction that inserts or deletes could wind up rewriting the root of the B-tree.

• Thus, only one transaction that is not read-only can access the B-tree at any time

• However, in most situations a B-tree node will not be rewritten, even if the transaction inserts or deletes a tuple.

• Thus, as soon as a transaction moves to a child of the root and observes the situation that rules out a rewrite of the root, we would like to release the lock on the root.

• Releasing the lock on the root early will violate two phase locking, so we cannot be sure that the schedule of several transactions accessing the B-tree will be serializable.

• The solution is a specialized protocol for transactions that access tree structured data like B-trees.

Rules for Access to Tree-Structured Data-

• The following restrictions on locks form tree protocol. Assumptions:

There is only ONE kind of a lock, represented by the form li(X) Transactions are consistent, and schedules must be legal

but there is no two-phase locking requirement on transactions.

Rules: A transaction’s first lock may be at any node of the tree Subsequent locks may only be acquired if the transaction

currently has a lock on the parent node. Nodes may be unlocked at any time. A transaction may not relock a node on which it has

released a lock, even if it still holds a lock on the node’s parent

Example…

A tree of lockable elements

A

B C

D E

F G

3 transactions following the protocol…

T1 T2 T3

l1 (A);r1(A);

l1 (B); r1(B);

l1(C); l1 (C);

w1(A);u1(A);

l1(D); r1(D);

w1 (B); u1(B);

l2(B);r2(B);

l3(E);r3(E);

w1(D);u1(D);

w1(c);u1(C);

l2(E) denied

l3(F);r3(F);

w3(F);u3(F);

l3(G);r3(G) w3(E);u3(E);

l2(E);r2(E);

w3(G);u3(G);

w2 (B); u2(B);

w1 (E); u1(E);

Why Tree Protocol works-

• The tree protocol implies a serial order on the transactions involved in a schedule.

• The order of precedence can be defined as-Ti<S Tj ;If in a schedule S, the transactions Ti and Tj

lock a node in common, and Ti locks the node first.

• If precedence graph drawn from the precedence relations that we defined above has no cycles, then we claim that any topological order of transactions is an equivalent serial schedule.

• For Example either ( T1,T2,T3) or (T3,T1,T2) is an equivalent serial schedule the reason for this serial order is that all the nodes are touched in the same order as they are originally scheduled.

• If two transactions lock several elements in common, then they are all locked in same order.

Precedence graph derived from schedule:

2

1

3

Example:--4 Path of elements locked by two transactions

X T locks first P

U locks first Z Y U locks first

• Now Consider an arbitrary set of transactions T1, T2;.. . . Tn,, that obey the tree protocol and lock some of the nodes of a tree according to schedule S.

• First among those that lock, the root. they do also in same order.

• If Ti locks the root before Tj, Then Ti locks every node in common with Tj does. That is Ti<sTj, But not Tj>sTi.

THANK YOU

CS-257

CONCURRENCY CONTROL

SECTION 18.8Timestamps

What is Timestamping?

Scheduler assign each transaction T a unique number, it’s timestamp TS(T).

Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning

Timestamp TS(T)

Two methods of generating Timestamps.

Use the value of system, clock as the timestamp.Use a logical counter that is incremented after a new

timestamp has been assigned.

Scheduler maintains a table of currently active transactions and their timestamps irrespective of the method used

Timestamps for database element X and commit bit

RT(X):- The read time of X, which is the highest timestamp of transaction that has read X.

WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.

C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.

Physically Unrealizable Behavior

Read too late:A transaction U that started after

transaction T, but wrote a value for X before T reads X.

U writes X

T reads X

T start U start

Physically Unrealizable Behavior

Write too late:A transaction U that started after T, but

read X before T got a chance to write X.

U reads X

T writes X

T start U start

Figure: Transaction T tries to write too late

Dirty Read

It is possible that after T reads the value of X written by U, transaction U will abort.

U writes X

T reads X

U start T start U aborts

T could perform a dirty read if it reads X when shown

Thomas Write Rule

Too late to repair the damage

U writes X

T writes X

T Start U Start T Commits

U Aborts

Rules for Timestamps-Based scheduling

The scheduler, in response to a read or write request from a transaction T has the choice of:

a) Granting the request,

b) Aborting T (if T would violate physical reality) and restarting T with a new timestamp (abort followed by restart is often called rollback), or

c) Delaying T and later deciding whether to abort T or to grant the request(if the request is a read, and the read might be dirty)

Rules for Timestamps-Based scheduling

Scheduler receives a request rT(X)If TS(T) ≥ WT(X), the read is physically realizable.

If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).

If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.

If TS(T) < WT(X), the read is physically unrealizable. Rollback T.

Rules for Timestamps-Based scheduling (Cont.)

Scheduler receives a request WT(X). If TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and

must be performed.Write the new value for X,Set WT(X) := TS(T), andSet C(X) := false.

if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X.

If C(X) is true, then the previous writers of X is committed, and ignore the write by T.

If C(X) is false, we must delay T.

if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

Rules for Timestamps-Based scheduling (Cont.)

Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.

Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

Three transactions executing under a timestamp-based scheduler

T1 T2 T3 A B C200 150 175 RT=0 RT=0 RT=0

WT=0 WT=0 WT=0r1(B); RT=200

r2(A); RT=150r3(C); RT=175

W1(B); WT=200W1(A); WT=200

W2(C);Abort

W3(A);

Timestamps and LockingGenerally, timestamping performs better than locking in situations where:

Most transactions are read-only.It is rare that concurrent transaction will try to read and write

the same element. In high-conflict situation, locking performs better than

timestamps The argument for this rule-of-thumb is:

Locking will frequently delay transactions as they wait for locks.

But if concurrent transactions frequently read and write elements in common, then rollbacks will be frequent in a timestamp scheduler, introducing even more delay than a locking system.

Thank You

?

Concurrency control by validationAnusha Damodaran

ID : 130

CS 257 : Database System Principles

Section 18.9

At a Glance

• What is Validation?• Architecture of Validation based Scheduler• Validation Rules• Comparison between Concurrency Control

Mechanisms

Validation (p1)

• Another type of Optimistic Concurrency control• Allows transactions to access data without locks• Validation Scheduler: Keeps record of what active

transactions are doing• Goes through ‘Validation Phase’ before the

transaction starts to write values of database elements

• If there is a physically unrealizable behavior, the transaction is rolled back

18.9.1 Architecture of Validation based Scheduler (p1)

Scheduler must be told for each transaction T– Read Set , RS(T) - Sets of database elements T reads– Write Set , WS(T) - Sets of database elements T writes– Three phases of the Validation Scheduler

• Read – Transaction reads from Database all elements in its Read Set.– Also computes in its local address space all results its going to write.

• Validate – Validates the transaction by comparing its read and write sets with those of

other transactions.– If validation fails, transaction is rolled back, else proceeds to write phase.

• Write– Writes to the database its values for the elements in its write set.

Validation based Scheduler

– Scheduler has an assumed serial order of the transactions to work with.

– Maintains three sets• START : Set of transactions that have started but not yet

completed– START (T) – time at which transaction started

• VAL : Set of transactions that have been validated but not yet finished the writing of phase 3

– START(T) & VAL(T) – time at which T validated• FIN : Set of transactions that completed phase 3

– START(T), VAL(T), FIN(T) – time at which T finished

18.9.2 Validation Rules

– Case 1: • U is in VAL or FIN, that is U is validated• FIN(U) > START(T) , that is U did not finish before T started• RS(T) ∩ WS(U) is not empty (let it contain database element X)

• Since we don’t know whether or not T got to read U’s value, we must rollback T to avoid a risk that the actions of T and U will not be consistent with the assumed serial order.

T reads XU writes X

U startsT starts

U validatedT validating

18.9.2 Validation Rules

– Case 2: • U is in VAL , i.e. U has successfully validated• FIN(U) > VAL(T) , i.e. U did not finish before T entered its validation phase• WS(T) ∩ WS(U) is not empty (let database element X be in both write sets)

• T and U must both write values of X , and if we let T validate, it is possible that it will write X before U does. Since we cannot be sure, we rollback T to make sure it does not violate the assumed serial order in which it follows U.

T writes XU writes X

U validated T validating U finish

Rules for Validating a transaction T

• Check that RS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T started, i.e., if FIN(U) > START(T).

• Check that WS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T validated, i.e., if FIN(U) > VAL(T).

1. RS(T) ∩ WS(U) = ᶲ ; FIN(U) > START(T)2. WS(T) ∩ WS(U) = ᶲ ; FIN(U) > VAL(T)

Example 18.2.9• 4 Transactions T, U,V,W attempt to execute and validate

T: RS = {A,B} WS ={A,C}

U : RS = {B} WS = {D}

W : RS ={A,D} WS = {A,C}

V : RS = {B} WS = {D, E}

- Read

- Validate

- Write

Example 18.2.9• Validation of U [RS = {B}; WS = {D}]

– Nothing to check ; Reads {B} – U validates successfully – Writes {D}

• Validation of T [RS = {A,B}; WS ={A,C}]

– FIN(U) > START(T) ; RS(T) ∩ WS(U) should be empty {A,B} ∩ {D} = ᶲ– FIN(U) > VAL(T) ; WS(T) ∩ WS(U) should be empty {A,C} ∩ {D} = ᶲ

• Validation of V [RS = {B}; WS = {D, E}]

– FIN(T) > START(V); RS(V) ∩ WS(T) should be empty {B} ∩ {A,C} = ᶲ– FIN(T) > VAL(V) ;WS(V) ∩ WS(T) should be empty {D,E} ∩ {A,C} = ᶲ– FIN(U) > START(V) ;RS(V) ∩ WS(U) should be empty {B} ∩ {D} = ᶲ

• Validation of W [RS ={A,D}; WS = {A,C}]

– FIN(T) > START(W); RS(W) ∩ WS(T) should be empty {A,D} ∩ {A,C} = {A}– FIN(V) > START(W);RS(W) ∩ WS(V) should be empty {A,D} ∩ {D,E} = {D}– FIN(V) > VAL(W);WS(W) ∩ WS(V) should be empty {A,C} ∩ {D,E} = ᶲ– W is not validated , Is rolled back and hence does not write values A and C

18.9.3 Comparison between Concurrency Control Mechanisms – Storage Utilization

Concurrency control Mechanisms

Storage Utilization

Locks Space in the lock table is proportional to the number of database elements locked.

Timestamps Space is needed for read- andwrite-times with every database element,

whether or not it is currentlyaccessed.

Validation Space is used for timestamps and read or write sets for each currently active transaction, plus a few more transactions that finished after some

currently active transaction began.

• Timestamping and validation may use slightly more space than a locking.• A potential problem with validation is that the write set for a transaction must be known before the writes occur

18.9.3 Comparison between Concurrency Mechanisms - Delay

• The performance of the three methods depends on whether interaction among transactions is high or low.

(Interaction the likelihood that a transaction will access an element that is also being accessed by a concurrent transaction)– Locking delays transactions but avoids rollbacks, even when

interaction is high. – Timestamps and validation do not delay transactions, but can cause

them to rollback, which is a more serious form of delay and also wastes resources.

• If interference is low, then neither timestamps nor validation will cause many rollbacks, and is preferable to locking

Questions ?

Thank you

Changes 13.1 – Added content to slide #8 13.2 – Added extra slide 13.4 – Added content to...

Documents

Transcript of Changes 13.1 – Added content to slide #8 13.2 – Added extra slide 13.4 – Added content to...