Download - The Memory Hierarchy Chapter 13.1 The Memory Hierarchy Computer systems have several different components in which data may be stored. Data capacities.

The Memory Hierarchy Chapter 13.1

The Memory Hierarchy Computer systems have several different components in which data may be stored. Data capacities & access speeds range over at least seven orders of magnitude Devices with smallest capacity also offer the fastest access speed

Description of Levels 1.Cache Megabyte or more of Cache storage. On-board cache : On same chip. Level-2 cache : On another chip. Cache data accessed in few nanoseconds. Data moved from main memory to cache when needed by processor Volatile

Description of Levels 2.Main Memory 1 GB or more of main memory. Instruction execution & Data Manipulation - involves information resident in main memory. Time to move data from main memory to the processor or cache is in the 10-100 nanosecond range. Volatile 3.Secondary Storage Typically a magnetic disk. Capacity upto 1 TB. One machine can have several disk units. Time to transfer a single byte between disk & main memory is around 10 milliseconds.

Description of Levels 4. Tertiary Storage Holds data volumes measured in terabytes. Significantly higher read/write times. Smaller cost per bytes. Retrieval takes seconds or minutes, but capacities in the petabyte range are possible.

Transfer of Data Between Levels Data moves between adjacent levels of the hierarchy. Each level is organized to transfer large amounts of data to or from the level below Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed, it is likely that other data on the same block will also be needed at about the same time.

Volatile & Non Volatile Storage A volatile device forgets what is stored in it when the power goes off. Example: Main Memory A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure. Example: Secondary & Tertiary Storage Note: No change to the database can be considered final until it has migrated to nonvolatile, secondary storage.

Virtual Memory Managed by Operating System. Some memory in main memory & rest on disk. Transfer between the two is in units of disk blocks (pages). Not a level of the memory hierarchy

CS-257 Database System Principles Avinash Anantharamu 102

Index 13.2 Disks 13.2.1 Mechanics of Disks 13.2.2 The Disk Controller 13.2.3 Disk Access Characteristics

Disks: The use of secondary storage is one of the important characteristics of a DBMS, and secondary storage is almost exclusively based on magnetic disks

Structure of a Disk

Data in Disk 0s and 1s are represented by different patterns in the magnetic material. A common diameter for the disk platters is 3.5 inches.

Mechanics of Disks Two principal moving pieces of hard drive 1- Head Assembly 2- Disk Assembly Disk Assembly has 1 or more circular platters that rotate around a central spindle. Platters are covered with thin magnetic material

Top View of Disk Surface

Mechanics of Disks Tracks are concentric circles on a platter. Tracks are organized into sectors which are segments of circular platter. Sectors are indivisible as far as errors are concerned. Blocks are logical data transfer units.

Disk Controller Control the actuator to move head assembly Selecting the surface from which to read or write Transfer bits from desired sector to main memory

Simple Single Processor Computer

Disk Access characteristics Seek time: The disk controller positions the head assembly at the cylinder containing the track on which the block is located. The time to do so is the seek time. Rotational latency: The disk controller waits while the first sector of the block moves under the head. This time is called the rotational latency.

Disk Access characteristics Transfer time: All the sectors and the gaps between them pass under the head, while the disk controller reads or writes data in these sectors. This delay is called the transfer time. Latency of the disk: The sum of the seek time, rotational latency, transfer time is the latency of the time.

13.3 Accelerating Access to Secondary Storage San Jose State University Spring 2012

13.3 Accelerating Access to Secondary Storage Section Overview 13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks 13.3.5: Disk Scheduling and the Elevator Algorithm 13.3.6: Prefetching and Large-Scale Buffering

13.3 Introduction Average block access is ~10ms. Disks may be busy. Requests may outpace access delays, leading to infinite scheduling latency. There are various strategies to increase disk throughput. The I/O Model is the correct model to determine speed of database operations

13.3 Introduction (Contd.) Actions that improve database access speed: Place blocks closer, within the same cylinder Increase the number of disks Mirror disks Use an improved disk-scheduling algorithm Use prefetching

13.3.1 The I/O Model of Computation If we have a computer running a DBMS that: Is trying to serve a number of users Has 1 processor, 1 disk controller, and 1 disk Each user is accessing different parts of the DB It can be assumed that: Time required for disk access is much larger than access to main memory; and as a result: The number of block accesses is a good approximation of time required by a DB algorithm

13.3.2 Organizing Data by Cylinders It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s). In a relational database, related data should be stored in the same cylinder.

13.3.3 Using Multiple Disks If the disk controller supports the addition of multiple disks and has efficient scheduling, using multiple disks can improve performance significantly By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over

A drawback of striping data across multiple disks is that you increase your chances of disk failure. To mitigate this risk, some DBMS use a disk mirroring configuration Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chosen 13.3.4 Mirroring Disks

AdvantagesDisadvantages StripingRead/Write speedup ~n Capacity increased by ~n Higher risk of failure MirroringRead speedup ~n Reduced failure risk Fast initial access High cost per bit Slow writes compared to striping

One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient The elevator algorithm is a simple yet effective disk scheduling algorithm The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down The access requests closest to the heads current position are processed first 13.3.5 Disk Scheduling

When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed Example: 13.3.5 Disk Scheduling CylinderTime Requested (ms) 80000 240000 560000 1600010 6400020 4000030 CylinderTime Completed (ms) 80004.3 2400013.6 5600026.9 6400034.2 4000045.5 1600056.8

In some cases we can anticipate what data will be needed We can take advantage of this by prefetching data from the disk before the DBMS requests it Since the data is already in memory, the DBMS receives it instantly 13.3.6 Prefetching and Large-Scale Buffering

Chapter 13.4 Presented by Timothy Chen Spring 2013

Index 13.4 Disk Failures 13.4.1 Intermittent Failures 13.4.2 Organizing Data by Cylinders 13.4.3 Stable Storage 13.4.4 Error- Handling Capabilities of Stable Storage 13.4.5 Recovery from Disk Crashes 13.4.6 Mirroring as a Redundancy Technique 13.4.7 Parity Blocks 13.4.8 An Improving: RAID 5 13.4.9 Coping With Multiple Disk Crashers

Intermittent Failures If we try to read the sector but the correct content of that sector is not delivered to the disk controller Controller will check good and bad sector If the write is correct: Read is performed Good sector and bad sector is known by the read operation

CheckSum Read operation that determine the good or bad status

How CheckSum perform Each sector has some additional bits Set depending on the values of the data bits stored in each sector If the data bit in the not proper we know there is an error reading Odd number of 1: bits have odd parity(01101000) Even number of 1: bit have even parity (111011100) Find Error is the it is one bit parity

Stable Storage Deal with disk error Sectors are paired and each pair X showing left and right copies as Xl and Xr It check the parity bit of left and right by subsituting spare sector of Xl and Xr until the good value is returned

Error-Handling Capabilities of Stable Storage Since it has XL and XR, one of them fail we can still read other one Chance both of them fail are pretty small The write Fail, it happened during power outage

Recover Disk Crash The most serious mode of failure for disks is head crash where data permanently destroyed. The way to recover from crash, we use RAID method

Mirroring as a Redundancy Technique it is call Raid 1 Just mirror each disk

Raid 1 graph

Parity Block It often call Raid 4 technical read block from each of the other disks and modulo-2 sum of each column and get redundant disk disk 1: 11110000 disk 2: 10101010 disk 3: 00111000 get redundant disk 4(even 1= 0, odd 1 =1) disk 4: 01100010

Raid 4 graphic

Parity Block- Fail Recovery It can only recover one disk fail If it has more than one like two disk Then it cant be recover us modulo-2 sum

An Improvement Raid 5

Coping with multiple Disk Crash For more one disk fail Either raid 4 and raid 5 cant be work So we need raid 6 It is need at least 2 redundant disk

Raid 6

Secondary Storage Management 13.5 Arranging data on disk Mangesh Dahale ID-105 CS 257

Outline Fixed-Length Records Example of Fixed-Length Records Packing Fixed-Length Records into Blocks Example of Packing Fixed-Length Records into Blocks Details of Block header

Arranging Data on Disk A data element such as a tuple or object is represented by a record, which consists of consecutive bytes in some disk block.

Fixed Length Records The Simplest record consists of fixed length fields. The record begins with a header, a fixed-length region where information about the record itself is kept. Fixed Length Record header 1. A pointer to record schema. 2. The length of the record. 3. A timestamp indicating when the record was created.

Example CREATE TABLE employee( name CHAR(30) PRIMARY KEY, address VARCHAR(255), gender CHAR(1), birthdate DATE );

Packing Fixed Length Records into Blocks Records are stored in blocks of the disk and moved into main memory when we need to access or update them. A block header is written first and it is followed by series of blocks.

Example Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused

Block header contains following information Links to one or more other blocks that are part of a network blocks Information about the role played by this block in such a network Information about which relation the tuples of this block belong to. A directory giving the offset of each round in the block Timestamp(s) indicating the time of the block's last modification and / or access

Chapter 13.7 Ashwin Kalbhor Class ID : 107

Agenda Records with Variable Length Fields Records with Repeating Fields Variable Format Records Records that do not fit in a block

Example of a record nameaddressgenderbirth date 0 30 286 287 297

Records with Variable Length Fields Simple and Effective way to represent variable length records is as follows 1. Fixed length fields are kept ahead of the variable length records. 2. A header is put in front of the of the record. 3. Record header contains Length of the record Pointers to the beginning of all variable length fields except the first one.

Example Record with name and address as variable length field. birth datenameaddress header information record length to address gender

Records with repeating fields Repeating fields simply means fields of the same length L. All occurrences of Field F grouped together. Pointer pointing to the first field F is put in the header. Based on the length L the starting offset of any repeating field can be obtained.

Example of a record with Repeating Fields Movie star record with movies as the repeating field. nameaddress other header information record length to address to movie pointers pointers to movies

Alternative representation Record is of fixed length Variable length fields stored on a separate block. The record itself keeps track of - 1. Pointers to the place where each repeating field begins, and 2. Either how many repetitions there are, or where the repetitions end.

Storing variable length fields separately from the record.

Variable Format Records Records that do not have fixed schema Represented by sequence of tagged fields Each of the tagged fields consist of information Attribute or field name Type of the field Length of the field Value of the field

Variable Format Records N16SS14Clint EastwoodHogs Breath InnR code for namecode for restaurant owned code for string type length

Records that do not fit in a block When the length of a record is greater than block size,then record is divided and placed into two or more blocks Portion of the record in each block is referred to as a RECORD FRAGMENT Record with two or more fragments is called a SPANNED RECORD Record that do not cross a block boundary is called UNSPANNED RECORD

Spanned Records Spanned records require the following extra header information A bit indicates whether it is fragment or not A bit indicates whether it is first or last fragment of a record Pointers to the next or previous fragment for the same record

Spanned Records record 1 record 3 record 2 - a record 2 - b block header record header block 1 block 2

CS257 Lok Kei Leong ( 108 )

Outline Record Insertion Record Deletion Record Update

Insertion Insert new records into a relation - records of a relation in no particular order - record of a relation in fixed order (e.g. sorted by primary key) A pointer to a record from outside the block is a structured address Record 4 Record 3 Record 2 Record 1 unusedheader Offeset table

What If The Block is Full? If we need to insert the record in a particular block but the block is full. What should we do? Find room outside the Block There are 2 solutions I. Find Space on Nearby Block II. Create an Overflow Block

Insertion (solution 1) Find space on a nearby block Block B1 has no space If space available on block B2 move records of B1 to B2 If there are external pointers to records of B1 moved to B2 leave forwarding address in offset table of B1

Insertion (solution 2) Create an overflow block Each block B has its header pointer to an overflow block where additional blocks of B can be placed Block B Overflow block for B

Deletion Slide around the block Cannot slide records - maintain an available-space list in the block header to keep track of space available Avoid dangle or wind up pointing to a new record

Tombstone What about pointer to deleted records ? A tombstone is placed in place of each deleted record A tombstone is a bit placed at first byte of deleted record to indicate the record was deleted ( 0 Not Deleted 1 Deleted) A tombstone is permanent Record 1 Record 2

Update For Fixed-Length Records, there is no effect on the storage system For variable length records: associated with insertion and deletion (never create a tombstone for old record) Longer updated record create more space on its block - sliding records - create an overflow block

Sweta Shah CS257: Database Systems ID: 118

Agenda Query Processor Query compilation Physical Query Plan Operators Scanning Tables Table Scan Index scan Sorting while scanning tables Model of computation for physical operators Parameters for measuring cost Iterators

Query Processor The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations Query processor is responsible for supplying details regarding how the query is to be executed

The major parts of the query processor

Query compilation Query compilation itself is a multi-step process consisting of : Parsing: in which a parse tree representing query and its structure is constructed Query rewrite: in which the parse tree is converted to an initial query plan Physical plan generation: where the logical query plan is turned into a physical query plan by selecting algorithms.

Outline of query compilation

Physical Query Plan Operators Physical query plans are built from operators Each of the operators implement one step of the plan They are particular implementations for one of the operators of relational algebra. They can also be non relational algebra operators like scan which scans tables.

Scanning Tables One of the most basic things in a physical query plan. Necessary when we want to perform join or union of a relation with another relation.

Two basic approaches to locating the tuples of a relation R Table-scan Relation R is stored in secondary memory with its tuples arranged in blocks it is possible to get the blocks one by one This operation is called Table Scan

Two basic approaches to locating the tuples of a relation R Index-scan there is an index on any attribute of Relation R Use this index to get all the tuples of R This operation is called Index Scan

Sorting While Scanning Tables Why do we need sorting while scanning? the query could include an ORDER BY clause requiring that a relation be sorted Various algorithms for relational-algebra operations require one or both of their arguments to be sorted relation Sort-scan takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order

Model of Computation for Physical Operators Choosing physical plan operators wisely is an essential for a good query processor. Cost for an operation is measured in number of disk i/o operations. If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.

Improvements in cost Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.

Parameters for Measuring Costs Parameters that affect the performance of a query Buffer space availability in the main memory at the time of execution of the query Size of input and the size of the output generated The size of memory block on the disk and the size in the main memory also affects the performance

Iterators for Implementation of Physical Operators Many physical operators can be implemented as an iterator It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time

Iterator The three functions forming the iterator are: Open: This function starts the process of getting tuples. It initializes any data structures needed to perform the operation

Iterator GetNext This function returns the next tuple in the result Adjusts data structures as necessary to allow subsequent tuples to be obtained If there are no more tuples to return, GetNext returns a special value NotFound

Iterator Close This function ends the iteration after all tuples it calls Close on any arguments of the operator

One-pass algorithm for database operations Chetan Sharma 008565661

Overview One-Pass Algorithm One-Pass Algorithm Methods: 1) Tuple-at-a-time, unary operations. 2) Full-relation, unary operations. 3) Full-relation, binary operations.

we can divide algorithms for operators into three degrees of difficulty and cost: 1) Require at least one of the arguments to fit in main memory.-one pass 2) Some methods work for data that is too large to fit in available main memory but not for the largest imaginable data sets.-two pass 3) Some methods work without a limit on the size of the data.- multipass:recursive generalizations of the two-pass algorithms.

One-Pass Algorithm Reading the data only once from disk. Usually, they require at least one of the arguments to fit in main memory

Tuple-at-a-Time These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memory buffer, and produce our output. Ex- selection and projection

Tuple-at-a-Time A selection or projection being performed on a relation R

Full-relation, unary operations Now, let us consider the unary operations that apply to relations as a whole, rather than to one tuple at a time: a)Duplicate elimination. b)Grouping.

a) Duplicate elimination

b) Grouping MIN (a),MAX (a) aggregate, record the minimum or maximum value, respectively, of attribute a seen for any tuple in the group so far. COUNT aggregation, add one for each tuple of the group that is seen. SUM (a), add the value of attribute a to the accumulated sum for its group. AVG (a) is the hard case. We must maintain two accumulations: the count of the number of tuples in the group and the sum of the a-values of these tuples.

b) Grouping When all tuples of R have been read into the input buffer and contributed to the aggregation(s) for their group, we can produce the output by writing the tuple for each group. Note-: that until the last tuple is seen, we cannot begin to create output for a operation. Thus, this algorithm does not fit the iterator framework very well; The entire grouping has to be done by the Open method before the first tuple can be retrieved

One-Pass Algorithms for Binary Operations All other operations are in this class: set and bag versions of union, intersection, difference, joins, and products. binary operations require reading the smaller of the operands R and S into main memory and building a suitable data structure so tuples can be both inserted quickly and found quickly. to be performed in one pass is: min(B(R),B(S))

Joining by Using an Index (Algorithm 1) Consider natural join: R(X,Y) |>

Joining by Using an Index (Algorithm 1) Analysis Consider R(X,Y) |>

Join Using a Sorted Index Consider R(X,Y) |>

Join Using a Sorted Index (Zig-zag join) Consider R(X,Y) |>

Multipass sort-based algorithm. INDUCTION: (B(R)> M) 1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R 1, R 2, , R M 2.Recursively sorting R i from i =1 to M 3.Once sorting is done, the algorithm merges the M sorted sub-lists.

Performance: Multipass Sort-Based Algorithms 1) Each pass of a sorting algorithm: 1.Reading data from the disk. 2. Sorting data with any sorting algorithms 3. Writing data back to the disk. 2-1) (k)-pass sorting algorithm needs 2k B(R) disk I/Os 2-2)To calculate (Multipass)-pass sorting algorithm needs = > A+ B A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists] B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass] Total: (2k-1)(B(R)+B(S)) disk I/Os

Multipass Hash-Based Algorithms 1. Hashing the relations into M-1 buckets, where M is number of memory buffers. 2. Unary case: It applies the operation to each bucket individually. 1.Duplicate elimination ( ) and grouping ( ). 1) Grouping: Min, Max, Count, Sum, AVG, which can group the data in the table 2) Duplicate elimination: Distinct Basis: If the relation fits in M memory block, -> Reading relation into memory and perform the operations. 3. Binary case: It applies the operation to each corresponding pair of buckets. Query operations: union, intersection, difference, and join If either relations fits in M-1 memory blocks, -> Reading that relation into main memory M-1 blocks -> Reading next relation to 1 block at a time into the M th block Then performing the operations.

INDUCTION If Unary and Binary relation does not fit into the main memory buffers. 1. Hashing each relation into M-1 buckets. 2. Recursively performing the operation on each bucket or corresponding pair of buffers. 3. Accumulating the output from each buckets or pair.

Hash-Based Algorithms : Unary Operatiors

Perfermance: Hash-Based Algorithms R: Realtion. Operations are like and M: Buffers U(M, k): Number of blocks in largest relation with k-pass hashing algorithm.

Performance: Induction Induction: 1. Assuming that the first step divides relation R into M-1 equal buckets. 2. The buckets for the next pass must be small enough to handle in k-1 passes 3.Since R is divided into M-1 buckets, we need to have (M-1)u(M, k-1).

Sort-Based VS Hash-Based 1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time 2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.

15.9 Query Execution Summary

Query Processing Outline of Query Compilation Table Scanning Cost Measures Review of Algorithms One-pass Methods Nested-Loop Join Two-pass Sort-based Hash-based Index-based Multi-pass Overview

Query Processing Query Compilation Query Execution query query plan metadata data Query is compiled. This involves extensive optimization using operations of relational algebra. First compiled into a logical query plans, e.g. using expressions of relational algebra. Then converted to a physical query plan such as selecting implementation for each operator, ordering joins and etc. Query is then executed.

Outline of Query Compilation Parse query Select logical plan SQL query expression tree query optimization Parsing: A parse tree for the query is constructed. Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan. Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions. Select physical plan Execute plan logical query plan tree physical query plan tree

Table Scanning There are two approaches for locating tuples of relation R: Table-scan: Get the blocks one by one. Index-scan: Use index to lead us to all blocks holding R. Sort-scan takes a relation R and sorting specifications and produces R in a sorted order. This can be accomplished with SQL clause ORDER BY.

Estimates of cost are essential for query optimization. It allows us to determine the slow and fast parts of a query plan. Reading many consecutive blocks on a track is extremely important since disk I/Os are expensive in term of time. EXPLAIN SELECT * FROM a JOIN b on a.id = b.id; Cost Measures

EXPLAIN SELECT snp.* FROM snp JOIN chr ON snp.chr_key = chr.chr_key WHERE snp_name '' Cost Measures Optimizing Queries:

One-pass Methods Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once. Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators. Hash table O(n) or a balanced binary search tree O(n log n) is used for duplicate eliminations to speed up the detections. Full-relation, binary operations. These include union, intersection, difference, product and join. Review of Algorithms

Nested-Loop Joins In a sense, it is one-and-a-half passes, since one argument has its tuples read only once, while the other will be read repeatedly. Can use relation of any size and does not have to fit all in main memory. Two variations of nested-loop joins: Tuple-based: Simplest form, can be very slow since it takes T(R)*T(S) disk I/Os if we are joining R(x,y) with S(y,z). Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples. Review of Algorithms

Two-pass Algorithms Usually enough even for large relations. Based on Sorting: Partition the arguments into memory-sized, sorted sublists. Sorted sublists are then merged appropriately to produce desired results. Based on Hashing: Partition the arguments into buckets. Useful if data is too big to store in memory. Review of Algorithms

Two-pass Algorithms Sort-based vs. Hash-based: Hash-based are often superior to sort-based since they require only one of the arguments to be small. Sorted-based works well when there is reason to keep some of the data sorted. Review of Algorithms

Index-based Algorithms Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes. Clustering and non-clustering indexes: Clustering index has all tuples with fixed value packed into minimum number of blocks. A clustered relation can have non-clustering indexes. Review of Algorithms

Multi-pass Algorithms Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets. Each pass of a sorting algorithm reads all data from disk and writes it out again. Thus, a k-pass sorting algorithm requires 2kB(R) disk I/Os. Review of Algorithms

Chapter 18

Dona Baysa ID: 127 CS 257 Spring 2013

Intro Concurrency Control Scheduler Serializability Schedules Serial and Serializable

Intro: Concurrency Control & Scheduler Concurrently executing transactions can cause inconsistent database state Concurrency Control assures transactions preserve consistency Scheduler: Regulates individual steps of different transactions Takes reads/writes requests from transactions and executes/delays them

Intro: Scheduler Transaction requests passed to Scheduler Scheduler determines execution of requests Transaction manager Scheduler Buffers Read/Write requests Reads and writes

Serializability How to assure concurrently executing transactions preserve database state correctness? Serializability schedule transactions as if they were executed one-at-a-time Determine a Schedule

Schedules Schedule sequence of important actions performed by transactions Actions: reads and writes Example: Transactions and actions T1T1 T2T2 READ(A, t)READ(A, s) t := t+100s := s*2 WRITE(A,t)WRITE(A,s) READ(B,t)READ(B,s) t := t+100s := s*2 WRITE(B,t)WRITE(B,s)

Serial Schedules All actions of one transactions are followed by all actions of another transaction, and so on. No mixing of actions Depends only on order of transactions Serial Schedules: T 1 precedes T 2 T 2 precedes T 1

Serial Schedule: Example T 1 precedes T 2 Notation: (T 1,T 2 ) Consistency constraint: A = B Final value: A = B = 250 Consistency is preserved T1T1 T2T2 AB READ(A, t)25 t := t+100 WRITE(A,t ) 125 READ(B,t) t := t+100 WRITE(B,t ) 125 READ(A, s) s := s*2 WRITE(A,s ) 250 READ(B,s) s := s*2 WRITE(B,s ) 250

Serializable Schedules Serial schedules preserve consistency Any other schedules that also guarantee consistency? Serializable schedules Definition: A schedule S is serializable if theres a serial schedule S such that for every initial database state, the effects of S and S are the same.

Serializable Schedule: Example Serializable, but not serial, schedule T 2 acts on A after T 1, but before T 1 acts on B Effect is same as serial schedule (T 1, T 2 ) T1T1 T2T2 AB 25 READ(A, t) t := t+100 WRITE(A,t)125 READ(A, s) s := s*2 WRITE(A,s)250 READ(B,t) t := t+100 WRITE(B,t)125 READ(B,s) s := s*2 WRITE(B,s)250

Notation: Transactions and Schedules Transaction: T i (for example T 1, T 2,) Database element: X Actions: read/write r Ti (X) = r i (X) w Ti (X) = w i (X) Examples Transactions: T 1 : r 1 (A); w 1 (A); r 1 (B); w 1 (B); T 2 : r 2 (A); w 2 (A); r 2 (B); w 2 (B); Schedule: r 1 (A); w 1 (A); r 2 (A); w 2 (A); r 1 (B); w 1 (B); r 2 (B); w 2 (B);

Geetha Ranjini Viswanathan ID: 121

18.2 Conflict-Serializability 18.2.1 Conflicts 18.2.2 Precedence Graphs and a Test for Conflict-Serializability 18.2.3 Why the Precedence-Graph Test Works

18.2.1 Conflicts Conflict - a pair of consecutive actions in a schedule such that, if their order is interchanged, the final state produced by the schedule is changed.

18.2.1 Conflicts Non-conflicting situations: Assuming T i and T j are different transactions, i.e., i j: r i (X); r j (Y) will never conflict, even if X = Y. r i (X); w j (Y) will not conflict for X Y. w i (X); r j (Y) will not conflict for X Y. w i (X); w j (Y) will not conflict for X Y.

18.2.1 Conflicts Two actions of the same transactions always conflict r i (X); w i (Y) Two writes of the same database element by different transactions conflict w i (X); w j (X) A read and a write of the same database element by different transaction conflict r i (X); w j (X) w i (X); r j (X) Conflicting situations: Three situations where actions may not be swapped:

18.2.1 Conflicts Conclusions: Any two actions of different transactions may be swapped unless: They involve the same database element, and At least one is a write The schedules S and S are conflict-equivalent, if S can be transformed into S by a sequence of non- conflicting swaps of adjacent actions. A schedule is conflict-serializable if it is conflict- equivalent to a serial schedule.

18.2.1 Conflicts Example 18.6 Conflict-serializable schedule S: r 1 (A); w 1 (A); r 2 (A); w 2 (A); r 1 (B); w 1 (B); r 2 (B); w 2 (B); Above schedule is converted to the serial schedule S (T 1, T 2 ) through a sequence of swaps. r 1 (A); w 1 (A); r 2 (A); w 2 (A); r 1 (B); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 2 (A); r 1 (B); w 2 (A); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 1 (B); r 2 (A); w 2 (A); w 1 (B); r 2 (B); w 2 (B); r 1 (A); w 1 (A); r 1 (B); r 2 (A); w 1 (B); w 2 (A); r 2 (B); w 2 (B); S: r 1 (A); w 1 (A); r 1 (B); w 1 (B); r 2 (A); w 2 (A); r 2 (B); w 2 (B);

18.2.2 Precedence Graphs and a Test for Conflict-Serializability Given a schedule S, involving transactions T 1 and T 2, T 1 takes precedence over T 2 (T 1

18.2.3 Why the Precedence-Graph Test Works Consider a cycle involving n transactions T 1 > T 2... > T n > T 1 In the hypothetical serial order, the actions of T 1 must precede those of T 2, which precede those of T 3, and so on, up to T n. But the actions of T n, which therefore come after those of T 1, are also required to precede those of T 1. This puts constraints on legal swaps between T 1 and T n. Thus, if there is a cycle in the precedence graph, then the schedule is not conflict-serializable.

Shailesh Padave ID 111 CS257 Spring 2013 18.3 Enforcing Serializability by locks

INTRODUCTION Enforcing serializability by locks Locks Locking scheduler Two phase locking

Locks Maintained on database element to prevent unserializable behavior It works like as follows : A request from transaction Scheduler checks in the lock table to guide the decision Generates a serializable schedule of actions.

Consistency of transactions Actions and locks must relate each other Transactions can only read & write only if it has a lock on the database elements involved in the transaction. Unlocking an element is compulsory. Legality of schedules No two transactions can acquire the lock on same element without the prior one releasing it.

Locking scheduler Grants lock requests only if it is in a legal schedule. Lock table stores the information about current locks on the elements. Consider l i (X): Transaction T i requests a lock on database element X u i (X): Transaction T i releases its lock on database element X

Locking scheduler (contd.) A legal schedule of consistent transactions but unfortunately it is not a serializable but it is legal. T1T2AB l1(A); r1(A); A:=A+100; w1(A);u1(A); l1(B); r1(B); B:=B+100; w1(B);u1(B); l2(A); r2(A); A:=A*2; w2(A);u2(A); l2(B); r2(B); B:=B*2; w2(B);u2(B); 25 125 250 25 50 150

Locking schedule (contd.) The locking scheduler delays requests in order to maintain a consistent database state. T1T2AB l1(A); r1(A); A:=A+100; w1(A);l1(B);u1(A); r1(B);B:=B+100; w1(B);u1(B); l2(A); r2(A); A:=A*2; w2(A);u2(A); L2(B); Denied l2(B); u2(A);r2(B); B:=B*2; w2(B);u2(B); 25 125 250 25 125 250

Two-phase locking(2PL) Guarantees a legal schedule of consistent transactions is conflict-serializable. All lock requests proceed all unlock requests. The growing phase: Obtain all the locks and no unlocks allowed. The shrinking phase: Release all the locks and no locks allowed.

Working of Two-Phase locking Assures serializability. Two protocols for 2PL: Strict two phase locking : Transaction holds all its write locks till commit / abort. Rigorous two phase locking : Transaction holds all locks till commit / abort. Two phase transactions are ordered in the same order as their first unlocks.

Two Phase Locking Locks required Time Instantaneously executes now Every two-phase-locked transaction has a point at which it may be thought to execute instantaneously

Failure of 2PL. 2PL fails to provide security against deadlocks. T1: l1(A); r1(A); A:=A+100; w1(A); l1(B); u1(A); r1(B); B:=B+100; w1(B); u1(B); T2: l2(B); r2(B); B:=B*2; w2(B); l2(A); u2(B); r2(A);A:=A*2; w2(A); u2(A); T1T2AB l1(A); r1(A); A:=A+100; w1(A); l1(B); Denied l2(B); r2(B); B:=B*2; W2(B); l2(A); Denied 25 125 25 50

18. Concurrency Control 18.4. Locking Systems With Several Lock Modes by Kiruthika Sivaraman ID: 129

Lock Types Shared Lock or Read Lock To read database element X we use shared lock. There can be more than one shared lock on X. Exclusive Lock or Write Lock To write to database element X we use exclusive lock. There can be only one exclusive lock on X.

Notations Used sl i (X) Transaction T i requests shared lock on database element X. xl i (X) Transaction T i requests exclusive lock on database element X. u i (X) Transaction T i unlocks X.

Requirements Consistency of transactions Transaction may not write without an exclusive lock and cannot read without any kind of lock. Two phase locking of transactions Locking must precede unlocking xl i (X) or sl i (X) cannot be preceded by u i (Y) for any Y. Legality of schedules An element can be locked exclusive by one transaction or by several in shared mode but not both.

Compatibility Matrices Compatibility matrix describes lock management policy. Has row and column for each lock mode Row corresponds to lock held on element X by a transaction. Column corresponds to mode of lock requested on X. SX SYesNo X

Upgrading Locks Transaction T first takes a shared lock on X. Later when T is ready to write it upgrades it lock to exclusive lock on X. u i (X) releases all the lock that was established by transaction T i on X. By this way transaction T remains friendly with other transactions.

Upgrading Locks - Drawback T1 first establishes a shared lock on X. T2 also establishes a shared lock on X. T1 tries to upgrade its lock to exclusive lock. T2 also tries the same. Deadlock! SXU SYesNoYes XNo U

Update Locks To avoid deadlock Update lock is similar to shared locks with only difference that the transaction requesting update lock can only upgrade its lock to exclusive lock. Once a transaction requests update lock on X, then no other locks will be granted on X. T1T2 Sl1(A) Sl2(A) Xl1(A) Denied Xl2(A) Denied

Increment Lock Two transaction can establish an increment lock on a database element at the same time. Useful when the order of write is not important. INC(A,2) INC(A,10) INC(A,2) A = 5 A = 17 A = 15 A = 7

Increment Lock Compatibility Matrix SXI SYesNo X I Yes

Presented By: Akash Patel ID: 113 Akash Patel

Overview Overview of Locking Scheduler Scheduler That Inserts Lock Actions The Lock Table Handling Locking and Unlocking Request

Principles of simple scheduler architecture The transactions themselves do not request locks, or cannot be relied upon to do so. It is the job of the scheduler to insert lock actions into the stream of reads, writes, and other actions that access data. Transactions do not release locks. Rather, the scheduler releases the locks when the transaction manager tells it that the transaction will commit or abort.

Scheduler That Inserts Lock Actions into the transactions request stream Scheduler, Part 1 Scheduler, Part 2 Lock(A);Read(A) Read(A);Write(B) ;Commit(T) Lock Table From Transaction

The scheduler maintains a lock table, which, although it is shown as secondary-storage data, may be partially or completely in main memory Actions requested by a transaction are generally transmitted through the scheduler and executed on the database. Under some circumstances a transaction is delayed, waiting for a lock, and its requests are not (yet) transmitted to the database.

The two parts of the scheduler perform Part I takes the stream of requests generated by the transactions and inserts appropriate lock actions ahead of all database-access operations, such as read, write, increment, or update. Part II takes the sequence of lock and database- access actions passed to it by Part I, and executes each appropriately

Determine the transaction (T) that action belongs and status of T (delayed or not). If T is not delayed then 1. Database access action is transmitted to the database and executed 2. If lock action is received by PartII, it checks the L Table whether lock can be granted or not i> Granted, the L Table is modified to include granted lock ii>Not G. then update L Table about requested lock then PartII delays transaction T

3. When a T = commits or aborts, PartI is notified by the transaction manager and releases all locks. - If any transactions are waiting for locks PartI notifies PartII. 4. Part II when notified about the lock on some DB element, determines next transaction T to get lock to continue.

The Lock Table A relation that associates database elements with locking information about that element Implemented with a hash table using database elements as the hash key Size is proportional to the number of lock elements only, not to the size of the entire database DB element A Lock information for A

Lock table Entry Field Group Mode S means that only shared locks are held. U means that there is one update lock and perhaps one or more shared locks. X means there is one exclusive lock and no other locks. Waiting Waiting bit tells that there is at least one transaction waiting for a lock on A. A list A list describing all those transactions that either currently hold locks on A or are waiting for a lock on A.

Handling Lock Requests Suppose transaction T requests a lock on A If there is no lock table entry for A, then there are no locks on A, so create the entry and grant the lock request If the lock table entry for A exists, use the group mode to guide the decision about the lock request

1) If group mode is U (update) or X (exclusive) No other lock can be granted Deny the lock request by T Place an entry on the list saying T requests a lock And Wait? = yes 2) If group mode is S (shared) Another shared or update lock can be granted Grant request for an S or U lock Create entry for T on the list with Wait? = no Change group mode to U if the new lock is an update lock How to deal with existing Lock

Handling Unlock Requests Now suppose transaction T unlocks A Delete Ts entry on the list for A If Ts lock is not the same as the group mode, no need to change group mode Otherwise check entire list for new group mode

RT(X), set RT(X) := TS(T); otherwise do not change RT(X). If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts. If TS(T) < WT(X), the read is physically unrealizable. Rollback T.

Rules for Timestamps-Based scheduling (Cont.) Scheduler receives a request WT(X). If TS(T) RT(X) and TS(T) WT(X), write is physically realizable and must be performed. Write the new value for X, Set WT(X) := TS(T), and Set C(X) := false. if TS(T) RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X. If C(X) is true, then the previous writers of X is committed, and ignore the write by T. If C(X) is false, we must delay T. if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

Rules for Timestamps-Based scheduling (Cont.) Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed. Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

Three transactions executing under a timestamp-based scheduler

Timestamps and Locking Generally, timestamping performs better than locking in situations where: Most transactions are read-only. It is rare that concurrent transaction will try to read and write the same element. In high-conflict situation, locking performs better than timestamps The argument for this rule-of-thumb is: Locking will frequently delay transactions as they wait for locks. But if concurrent transactions frequently read and write elements in common, then rollbacks will be frequent in a timestamp scheduler, introducing even more delay than a locking system.

Anusha Damodaran ID : 130 CS 257 : Database System Principles Section 18.9

At a Glance What is Validation? Architecture of Validation based Scheduler Validation Rules Comparison between Concurrency Control Mechanisms

Validation (p1) Another type of Optimistic Concurrency control Allows transactions to access data without locks Validation Scheduler: Keeps record of what active transactions are doing Goes through Validation Phase before the transaction starts to write values of database elements If there is a physically unrealizable behavior, the transaction is rolled back

18.9.1 Architecture of Validation based Scheduler (p1) Scheduler must be told for each transaction T Read Set, RS(T) - Sets of database elements T reads Write Set, WS(T) - Sets of database elements T writes Three phases of the Validation Scheduler Read Transaction reads from Database all elements in its Read Set. Also computes in its local address space all results its going to write. Validate Validates the transaction by comparing its read and write sets with those of other transactions. If validation fails, transaction is rolled back, else proceeds to write phase. Write Writes to the database its values for the elements in its write set.

Validation based Scheduler Scheduler has an assumed serial order of the transactions to work with. Maintains three sets START : Set of transactions that have started but not yet completed START (T) time at which transaction started VAL : Set of transactions that have been validated but not yet finished the writing of phase 3 START(T) & VAL(T) time at which T validated FIN : Set of transactions that completed phase 3 START(T), VAL(T), FIN(T) time at which T finished

18.9.2 Validation Rules Case 1: U is in VAL or FIN, that is U is validated FIN(U) > START(T), that is U did not finish before T started RS(T) WS(U) is not empty (let it contain database element X) Since we dont know whether or not T got to read Us value, we must rollback T to avoid a risk that the actions of T and U will not be consistent with the assumed serial order. T reads X U writes X U starts T starts U validated T validating

18.9.2 Validation Rules Case 2: U is in VAL, i.e. U has successfully validated FIN(U) > VAL(T), i.e. U did not finish before T entered its validation phase WS(T) WS(U) is not empty (let database element X be in both write sets) T and U must both write values of X, and if we let T validate, it is possible that it will write X before U does. Since we cannot be sure, we rollback T to make sure it does not violate the assumed serial order in which it follows U. T writes X U writes X U validated T validating U finish

Rules for Validating a transaction T Check that RS(T) WS(U) = for any previously validated U that did not finish before T started, i.e., if FIN(U) > START(T). Check that WS(T) WS(U) = for any previously validated U that did not finish before T validated, i.e., if FIN(U) > VAL(T). 1. RS(T) WS(U) = ; FIN(U) > START(T) 2. WS(T) WS(U) = ; FIN(U) > VAL(T)

Example 18.2.9 4 Transactions T, U,V,W attempt to execute and validate T: RS = {A,B} WS ={A,C} U : RS = {B} WS = {D} W : RS ={A,D} WS = {A,C} V : RS = {B} WS = {D, E} - Read - Validate - Write

Example 18.2.9 Validation of U [RS = {B}; WS = {D}] Nothing to check ; Reads {B} U validates successfully Writes {D} Validation of T [RS = {A,B}; WS ={A,C}] FIN(U) > START(T) ; RS(T) WS(U) should be empty {A,B} {D} = FIN(U) > VAL(T) ; WS(T) WS(U) should be empty {A,C} {D} = Validation of V [RS = {B}; WS = {D, E}] FIN(T) > START(V); RS(V) WS(T) should be empty {B} {A,C} = FIN(T) > VAL(V) ;WS(V) WS(T) should be empty {D,E} {A,C} = FIN(U) > START(V) ;RS(V) WS(U) should be empty {B} {D} = Validation of W [RS ={A,D}; WS = {A,C}] FIN(T) > START(W); RS(W) WS(T) should be empty {A,D} {A,C} = {A} FIN(V) > START(W);RS(W) WS(V) should be empty {A,D} {D,E} = {D} FIN(V) > VAL(W);WS(W) WS(V) should be empty {A,C} {D,E} = W is not validated, Is rolled back and hence does not write values A and C

18.9.3 Comparison between Concurrency Control Mechanisms Storage Utilization Concurrency control MechanismsStorage Utilization LocksSpace in the lock table is proportional to the number of database elements locked. TimestampsSpace is needed for read- and write-times with every database element, whether or not it is currently accessed. ValidationSpace is used for timestamps and read or write sets for each currently active transaction, plus a few more transactions that finished after some currently active transaction began. Timestamping and validation may use slightly more space than a locking. A potential problem with validation is that the write set for a transaction must be known before the writes occur

18.9.3 Comparison between Concurrency Mechanisms - Delay The performance of the three methods depends on whether interaction among transactions is high or low. (Interaction the likelihood that a transaction will access an element that is also being accessed by a concurrent transaction) Locking delays transactions but avoids rollbacks, even when interaction is high. Timestamps and validation do not delay transactions, but can cause them to rollback, which is a more serious form of delay and also wastes resources. If interference is low, then neither timestamps nor validation will cause many rollbacks, and is preferable to locking

Summary of changes 13.1 - 3 13.2 - 3 13.3 - 2 13.4 - 2 13.5 - 2 13.6 - 0 13.7 - 2 13.8 - 3

Summary of changes 15.1 - 4 15.2 - 3 15.3 - 3 15.4 - 1 15.5 - 3 15.6 - 3 15.7 - 1 15.8 - 2 15.9 - 1

Summary of changes 18.1 - 1 18.2 - 2 18.3 - 1 18.4 - 0 18.5 - 2 18.6 - 2 18.7 - 3 18.8 - 2 18.9 - 1