Physical Design: Types of Indexes & Files University of Manitoba Asper School of Business 3500 DBMS...

27
Physical Design: Types of Indexes & Files University of Manitoba Asper School of Business 3500 DBMS Bob Travica Based on G. Post, DBMS: Designing & Building Business Applications Updated 2015

Transcript of Physical Design: Types of Indexes & Files University of Manitoba Asper School of Business 3500 DBMS...

Physical Design:Types of Indexes

& Files

University of ManitobaAsper School of Business

3500 DBMSBob Travica

Based on G. Post, DBMS: Designing & Building Business Applications

Updated 2015

DDBB

SSYYSSTTEEMMSS

2 of 29

Physical Data Storage

Topics of interest:

File types for storing data

Index types - Data structures for retrieving data (Index to

sequential file, Linked List, B+-Tree, Hash Table)

Additional Physical Design Methods (file partitioning, clustering)

DDBB

SSYYSSTTEEMMSS

3 of 29

Terminology

Data entry or data element (a special short record containing usually the key attribute, address to the rest of data, and sometimes other addresses)

Pointer = address of data, designation of data location

DBMS task = any of CRUD operations

DDBB

SSYYSSTTEEMMSS

4 of 29

DBMS Tasks(CRUD)

Store (write, create) data Insert a row.

Retrieve (read) data Read entire table (scan all rows). Read arbitrary/random row.

Modify (update) data

(Change “Crag” into “Craig”)

Delete data.

(2 steps: mark + ”pack”)

LastName FirstName PhoneAdams Kimberly (406) 987-9338

Allbright Searoba (619) 281-2485Anderson Charlotte (701) 384-5623Baez Bessie (606) 661-2765Baez Lou Ann (502) 029-3909Bailey Gayle (360) 649-9754Bell Luther (717) 244-3484Carter Phillip (219) 263-2040Carver Bernice (804) 020-5842Crag Melinda (502) 691-7565x Duvall Pierre (502) 595-1052

Adkins Inga (706) 977-4337

DDBB

SSYYSSTTEEMMSS

5 of 29

File Types & Access Methods (Indexes)

Indexed Sequential Access Method (ISAM) &

Sequential File

Linked List index

B+-Tree index

Hash index

DDBB

SSYYSSTTEEMMSS

6 of 29

Sequential File

Uses: When data don’t change much

Data retrieved in same order

When table is huge and space

is expensive.

When transporting / converting

data to a different system.

• Two forms: Random order of records (heap file), and sorted file.

DDBB

SSYYSSTTEEMMSS

7 of 29

Operations on Sequential Files

Read entire file sequentially: Easy and fast

Read next record: Fast

Random Read/Sequential (pattern matching): Slow Probability of any row lookup = 1/N

Delete, Insert, Modify: First find, then do… So, slow, costly

Row Prob. # ReadsA 1/N 1B 1/N 2C 1/N 3D 1/N 4E 1/N 5… 1/N i

DDBB

SSYYSSTTEEMMSS

8 of 29

Sequential Access to Sorted Sequential File

Sequential search

Find: Brown; 2 lookups,

Find: Jones; 10 lookups

Go one by one from top

Min lookups = 1, Max = 10

On the average = (N+1)/2 =

(10+1)/2= 11/2 = 5.5, i.e. 6

lookups

Record# Key

1. Adams 2. Brown

3. Cadiz4. Dorfmann5. Eaton6. Farris7. Goetz8. Hanson9. Inez

10. JonesSum (N) =10 entries

DDBB

SSYYSSTTEEMMSS

9 of 29

Insertion into Sorted Sequential File Insert record Inez:

Find insert location, mark top & bottom parts of the old file.

Copy the top to new file. Add new row. Copy the bottom to new file. Delete old file

ID LastName FirstName DateHired 8 6 7 2

Carpenter Eaton Farris Gibson

Carlos Anissa

Dustin Bill

12/29/2001 8/23/2001 3/28/2001 3/31/2001

5 9 3 1

10

James O’Connor

Reasoner Reeves Shields

Leisha Jessica Katy Keith Howard

1/6/2001 7 /23/2001 2/17/2001 1/29/2001 7/13/2001

Insert

Old File

New File

Top

Bottom

DDBB

SSYYSSTTEEMMSS

10 of 29

Indexed Sequential Access Method (ISAM)

Common uses Uses an index like a table

with the columns key and pointer.

Multiple columns can be indexed, indexes built.

ID LastName FirstName DateHired1 Reeves Keith 1/29/20012 Gibson Bill 3/31/20013 Reasoner Cathy 2/17/20014 Hopkins Alan 2/8/ 20015 James Leisha 1/6/ 20016 Eaton Anissa 8/23/ 20017 Farris Dustin 3/28/ 20018 Carpenter Carlos 12/29/ 20019 O'Connor Jessica 7/23/ 200110 Shields Howard 7/13/ 2001

A11A22A32A42A47A58A63A67A78A83

Record address

ID Pointer1 A112 A223 A324 A425 A476 A587 A638 A679 A7810 A83

Index on ID

LastName PointerCarpenter A67Eaton A58Farris A63Gibson A22Hopkins A42James A47O'Connor A78Reasoner A32Reeves A11Shields A83

Index on LastName

Index is sorted, search performedon it, and the pointer used to fetch the record sought.

Records (Table), sorted or unsorted

DDBB

SSYYSSTTEEMMSS

11 of 29

Binary Search with ISAM

Task: Find Jones. Binary search

1) Split & test middle value Goetz vs. Jones.

Result: Jones comes after Goetz (Jones > Goetz), so look down + discard upper half

2) Split & test: Jones < Kalida, so look up

3) Split & test: Jones > Inez, so look down

4) Split & test: Jones = Jones, so match!

AdamsBrownCadizDorfmannEatonFarrisGoetzHansonInezJonesKalidaLomaxMirandaNorman

14 entries (=N)

2

1

3

4 Match! in

fourth

lookup

Index – Key Field

DDBB

SSYYSSTTEEMMSS

12 of 29

Binary Search (Cont.)

4 lookups in total: 1) 14/2=7 2) 7/2=3.5, round to 4;

3) 4:2=2

4) 2:2=1

Or: 2x2=4; 4x2=8; 8x2=16that is, appx. 24 . Number of lookups is the exponential to which 2 should be raised to get the number of items retrieved or bit over or

log214 ~ between 3 and 4.

DDBB

SSYYSSTTEEMMSS

13 of 29

Linked List Index

CarpenterB87 B29 A67

GibsonB38 00 A22

EatonB29 B71 A58

FarrisB71 B38 A63

7 Farris Dustin 3/28/2001A63

8 Carpenter Carlos 12/29/2001A67

6 Eaton Anissa 8/23/2001A58

2 Gibson Bill 3/31/2001A22

• Index consists of data entries with 3 pieces: key value, pointer to next element, and pointer to a stored record.

Complete records

Linked List Index

Indicates last record

DDBB

SSYYSSTTEEMMSS

14 of 29

Linked List: Insert Task

Task: Insert the Eccles row

Procedure: 1. Identify place of Eccles element in sorting order (Eccles is after Eaton and before Farris) – location is logical (pointer-related)

2. Store Eccles element at an available

location (B14)

3. Move pointer from Eaton element to

Eccles element – B71 (referencing Farris

element)

4. Insert pointer in Eaton to point to the

Eccles record – new location B14

FarrisB71 B38 A63

EatonB29 B71 A58

EcclesB14 B71 A97

B14RECORD

S4.

3.2.

1.

X

DDBB

SSYYSSTTEEMMSS

15 of 29

Tree (hierarhical) Indexes

Root = start point

Node (data entry)

Leaf (bottom node with no children)

Depth (n) = number of levels

Degree (m) = max. no. of children per node (2 or more)

Three = a hierarchical structure with a root element on top, branches,nodes, and leaves.

Pointer to keyswith higher/equalvalues

value< <=

Pointer to keys with lower values

Root

DDBB

SSYYSSTTEEMMSS

16 of 29

B+-Tree

Increased retrieval power and performs optimally on other tasks.

Typical index in modern DBMSes

Characteristics: Root, non-leaf nodes (some values of key attribute, used for

navigating through Tree), leaf-nodes (all key values, point to records)

Degree, m >= 3

Every non-leaf node (except Root) has between m/2 and m children

Leaf-nodes (Leaves) are at the same level/depth & in

sequential order.

DDBB

SSYYSSTTEEMMSS

17 of 29

B+-Tree Example

Degree = 3 At least m/2 = 1.5 (=2) children. Maximum 3 children.

Search procedure (e.g., find 692) using comparisons: Less than Equal or Greater than Between

Note sequential order at leaf level (156…792), as ISAM index.

315< <=

231< <= < 287 <= 458< <= < 792 <=

315< <= <347<= 458< <= <692 <=156< <= 231< <= 792< <=287< <=

records

DDBB

SSYYSSTTEEMMSS

18 of 29

B+-Tree: Insert Task Insert 257

Find location, starting from Root. Easy with extra space. Just insert 257 in appropriate sequence.

315< <=

231< <= < 287 <= 458< <= < 792 <=

315< <= <347<= 458< <= <692 <=156< <= 792< <=287< <=231< <= <257<=

Test 1: 257 vs 315

Test 2

Test 3

DDBB

SSYYSSTTEEMMSS

19 of 29

B+-Tree Strengths

Designed to give good performance for any

type of data and usage.

Lookup speed is based on degree/depth.

Random and sequential retrieval fast.

Insert, delete, modify fast.

Many changes are easy.

Occasionally large sections must be reorganized to

balance the tree.

DDBB

SSYYSSTTEEMMSS

20 of 29

Direct Access / Hashed

Convert key value directly to location address (relative or absolute). Prime modulus algorithm:

Choose prime number greater than expected database size.

Divide key with prime no. and use remainder as address of storage location:

528 : 101 = 5 + 23

Very fast random retrieval (use: POS to retrieve price on product no.).

Slower sequential access. Collision/overflow space for

duplicates (disadvantage of hashing):Reorganize if out of space.

Example Prime = 101 Key = 528 Modulus = 23

416303

528

Overflow/collisions

0 1 201

3

2

3

Location

Store in

DDBB

SSYYSSTTEEMMSS

21 of 29

Comparison of Index Types Choice depends on data usage.

How often do data change? What percent of the data is

used at one time? How big are the tables? How many transactions are

processed per second?

B+-Tree is best overall Hashing is good for high-

speed random access Sequential/ISAM is good if

entire larger tables often used

DDBB

SSYYSSTTEEMMSS

22 of 29

Storing Data Columns

Different methods of storing data within each row.Fixed (Positional)Simple, common

Fixed with overflow(Memo/highly variable text;

VARCHAR data type)

A101: -Extra Large

A321: an-Premium

A532: r-Cat

Overflow text

DDBB

SSYYSSTTEEMMSS

23 of 29

Data Clustering

Grouping related data together to improve retrieval.

Data should be close to each other on one disk.

Preferably within the same disk page or cylinder.

Minimize disk reads and seeks.

Example: cluster each invoice with the matching

order.

DDBB

SSYYSSTTEEMMSS

24 of 29

Data Clustering

Keeping data on the same drive Keeping data close together

Same cylinder Same I/O page Consecutive sectors

Order# 1123 Customer# 8876 OrderDate

Order# 1123 Item# 240 Quantity 2

Order# 1123 Item# 987 Quantity 1

Order# 1123 Item# 078 Quantity 3

Order# 1124

Order

OrderItem

Order# 1124 ItemOrder# 1124 Item

DDBB

SSYYSSTTEEMMSS

25 of 29

Data Partitioning

Split table: Horizontally or Vertically

Infrequent access to some rows Large tables Move less used rows to

slower / cheaper storage

High speed hard disk

Low cost optical disk

Customer# Name Address Phone2234 Inouye 9978 Kahlea Dr. 555-555-22225532 Jones 887 Elm St. 666-777-33330087 Hardaway 112 West 2000 888-222-11110109 Pippen 873 Lake Shore 333-111-2235

Activecustomers

Horizontal Partition

Current Customers

Customers w/ no purchase in last 3 years

DDBB

SSYYSSTTEEMMSS

26 of 29

Data Partitioning

Some columns less used and large (long) Store often used data on hi speed disk. Store less used data on optical disk. DBMS retrieves both automatically as

needed.

High speed hard disk

Low cost optical disk

Item# Name QOH Description TechnicalSpecifications875 Bolt 268 1/4” x 10 Hardened, meets standards ...937 Injector 104 Fuel injector Designed 1995, specs . . .

Vertical Partition

DDBB

SSYYSSTTEEMMSS

27 of 27

RAID and Disk Striping

Redundant Array of

Independent Drives - RAID Instead of one massive drive,

use many smaller drives.

Split table to store parts on

different drives - Striping Drives can simultaneously

retrieve portions of data -

parallel processing).

CustID Name Phone115 Jones 555-555-1111225 Inez 666-666-2222333 Shigeta 777-777-1357938 Smith 888-888-2225