CS 542 Putting it all together -- Storage Management

79
CS 542 Database Management Systems March 14, 2011 Putting it all together Storage Hierarchy Secondary Storage Management System Catalogs Data Modeling

Transcript of CS 542 Putting it all together -- Storage Management

Page 1: CS 542 Putting it all together -- Storage Management

CS 542 Database Management Systems

J Singh March 14, 2011

Putting it all together

Storage Hierarchy Secondary Storage Management

System Catalogs Data Modeling

Page 2: CS 542 Putting it all together -- Storage Management

2© J Singh, 2011 2

Plan for today

• Putting it all together. – Storage Hierarchy– Secondary Storage Management– System Catalogs

• By the second break, we will have arrived at a place where we have most of the tools to build our own database

• After the second break,– Data Modeling

• This topic does not fit with the other three but some of you will need it for your project so I am including it

Page 3: CS 542 Putting it all together -- Storage Management

3© J Singh, 2011 3

Storage Hierarchy

• A “typical” system is shown here

• Many different levels– Main Memory (2011)

• 1μsec access / word, $12.50/GB– Disk (2011, ref)

• 3.4 msec access / page, $1.35/GB

• Further away from primary storage,• Cost per MB decreases• Access speed decreases• Storage capacity increases

• Secondary- and tertiary-storage must be non-volatile Source: Wikipedia

Page 4: CS 542 Putting it all together -- Storage Management

4© J Singh, 2011 4

DBMS vs. OS File System

OS does disk space & buffer mgmt already So why not let OS manage these tasks?

• Differences in OS support: Portability issues

• Some limitations, e.g., files don’t span multiple disk devices.

• Buffer management in DBMS requires ability to:– pin a page in buffer pool, – force a page to disk (important for implementing CC &

recovery),– adjust replacement policy, and pre-fetch pages based on

access patterns in typical DB operations.

Page 5: CS 542 Putting it all together -- Storage Management

5© J Singh, 2011 5

Structure of a DBMS

• A typical DBMS has a layered architecture.

• Disk Storage hierarchy, RAID

• Disk Space ManagementRoles, Free blocks

• Buffer ManagementBuffer Pool, Replacement policy

• Files and Access MethodsFile organization

• heaps, sorted files, indexes

File and Page level storage

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

These layersmust considerconcurrencycontrol andrecovery

Index FilesSystem Catalog

Data Files

Page 6: CS 542 Putting it all together -- Storage Management

6© J Singh, 2011 6

Five-minute rule (p1)

• Jim Gray, 1985, 1997

• When it comes to improving database performance,– Pay per MB when buying RAM

• If RAM is cheaper, we can afford to buy more

– Pay for speed when buying disk• Faster disk, can get away with less RAM

• The critical question:– At what point does the cost of keeping data in memory balance

the cost of getting it from disk?

Source: ACM Queue

Page 7: CS 542 Putting it all together -- Storage Management

7© J Singh, 2011 7

Five-minute Rule (p2)

• In 1985 context: (ref)– Block size: 1KB– Disk:

• 15 I/Os per second = $15K• 1 I/O per sec $1K +

overhead $2K

– Memory• $5K /MB $5 /KB 1KB = $5

– Trade-off• Spend $5 for 1K in memory to

save $2K in I/O cost? Sure!• Any RAM cost up to $2K is

good!• $2K/$5 = 400 seconds

• Five minute rule:– Have enough RAM to

keep any data that will be used within 5 minutes

• In 1997 context: (ref)– Re-validated and re-

published

• In 2008 context: (ref)– Still valid, but for much

larger block sizes (64KB). Why?

• Disk speeds have not increased at the same rate as RAM capacity, making it more economical to bring in bigger blocks.

Page 8: CS 542 Putting it all together -- Storage Management

8© J Singh, 2011 8

Extending the Hierarchy (p1)

• Flash memories (Solid-State Disks) (ref)

– Fit nicely at the half-way point between memory and disk

– 2011 price: 80¢/GB– Persistent Storage– Low Power– Implications of 5-minute rule:

• RAM-Flash buffer size 4KB• Flash-Disk buffer size 256KB• Special uses?

– Index Structures?– Materialized Views?

Page 9: CS 542 Putting it all together -- Storage Management

9© J Singh, 2011 9

Extending the Hierarchy (P2)

• Principles of caching also extend into the cloud. Comparison:

– Disk (2011)• 3.4 msec access, $1.35/GB

– Amazon S3 (2011)• Highly Redundant storage constructed from cheap disks• 100-250 msec access (across the network), $0.14/GB

• Potential Applications,– How to turn a key-value store (e.g., Amazon SimpleDB)

into a document store• Use SimpleDB for its indexing capabilities • Use S3 for storing documents

– Database backups / checkpoints into the cloud

Page 10: CS 542 Putting it all together -- Storage Management

10© J Singh, 2011 10

Disks

• Secondary storage device of choice.

• Data is stored and retrieved in units:– called disk blocks or pages.

• Unlike RAM, time to retrieve a disk page varies depending upon location on disk.

– Therefore, relative placement of pages on disk has major impact on DBMS performance

Page 11: CS 542 Putting it all together -- Storage Management

11© J Singh, 2011 11

Components of a Disk

• The platters spin.– 15,000 rpm available– 7,200 or 5,400 rpm are

more typical

• The arm assembly is moved in or out to position a head on a desired track.

• Tracks under heads make a cylinder (imaginary!).

• Only one head reads/writes at any one time.

Block size is a multiple of sector size (which is fixed).

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Sector

Page 12: CS 542 Putting it all together -- Storage Management

12© J Singh, 2011 12

Accessing a Disk Page

• Time to access (read/write) a disk block:– seek time (moving arms to position disk head on track)– rotational delay (waiting for block to rotate under head)– transfer time (actually moving data to/from disk surface)

• Seek time and rotational delay dominate.– Seek time varies from about 1 to 20msec– Rotational delay varies from 0 to 10msec– Transfer rate is about 1msec per 4KB page

• Lower I/O cost: reduce seek/rotation delays

Page 13: CS 542 Putting it all together -- Storage Management

13© J Singh, 2011 13

Disk Access Optimizations

• Instead of responding to access requests in FIFO order, respond to them in an order that takes the disk characteristics into account

– Disk Scheduling

• Distribute data accesses among several disks and have them return data in parallel

– also known as Disk Striping

• Mirror disks

• Prefetch for sequential accesses– Locate blocks strategically on disk to minimize seek times and

rotational delay

Page 14: CS 542 Putting it all together -- Storage Management

14© J Singh, 2011 14

Disk Scheduling

• The Elevator Algorithm (like an elevator in a building)

1. Sort all requests by cylinder outer to inner,

2. Move arm inward and return results for each request

3. Sort all requests inner to outer,

4. Move arm outward and return results for each request

5. Repeat

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Sector

Page 15: CS 542 Putting it all together -- Storage Management

15© J Singh, 2011 15

Disk Striping

• Distribute the data among several disks

• Seek time goes up, not down!

• But data transfer time can go down

R1R5R9

R2R6

R10

R3R7

R11

R4R8

R12

Page 16: CS 542 Putting it all together -- Storage Management

16© J Singh, 2011 16

Mirror disks

• Can read from either copy, whichever is faster

• Must write to both copies

Copy 1 Copy 2

Page 17: CS 542 Putting it all together -- Storage Management

17© J Singh, 2011 17

Locating Blocks for Sequential access

• ‘Next’ block concept: – blocks on same track, followed by– blocks on same cylinder, followed by– blocks on adjacent cylinder

• Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.

• For a sequential scan, pre-fetching several pages at a time is a big win

Page 18: CS 542 Putting it all together -- Storage Management

CS-542 Database Management SystemsRedundant Array of Independent Disks (RAID)

Page 19: CS 542 Putting it all together -- Storage Management

19© J Singh, 2011 19

Introduction to RAID

• Arrangement of several disks that gives the abstraction of a single large disk

• Goals: Increase Performance and Reliability

• Techniques1. Data striping

• Data is partitioned• Definition: size of each partition is called the striping unit• Partitions are distributed over several disks

2. Redundancy• More disks Increased reliability• Redundant information allows reconstruction of data if disk

fails

Page 20: CS 542 Putting it all together -- Storage Management

20© J Singh, 2011 20

RAID Levels 0 and 1

• Level 0: No redundancy– Best write performance– Not best in reading. (Why?)

• Level 1: Mirrored (two identical copies)– Each disk has a mirror image– Parallel reads, a write involves two disks.– Maximum transfer rate = transfer rate of one disk

Page 21: CS 542 Putting it all together -- Storage Management

21© J Singh, 2011 21

RAID Level 4

• Arrangement– Uses n data disks and 1 parity disk– Block is striped across the n disks. – The parity disk holds XOR of the blocks.

• Read: from the n data disks• Write: update the data disks and also the parity disk• Upon crash: remaining disks are used to reconstruct the

data• Problem: performance bottleneck on the parity disk

11110000

10101010

00111000

01100010

Page 22: CS 542 Putting it all together -- Storage Management

22© J Singh, 2011 22

RAID Level 5

• Arrangement– 1/n of the cylinders of each disk are set aside for parity

• Thus the bottleneck is distributed evenly

Page 23: CS 542 Putting it all together -- Storage Management

23© J Singh, 2011 23

Disk Space Management

• Lowest layer of DBMS software manages space on disk.

• Higher levels call upon this layer to:– allocate/de-allocate a page– read/write a page

• Higher levels don’t need to know how this is done, or how free space is managed.

Page 24: CS 542 Putting it all together -- Storage Management

24© J Singh, 2011 24

Buffer Management in a DBMS

• Data must be in RAM for DBMS to operate on it!• Table of <frame#, pageid> pairs is maintained.

• DB

• MAIN MEMORY

• DISK

• disk page

• free frame

• Page Requests from Higher Levels

• BUFFER POOL

• choice of frame dictated• by replacement policy

Page 25: CS 542 Putting it all together -- Storage Management

25© J Singh, 2011 25

When a Page is Requested ...

• If requested page is not in buffer pool:– Choose a frame for replacement– If frame is dirty, write it to disk– Read requested page into chosen frame

• Pin the page and return its address.

• If requests can be predicted (e.g., sequential scans), pages can be pre-fetched

Page 26: CS 542 Putting it all together -- Storage Management

26© J Singh, 2011 26

More on Buffer Management

• Requestor of page must unpin it, and indicate whether page has been modified:

– dirty bit is used for this.

• Page in pool may be requested many times, – a pin count is used. – A page is a candidate for replacement iff pin count = 0.

• CC & recovery may entail additional I/O when a frame is chosen for replacement. (Write-Ahead Log protocol; more later.)

Page 27: CS 542 Putting it all together -- Storage Management

27© J Singh, 2011 27

Buffer Replacement Policy

• Frame is chosen for replacement by a replacement policy:

– Least-recently-used (LRU), Clock, MRU etc.

• Policy can have big impact on # of I/O’s; depends on access pattern.

• Sequential flooding: Nasty situation caused by LRU + repeated sequential scans.

– # buffer frames < # pages in file means each page request causes an I/O.

– MRU much better in this situation (but not in all situations, of course).

Page 28: CS 542 Putting it all together -- Storage Management

28© J Singh, 2011 28

Representing addresses (p1)

• We need pointers especially in object oriented databases. • Two kind of addresses:

– Physical (e.g. host, driveID, cylinder, surface, sector (block), offset)

– Logical (unique ID). • Physical addresses are very long

– 8B is the minimum – up to 16B in some systems• Example: A database that is designed to last 100 years.

– If the database grows to encompass 1 million machines and each machine creates 1 object each nanoseconds then we could have 277 objects.

– 10 bytes are needed to represent addresses for that many objects.

Page 29: CS 542 Putting it all together -- Storage Management

29© J Singh, 2011 29

• We need a map table for flexibility.

• The level of indirection gives the flexibility.

• For example, often we move records around, either within a block or from block to block.

• What about the programs that are pointing to these records?

• They are going to have dangling pointers, if they work with physical addresses.

• We only arrange the map table!

Physical address

logical physical

Logical address

Representing Addresses (p2)

Page 30: CS 542 Putting it all together -- Storage Management

30© J Singh, 2011 30

Pointer Swizzling (p1)

Typical DB structure: • Data maintained by server process, using physical or logical addresses of perhaps 8 bytes.

• Application programs are clients with their own (conventional memory) address spaces.

• When blocks and records are copied to client's memory, DB addresses must be swizzled = translated to virtual memory addresses.

– Allows conventional pointer following. – Especially important in OODBMS, where pointers as data are common.

• DBMS uses translation table

Db address memory address

Page 31: CS 542 Putting it all together -- Storage Management

31© J Singh, 2011 31

Pointer swizzling (p2)

• DBMS uses a translation table– Map Table vs. Translation Table– Logical and Physical address

are both representations for the database address.

– In contrast, memory addresses in the translation table are for copies of the corresponding object in memory.

– All addressable items in the database have entries in the map table, while only those items currently in memory are mentioned in the translation table.

DBaddr Mem-addr

database address

memory address

Page 32: CS 542 Putting it all together -- Storage Management

32© J Singh, 2011 32

Swizzling Example

Disk Memory

Block 1

Block 2

Read intomemory

Unswizzled

Swizzled

Page 33: CS 542 Putting it all together -- Storage Management

33© J Singh, 2011 33

Pointer Swizzling (p3)

• Swizzling Options:– Never swizzle. Keep a translation table of DB pointers

local pointers; consult map to follow any DB pointer. • Problem: time to follow pointers.

– Automatic swizzling. When a block is copied to memory, replace all its DB pointers by local pointers.

• Problem: requires knowing where every pointer is (use block and record headers for schema info).

• Problem: large investment if not too many pointer followings occur.

– Swizzle on demand. When a block is copied to memory, enter its own address and those of member records into translation table, but do not translate pointers within the block. If we follow a pointer, translate it the first time.

• Problem: requires a bit in pointer fields for DB/local, • Problem: extra decision at each pointer following.

Page 34: CS 542 Putting it all together -- Storage Management

34© J Singh, 2011 34

Pinned records

• Pinned record = some swizzled pointer points to it• Pointers to pinned records have to be unswizzled before the

pinned record is returned to disk• We need to know where the pointers to it are• Implementation: keep a linked list of all (swizzled) records

pointing to a record.

x y

y

y

Swizzled pointer

Page 35: CS 542 Putting it all together -- Storage Management

35© J Singh, 2011 35

Variable-Length Data

• Skipped discussion of fixed-length records, please read in the book.

• Real complexity is with variable-length records– Varying-size data items (e.g., address)– Repeating fields (stars-to-movie relationship)

• Sliding Records – Use offset table in a

block, pointing to current records.

– If a record grows, slide records around the block.

• Not enough space? Create overflow block; offset table must indicate “record moved.”

Page 36: CS 542 Putting it all together -- Storage Management

36© J Singh, 2011 36

System Catalogs

• Meta information stored in system catalogs.– For each index:

• structure (e.g., B+ tree) and search key fields– For each relation:

• name, file name, file structure (e.g., Heap file)• attribute name and type, for each attribute• index name, for each index• integrity constraints

– For each view:• view name and definition

– Plus statistics, authorization, buffer pool size, etc.

• Catalogs are themselves stored as relations

Page 37: CS 542 Putting it all together -- Storage Management

37© J Singh, 2011 37

Example: Attribute Catalog

attr_name rel_name type position

attr_name Attribute_Cat

string 1

rel_name Attribute_Cat

string 2

type Attribute_Cat

string 3

position Attribute_Cat

integer 4

employee_id

Employee string 1

DoB Employee Date 2

… … … …

• Conceptually speaking…

Page 38: CS 542 Putting it all together -- Storage Management

38© J Singh, 2011 38

Example: MySQL Information Schema

• INFORMATION_SCHEMA Tables– The INFORMATION_SCHEMA SCHEMATA Table. – The INFORMATION_SCHEMA TABLES Table. – The INFORMATION_SCHEMA COLUMNS Table. – The INFORMATION_SCHEMA STATISTICS Table. – The INFORMATION_SCHEMA USER_PRIVILEGES Table. – The INFORMATION_SCHEMA SCHEMA_PRIVILEGES Table. – The INFORMATION_SCHEMA TABLE_PRIVILEGES Table. – The INFORMATION_SCHEMA COLUMN_PRIVILEGES Table. – The INFORMATION_SCHEMA CHARACTER_SETS Table– … (Total 18 tables)

Page 39: CS 542 Putting it all together -- Storage Management

39© J Singh, 2011 39

Summary

• Disks provide cheap, non-volatile storage.– Random access, but cost depends on location of page on

disk– Important to arrange data sequentially to minimize seek

and rotation delays.

• Buffer manager brings pages into RAM.– Page stays in RAM until released by requestor.– Written to disk when frame chosen for replacement. – Frame to replace based on replacement policy.– Tries to pre-fetch several pages at a time.

Page 40: CS 542 Putting it all together -- Storage Management

40© J Singh, 2011 40

More Summary

• DBMS vs. OS File Support– DBMS needs features not found in many OSs.

• forcing a page to disk• controlling the order of page writes to disk• files spanning disks• ability to control pre-fetching and page replacement policy

based on predictable access patterns

• Two mapping structures help us map addresses– Map tables take us from logical addresses to physical

addresses– Translation tables take us from physical addresses to in-

memory addresses (where applicable)• Swizzling helps keep track of where in memory

Page 41: CS 542 Putting it all together -- Storage Management

41© J Singh, 2011 41

Even More Summary

• Catalog relations store information about relations, indexes and views.

– Information common to all records in collection.

Page 42: CS 542 Putting it all together -- Storage Management

CS 542 – Database Management SystemsData Modeling

Page 43: CS 542 Putting it all together -- Storage Management

43© J Singh, 2011 43

Data Modeling Techniques

• Entity-Relationship Modeling– E/R Diagrams allow us to sketch database schema

designs• Designs are pictures called entity-relationship diagrams

– Weak Entity Sets • Skipping, please read in book if interested, not on exam

– Converting E/R Diagrams to Relations

• Unified Modeling Language (UML)– Skipping, please read in book if interested, not on exam

• Object Definition Language (ODL)– Skipping, please read in book if interested, not on exam

Page 44: CS 542 Putting it all together -- Storage Management

44© J Singh, 2011 44

Framework for E/R

• Design is a serious business.

• The “boss” (or customer) knows they want a database, but they don’t know what they want in it.

• Sketching the key components is an efficient way to develop a working database.

44

Page 45: CS 542 Putting it all together -- Storage Management

45© J Singh, 2011 45

Entity Sets

• Entity = “thing” or object.

• Entity set = collection of similar entities.– Similar to a class in object-oriented languages.

• Attribute = property of (the entities of) an entity set.– Attributes are simple values, e.g. integers or character

strings, not structs, sets, etc.

• In an entity-relationship diagram:– Entity set = rectangle.– Attribute = oval, with a line to the rectangle representing

its entity set.

Page 46: CS 542 Putting it all together -- Storage Management

46© J Singh, 2011 46

Example: Beer Manufacturers

• Entity set Beers has two attributes, name and manf (manufacturer).

• Each Beers entity has values for these two attributes, e.g. (Bud, Anheuser-Busch)

Beers

name manf

Page 47: CS 542 Putting it all together -- Storage Management

47© J Singh, 2011 47

Relationships

• A relationship connects two or more entity sets.• It is represented by a diamond, with lines to each of the

entity sets involved.

Drinkers addrname

Beers

manfname

Bars

name

license

addr

Note:license =beer, full,none

Sells Bars sell somebeers.

Likes

Drinkers likesome beers.Frequents

Drinkers frequentsome bars.

Page 48: CS 542 Putting it all together -- Storage Management

48© J Singh, 2011 48

Relationship Set

• The current “value” of an entity set is the set of entities that belong to it.

– Example: the set of all bars in our database.

• The “value” of a relationship is a relationship set, a set of tuples with one component for each related entity set.

• For the relationship Sells, we might have a relationship set like:

Bar Beer

Joe’s Bar Bud

Joe’s Bar Miller

Sue’s Bar

Bud

Sue’s Bar

Pete’s Ale

Sue’s Bar

Bud Lite

Page 49: CS 542 Putting it all together -- Storage Management

49© J Singh, 2011 49

Multiway Relationships

• Sometimes, we need a relationship that connects more than two entity sets.

• Suppose that drinkers will only drink certain beers at certain bars.

– Our three binary relationships Likes, Sells, and Frequents do not allow us to make this distinction.

– But a 3-way relationship would.

Page 50: CS 542 Putting it all together -- Storage Management

50© J Singh, 2011 50

Example: 3-Way Relationship

• Bars • Beers

• Drinkers

• name • name• addr • manf

• name • addr

• license

• Preferences

Page 51: CS 542 Putting it all together -- Storage Management

51© J Singh, 2011 51

A Typical Relationship Set

Bar Drinker Beer

Joe’s Bar Ann Miller

Sue’s Bar Ann Bud

Sue’s Bar Ann Pete’s Ale

Joe’s Bar Bob Bud

Joe’s Bar Bob Miller

Joe’s Bar Cal Miller

Sue’s Bar Cal Bud Lite

• Each row of a relationship set typically consists of foreign keys to other tables

Page 52: CS 542 Putting it all together -- Storage Management

52© J Singh, 2011 52

Many-Many Relationships

• Focus: binary relationships, such as Sells between Bars and Beers.

• In a many-many relationship, an entity of either set can be connected to many entities of the other set.

– E.g., a bar sells many beers; a beer is sold by many bars.

Page 53: CS 542 Putting it all together -- Storage Management

53© J Singh, 2011 53

Many-One Relationships

• Some binary relationships are many -one from one entity set to another.

• Each entity of the first set is connected to at most one entity of the second set.

• But an entity of the second set can be connected to zero, one, or many entities of the first set.

• FavBeer, (Drinkers Beers) is many-one

– A drinker has at most one favBeer– A beer can be the favorite of any

number of drinkers, including zero

Page 54: CS 542 Putting it all together -- Storage Management

54© J Singh, 2011 54

One-One Relationships

• In a one-one relationship, each entity of either entity set is related to at most one entity of the other set.

• Example: Relationship Best-seller between entity sets Manfs (manufacturer) and Beers.

– A beer cannot be made by more than one manufacturer

– No manufacturer can have more than one best-seller (assume no ties).

Page 55: CS 542 Putting it all together -- Storage Management

55© J Singh, 2011 55

Representing “Multiplicity”

• Show a many-one relationship by an arrow entering the “one” side.

• Show a one-one relationship by arrows entering both entity sets.

• Rounded arrow = “exactly one,” i.e., each entity of the first set is related to exactly one entity of the target set.

Page 56: CS 542 Putting it all together -- Storage Management

56© J Singh, 2011 56

Example: Many-One Relationship

• Drinkers • Beers• Likes

• Favorite• Notice: two relationships• connect the same entity• sets, but are different.

Page 57: CS 542 Putting it all together -- Storage Management

57© J Singh, 2011 57

Example: One-One Relationship

• Consider Best-seller between Manfs and Beers.• But a beer manufacturer has to have a best-seller.

– Shown with a rounded arrow• Some beers are not the best-seller of any manufacturer

– A rounded arrow to Manfs would be inappropriate

Manfs BeersBest-seller

A manufacturer hasexactly one bestseller.

A beer is the best-seller for 0 or 1manufacturer.

Page 58: CS 542 Putting it all together -- Storage Management

58© J Singh, 2011 58

Attributes on Relationships

• Sometimes it is useful to attach an attribute to a relationship.

• Think of this attribute as a property of tuples in the relationship set.

Bars BeersSells

price

Price is a function of both the bar and the beer,not of one alone.

Page 59: CS 542 Putting it all together -- Storage Management

59© J Singh, 2011 59

Subclasses in E/R Diagrams

• Subclass – fewer entities, more properties.

• Example: Ales are a kind of beer.– Not every beer is an ale, but some

are.– In addition to all the properties

(attributes and relationships) of beers, suppose ales also have the attribute color.

• Assume subclasses form a tree.– I.e., no multiple inheritance.

• Isa triangles indicate the subclass relationship.

Beers

Ales

isa

name manf

color

Page 60: CS 542 Putting it all together -- Storage Management

60© J Singh, 2011 60

E/R vs. Object-Oriented Subclasses

• In OO, objects are in one class only.

– Subclasses inherit from superclasses.

• In contrast, E/R entities have representatives in all subclasses to which they belong.

– Rule: if entity e is represented in a subclass, then e is represented in the superclass (and recursively up the tree).

Beers

Ales

isa

name manf

color

Pete’s Ale

Page 61: CS 542 Putting it all together -- Storage Management

61© J Singh, 2011 61

Designating keys

• Show keys by underlining the attribute

Beers

Ales

isa

name manf

color

Page 62: CS 542 Putting it all together -- Storage Management

62© J Singh, 2011 62

Example: Good

Beers ManfsManfBy

name

This design gives the address of each manufacturer exactly once.

name addr

Page 63: CS 542 Putting it all together -- Storage Management

63© J Singh, 2011 63

Example: Bad

Beers ManfsManfBy

name

This design states the manufacturer of a beer twice: as an attribute and as a related entity.

name

manf

addr

Page 64: CS 542 Putting it all together -- Storage Management

64© J Singh, 2011 64

Example: Bad

Beers

name

This design repeats the manufacturer’s address once for each beer and loses the address if there are temporarily no beers for a manufacturer.

manf manfAddr

Page 65: CS 542 Putting it all together -- Storage Management

65© J Singh, 2011 65

Example: Good

• Manfs deserves to be an entity set because of the nonkey attribute addr.

• Beers deserves to be an entity set because it is the “many” of the many-one relationship ManfBy.

Beers ManfsManfBy

name name addr

Page 66: CS 542 Putting it all together -- Storage Management

66© J Singh, 2011 66

Example: Bad

• Since the manufacturer is nothing but a name, and is not at the “many” end of any relationship, it should not be an entity set.

Beers ManfsManfBy

name name

Page 67: CS 542 Putting it all together -- Storage Management

67© J Singh, 2011 67

From E/R Diagrams to Relations

• Entity set relation.– Attributes attributes.

• Relationships relations whose attributes are only:– The keys of the connected entity sets.– Attributes of the relationship itself.

Page 68: CS 542 Putting it all together -- Storage Management

68© J Singh, 2011 68

Entity Set Relation

Relation: Beers(name, manf)

68

Beers

name manf

Page 69: CS 542 Putting it all together -- Storage Management

69© J Singh, 2011 69

Relationship Relation

Drinkers BeersLikes

Likes(drinker, beer)Favorite

Favorite(drinker, beer)

Married

husband wife

Married(husband, wife)

name addr name manf

Buddies

1 2

Buddies(name1, name2)

Page 70: CS 542 Putting it all together -- Storage Management

70© J Singh, 2011 70

Combining Relations

• OK to combine into one relation:1. The relation for an entity-set E 2. The relations for many-one relationships of which E is

the “many.”

• Example: Drinkers(name, addr) and Favorite(drinker, beer) combine to make Drinker1(name, addr, favBeer).

Page 71: CS 542 Putting it all together -- Storage Management

71© J Singh, 2011 71

Risk with Many-Many Relationships

• Combining Drinkers with Likes would be a mistake. It leads to redundancy, as:

name addr beer

Sally 123 Maple BudSally 123 Maple Miller

Redundancy

Page 72: CS 542 Putting it all together -- Storage Management

72© J Singh, 2011 72

Subclasses: Three Approaches

1. Object-oriented : One relation per subset of subclasses, with all relevant attributes.

2. Use nulls : One relation; entities have NULL in attributes that don’t belong to them.

3. E/R style : One relation for each subclass:– Key attribute(s).– Attributes of that subclass.

72

Page 73: CS 542 Putting it all together -- Storage Management

73© J Singh, 2011 73

Example: Subclass Relations

Beers

Ales

isa

name manf

color

Page 74: CS 542 Putting it all together -- Storage Management

74© J Singh, 2011 74

Object-Oriented

• Beers

• Ales

• Good for queries like “find the color of ales made by Pete’s”

Name Manf

Bud Anheuser-Busch

Name Manf Color

Summerbrew

Pete’s Ales

Dark

Beers

Ales

isa

name manf

color

Page 75: CS 542 Putting it all together -- Storage Management

75© J Singh, 2011 75

E/R Style

• Beers

• Ales

• Good for queries like “find all beers, including ales, made by Pete’s”

Name Manf

Bud Anheuser-Busch

Summerbrew Pete’s

Name Color

Summerbrew Dark

Beers

Ales

isa

name manf

color

Page 76: CS 542 Putting it all together -- Storage Management

76© J Singh, 2011 76

Using NULLS

• Beers

• Saves space and does everything with one table

Name Manf Color

Summerbrew

Pete’s Ales

Dark

Bud Anheuser-Busch

NULL

Beers

Ales

isa

name manf

color

Page 77: CS 542 Putting it all together -- Storage Management

77© J Singh, 2011 77

Data Models for NoSQL Databases

• Class Discussion at Next Meeting. – How would you represent a many-to-many relationships

in?• Amazon SimpleDB?• Cassandra?• Google App Engine?• MongoDB?• Redis?• Other?

– Inviting a 3-minute presentation (on 3/21) for 20 bonus points

• Only one presentation per DB• Please volunteer by Tuesday noon if interested• I will let you know by Wednesday noon if you were selected

Page 78: CS 542 Putting it all together -- Storage Management

78© J Singh, 2011 78

Summary

• Data Modeling is an essential part of designing an application– Intersects business and technology– Essential elements

• Entities• Relationships

– Is-a relationships– Multiplicity

– Has to be done with an eye toward the long term• (But has to avoid analysis paralysis)• Attributes can be added later but Entities and Relationships are

baked-in in the beginning and very hard to change later– Pay particular attention to multiplicity of relationships

• Best to separate modeling from “table design”– Needed for all databases, Relational or not.

Page 79: CS 542 Putting it all together -- Storage Management

79© J Singh, 2011 79

Next meetings

• March 21: Sort and Join Processing– Sort: Chapter 15– Join: Sections 16.1 – 16.4