9/9/1999Information Organization and Retrieval Database Design: From Conceptual Design to Physical...

43
9/9/1999 Information Organization and Retrieval Database Design: From Conceptual Design to Physical Implementation University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Transcript of 9/9/1999Information Organization and Retrieval Database Design: From Conceptual Design to Physical...

9/9/1999 Information Organization and Retrieval

Database Design: From Conceptual Design to Physical

ImplementationUniversity of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

9/9/1999 Information Organization and Retrieval

Review

• Database Design Process

• Normalization

9/9/1999 Information Organization and Retrieval

Database Design Process

ConceptualModel

LogicalModel

External Model

Conceptual requirements

Conceptual requirements

Conceptual requirements

Conceptual requirements

Application 1

Application 1

Application 2 Application 3 Application 4

Application 2

Application 3

Application 4

External Model

External Model

External Model

Internal Model

9/9/1999 Information Organization and Retrieval

Normalization

• Normalization theory is based on the observation that relations with certain properties are more effective in inserting, updating and deleting data than other sets of relations containing the same data

• Normalization is a multi-step process beginning with an “unnormalized” relation

– Hospital example from Atre, S. Data Base: Structured Techniques for

Design, Performance, and Management.

9/9/1999 Information Organization and Retrieval

Normal Forms

• First Normal Form (1NF)

• Second Normal Form (2NF)

• Third Normal Form (3NF)

• Boyce-Codd Normal Form (BCNF)

• Fourth Normal Form (4NF)

• Fifth Normal Form (5NF)

9/9/1999 Information Organization and Retrieval

Normalization

Boyce-Codd and

Higher

Functional dependencyof nonkey attributes on the primary key - Atomic values only

Full Functional dependencyof nonkey attributes on the primary key

No transitive dependency between nonkey attributes

All determinants are candidate keys - Single multivalued dependency

9/9/1999 Information Organization and Retrieval

Unnormalized Relations

• First step in normalization is to convert the data into a two-dimensional table

• In unnormalized relations data can repeat within a column

9/9/1999 Information Organization and Retrieval

Unnormalized RelationPatient # Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drugDrug side effects

1111145 311

Jan 1, 1995; June 12, 1995 John White

15 New St. New York, NY

Beth Little Michael Diamond

Gallstones removal; Kidney stones removal

Penicillin, none-

rash none

1234243 467

Apr 5, 1994 May 10, 1995 Mary Jones

10 Main St. Rye, NY

Charles Field Patricia Gold

Eye Cataract removal Thrombosis removal

Tetracycline none

Fever none

2345 189Jan 8, 1996 Charles Brown

Dogwood Lane Harrison, NY

David Rosen

Open Heart Surgery

Cephalosporin none

4876 145Nov 5, 1995 Hal Kane

55 Boston Post Road, Chester, CN Beth Little

Cholecystectomy Demicillin none

5123 145May 10, 1995 Paul Kosher

Blind Brook Mamaroneck, NY Beth Little

Gallstones Removal none none

6845 243

Apr 5, 1994 Dec 15, 1984 Ann Hood

Hilton Road Larchmont, NY

Charles Field

Eye Cornea Replacement Eye cataract removal

Tetracycline Fever

9/9/1999 Information Organization and Retrieval

First Normal FormPatient # Surgeon #Surgery DatePatient NamePatient AddrSurgeon Name Surgery Drug adminSide Effects

1111 145 01-Jan-95 John White

15 New St. New York, NY Beth Little

Gallstones removal Penicillin rash

1111 311 12-Jun-95 John White

15 New St. New York, NY

Michael Diamond

Kidney stones removal none none

1234 243 05-Apr-94 Mary Jones10 Main St. Rye, NY Charles Field

Eye Cataract removal

Tetracycline Fever

1234 467 10-May-95 Mary Jones10 Main St. Rye, NY Patricia Gold

Thrombosis removal none none

2345 189 08-Jan-96Charles Brown

Dogwood Lane Harrison, NY David Rosen

Open Heart Surgery

Cephalosporin none

4876 145 05-Nov-95 Hal Kane

55 Boston Post Road, Chester, CN Beth Little

Cholecystectomy Demicillin none

5123 145 10-May-95 Paul Kosher

Blind Brook Mamaroneck, NY Beth Little

Gallstones Removal none none

6845 243 05-Apr-94 Ann Hood

Hilton Road Larchmont, NY Charles Field

Eye Cornea Replacement

Tetracycline Fever

6845 243 15-Dec-84 Ann Hood

Hilton Road Larchmont, NY Charles Field

Eye cataract removal none none

9/9/1999 Information Organization and Retrieval

Second Normal FormPatient # Patient Name Patient Address

1111 John White15 New St. New York, NY

1234 Mary Jones10 Main St. Rye, NY

2345Charles Brown

Dogwood Lane Harrison, NY

4876 Hal Kane55 Boston Post Road, Chester,

5123 Paul KosherBlind Brook Mamaroneck, NY

6845 Ann HoodHilton Road Larchmont, NY

9/9/1999 Information Organization and Retrieval

Second Normal FormSurgeon # Surgeon Name

145 Beth Little

189 David Rosen

243 Charles Field

311 Michael Diamond

467 Patricia Gold

9/9/1999 Information Organization and Retrieval

Second Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin Side Effects

1111 145 01-Jan-95Gallstones removal Penicillin rash

1111 311 12-Jun-95

Kidney stones removal none none

1234 243 05-Apr-94Eye Cataract removal Tetracycline Fever

1234 467 10-May-95Thrombosis removal none none

2345 189 08-Jan-96Open Heart Surgery

Cephalosporin none

4876 145 05-Nov-95Cholecystectomy Demicillin none

5123 145 10-May-95Gallstones Removal none none

6845 243 15-Dec-84Eye cataract removal none none

6845 243 05-Apr-94Eye Cornea Replacement Tetracycline Fever

9/9/1999 Information Organization and Retrieval

Third Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin

1111 145 01-Jan-95 Gallstones removal Penicillin

1111 311 12-Jun-95Kidney stones removal none

1234 243 05-Apr-94 Eye Cataract removal Tetracycline

1234 467 10-May-95 Thrombosis removal none

2345 189 08-Jan-96 Open Heart Surgery Cephalosporin

4876 145 05-Nov-95 Cholecystectomy Demicillin

5123 145 10-May-95 Gallstones Removal none

6845 243 15-Dec-84 Eye cataract removal none

6845 243 05-Apr-94Eye Cornea Replacement Tetracycline

9/9/1999 Information Organization and Retrieval

Third Normal Form

Drug Admin Side Effects

Cephalosporin none

Demicillin none

none none

Penicillin rash

Tetracycline Fever

9/9/1999 Information Organization and Retrieval

Most 3NF Relations are also BCNF

Patient # Patient Name Patient Address

1111 John White15 New St. New York, NY

1234 Mary Jones10 Main St. Rye, NY

2345Charles Brown

Dogwood Lane Harrison, NY

4876 Hal Kane55 Boston Post Road, Chester,

5123 Paul KosherBlind Brook Mamaroneck, NY

6845 Ann HoodHilton Road Larchmont, NY

9/9/1999 Information Organization and Retrieval

ER Diagram Symbols

Entity

AttributePrimary

key

Relationship

Ovals are used to indicate the attributes associated with an entity or relationship (That is, the pieces of information recorded in the database about the entity or relationship) An underlined name indicates that the attribute is a primary key (That is, it can uniquely identify the entity)

Rectangles are used to indicate entities (That is, the representatives or records describing persons, things, or events in the database)

Diamonds are used to indicate relationships between entities. (That is, some association between the data records of different entities)

9/9/1999 Information Organization and Retrieval

Today: New Design

• Today we will build the COOKIE database from needs (rough) through the conceptual model, logical model and finally physical implementation in Access.

9/9/1999 Information Organization and Retrieval

Cookie Requirements• Cookie is a bibliographic database that contains

information about a hypothetical union catalog of several libraries.

• Need to record which books are held by which libraries

• Need to search on bibliographic information– Author, title, subject, call number for a given library,

etc.

• Need to know who publishes the books for ordering, etc.

9/9/1999 Information Organization and Retrieval

Cookie Database

• There are currently 5 main types of entities in the database – Books (bibfile)

– Local Call numbers (callfile)

– Libraries (libfile)

– Publishers (pubfile)

– Subject headings (subfile)

– Links between subject and books (indxfile)

9/9/1999 Information Organization and Retrieval

BIBFILE• Books (BIBFILE) contains information about

particular books. It includes one record for each book. The attributes are:– accno -- an “accession” or serial number

– author -- The author’s name (not realistic -- one author per book)

– title -- The title of the book

– loc -- Location of publication (where published)

– date -- Date of publication

– price -- Price of the book

– pagination -- Number of pages

– ill -- What type of illustrations (maps, etc) if any

– height -- Height of the book in centimeters

9/9/1999 Information Organization and Retrieval

Books/BIBFILE

Books

Author

accno

Title

Loc

DatePrice

Pagination

HeightIll

9/9/1999 Information Organization and Retrieval

CALLFILE

• CALLFILE contains call numbers and holdings information linking particular books with particular libraries. Its attributes are:– accno -- the book accession number

– libid -- the id of the holding library

– callno -- the call number of the book in the particular library

– copies -- the number of copies held by the particular library

9/9/1999 Information Organization and Retrieval

LocalInfo/CALLFILE

CALLFILE

Copiesaccno

libid Callno

9/9/1999 Information Organization and Retrieval

LIBFILE• LIBFILE contain information about the libraries

participating in this union catalog. Its attributes include:– libid -- Library id number– library -- Name of the library– laddress -- Street address for the library– lcity -- City name– lstate -- State code (postal abbreviation)– lzip -- zip code– lphone -- Phone number– mop - suncl -- Library opening and closing times for each day of the week.

9/9/1999 Information Organization and Retrieval

Libraries/LIBFILE

LIBFILE

LibidSatCl

SatOp

FCl

FOp

ThCl

ThOpWClWOpTuClTuOp

Mcl

MOp

Suncl

SunOp

lphone

lziplstate lcityladdressLibrary

9/9/1999 Information Organization and Retrieval

PUBFILE• PUBFILE contain information about the

publishers of books. Its attributes include– pubid -- The publisher’s id number– publisher -- Publisher name– paddress -- Publisher street address– pcity -- Publisher city– pstate -- Publisher state– pzip -- Publisher zip code– pphone -- Publisher phone number– ship -- standard shipping time in days

9/9/1999 Information Organization and Retrieval

Publisher/PUBFILE

PUBFILEpubid

Ship

Publisher

pphone

pzip

pstate

pcity

paddress

9/9/1999 Information Organization and Retrieval

SUBFILE

• SUBFILE contains each unique subject heading that can be assigned to books. Its attributes are– subcode -- Subject identification number– subject -- the subject heading/description

9/9/1999 Information Organization and Retrieval

Subjects/SUBFILE

SUBFILE

Subjectsubid

9/9/1999 Information Organization and Retrieval

INDXFILE

• INDXFILE provides a way to allow many-to-many mapping of subject headings to books. Its attributes consist entirely of links to other tables– subcode -- link to subject id– accno -- link to book accession number

9/9/1999 Information Organization and Retrieval

Linking Subjects and Books

INDXFILE

accnosubid

9/9/1999 Information Organization and Retrieval

Some examples of Cookie Searches

• Who wrote Microcosmographia Academica?• How many pages long is Alfred Whitehead’s The Aims of Education

and Other Essays?• Which branches in Berkeley’s public library system are open on Sunday?

• What is the call number of Moffitt Library’s copy of Abraham Flexner’s book Universities: American, English, German?

• What books on the subject of higher education are among the holdings of Berkeley (both UC and City) libraries?

• Print a list of the Mechanics Library holdings, in descending order by height.

• What would it cost to replace every copy of each book that contains illustrations (including graphs, maps, portraits, etc.)?

• Which library closes earliest on Friday night?

9/9/1999 Information Organization and Retrieval

Cookie ER diagram

Has callBIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILEHas index

libid

CALLFILE Has copy

publishes pubidPUBFILE

Has subject

subcodeaccno subcode

libidaccno

Note: diagramcontains onlyattributes usedfor linking

9/9/1999 Information Organization and Retrieval

What Problems?

• What sorts of problems and missing features arise given the previous ER diagram?

9/9/1999 Information Organization and Retrieval

Problems Identified

• Field sizes inappropriate• Author doesn’t allow

multiple authors (editors, etc).

• Subtitles, parallel titles• Edition information• Series information• lending status• material type designation• Genre, class information• Better codes (ISBN?)

• Missing information (ISBN)

• Authority control for authors

• Missing/incomplete data• Data entry problems• Ordering information• Illustrations• Subfield separation (such

as last_name, first_name)• Separate personal and

corporate authors

9/9/1999 Information Organization and Retrieval

Problems (Cont.)

• Location field inconsistent

• No notes field• No language field• Zipcode doesn’t support

plus-4• No publisher shipping

addresses

• No (indexable) keyword search capability

• No support for multivolume works

• No support for URLs – to online version

– to libraries

– to publishers

9/9/1999 Information Organization and Retrieval

Original Cookie ER diagram

Has callBIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILEHas index

Address, etc

Librarylibid

CALLFILE Has copy

publishes pubidPUBFILE

Has subject

subidaccno subid subject

CallnoLibidaccno

9/9/1999 Information Organization and Retrieval

Cookie2: Separate Name Authorities

nameid

BIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILE

libid

CALLFILE

pubidPUBFILE

subcodeaccno subcode

libidaccno

AUTHFILE

AUTHBIB

authtype

accno

nameid

name

9/9/1999 Information Organization and Retrieval

Cookie3: Keywords

nameid

BIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILE

libid

CALLFILE

pubidPUBFILE

subcodeaccno subcode

libidaccno

AUTHFILE

AUTHBIB

authtype

accno

nameid

name

KEYMAP TERMS

accno termid termid

9/9/1999 Information Organization and Retrieval

Cookie 4: Series

nameid

BIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILE

libid

CALLFILE

pubidPUBFILE

subcodeaccno subcode

libidaccno

AUTHFILE

AUTHBIB

authtype

accno

nameid

name

KEYMAP TERMS

accno termid termid

SERIES

seriesid

seriesid

ser_title

9/9/1999 Information Organization and Retrieval

Cookie 5: Circulation

nameid

BIBFILE

pubid

LIBFILE

accno

libid

CALLFILE

pubidPUBFILE

libidaccno

INDXFILE SUBFILE

subcodeaccno subcodeAUTHFILE

AUTHBIB

authtype

accno

nameid

name

KEYMAP TERMS

accno termid termid

SERIES

seriesid

seriesid

ser_title

CIRC

circidcopynumpatronid

PATRON

circid

9/9/1999 Information Organization and Retrieval

Mapping to Relations

• Take each entity– BIBFILE– LIBFILE– CALLFILE– SUBFILE– PUBFILE– INDXFILE

• And make it a table...

9/9/1999 Information Organization and Retrieval

Implementing the Physical Database...

• For each of the entities, we will build a table…

• Start up access…

• Use “New” in Tables…

• Loading data

• Entering data

• Data entry forms