Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for...
-
Upload
griffin-hutchinson -
Category
Documents
-
view
220 -
download
1
Transcript of Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for...
Effective design Effective design and analysis of and analysis of bioinformationbioinformation
Unit 3Unit 3BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for
BiotechnologyBiotechnology
Irene Gabashvili, PhD
Course availabilityCourse availability
Lectures & Lab: every Wednesday, Duncan Lectures & Lab: every Wednesday, Duncan Hall, Room 550, 6:00 pm to 9:45 pmHall, Room 550, 6:00 pm to 9:45 pm
Office hours: Wednesday, 4pm-6pm (Room Office hours: Wednesday, 4pm-6pm (Room 554, phone: 92404831) and by appointment554, phone: 92404831) and by appointment
Lecture notes will be posted at: Lecture notes will be posted at: http://home.comcast.net/~igabashvili/221T.htmhttp://home.comcast.net/~igabashvili/221T.htm
Or the SJSU page -- Or the SJSU page -- The user name is “ewok\biostudents” (The user name is “ewok\biostudents” (don’t enter quotation don’t enter quotation
mark) mark)
And the password is “4biolecture” And the password is “4biolecture” (don’t enter (don’t enter quotation mark).quotation mark).
Consumer genomics gets Consumer genomics gets crowded crowded
http://www.seqwright.com/ SoliD, http://www.seqwright.com/ SoliD, ABIABI
http://www.decodeme.com/ Illuminahttp://www.decodeme.com/ Illumina https://www.23andme.com/ Illuminahttps://www.23andme.com/ Illumina http://www.navigenics.com/ http://www.navigenics.com/
AffymetrixAffymetrix http://www.knome.com/ http://www.knome.com/
ABI,Amersham,IlluminaABI,Amersham,Illumina
In the News
https://www.23andme.com/experts/letters/https://www.23andme.com/experts/letters/science/science/
Phenotype Genes/RegionsBreast Cancer FGFR2, 16q12 region
Crohn's DiseaseNOD2(1), NOD2(2), NOD2(3), ATG16L1, IL23R, NKX2-3, 5p13 region, PTPN2
Heart Attack 9p21 region, MTHFD1LMultiple Sclerosis IL7RA, HLA-DRB1Obesity FTOProstate Cancer 8q24(1), 8q24(2), 8q24(3)Restless Legs Syndrome BTBD9Rheumatoid Arthritis HLA region, PADI4, PTPN22, MMEL1, 6q23 region, TRAF1/C5
Type 1 Diabetes HLA region, CTLA4, IFIH1, INS, PTPN2, PTPN22, SH2B3, KIAA0350
Type 2 DiabetesTCF7L2, PPARG, KCNJ11, IGF2BP2, HHEX, CDKAL1, SLC30A8, WFS1, CDKN2A/B
Venous Thromboembolism F5, F2Alcohol Flush Reaction ALDH2Bitter Taste Perception TAS2R38Earwax Type ABCC11Lactose Intolerance LCTMuscle Performance ACTN3
List from DeCODE List from DeCODE geneticsgenetics
Our current list of diseases includes: Our current list of diseases includes: Age-related Macular Degeneration, Age-related Macular Degeneration, Asthma, Alzheimer's Disease, Atrial Asthma, Alzheimer's Disease, Atrial Fibrillation, Breast Cancer, Celiac Fibrillation, Breast Cancer, Celiac Disease, Colorectal Cancer, Exfoliation Disease, Colorectal Cancer, Exfoliation Glaucoma XFG, Crohn's Disease, Glaucoma XFG, Crohn's Disease, Multiple Sclerosis, Myocardial Infarction, Multiple Sclerosis, Myocardial Infarction, Obesity, Prostate Cancer, Psoriasis, Obesity, Prostate Cancer, Psoriasis, Restless Legs, Rheumatoid Arthritis, Restless Legs, Rheumatoid Arthritis, Type 1 Diabetes and Type 2 Diabetes.Type 1 Diabetes and Type 2 Diabetes.
ThreeThree important sub- important sub-disciplines disciplines within within
bioinformaticsbioinformatics the the development of new algorithms development of new algorithms and and
statistics with which to assess relationships statistics with which to assess relationships among members of large data sets among members of large data sets
the analysis and the analysis and interpretationinterpretation of various of various types types of dataof data including nucleotide and including nucleotide and amino acid sequences, protein domains, amino acid sequences, protein domains, and protein structures and protein structures
the development and implementation of the development and implementation of tools that enable efficient access and tools that enable efficient access and management of management of different types of different types of biological biological informationinformation. .
biomedical biomedical informaticsinformatics
Main Main tasks tasks ofof
Storage, Analysis, Visualization and Storage, Analysis, Visualization and Management of biomedical dataManagement of biomedical data
Mining for new knowledge, Mining for new knowledge, hypothesis formulation and testinghypothesis formulation and testing
Development of tools and resources Development of tools and resources for the abovefor the above
Brief History of Brief History of BioinformaticsBioinformatics
19201920 - term genome was introduced by - term genome was introduced by H. Winkler to denote the complete set of H. Winkler to denote the complete set of chromosomal and extra chromosomal chromosomal and extra chromosomal genes genes
19331933 - A new technique, electrophoresis, - A new technique, electrophoresis, is introduced by Tiselius for separating is introduced by Tiselius for separating proteins in solution.proteins in solution.
19511951 - Pauling and Corey propose the - Pauling and Corey propose the structure for the alpha-helix and beta-structure for the alpha-helix and beta-sheetsheet
Brief History of Brief History of BioinformaticsBioinformatics
1953 1953 - Watson & Crick propose the double - Watson & Crick propose the double helix model for DNA (data by Franklin & helix model for DNA (data by Franklin & Wilkins)Wilkins)
1954 1954 - Perutz's group develop methods to - Perutz's group develop methods to solve the phase problem in protein solve the phase problem in protein crystallography. crystallography.
1955 1955 - The sequence of the first protein to be - The sequence of the first protein to be analyzed, analyzed, bovine insulinbovine insulin, announced by F.Sanger. , announced by F.Sanger.
1956 1956 - The first protein sequence reported was - The first protein sequence reported was that of bovine insulin, consisting of 51 residuesthat of bovine insulin, consisting of 51 residues
Brief History of Brief History of BioinformaticsBioinformatics
19621962 - Pauling's theory of molecular evolution - Pauling's theory of molecular evolution 19651965 – M.Dayhoff’s Atlas of Protein Sequences – M.Dayhoff’s Atlas of Protein Sequences 1970 1970 - Needleman-Wunsch algorithm- Needleman-Wunsch algorithm 19721972: The Protein DataBank : The Protein DataBank 19801980 - The first complete gene sequence for an - The first complete gene sequence for an
organism (FX174):5,386 bp, nine proteins.organism (FX174):5,386 bp, nine proteins. 1981 1981 - The Smith-Waterman algorithm - The Smith-Waterman algorithm
IBM introduces its PC to the market. IBM introduces its PC to the market. The concept of a sequence motif ( Doolittle ) The concept of a sequence motif ( Doolittle )
Brief History of Brief History of BioinformaticsBioinformatics
19831983: Sequence DB searching (Wilbur-Lipman) : Sequence DB searching (Wilbur-Lipman) 19861986 - Human Genome Initiative announcement - Human Genome Initiative announcement 19871987: SWISSPROT protein sequence database : SWISSPROT protein sequence database 1988 1988 - NCBI created at NIH/NLM (databases)- NCBI created at NIH/NLM (databases) 19881988 - FASTA by Pearson and Lupman - FASTA by Pearson and Lupman
EMBL establish sequence database networkEMBL establish sequence database network 1990 1990 - BLAST by Altschul,et.al.- BLAST by Altschul,et.al. 20032003 -Human Genome Project Completion -Human Genome Project Completion
biomedical biomedical informaticsinformatics
The data The data ofofPublic & Private Databases store Public & Private Databases store
biological data in various formatsbiological data in various formats Sequences DNA, RNA, proteinsSequences DNA, RNA, proteins Structures: X-ray, NMR, microscopyStructures: X-ray, NMR, microscopy Expression: microarrays, gelsExpression: microarrays, gels Interaction: 2 hybrid, mass specInteraction: 2 hybrid, mass spec Metabolism: GC-MS, NMRMetabolism: GC-MS, NMR Physiology: medical images, PK/PDPhysiology: medical images, PK/PD
Search EnginesSearch Engines
AND, OR, NOTAND, OR, NOT Specifying database fields Specifying database fields
(Organism, Author)(Organism, Author) Order of words,: Order of words,: neonatal pre/3 neonatal pre/3
screeningscreening ( (neonatalneonatal at least 3 at least 3 words before words before screeningscreening
Spaces: wom?n cat*sSpaces: wom?n cat*s
Search & DownloadSearch & Download Entrez: integrated, text-based search and Entrez: integrated, text-based search and
retrieval system for PubMed, Nucleotide and retrieval system for PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, etc + batch download Genomes, Taxonomy, etc + batch download
http://www.ncbi.nlm.nih.gov/sites/batchentrezhttp://www.ncbi.nlm.nih.gov/sites/batchentrezterm [field] OPERATOR term [field]term [field] OPERATOR term [field] 1:10[ESTC] AND Homo sapiens[ORGN] AND 1:10[ESTC] AND Homo sapiens[ORGN] AND
deafness[dis] (deafness[dis] (BSND: BSND: Bartter syndrome, Bartter syndrome, infantile, with sensorineural deafness)infantile, with sensorineural deafness)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=unigeneCMD=search&DB=unigene
More on the course’s websiteMore on the course’s website
DATA FORMATS AND DATA DATA FORMATS AND DATA INTEGRATIONINTEGRATION
It is widely recognized that successful data integration is It is widely recognized that successful data integration is one of the keys to improved productivity in one of the keys to improved productivity in biopharmaceutical R&D biopharmaceutical R&D
Success in most bioinformatics-related activities, from Success in most bioinformatics-related activities, from functional characterization of genomic sequences to functional characterization of genomic sequences to prioritization of drug targets, requires an integrated view prioritization of drug targets, requires an integrated view of all relevant data in a drug discovery R&D program of all relevant data in a drug discovery R&D program
Bioinformatics data sources often have large, complex Bioinformatics data sources often have large, complex data structures, reflecting the richness of the scientific data structures, reflecting the richness of the scientific concepts they model. Many bioinformatics data sources concepts they model. Many bioinformatics data sources cover similar domains, such as genes, proteins, sequence cover similar domains, such as genes, proteins, sequence annotations or microarray results. annotations or microarray results.
Database designDatabase design links links
http://www.devx.com/ibm/Article/20702http://www.devx.com/ibm/Article/20702 http://www.campus.ncl.ac.uk/databases/design/http://www.campus.ncl.ac.uk/databases/design/ http://www.dbazine.com/mullins_datamodel.shtmlhttp://www.dbazine.com/mullins_datamodel.shtml http://www.extropia.com/tutorials/sql/toc.htmlhttp://www.extropia.com/tutorials/sql/toc.html http://www.surfermall.com/relational/lesson_1.htmhttp://www.surfermall.com/relational/lesson_1.htm
Database: DefinitionDatabase: Definition
A collection of data that:A collection of data that: is organizedis organized usuallyusually computer-based computer-based represents repetitive information implicitlyrepresents repetitive information implicitly supports retrievalsupports retrieval
A set of rules to manipulate dataA set of rules to manipulate data A method to mold information into A method to mold information into
knowledgeknowledge
Database: applicationsDatabase: applications
Who uses Computerized Databases:Who uses Computerized Databases: Stores to keep track of inventoryStores to keep track of inventory Hospitals – to track of patient infoHospitals – to track of patient info Travel agents – to keep up with their Travel agents – to keep up with their
customers and reservationscustomers and reservations Biologists – to efficiently manage and Biologists – to efficiently manage and
manipulate their datamanipulate their data
DATA DATA INFORMATION INFORMATION KNOWLEDGEKNOWLEDGE
Paper Database as Expert Paper Database as Expert SystemSystem
HISTORYHISTORY
1960's1960's: Two main data models are : Two main data models are developed: network model (CODASYL) developed: network model (CODASYL) and hierarchical (IMS). A user would and hierarchical (IMS). A user would need to know the physical structure of need to know the physical structure of the database in order to query for the database in order to query for information. SABRE IBM/AA.information. SABRE IBM/AA.
1970-721970-72: E.F. Codd proposed relational : E.F. Codd proposed relational model He disconnects the schema (logical model He disconnects the schema (logical organization) of a database from the organization) of a database from the physical storage methods. physical storage methods.
HISTORYHISTORY
1970's1970's:: Ingres: UCB Ingres: UCB Ingres Corp., Sybase, MS Ingres Corp., Sybase, MS
SQL Server, Britton-Lee, Wang's PACE. This SQL Server, Britton-Lee, Wang's PACE. This system used QUEL as query language. system used QUEL as query language.
System R: IBM System R: IBM IBM's SQL/DS & DB2, IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem's Non-Stop Oracle, HP's Allbase, Tandem's Non-Stop SQL. This system used SEQUEL as query SQL. This system used SEQUEL as query language.language.
The term Relational Database The term Relational Database Management System (RDBMS) is coinedManagement System (RDBMS) is coined
HISTORYHISTORY
19761976: P. Chen proposed the Entity-: P. Chen proposed the Entity-Relationship (ER) model for database Relationship (ER) model for database designdesign
Early 1980'sEarly 1980's: Commercialization of : Commercialization of relational systems begins as a boom…relational systems begins as a boom…
Mid-1980'sMid-1980's: SQL (Structured Query : SQL (Structured Query Language) becomes "intergalactic Language) becomes "intergalactic standard". DB2 becomes IBM's flagship standard". DB2 becomes IBM's flagship product. Network and hierarchical models product. Network and hierarchical models fade into the backgroundfade into the background
HISTORYHISTORY
Early 1990'sEarly 1990's: Application and personal : Application and personal productivity tool development: productivity tool development: PowerBuilder (Sybase), Oracle PowerBuilder (Sybase), Oracle Developer, VB (Microsoft), Excel/Access Developer, VB (Microsoft), Excel/Access (MS) and ODBC. First Object Database (MS) and ODBC. First Object Database Management Systems (ODBMS) Management Systems (ODBMS) prototypes. prototypes.
Mid-1990'sMid-1990's: Internet/WWW. Web/DB : Internet/WWW. Web/DB grows exponentially, usable for average grows exponentially, usable for average usersusers
HISTORYHISTORY
Late-1990'sLate-1990's: Boom for Web/Internet/DB : Boom for Web/Internet/DB connectors. Open source solution with connectors. Open source solution with widespread use of gcc, cgi, Apache, widespread use of gcc, cgi, Apache, MySQL, etc. Online Transaction MySQL, etc. Online Transaction processing (OLTP) and online analytic processing (OLTP) and online analytic processing (OLAP) comes of ageprocessing (OLAP) comes of age
Early 21st centuryEarly 21st century: Burst of.com but : Burst of.com but solid growth of DB applications. PDAs, solid growth of DB applications. PDAs, POS transactions, IBM, Microsoft, POS transactions, IBM, Microsoft, Oracle. Oracle.
FUTUREFUTURE
Terabyte and Petabyte databases of Terabyte and Petabyte databases of everythingeverything
Mobile databasesMobile databases Semantic WebSemantic Web Object Oriented Everything, includes Object Oriented Everything, includes
databases databases Object Database Management Group (ODMG) Object Database Management Group (ODMG)
standards are proposed and acceptedstandards are proposed and accepted Security issuesSecurity issues
Database: advantagesDatabase: advantages
An advantage of a database program is:An advantage of a database program is: Can find a specific file quickly Can find a specific file quickly Can easily add recordsCan easily add records Can alphabetize and sort data faster Can alphabetize and sort data faster
than most peoplethan most people Is as accurate as the data that is enteredIs as accurate as the data that is entered Can make many different types of Can make many different types of
reportsreports Is invaluable for large amounts of dataIs invaluable for large amounts of data
Database: PartsDatabase: Parts
Parts of a relational database:Parts of a relational database: Fields-categories of informationFields-categories of information <table><table> Entry = data in a fieldEntry = data in a field Record = all of the information about Record = all of the information about
one item (row)one item (row) File = document of all of the recordsFile = document of all of the records To sort – field, ascend or descend (To sort – field, ascend or descend (Excel, Excel,
WorksWorks))
Database typesDatabase types
Flat (spreadsheet)Flat (spreadsheet) HierarchicalHierarchical Network (two fundamental Network (two fundamental
constructs, called constructs, called recordsrecords and and setssets)) RelationalRelational
Relational DatabasesRelational Databases
Relational databases started to get to be Relational databases started to get to be a big deal in the 1970's, and they're still a big deal in the 1970's, and they're still a big deal today, which is a little a big deal today, which is a little peculiar, because they're a 1960's peculiar, because they're a 1960's technology.technology.
A relational database is a bunch of A relational database is a bunch of rectangular tables. Each row of a table rectangular tables. Each row of a table is a record about one person or thing; is a record about one person or thing; the record contains several pieces of the record contains several pieces of information called information called fieldsfields. .
Entities and Entities and RelationshipsRelationships
Entities – Entities – things things we store we store information information aboutabout
Relationships – links between the entities
Many-to-manyOne-to-oneOne-to-many…
A Table is a RelationA Table is a RelationEMPLOYEE ID
NAME job DepartmentID
Cube
7513 Nora Edwards
Programmer 128 1
9842 Ben Smith DBA 42 15
6651 Ajay Patel Programmer 128 2
9006 Candy Burnett
Systems Administrator
128 3
Columns, Fields, Attributes; Columns, Fields, Attributes; Rows, Records, Tuples, Entities.Rows, Records, Tuples, Entities.records of data, comprised of fields, stored in tables
Keys and Functional Keys and Functional DependenciesDependencies
Key field (superkey, key) - a field that Key field (superkey, key) - a field that uniquely identifies a record uniquely identifies a record
If there is a functional dependency If there is a functional dependency between column A and column B in a between column A and column B in a given table, given table,
(A (A B), then the value of column A B), then the value of column A determines the value of column B. determines the value of column B. (employeeID (employeeID name) name)
SchemaSchema
Database schemaDatabase schema is the structure or is the structure or design of the database, a blueprint design of the database, a blueprint for the data in the database.for the data in the database.
employee(employeeID, employee(employeeID, name, job, cube, departmentID)name, job, cube, departmentID) What information needs to be stored? What information needs to be stored?
((thingsthings or or entitiesentities)) What questions will we ask of the What questions will we ask of the
database? (database? (queriesqueries.).)
Flawed schemasFlawed schemasEMPLOYEE ID
NAME job Depart.ID
Depart. Name
7513 Nora Edwards
Programmer 128 R&D
9842 Ben Smith DBA 42 Finance
6651 Ajay Patel Programmer 128 R&D
9006 Candy Burnett
Systems Administrator
128 R&D
This Schema design leads to redundanciesThis Schema design leads to redundanciesEmployee(employee ID, name, job, department IDDepartment(Department ID, Department name)
EMPLOYEE ID
NAME job Depart.ID
Depart. Name
7513 Nora Edwards
Programmer 128 R&D
9842 Ben Smith DBA 42 Finance
6651 Ajay Patel Programmer 128 R&D
9006 Candy Burnett
Systems Administrator
128 R&D
Flawed schemasFlawed schemasEMPLOYEE ID
NAME job Depart.ID
Depart. Name
7513 Nora Edwards
Programmer 128 R&D
9842 Ben Smith DBA 42 Finance
6651 Ajay Patel Programmer 128 R&D
9006 Candy Burnett
Systems Administrator
128 R&D
9901 Steve Smith Engineer 42 DevelopmentInsertion AnomalyInsertion Anomaly
Deletion AnomalyDeletion Anomaly
EMPLOYEE ID
NAME job Depart.ID
Depart. Name
7513 Nora Edwards
Programmer 128 R&D
6651 Ajay Patel Programmer 128 R&D
9006 Candy Burnett
Systems Administrator
128 R&D
Update AnomalyUpdate Anomaly
EMPLOYEE ID
NAME job Depart.ID
Depart. Name
7513 Nora Edwards
Programmer 128 Emerging Tech
9842 Ben Smith DBA 42 Finance
6651 Ajay Patel Programmer 128 Emerging Tech
9006 Candy Burnett
Systems Administrator
128 R&D
Avoid Null ValuesAvoid Null Values
ID Engineering Skills Astronaut’s Skills
001001 Solidworks NULL
001002 Implantable Medical Devices NULL
001003 FDA QS regulations NULL
001004 Design Controls, FDA ISO, QC regulations
NULL
001005 Digital Circuits: logic devices, state machines
NULL
001006 CAD/CAM programming Walking on the moon
001007 HDLs: VHDL and Verilog NULL
001008 Microcontrollers: ARM, H8, 8051,PIC
NULL
001009 Medical manufacturing, 3D CAD NULL
NormalizationNormalizationEMPLOYEE ID
NAME Job Depart.ID
Skills
7513 Nora Edwards
Programmer 128 C, Perl, Java
9842 Ben Smith DBA 42 DB2
6651 Ajay Patel Programmer 128 VB, Java
9006 Candy Burnett
Systems Administrator
128 NT, Linux
Unnormlized table: lists instead of atomic numbers. Unnormlized table: lists instead of atomic numbers. This violates the rules of first normal form
NormalizationNormalizationEMPLOYEE ID
NAME Job Depart.ID
Skills
7513 Nora Edwards
Programmer 128 C
7513 Nora Edwards
Programmer 128 Perl
7513 Nora Edwards
Programmer 128 Java
9842 Ben Smith DBA 42 DB2
6651 Ajay Patel Programmer 128 VB
6651 Ajay Patel Programmer 128 Java
9006 Candy Burnett
Systems Administrator
128 NT
9006 Candy Burnett
Systems Administrator
128 Linux
This schema is in first normal form, 1NFThis schema is in first normal form, 1NF
Second Normal Form, Second Normal Form, 2NF2NF
Empl. ID
NAME Job Dep.ID
7513 Nora Edwards
Programmer 128
9842 Ben Smith DBA 42
6651 Ajay Patel Programmer 128
9006 Candy Burnett
Systems Administrator
128
Empl. ID Skills
7513 C
7513 Perl
7513 Java
9842 DB2
6651 VB
6651 Java
9006 NT
9006 Linux
2NF: Attributes must 2NF: Attributes must depend on the whole keydepend on the whole key
3NF and BCNF (Boyce-3NF and BCNF (Boyce-Codd)Codd)
Empl. ID
NAME Job Dep.ID
7513 Nora Edwards
Programmer 128
9842 Ben Smith DBA 42
6651 Ajay Patel Programmer 128
9006 Candy Burnett
Systems Administrator
128
Empl. ID Skills
7513 C
7513 Perl
7513 Java
9842 DB2
6651 VB
6651 Java
9006 NT
9006 Linux
3NF3NF: Attributes must : Attributes must depend on nothing but the depend on nothing but the keykeyBCNF: all the functional dependencies must have a superkey on the left side
ConceptsConcepts Entities are things, and relationships are the links Entities are things, and relationships are the links
between them.between them. Relations or tables hold a set of data in tabular form.Relations or tables hold a set of data in tabular form. Columns belonging to tables describe the attributes Columns belonging to tables describe the attributes
that each data item possesses.that each data item possesses. Rows in tables hold data items with values for each Rows in tables hold data items with values for each
column in a table.column in a table. Keys are used to identify a single row.Keys are used to identify a single row. Functional dependencies identify which attributes Functional dependencies identify which attributes
determine the values of other attributes.determine the values of other attributes. Schemas are the blueprints for a database.Schemas are the blueprints for a database.
Design PrinciplesDesign Principles
Minimize redundancy without losing Minimize redundancy without losing data.data.
Insertion, deletion, and update Insertion, deletion, and update anomalies are problems that occur anomalies are problems that occur when trying to insert, delete, or when trying to insert, delete, or update data in a table with a flawed update data in a table with a flawed structure.structure.
Avoid designs that will lead to large Avoid designs that will lead to large quantities of null values.quantities of null values.
NormalizationNormalization Normalization is a formal process for improving Normalization is a formal process for improving
database design.database design. First normal form (1NF) means atomic column or First normal form (1NF) means atomic column or
attribute values.attribute values. Second normal form (2NF) means that all Second normal form (2NF) means that all
attributes outside the key must depend on the attributes outside the key must depend on the whole key.whole key.
Third normal form (3NF) means no transitive Third normal form (3NF) means no transitive dependencies.dependencies.
Boyce-Codd normal form (BCNF) means that all Boyce-Codd normal form (BCNF) means that all attributes must be functionally determined by a attributes must be functionally determined by a superkey.superkey.
Hierarchical DatabasesHierarchical Databases
1234567
Sandiego, Carmen 123 Main Street
Labs
Chem7
Na136 K 4.3
Chem7
Na142 K 3.9
Hierarchical DatabasesHierarchical Databases
Easy to useEasy to use Efficient storageEfficient storage ““Tree walking” is fastTree walking” is fast Queries across trees are slowQueries across trees are slow FlexibleFlexible Too flexible: chaos is allowedToo flexible: chaos is allowed Too easy to modifyToo easy to modify Difficult to document complex structuresDifficult to document complex structures
Hierarchical DatabasesHierarchical Databases
^EMR(1234567)=“Sandiego, Carmen”^EMR(1234567)=“Sandiego, Carmen”
^EMR(1234567, “Address”)=“123 Main Street”^EMR(1234567, “Address”)=“123 Main Street”
^EMR(1234567, “Chem7”, “2/2/02”, “Na”)=136^EMR(1234567, “Chem7”, “2/2/02”, “Na”)=136
^EMR(1234567, “Chem7”, “2/2/02”, “K”)=4.3^EMR(1234567, “Chem7”, “2/2/02”, “K”)=4.3
^EMR(1234567, “Chem7”, “2/3/02”, “Na”)=142^EMR(1234567, “Chem7”, “2/3/02”, “Na”)=142
^EMR(1234567, “Chem7”, “2/3/02”, “K”)=3.9^EMR(1234567, “Chem7”, “2/3/02”, “K”)=3.9
Hierarchical ChaosHierarchical Chaos
1234567
Admissions
Admission 1
Admit Date: 2/2/02
Primary DX: CHF
Other DX
AODM
Flag: S
A Fib
Flag: P
1234567
Gyn Clinic
Pap
Dr. Jones
Sandiego
Gyn Visit
Gyn Clinic
Secretary
305-1000 Service
Ms Smith
Beeper 34
2 Main St.
8AM-5PM
305-2500
Network DatabasesNetwork Databases
Extensible Markup Extensible Markup Language (XML) DatabasesLanguage (XML) Databases
SGML is a SGML is a metalanguagemetalanguage SGML is used to write Document Type Definitions SGML is used to write Document Type Definitions
(DTDs) that define (DTDs) that define languageslanguages HTML is a language with an SGML DTDHTML is a language with an SGML DTD
Tags are for formatting/presentation Tags are for formatting/presentation syntaxsyntax XML is a proper subset of SGMLXML is a proper subset of SGML XML defines tags that convey XML defines tags that convey semanticssemantics We could write “Health Markup Language” We could write “Health Markup Language”
(“HML”) in XML (if we could agree on the (“HML”) in XML (if we could agree on the semantics and tags)semantics and tags)
Tags may or may not be stored with dataTags may or may not be stored with data
<document>
</document>
<document.id>CXR001</document.id><doc. date>19991101</doc. date><document.type>
</document.type><document.body>
<document.body>
<identifier>P5-00010</identifier> <text>Chest X-Ray</text>
<findings>No infiltrate, cardiac shadownot enlarged...</findings>
<impression>Normal X-ray</impression>
<patient.id>
</patient.id><patient.name>
</patient.name><patient.dob>19230113</patient.dob><patient.sex value="male"/><inpatient/>
<patient>
</patient>
<id.value>1234789</id.value>
<family.name>Sandiego</family.name><given.name>Carmen</given.name><suffix>M.D.</suffix>
Extensible Markup Extensible Markup Language (XML) DatabasesLanguage (XML) Databases StrengthsStrengths
Flexibility to represent wide range of Flexibility to represent wide range of datadata
Data carries its field assignmentData carries its field assignment Sparse data handled compactlySparse data handled compactly Tags can have platform-specific displayTags can have platform-specific display
WeaknessesWeaknesses Immature database toolsImmature database tools VerboseVerbose I/O intensiveI/O intensive A trade-off of decreased efficiency for A trade-off of decreased efficiency for
increased flexibility; ? scalabilityincreased flexibility; ? scalability
Relational Databases - Relational Databases - AdvantagesAdvantages
ComprehensibleComprehensible Multiple “views” possibleMultiple “views” possible Easy to modifyEasy to modify New elements don’t “break” programsNew elements don’t “break” programs Database management systems (DBMS)Database management systems (DBMS)
Referential integrityReferential integrity ““Reorg” for efficiencyReorg” for efficiency Access controlAccess control Locking for multiple simultaneous useLocking for multiple simultaneous use
Relational Databases - Relational Databases - DisadvantagesDisadvantages
Storage Storage overheadoverhead
I/O-intenseI/O-intense CostCost