Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for...

54
Effective design Effective design and analysis of and analysis of bioinformation bioinformation Unit 3 Unit 3 BIOL221T BIOL221T : Advanced : Advanced Bioinformatics for Bioinformatics for Biotechnology Biotechnology Irene Gabashvili, PhD [email protected] m

Transcript of Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for...

Page 1: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Effective design Effective design and analysis of and analysis of bioinformationbioinformation

Unit 3Unit 3BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnology

Irene Gabashvili, PhD

[email protected]

Page 2: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Course availabilityCourse availability

Lectures & Lab: every Wednesday, Duncan Lectures & Lab: every Wednesday, Duncan Hall, Room 550, 6:00 pm to  9:45 pmHall, Room 550, 6:00 pm to  9:45 pm

Office hours: Wednesday, 4pm-6pm (Room Office hours: Wednesday, 4pm-6pm (Room 554, phone: 92404831) and by appointment554, phone: 92404831) and by appointment

Lecture notes will be posted at: Lecture notes will be posted at: http://home.comcast.net/~igabashvili/221T.htmhttp://home.comcast.net/~igabashvili/221T.htm

Or the SJSU page -- Or the SJSU page -- The user name is “ewok\biostudents” (The user name is “ewok\biostudents” (don’t enter quotation don’t enter quotation

mark)  mark)  

And the password is “4biolecture” And the password is “4biolecture” (don’t enter (don’t enter quotation mark).quotation mark).

Page 3: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Consumer genomics gets Consumer genomics gets crowded crowded

http://www.seqwright.com/ SoliD, http://www.seqwright.com/ SoliD, ABIABI

http://www.decodeme.com/ Illuminahttp://www.decodeme.com/ Illumina https://www.23andme.com/ Illuminahttps://www.23andme.com/ Illumina http://www.navigenics.com/ http://www.navigenics.com/

AffymetrixAffymetrix http://www.knome.com/ http://www.knome.com/

ABI,Amersham,IlluminaABI,Amersham,Illumina

In the News

Page 4: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

https://www.23andme.com/experts/letters/https://www.23andme.com/experts/letters/science/science/

Phenotype Genes/RegionsBreast Cancer FGFR2, 16q12 region

Crohn's DiseaseNOD2(1), NOD2(2), NOD2(3), ATG16L1, IL23R, NKX2-3, 5p13 region, PTPN2

Heart Attack 9p21 region, MTHFD1LMultiple Sclerosis IL7RA, HLA-DRB1Obesity FTOProstate Cancer 8q24(1), 8q24(2), 8q24(3)Restless Legs Syndrome BTBD9Rheumatoid Arthritis HLA region, PADI4, PTPN22, MMEL1, 6q23 region, TRAF1/C5

Type 1 Diabetes HLA region, CTLA4, IFIH1, INS, PTPN2, PTPN22, SH2B3, KIAA0350

Type 2 DiabetesTCF7L2, PPARG, KCNJ11, IGF2BP2, HHEX, CDKAL1, SLC30A8, WFS1, CDKN2A/B

Venous Thromboembolism F5, F2Alcohol Flush Reaction ALDH2Bitter Taste Perception TAS2R38Earwax Type ABCC11Lactose Intolerance LCTMuscle Performance ACTN3

Page 5: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

List from DeCODE List from DeCODE geneticsgenetics

Our current list of diseases includes: Our current list of diseases includes: Age-related Macular Degeneration, Age-related Macular Degeneration, Asthma, Alzheimer's Disease, Atrial Asthma, Alzheimer's Disease, Atrial Fibrillation, Breast Cancer, Celiac Fibrillation, Breast Cancer, Celiac Disease, Colorectal Cancer, Exfoliation Disease, Colorectal Cancer, Exfoliation Glaucoma XFG, Crohn's Disease, Glaucoma XFG, Crohn's Disease, Multiple Sclerosis, Myocardial Infarction, Multiple Sclerosis, Myocardial Infarction, Obesity, Prostate Cancer, Psoriasis, Obesity, Prostate Cancer, Psoriasis, Restless Legs, Rheumatoid Arthritis, Restless Legs, Rheumatoid Arthritis, Type 1 Diabetes and Type 2 Diabetes.Type 1 Diabetes and Type 2 Diabetes.

Page 6: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

ThreeThree important sub- important sub-disciplines disciplines within within

bioinformaticsbioinformatics the the development of new algorithms development of new algorithms and and

statistics with which to assess relationships statistics with which to assess relationships among members of large data sets among members of large data sets

the analysis and the analysis and interpretationinterpretation of various of various types types of dataof data including nucleotide and including nucleotide and amino acid sequences, protein domains, amino acid sequences, protein domains, and protein structures and protein structures

the development and implementation of the development and implementation of tools that enable efficient access and tools that enable efficient access and management of management of different types of different types of biological biological informationinformation. .

Page 7: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

biomedical biomedical informaticsinformatics

Main Main tasks tasks ofof

Storage, Analysis, Visualization and Storage, Analysis, Visualization and Management of biomedical dataManagement of biomedical data

Mining for new knowledge, Mining for new knowledge, hypothesis formulation and testinghypothesis formulation and testing

Development of tools and resources Development of tools and resources for the abovefor the above

Page 8: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Brief History of Brief History of BioinformaticsBioinformatics

19201920 - term genome was introduced by - term genome was introduced by H. Winkler to denote the complete set of H. Winkler to denote the complete set of chromosomal and extra chromosomal chromosomal and extra chromosomal genes genes

19331933 - A new technique, electrophoresis, - A new technique, electrophoresis, is introduced by Tiselius for separating is introduced by Tiselius for separating proteins in solution.proteins in solution.

19511951 - Pauling and Corey propose the - Pauling and Corey propose the structure for the alpha-helix and beta-structure for the alpha-helix and beta-sheetsheet

Page 9: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Brief History of Brief History of BioinformaticsBioinformatics

1953 1953 - Watson & Crick propose the double - Watson & Crick propose the double helix model for DNA (data by Franklin & helix model for DNA (data by Franklin & Wilkins)Wilkins)

1954 1954 - Perutz's group develop methods to - Perutz's group develop methods to solve the phase problem in protein solve the phase problem in protein crystallography. crystallography.

1955 1955 - The sequence of the first protein to be - The sequence of the first protein to be analyzed, analyzed, bovine insulinbovine insulin, announced by F.Sanger. , announced by F.Sanger.

1956 1956 - The first protein sequence reported was - The first protein sequence reported was that of bovine insulin, consisting of 51 residuesthat of bovine insulin, consisting of 51 residues

Page 10: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Brief History of Brief History of BioinformaticsBioinformatics

19621962 - Pauling's theory of molecular evolution - Pauling's theory of molecular evolution 19651965 – M.Dayhoff’s Atlas of Protein Sequences – M.Dayhoff’s Atlas of Protein Sequences 1970 1970 - Needleman-Wunsch algorithm- Needleman-Wunsch algorithm 19721972: The Protein DataBank : The Protein DataBank 19801980 - The first complete gene sequence for an - The first complete gene sequence for an

organism (FX174):5,386 bp, nine proteins.organism (FX174):5,386 bp, nine proteins. 1981 1981 - The Smith-Waterman algorithm - The Smith-Waterman algorithm

IBM introduces its PC to the market. IBM introduces its PC to the market. The concept of a sequence motif ( Doolittle ) The concept of a sequence motif ( Doolittle )

Page 11: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Brief History of Brief History of BioinformaticsBioinformatics

19831983: Sequence DB searching (Wilbur-Lipman) : Sequence DB searching (Wilbur-Lipman) 19861986 - Human Genome Initiative announcement - Human Genome Initiative announcement 19871987: SWISSPROT protein sequence database : SWISSPROT protein sequence database 1988 1988 - NCBI created at NIH/NLM (databases)- NCBI created at NIH/NLM (databases) 19881988 - FASTA by Pearson and Lupman - FASTA by Pearson and Lupman

EMBL establish sequence database networkEMBL establish sequence database network 1990 1990 - BLAST by Altschul,et.al.- BLAST by Altschul,et.al. 20032003 -Human Genome Project Completion -Human Genome Project Completion

Page 12: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

biomedical biomedical informaticsinformatics

The data The data ofofPublic & Private Databases store Public & Private Databases store

biological data in various formatsbiological data in various formats Sequences DNA, RNA, proteinsSequences DNA, RNA, proteins Structures: X-ray, NMR, microscopyStructures: X-ray, NMR, microscopy Expression: microarrays, gelsExpression: microarrays, gels Interaction: 2 hybrid, mass specInteraction: 2 hybrid, mass spec Metabolism: GC-MS, NMRMetabolism: GC-MS, NMR Physiology: medical images, PK/PDPhysiology: medical images, PK/PD

Page 13: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Search EnginesSearch Engines

AND, OR, NOTAND, OR, NOT Specifying database fields Specifying database fields

(Organism, Author)(Organism, Author) Order of words,: Order of words,: neonatal pre/3 neonatal pre/3

screeningscreening ( (neonatalneonatal at least 3 at least 3 words before words before screeningscreening

Spaces: wom?n cat*sSpaces: wom?n cat*s

Page 14: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Search & DownloadSearch & Download Entrez: integrated, text-based search and Entrez: integrated, text-based search and

retrieval system for PubMed, Nucleotide and retrieval system for PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, etc + batch download Genomes, Taxonomy, etc + batch download

http://www.ncbi.nlm.nih.gov/sites/batchentrezhttp://www.ncbi.nlm.nih.gov/sites/batchentrezterm [field] OPERATOR term [field]term [field] OPERATOR term [field] 1:10[ESTC] AND Homo sapiens[ORGN] AND 1:10[ESTC] AND Homo sapiens[ORGN] AND

deafness[dis] (deafness[dis] (BSND: BSND: Bartter syndrome, Bartter syndrome, infantile, with sensorineural deafness)infantile, with sensorineural deafness)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=unigeneCMD=search&DB=unigene

More on the course’s websiteMore on the course’s website

Page 15: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

DATA FORMATS AND DATA DATA FORMATS AND DATA INTEGRATIONINTEGRATION

It is widely recognized that successful data integration is It is widely recognized that successful data integration is one of the keys to improved productivity in one of the keys to improved productivity in biopharmaceutical R&D biopharmaceutical R&D

Success in most bioinformatics-related activities, from Success in most bioinformatics-related activities, from functional characterization of genomic sequences to functional characterization of genomic sequences to prioritization of drug targets, requires an integrated view prioritization of drug targets, requires an integrated view of all relevant data in a drug discovery R&D program of all relevant data in a drug discovery R&D program

Bioinformatics data sources often have large, complex Bioinformatics data sources often have large, complex data structures, reflecting the richness of the scientific data structures, reflecting the richness of the scientific concepts they model. Many bioinformatics data sources concepts they model. Many bioinformatics data sources cover similar domains, such as genes, proteins, sequence cover similar domains, such as genes, proteins, sequence annotations or microarray results. annotations or microarray results.

Page 16: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database designDatabase design links links

  http://www.devx.com/ibm/Article/20702http://www.devx.com/ibm/Article/20702   http://www.campus.ncl.ac.uk/databases/design/http://www.campus.ncl.ac.uk/databases/design/ http://www.dbazine.com/mullins_datamodel.shtmlhttp://www.dbazine.com/mullins_datamodel.shtml http://www.extropia.com/tutorials/sql/toc.htmlhttp://www.extropia.com/tutorials/sql/toc.html http://www.surfermall.com/relational/lesson_1.htmhttp://www.surfermall.com/relational/lesson_1.htm

Page 17: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database: DefinitionDatabase: Definition

A collection of data that:A collection of data that: is organizedis organized usuallyusually computer-based computer-based represents repetitive information implicitlyrepresents repetitive information implicitly supports retrievalsupports retrieval

A set of rules to manipulate dataA set of rules to manipulate data A method to mold information into A method to mold information into

knowledgeknowledge

Page 18: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database: applicationsDatabase: applications

Who uses Computerized Databases:Who uses Computerized Databases: Stores to keep track of inventoryStores to keep track of inventory Hospitals – to track of patient infoHospitals – to track of patient info Travel agents – to keep up with their Travel agents – to keep up with their

customers and reservationscustomers and reservations Biologists – to efficiently manage and Biologists – to efficiently manage and

manipulate their datamanipulate their data

DATA DATA INFORMATION INFORMATION KNOWLEDGEKNOWLEDGE

Page 19: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Paper Database as Expert Paper Database as Expert SystemSystem

Page 20: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

HISTORYHISTORY

1960's1960's: Two main data models are : Two main data models are developed: network model (CODASYL) developed: network model (CODASYL) and hierarchical (IMS). A user would and hierarchical (IMS). A user would need to know the physical structure of need to know the physical structure of the database in order to query for the database in order to query for information. SABRE IBM/AA.information. SABRE IBM/AA.

1970-721970-72: E.F. Codd proposed relational : E.F. Codd proposed relational model He disconnects the schema (logical model He disconnects the schema (logical organization) of a database from the organization) of a database from the physical storage methods. physical storage methods.

Page 21: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

HISTORYHISTORY

1970's1970's:: Ingres: UCB Ingres: UCB Ingres Corp., Sybase, MS Ingres Corp., Sybase, MS

SQL Server, Britton-Lee, Wang's PACE. This SQL Server, Britton-Lee, Wang's PACE. This system used QUEL as query language. system used QUEL as query language.

System R: IBM System R: IBM IBM's SQL/DS & DB2, IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem's Non-Stop Oracle, HP's Allbase, Tandem's Non-Stop SQL. This system used SEQUEL as query SQL. This system used SEQUEL as query language.language.

The term Relational Database The term Relational Database Management System (RDBMS) is coinedManagement System (RDBMS) is coined

Page 22: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

HISTORYHISTORY

19761976: P. Chen proposed the Entity-: P. Chen proposed the Entity-Relationship (ER) model for database Relationship (ER) model for database designdesign

Early 1980'sEarly 1980's: Commercialization of : Commercialization of relational systems begins as a boom…relational systems begins as a boom…

Mid-1980'sMid-1980's: SQL (Structured Query : SQL (Structured Query Language) becomes "intergalactic Language) becomes "intergalactic standard". DB2 becomes IBM's flagship standard". DB2 becomes IBM's flagship product. Network and hierarchical models product. Network and hierarchical models fade into the backgroundfade into the background

Page 23: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

HISTORYHISTORY

Early 1990'sEarly 1990's: Application and personal : Application and personal productivity tool development: productivity tool development: PowerBuilder (Sybase), Oracle PowerBuilder (Sybase), Oracle Developer, VB (Microsoft), Excel/Access Developer, VB (Microsoft), Excel/Access (MS) and ODBC. First Object Database (MS) and ODBC. First Object Database Management Systems (ODBMS) Management Systems (ODBMS) prototypes. prototypes.

Mid-1990'sMid-1990's: Internet/WWW. Web/DB : Internet/WWW. Web/DB grows exponentially, usable for average grows exponentially, usable for average usersusers

Page 24: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

HISTORYHISTORY

Late-1990'sLate-1990's: Boom for Web/Internet/DB : Boom for Web/Internet/DB connectors. Open source solution with connectors. Open source solution with widespread use of gcc, cgi, Apache, widespread use of gcc, cgi, Apache, MySQL, etc. Online Transaction MySQL, etc. Online Transaction processing (OLTP) and online analytic processing (OLTP) and online analytic processing (OLAP) comes of ageprocessing (OLAP) comes of age

Early 21st centuryEarly 21st century: Burst of.com but : Burst of.com but solid growth of DB applications. PDAs, solid growth of DB applications. PDAs, POS transactions, IBM, Microsoft, POS transactions, IBM, Microsoft, Oracle. Oracle.

Page 25: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

FUTUREFUTURE

Terabyte and Petabyte databases of Terabyte and Petabyte databases of everythingeverything

Mobile databasesMobile databases Semantic WebSemantic Web Object Oriented Everything, includes Object Oriented Everything, includes

databases databases Object Database Management Group (ODMG) Object Database Management Group (ODMG)

standards are proposed and acceptedstandards are proposed and accepted Security issuesSecurity issues

Page 26: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database: advantagesDatabase: advantages

An advantage of a database program is:An advantage of a database program is: Can find a specific file quickly Can find a specific file quickly Can easily add recordsCan easily add records Can alphabetize and sort data faster Can alphabetize and sort data faster

than most peoplethan most people Is as accurate as the data that is enteredIs as accurate as the data that is entered Can make many different types of Can make many different types of

reportsreports Is invaluable for large amounts of dataIs invaluable for large amounts of data

Page 27: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database: PartsDatabase: Parts

Parts of a relational database:Parts of a relational database: Fields-categories of informationFields-categories of information <table><table> Entry = data in a fieldEntry = data in a field Record = all of the information about Record = all of the information about

one item (row)one item (row) File = document of all of the recordsFile = document of all of the records To sort – field, ascend or descend (To sort – field, ascend or descend (Excel, Excel,

WorksWorks))

Page 28: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Database typesDatabase types

Flat (spreadsheet)Flat (spreadsheet) HierarchicalHierarchical Network (two fundamental Network (two fundamental

constructs, called constructs, called recordsrecords and and setssets)) RelationalRelational

Page 29: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Relational DatabasesRelational Databases

Relational databases started to get to be Relational databases started to get to be a big deal in the 1970's, and they're still a big deal in the 1970's, and they're still a big deal today, which is a little a big deal today, which is a little peculiar, because they're a 1960's peculiar, because they're a 1960's technology.technology.

A relational database is a bunch of A relational database is a bunch of rectangular tables. Each row of a table rectangular tables. Each row of a table is a record about one person or thing; is a record about one person or thing; the record contains several pieces of the record contains several pieces of information called information called fieldsfields. .

Page 30: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Entities and Entities and RelationshipsRelationships

Entities – Entities – things things we store we store information information aboutabout

Relationships – links between the entities

Many-to-manyOne-to-oneOne-to-many…

Page 31: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

A Table is a RelationA Table is a RelationEMPLOYEE ID

NAME job DepartmentID

Cube

7513 Nora Edwards

Programmer 128 1

9842 Ben Smith DBA 42 15

6651 Ajay Patel Programmer 128 2

9006 Candy Burnett

Systems Administrator

128 3

Columns, Fields, Attributes; Columns, Fields, Attributes; Rows, Records, Tuples, Entities.Rows, Records, Tuples, Entities.records of data, comprised of fields, stored in tables

Page 32: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Keys and Functional Keys and Functional DependenciesDependencies

Key field (superkey, key) - a field that Key field (superkey, key) - a field that uniquely identifies a record uniquely identifies a record

If there is a functional dependency If there is a functional dependency between column A and column B in a between column A and column B in a given table, given table,

(A (A B), then the value of column A B), then the value of column A determines the value of column B. determines the value of column B. (employeeID (employeeID name) name)

Page 33: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

SchemaSchema

Database schemaDatabase schema is the structure or is the structure or design of the database, a blueprint design of the database, a blueprint for the data in the database.for the data in the database.

employee(employeeID, employee(employeeID, name, job, cube, departmentID)name, job, cube, departmentID) What information needs to be stored? What information needs to be stored?

((thingsthings or or entitiesentities)) What questions will we ask of the What questions will we ask of the

database? (database? (queriesqueries.).)

Page 34: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Flawed schemasFlawed schemasEMPLOYEE ID

NAME job Depart.ID

Depart. Name

7513 Nora Edwards

Programmer 128 R&D

9842 Ben Smith DBA 42 Finance

6651 Ajay Patel Programmer 128 R&D

9006 Candy Burnett

Systems Administrator

128 R&D

This Schema design leads to redundanciesThis Schema design leads to redundanciesEmployee(employee ID, name, job, department IDDepartment(Department ID, Department name)

Page 35: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

EMPLOYEE ID

NAME job Depart.ID

Depart. Name

7513 Nora Edwards

Programmer 128 R&D

9842 Ben Smith DBA 42 Finance

6651 Ajay Patel Programmer 128 R&D

9006 Candy Burnett

Systems Administrator

128 R&D

Flawed schemasFlawed schemasEMPLOYEE ID

NAME job Depart.ID

Depart. Name

7513 Nora Edwards

Programmer 128 R&D

9842 Ben Smith DBA 42 Finance

6651 Ajay Patel Programmer 128 R&D

9006 Candy Burnett

Systems Administrator

128 R&D

9901 Steve Smith Engineer 42 DevelopmentInsertion AnomalyInsertion Anomaly

Deletion AnomalyDeletion Anomaly

EMPLOYEE ID

NAME job Depart.ID

Depart. Name

7513 Nora Edwards

Programmer 128 R&D

6651 Ajay Patel Programmer 128 R&D

9006 Candy Burnett

Systems Administrator

128 R&D

Update AnomalyUpdate Anomaly

EMPLOYEE ID

NAME job Depart.ID

Depart. Name

7513 Nora Edwards

Programmer 128 Emerging Tech

9842 Ben Smith DBA 42 Finance

6651 Ajay Patel Programmer 128 Emerging Tech

9006 Candy Burnett

Systems Administrator

128 R&D

Page 36: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Avoid Null ValuesAvoid Null Values

ID Engineering Skills Astronaut’s Skills

001001 Solidworks NULL

001002 Implantable Medical Devices NULL

001003 FDA QS regulations NULL

001004 Design Controls, FDA ISO, QC regulations

NULL

001005 Digital Circuits: logic devices, state machines

NULL

001006 CAD/CAM programming Walking on the moon

001007 HDLs: VHDL and Verilog NULL

001008 Microcontrollers: ARM, H8, 8051,PIC

NULL

001009 Medical manufacturing, 3D CAD NULL

Page 37: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

NormalizationNormalizationEMPLOYEE ID

NAME Job Depart.ID

Skills

7513 Nora Edwards

Programmer 128 C, Perl, Java

9842 Ben Smith DBA 42 DB2

6651 Ajay Patel Programmer 128 VB, Java

9006 Candy Burnett

Systems Administrator

128 NT, Linux

Unnormlized table: lists instead of atomic numbers. Unnormlized table: lists instead of atomic numbers. This violates the rules of first normal form

Page 38: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

NormalizationNormalizationEMPLOYEE ID

NAME Job Depart.ID

Skills

7513 Nora Edwards

Programmer 128 C

7513 Nora Edwards

Programmer 128 Perl

7513 Nora Edwards

Programmer 128 Java

9842 Ben Smith DBA 42 DB2

6651 Ajay Patel Programmer 128 VB

6651 Ajay Patel Programmer 128 Java

9006 Candy Burnett

Systems Administrator

128 NT

9006 Candy Burnett

Systems Administrator

128 Linux

This schema is in first normal form, 1NFThis schema is in first normal form, 1NF

Page 39: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Second Normal Form, Second Normal Form, 2NF2NF

Empl. ID

NAME Job Dep.ID

7513 Nora Edwards

Programmer 128

9842 Ben Smith DBA 42

6651 Ajay Patel Programmer 128

9006 Candy Burnett

Systems Administrator

128

Empl. ID Skills

7513 C

7513 Perl

7513 Java

9842 DB2

6651 VB

6651 Java

9006 NT

9006 Linux

2NF: Attributes must 2NF: Attributes must depend on the whole keydepend on the whole key

Page 40: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

3NF and BCNF (Boyce-3NF and BCNF (Boyce-Codd)Codd)

Empl. ID

NAME Job Dep.ID

7513 Nora Edwards

Programmer 128

9842 Ben Smith DBA 42

6651 Ajay Patel Programmer 128

9006 Candy Burnett

Systems Administrator

128

Empl. ID Skills

7513 C

7513 Perl

7513 Java

9842 DB2

6651 VB

6651 Java

9006 NT

9006 Linux

3NF3NF: Attributes must : Attributes must depend on nothing but the depend on nothing but the keykeyBCNF: all the functional dependencies must have a superkey on the left side

Page 41: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

ConceptsConcepts Entities are things, and relationships are the links Entities are things, and relationships are the links

between them.between them. Relations or tables hold a set of data in tabular form.Relations or tables hold a set of data in tabular form. Columns belonging to tables describe the attributes Columns belonging to tables describe the attributes

that each data item possesses.that each data item possesses. Rows in tables hold data items with values for each Rows in tables hold data items with values for each

column in a table.column in a table. Keys are used to identify a single row.Keys are used to identify a single row. Functional dependencies identify which attributes Functional dependencies identify which attributes

determine the values of other attributes.determine the values of other attributes. Schemas are the blueprints for a database.Schemas are the blueprints for a database.

Page 42: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Design PrinciplesDesign Principles

Minimize redundancy without losing Minimize redundancy without losing data.data.

Insertion, deletion, and update Insertion, deletion, and update anomalies are problems that occur anomalies are problems that occur when trying to insert, delete, or when trying to insert, delete, or update data in a table with a flawed update data in a table with a flawed structure.structure.

Avoid designs that will lead to large Avoid designs that will lead to large quantities of null values.quantities of null values.

Page 43: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

NormalizationNormalization Normalization is a formal process for improving Normalization is a formal process for improving

database design.database design. First normal form (1NF) means atomic column or First normal form (1NF) means atomic column or

attribute values.attribute values. Second normal form (2NF) means that all Second normal form (2NF) means that all

attributes outside the key must depend on the attributes outside the key must depend on the whole key.whole key.

Third normal form (3NF) means no transitive Third normal form (3NF) means no transitive dependencies.dependencies.

Boyce-Codd normal form (BCNF) means that all Boyce-Codd normal form (BCNF) means that all attributes must be functionally determined by a attributes must be functionally determined by a superkey.superkey.

Page 44: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Hierarchical DatabasesHierarchical Databases

1234567

Sandiego, Carmen 123 Main Street

Labs

Chem7

Na136 K 4.3

Chem7

Na142 K 3.9

Page 45: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Hierarchical DatabasesHierarchical Databases

Easy to useEasy to use Efficient storageEfficient storage ““Tree walking” is fastTree walking” is fast Queries across trees are slowQueries across trees are slow FlexibleFlexible Too flexible: chaos is allowedToo flexible: chaos is allowed Too easy to modifyToo easy to modify Difficult to document complex structuresDifficult to document complex structures

Page 46: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Hierarchical DatabasesHierarchical Databases

^EMR(1234567)=“Sandiego, Carmen”^EMR(1234567)=“Sandiego, Carmen”

^EMR(1234567, “Address”)=“123 Main Street”^EMR(1234567, “Address”)=“123 Main Street”

^EMR(1234567, “Chem7”, “2/2/02”, “Na”)=136^EMR(1234567, “Chem7”, “2/2/02”, “Na”)=136

^EMR(1234567, “Chem7”, “2/2/02”, “K”)=4.3^EMR(1234567, “Chem7”, “2/2/02”, “K”)=4.3

^EMR(1234567, “Chem7”, “2/3/02”, “Na”)=142^EMR(1234567, “Chem7”, “2/3/02”, “Na”)=142

^EMR(1234567, “Chem7”, “2/3/02”, “K”)=3.9^EMR(1234567, “Chem7”, “2/3/02”, “K”)=3.9

Page 47: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Hierarchical ChaosHierarchical Chaos

1234567

Admissions

Admission 1

Admit Date: 2/2/02

Primary DX: CHF

Other DX

AODM

Flag: S

A Fib

Flag: P

Page 48: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

1234567

Gyn Clinic

Pap

Dr. Jones

Sandiego

Gyn Visit

Gyn Clinic

Secretary

305-1000 Service

Ms Smith

Beeper 34

2 Main St.

8AM-5PM

305-2500

Network DatabasesNetwork Databases

Page 49: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Extensible Markup Extensible Markup Language (XML) DatabasesLanguage (XML) Databases

SGML is a SGML is a metalanguagemetalanguage SGML is used to write Document Type Definitions SGML is used to write Document Type Definitions

(DTDs) that define (DTDs) that define languageslanguages HTML is a language with an SGML DTDHTML is a language with an SGML DTD

Tags are for formatting/presentation Tags are for formatting/presentation syntaxsyntax XML is a proper subset of SGMLXML is a proper subset of SGML XML defines tags that convey XML defines tags that convey semanticssemantics We could write “Health Markup Language” We could write “Health Markup Language”

(“HML”) in XML (if we could agree on the (“HML”) in XML (if we could agree on the semantics and tags)semantics and tags)

Tags may or may not be stored with dataTags may or may not be stored with data

Page 50: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

<document>

</document>

<document.id>CXR001</document.id><doc. date>19991101</doc. date><document.type>

</document.type><document.body>

<document.body>

<identifier>P5-00010</identifier> <text>Chest X-Ray</text>

<findings>No infiltrate, cardiac shadownot enlarged...</findings>

<impression>Normal X-ray</impression>

Page 51: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

<patient.id>

</patient.id><patient.name>

</patient.name><patient.dob>19230113</patient.dob><patient.sex value="male"/><inpatient/>

<patient>

</patient>

<id.value>1234789</id.value>

<family.name>Sandiego</family.name><given.name>Carmen</given.name><suffix>M.D.</suffix>

Page 52: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Extensible Markup Extensible Markup Language (XML) DatabasesLanguage (XML) Databases StrengthsStrengths

Flexibility to represent wide range of Flexibility to represent wide range of datadata

Data carries its field assignmentData carries its field assignment Sparse data handled compactlySparse data handled compactly Tags can have platform-specific displayTags can have platform-specific display

WeaknessesWeaknesses Immature database toolsImmature database tools VerboseVerbose I/O intensiveI/O intensive A trade-off of decreased efficiency for A trade-off of decreased efficiency for

increased flexibility; ? scalabilityincreased flexibility; ? scalability

Page 53: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Relational Databases - Relational Databases - AdvantagesAdvantages

ComprehensibleComprehensible Multiple “views” possibleMultiple “views” possible Easy to modifyEasy to modify New elements don’t “break” programsNew elements don’t “break” programs Database management systems (DBMS)Database management systems (DBMS)

Referential integrityReferential integrity ““Reorg” for efficiencyReorg” for efficiency Access controlAccess control Locking for multiple simultaneous useLocking for multiple simultaneous use

Page 54: Effective design and analysis of bioinformation Unit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com.

Relational Databases - Relational Databases - DisadvantagesDisadvantages

Storage Storage overheadoverhead

I/O-intenseI/O-intense CostCost