Post on 18-Dec-2015
Overview of Genome Databases
Peter D. Karp, Ph.D.
SRI International
pkarp@ai.sri.com
www-db.stanford.edu/dbseminar/seminar.html
Talk Overview
Definition of bioinformatics
Motivations for genome databases
Issues in building genome databases
Definition of Bioinformatics
Computational techniques for management and analysis of biological data and knowledge
Methods for disseminating, archiving, interpreting, and mining scientific information
Computational theories of biology
Genome Databases is a subfield of bioinformatics
Motivations for Bioinformatics
Growth in molecular-biology knowledge (literature)
Genomics
1. Study of genomes through DNA sequencing2. Industrial Biology
Example Genomics Datatypes
Genome sequences DOE Joint Genome Institute
511M bases in Dec 2001 11.97G bases since Mar 1999
Gene and protein expression data
Protein-protein interaction data
Protein 3-D structures
Genome Databases
Experimental data Archive experimental datasets Retrieving past experimental results should be faster than repeating the
experiment Capture alternative analyses Lots of data, simpler semantics
Computational symbolic theories Complex theories become too large to be grasped by a single mind The database is the theory Biology is very much concerned with qualitative relationships Less data, more complex semantics
Bioinformatics
Distinct intellectual field at the intersection of CS and molecular biology
Distinct field because researchers in the field must know CS, biology, and bioinformatics
Spectrum from CS research to biology service
Rich source of challenging CS problems
Large, noisy, complex data-sets and knowledge-sets
Biologists and funding agencies demand working solutions
Bioinformatics Research
algorithms + data structures = programs
algorithms + databases = discoveries
Combine sophisticated algorithms with the right content:
Properly structured Carefully curated Relevant data fields Proper amount of data
Reference on Major Genome Databases
Nucleic Acids Research Database Issue
http://nar.oupjournals.org/content/vol30/issue1/ 112 databases
Questions to Ask of a New Genome Database
What are Database Goals andRequirements?
What problems will database be used to solve?
Who are the users and what is their expertise?
What is its Organizing Principle?
Different DBs partition the space of genome information in different dimensions
Experimental methods (Genbank, PDB)
Organism (EcoCyc, Flybase)
What is its Level of Interpretation?
Laboratory data
Primary literature (Genbank)
Review (SwissProt, MetaCyc)
Does DB model disagreement?
What are its Semantics and Content?
What entities and relationships does it model?
How does its content overlap with similar DBs?How many entities of each type are present?Sparseness of attributes and statistics on
attribute values
What are Sources of its Data?
Potential information sources Laboratory instruments Scientific literature
Manual entry Natural-language text mining
Direct submission from the scientific community Genbank
Modification policy DB staff only Submission of new entries by scientific community Update access by scientific community
What DBMS is Employed?
None
Relational
Object oriented
Frame knowledge representation system
Distribution / User Access
Multiple distribution forms enhance accessBrowsing access with visualization toolsAPIPortability
What Validation Approaches areEmployed?
None
Declarative consistency constraints
Programmatic consistency checking
Internal vs external consistency checking
What types of systematic errors might DB contain?
Database Documentation
Schema and its semanticsFormatAPIData acquisition techniquesValidation techniquesSize of different classesCoverage of subject matterSparseness of attributesError ratesUpdate frequency
Relationship of Database Field toBioinformaticsScientists generally unaware of basic DB
principles Complex queries vs click-at-a-time access Data model Defined semantics for DB fields Controlled vocabularies Regular syntax for flatfiles Automated consistency checking
Most biologists take one programming classEvolution of typical genome databaseFiner points of DB research off their radar screenHandfull of DB researchers work in bioinformatics
Database Field
For many years, the majority of bioinformatics DBs did not employ a DBMS
Flatfiles were the rule Scientists want to see the data directly Commercial DBMSs too expensive, too complex DBAs too expensive
Most scientists do not understand Differences between BA, MS, PhD in CS CS research vs applications Implications for project planning, funding, bioinformatics
research
Recommendation
Teaching scientists programming is not enoughTeaching scientists how to build a DBMS is
irrelevantTeach scientists basic aspects of databases and
symbolic computing Database requirements analysis Data models, schema design Knowledge representation, ontologies Formal grammars Complex queries Database interoperability
BioSPICE Bioinformatics
Database WarehousePeter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
Sonmez
SRI Internationalhttp://www.BioSPICE.org/
Project Goal
Create a toolkit for constructing bioinformatics database warehouses that collect together a set of bioinformatics databases into one physical DBMS
Motivations
Important bioinformatics problems require access to multiple bioinformatics databases
Hundreds of bioinformatics databases exist Nucleic Acids Research 30(1) 2002 – DB issue Nucleic Acids Research DB list: 350 DBs at
http://www3.oup.co.uk/nar/database/a/ Different problems require different sets of
databases
Motivations
Combining multiple databases allows for data verification and complementation
Simulation problems require access to data on pathways, enzymes, reactions, genetic regulation
Why is the Multidatabase Approach Not Sufficient?
Multidatabase query approaches assume databases are in a DBMS
Internet bandwidth limits query throughput Most sites that do operate DBMSs do not allow
remote SQL access because of security and loading concerns
Control data stability Need to capture, integrate and publish locally
produced data of different types Multidatabase and Warehouse approaches
complementary
Scenario 1
BioSPICE scientist wants to model multiple metabolic pathways in a given organism
Enumerate pathways and reactions What enzymes catalyze each reaction? What genes code for each enzyme? What control regions regulate each gene?
Approach
Oracle and MySQL implementations Warehouse schema defines many bioinformatics
datatypes Create loaders for public bioinformatics DBs
Parse file format for the DB Semantic transformations Insert database into warehouse tables
Warehouse query access mechanisms SQL queries via Perl, ODBC, OAA
Example: Swiss-Prot DB
Version 40.0 describes 101K proteins in a 320MB file
Each protein described as one block of records (an entry) in a large text file
Loader tool parses file one entry at a time Creates new entries in a set of warehouse tables
Warehouse Schema
Manages many bioinformatics datatypes simultaneously Pathways, Reactions, Chemicals Proteins, Genes, Replicons Citations, Organisms Links to external databases
Each type of warehouse object implemented through one or more relational tables (currently 43)
Warehouse Schema
Databases on our wish list: Genbank (nucleotide sequences) Protein expression database Protein-protein interactions database Gene expression database NCBI Taxonomy database Gene Ontology CMR
Warehouse Schema
Manages multiple datasets simultaneously Dataset = Single version of a database
Support alternative measurements and viewpoints
Version comparison Multiple software tools or experiments that
require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset
Warehouse Schema
Different databases storing the same biological types are coerced into same warehouse tables
Design of most datatypes inspired by multiple databases
Representational tricks to decrease schema bloat Single space of primary keys Single set of satellite tables such as for synonyms, citations,
comments, etc.
Warehouse Schema
Examples Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc
all loaded into same relational tables Pathway data from MetaCyc and KEGG are loaded into the
same relational tables
Example: Swiss-Prot DB
ID 1A11_CUCMA STANDARD; PRT; 493 AA.AC P23599;DT 01-NOV-1991 (Rel. 20, Created)DT 01-NOV-1991 (Rel. 20, Last sequence update)DT 15-DEC-1998 (Rel. 37, Last annotation update)DE 1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACCDE SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE).GN ACS1 OR ACCW.
How Swiss-Prot is Loaded intoThe Warehouse
Register Swiss-Prot in Datasets tableCreate entry in Entry and Protein tables for each
Swiss-Prot proteinSatellite tables store
Protein synonyms, citations, comments, accession numbers, organism, sequence features, subunits/complexes, DB links
Protein Table
CREATE TABLE Protein ( WID NUMBER --The warehouse ID of this protein Name VARCHAR2(500) --Common name of the protein AASequence VARCHAR2(4000),--Amino-acid sequence for this protein Charge NUMBER, --Charge of the chemical Fragment CHAR(1), --Is this protein a fragment or not, T or F MolecularWeightCalc NUMBER, --Molecular weight calculated from sequence. Units: Daltons. MolecularWeightExp NUMBER, --Molecular Weight determined through experimentation. Units: Daltons. PICalc VARCHAR2(50), --pI calculated from its sqeuence. PIExp VARCHAR2(50), --pI value determined through experimentation. DataSetWID NUMBER --Reference to the data set from which the entity came from);
Database Loaders
Loader tool defined for each DB to be loaded into Warehouse
Example loaders available in several languages Loaders
KEGG (C) BioCyc collection of 15 pathway DBs (C) Swiss-Prot (Java) ENZYME (Java)
Terminology
Model Organism Database (MOD) – DB describing genome and other information about an organismPathway/Genome Database (PGDB) – MOD that combines information about
Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters,
operons, DNA binding sites
BioCyc – Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, YeastCyc
Loader Architecture
Grammar forSwiss-Prot
Parser forSwissProt
ANTLRParserGenerator
Swiss-ProtDatafile
SQL InsertCommands
OracleLoadableFile
Current Warehouse Contents
KEGG ENZYME SwissProt BsubCyc Warehouse Total
Chemicals 7,284 2,952 0 576 10,812
Genes 5,714 0 88,605 4,221 98,540
Organisms 60 0 103,807 1 103,868
Proteins 3,829 3,870 101,602 4,150 113,451
Enzymatic
Reactions 3,509 0 0 717 4,226
Pathways 4,517 0 0 138 4,655
Pathway
Reactions 36,271 0 0 530 36,801
Example Warehouse Uses
Check completeness of data sources
Count reactions in ENZYME database with (and without) associated protein sequences in SWISS-PROT database:3870 reactions in ENZYME1662 reactions (43%) with a sequence in SWISS-PROT2208 reactions (57%) without a sequence in SWISS-PROT
Count #of distinct non-partial EC numbers in SWISS-PROT:1554 distinct EC numbers in SWISS-PROT (non-partial)