warehouse management database model -documantation-draft-20150520
BioWarehouse: A Bioinformatics Database Warehouse
description
Transcript of BioWarehouse: A Bioinformatics Database Warehouse
![Page 1: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/1.jpg)
BioWarehouse: A Bioinformatics Database
WarehousePeter D. Karp, Thomas J. Lee, Valerie Wagner
Oracle (10g) orMySQL (4.1.11)
UniProt
ENZYME
Genbank
Taxonomy
BioCyc
BioPAX
BioWarehouse
GO
MAGE-ML
KEGG
CMR
Eco2DBase
![Page 2: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/2.jpg)
2
Overview
• Motivations for BioWarehouse• Facile programmatic access to individual
DBs• Capture locally produced data• Database integration
• BioWarehouse technical approach• Loaders• Schema overview
• Applications of BioWarehouse• Join the BioWarehouse project
![Page 3: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/3.jpg)
3
Motivations: Computing with Individual Databases
• Most bioinformatics DBs are not queryable via a database management system• Via Internet or locally installable
• Having relational database versions of individual bioinformatics DBs facilitates complex queries against individual DBs
• What is the alternative? Perl scripts? Awkward to program, slow to execute
![Page 4: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/4.jpg)
4
Motivations:Manage/Integrate Locally Produced Data
• Need schema to capture locally produced data
• Integrate locally produced data with public databases
![Page 5: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/5.jpg)
5
Why is the Multidatabase Approach Alone Not Sufficient?
• Multidatabase query approaches assume databases are in a queryable DBMS
• Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns
• Users want to control data stability• Users want to control speed of their queries
• Multidatabase query systems limited by Internet bandwidth and by the speed of the slowest data source that they query
• Users need to capture, integrate and publish locally produced data of different types
• Multidatabase and Warehouse approaches complementary
![Page 6: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/6.jpg)
6
Key Challenges / Results for BioWarehouse
• Design schema that accurately captures the contents of source DBs
• Design schema that is understandable and scalable
• Address poorly-specified syntax & semantics of source DBs
• Balancing the preservation of source data with mapping into common semantics
• Clearly document data mappings performed by loaders
![Page 7: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/7.jpg)
7
Technical Approach
• Multi-platform support: Oracle (10G) and MySQL
• Schema support for multitude of bioinformatics datatypes
• Create loaders for public bioinformatics DBs• Parse file format of the source DB
• Some loaders parse interchange formats (BioPAX)
• Semantic transformations• Insert DB contents into warehouse tables
BMC Bioinformatics 7:170 2006http://bioinformatics.ai.sri.com/biowarehouse/
![Page 8: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/8.jpg)
8
Technical Approach
• Provide Warehouse query access mechanisms• SQL queries via ODBC, JDBC, OAA
• High quality documentation for schema and loader transformations
• No graphical query interface yet
![Page 9: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/9.jpg)
9
How to Use BioWarehouse?
• Create your own local instance of BioWarehouse
• Query an existing BioWarehouse instance, such as publichouse
![Page 10: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/10.jpg)
10
PublicHouse Server
• Publicly queryable BioWarehouse server operated by SRI
• Manages a set of biological DBs constructed using BioWarehouse • CMR• BioCyc Pathway/Genome DBs • ENZYME• NCBI Taxonomy
• Will be transitioning publichouse to contain• BioCyc• E. coli gene expression, proteomics, and ChIP-chip datasets
• See: http://bioinformatics.ai.sri.com/biowarehouse/publichouse.html
• Note publichouse will become a BioCyc/EcoliHub BioWarehouse server
Host: publichouse.sri.comPort: 3306Database: biospice
![Page 11: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/11.jpg)
11
BioWarehouse Schema
• Manages many bioinformatics datatypes simultaneously• Pathways, Reactions, Chemicals• Proteins, Genes, Replicons• Sequences, Sequence Features• Gene expression data• Protein expression data• Flow cytometry data• Organisms, Taxonomic relationships• Computations (sequence matches)• Citations, Controlled vocabularies• Links to external databases
• Each type of warehouse object implemented through one or more relational tables
![Page 12: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/12.jpg)
12
BioWarehouse Schema
• Manages multiple datasets simultaneously• Dataset = Single version of a database
• Version comparison• Multiple software tools or experiments that
require access to different versions• Each dataset is a warehouse entity• Every warehouse object is registered in a
dataset
![Page 13: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/13.jpg)
13
BioWarehouse Schema
• Different databases storing the same biological datatypes are coerced into same warehouse tables
• Design of most datatypes inspired by multiple databases
• Representational tricks to decrease schema bloat• Single space of primary keys• Single set of satellite tables such as for synonyms,
citations, comments, etc.
• Schema size• Core schema: 70 tables• Gene expression schema: 109 tables
![Page 14: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/14.jpg)
14
BioWarehouse Loaders
Database Loader Language
Input
Format
Comments
BioCyc C BioCyc attribute-value Pathway/Genome Databases
BioPAX Java BioPAX format Protein interactions data
CMR C CMR column-delimited Comprehensive Microbial Resource:
350+ microbial genomes
Eco2Dbase Java Relational table dumps E. coli 2-D gel data
ENZYME Java ENZYME attribute-value Enzyme Commission set of reactions
Genbank Java XML derived from ASN.1 Bacterial subset of Genbank
Gene Ontology Java OBO XML Hierarchical controlled vocabulary
KEGG C KEGG format Metabolic pathway data
MAGE-ML Java MAGE-ML format Microarray gene expression data
NCBI Taxonomy C Taxonomy format Organism taxonomy
UniProt Java UniProt XML SWISS-PROT and TrEMBL
![Page 15: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/15.jpg)
15
BioWarehouse Schema Overview• Schema manages many bioinformatics datatypes
… including links to external databases• Main biological objects:
Taxon
NucleicAcid Gene
SubSequenceBioSource
Reaction
Pathway
Chemical
Feature
Protein
FunctionEnzymaticReaction
• Each type of warehouse object implemented through one or more relational tables (70)
![Page 16: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/16.jpg)
16
Pathway Data
• BioCyc• KEGG• BioPAX format
• Physical interaction data only
• ENZYME• Populates these tables: Reaction, Protein,
Chemical
![Page 17: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/17.jpg)
17
Pathway Schema Neighborhood
Reaction
Product
Substrate
Chemical
EnzymaticReaction
Protein
PathwayReaction
Pathway
![Page 18: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/18.jpg)
18
Pathway Data: BioCyc Loader
• Each BioCyc DB can be loaded into separate BioWarehouse dataset, or one common dataset
• Loads data from 13 BioCyc source files:• pubs.dat [not present for all BioCyc PDDBs] • compounds.dat • proteins.dat • protseq.fasta [not present for all BioCyc PGDBs] • transunits.dat [not present for all BioCyc PGDBs] • genes.dat • promoters.dat [not present for the MetaCyc PGDB] • terminators.dat [not present for the MetaCyc PGDB] • dnabindsites.dat [not present for the MetaCyc PGDB] • reactions.dat • enzrxns.dat • regulation.dat • pathways.dat
• http://biowarehouse.ai.sri.com/repos/biocyc-loader/flatfile/doc/index.html
![Page 19: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/19.jpg)
19
BioCyc Loader: Chemical Compounds
![Page 20: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/20.jpg)
20
BioCyc Loader: Reactions
![Page 21: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/21.jpg)
21
BioCyc Loader: Products
![Page 22: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/22.jpg)
22
![Page 23: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/23.jpg)
23
Comparative Analysis with BioWarehouse:Compare MetaCyc to KEGG
• KEGG pathways are larger than MetaCyc pathways
• MetaCyc has a larger number of pathways
• Which database has a larger collection of pathway data?
• Prior result: KEGG pathways are on average 4.2 times larger than MetaCyc pathways
“The outcomes of pathway database computations depend on pathway ontology”Green and Karp, Nucleic Acids Research 2006:34 3687-97
![Page 24: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/24.jpg)
24
MetaCyc contains 5.1 times as many pathways as does KEGG
![Page 25: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/25.jpg)
25
MetaCyc contains 1.4 times as many reactions within its pathwaysas does KEGG
![Page 26: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/26.jpg)
26
Gene Expression Data inBioWarehouse
• Goals• Experimentalist loads locally produced data into
BioWarehouse• Computational biologist loads remotely
downloaded data into BioWarehouse• For processing and/or integration with other data
• Source data format: MAGE-ML• http://mged.org/
• BioWarehouse and ArrayExpress are only MAGE-ML compliant data models we could find
![Page 27: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/27.jpg)
27
Our Approach
• Translate MAGE-OM into a relational database schema• One class One table gives too large a schema
(ArrayExpress)• Instead, one table per inheritance hierarchy –
reduces table count by half• Use MAGE SDK tool for XML->Object; use
Castor for Object->Relational mapping
• Merge the resulting schema into BioWarehouse schema to eliminate redundancy
• Result: 109 tables
![Page 28: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/28.jpg)
28
ChIP-Chip Data
• Current project to extend MAGE-ML loader and BioWarehouse to accommodate ChIP-chip data
• Meta data, gene expression data, transcription factor(s), antibody(s)
![Page 29: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/29.jpg)
29
Protein Interactions Data
• Schema support
• Load via BioPAX
![Page 30: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/30.jpg)
30
Contribute to BioWarehouse
• Open Source project
• Ways to contribute:• Maintain/update an existing loader• Implement a new loader• Port to new compiler or platform or DBMS
![Page 31: BioWarehouse: A Bioinformatics Database Warehouse](https://reader034.fdocuments.net/reader034/viewer/2022051417/56814e4a550346895dbbd149/html5/thumbnails/31.jpg)
31
Acknowledgments
• Funded by • NIH/NIGMS EcoliHub project• NIH/NIGMS BioCyc project• DARPA Bio-SPICE program
SRI Colleagues• Valerie Wagner, Tom Lee, Tomer Altman
Learn more• http://bioinformatics.ai.sri.com/biowarehouse/• BMC Bioinformatics 7:170 2006