The Genomics Unified Schema and Application Framework · Java Data Loading API & Plugins ......

33
GUS: The Genomics Unified Schema and Application Framework Michael Saffitz Computational Biology and Informatics Laboratory (CBIL), University of Pennsylvania Center For Bioinformatics

Transcript of The Genomics Unified Schema and Application Framework · Java Data Loading API & Plugins ......

GUS: The Genomics Unified Schema and Application Framework

Michael SaffitzComputational Biology and Informatics Laboratory (CBIL), University of Pennsylvania Center For Bioinformatics

Presentation Overview

SchemaApplication FrameworkGUS In UseGUS and OracleFuture WorkObtaining GUS

Motivation

Functional Genomics: The analysis of gene, RNA, and protein information and its biological function

Represent diversity of functional genomics dataIntegrate and establish relationships between these dataProvide facilities for the utilization of these data and their relationships

The creation of an extensible system for the storage, integration, and analysis of functional genomic data.

GUS Overview

Relational Schema Overview

7 major divisions representing approximately 50 concepts in over 400 tables and views:

Central Dogma (Genes, RNAs, Proteins)Sequences and FeaturesReagentsMicroarray ExperimentsTranscription RegulationControlled VocabulariesMisc: Bibliographic, External Database, Administration

Strongly typed, i.e. few key/value pairsView-based subclassing Extensive use of Controlled Vocabularies

Support for tracking, versioning, permissions

GUS Schemas

DoTS (Database of Transcribed Sequences)Genes, RNAs, Proteins, Sequences

RAD (RNA Abundance Database)Gene Expression and Microarray Experiments

TESS (Transcription Element Search System)Transcriptional regulation

SRes (Shared Resources)Controlled vocabularies, ontologies

CoreNon-Biological Tracking and Overhead

DoTS Schema OverviewCentral Dogma

GenesRNAsProteins

Sequences and FeaturesDNAAmino AcidAssembliesAlignment

ReagentsFingerprintMappingGene TrapsClones

DoTS Schema: Central DogmaGene RNA Protein

Central Dogma of Biology: Single gene gives rise to RNAs, which in turn give rise to proteinsFoundational organizing structure

Central Dogma: Sequences

Gene RNA Protein

NASequence AASequence

Genes, RNAs, and Proteins all have sequences, either Nucleic Acid or Amino Acid Sequences are stored independently of any other object

Central Dogma: Features

Gene RNA Protein

NAFeatureNALocation AAFeatureAALocation

NASequence AASequence

Features are used to represent interesting regions of a sequenceFeatures may be hierarchical: multiple exon features share a parent gene featureFeatures may have absolute or relative locations on the sequence, and be noncontiguous

Central Dogma: Instances

NASequence AASequence

NAFeature AAFeatureNALocation AALocation

GeneInstance

RNAInstance

ProteinInstance

Gene RNA Protein

Genes, RNAs, and Proteins are canonical objects with instancesInstances allow for a many to many relationship between objects and sequences.

Central Dogma

SequencesOrganism A Organism B

Gene Instances

Gene

Gene Features

RAD Schema Overview

Representation and management of high-throughput gene expression data

MicroarraySerial Analysis (SAGE)

Supports:Study DesignPlatform / ArrayAssay to Quantification (Hybridization, Scanning, Feature Extraction)BiomaterialsData Preprocessing (e.g. Normalization)Analysis Results (e.g. Clustering, Differential Expression)Misc: Ontologies, Protocol, Contact, Versioning, Privacy

MGED Standards Compliant: MIAME, MAGE

TESS Schema Overview

Represents the analysis and prediction of functional transcription factor binding sitesSupports:

Proteins, ComplexActivity -- BindingModel -- Weight MatricesAnalysis -- Training & Learning

Integrates TRANSFAC, a public database of transcription factorsPartially designed to provide a bridge between DoTS and RAD

e.g. RNA Sequences in DoTS and their expression levels in RAD are regulated by the transcription factors in TESS

Ontologies / Controlled Vocabularies

Explicit formal specification of terms and concepts Represented individually to accommodate differences in structure (flat, tree, graph) and attributes (fields)Provides an explicit relationship between a biological concept and a given controlled vocabulary

Supports about 15 in total, including:NCBI TaxonomyGene Ontology Function TermsSequence Ontology TermsAnatomy

GUS EvidenceEvidence may be provided for any item of any row in any tableImplemented as a relationship between that item any any other row (the evidence)

Evidence table provides linking and attributes:target_table, target_row, target_attributeevidence_table, evidence_row

Example:An assembly of ESTs and mRNAs containing a RefSeq uses the RefSeq as evidence to support that the assembly is full length coding.An RNA’s description use comments, similarities, or a sequence as supporting evidence

CBIL: 9 evidence tables provide support to 11 tables in 26 combinations

SubclassingOne-level subclassing providing conceptual clarity and query simplification for tables with core commonality and slight divergence in attributes.

e.g. NAFeature superclass has GeneFeature,ExonFeature, RNAFeature, etc. subclasses

Implemented as views on a “implementation” table containing:

Columns common to all subclassesGeneric columns available for subclass-specific attributesA column indicating the subclass a given row belongs to

There are 19 superclasses and 111 subclasses in GUS

Data Provenance

Permissions: Row-level Unix-Style read/writeVersioning: Simple preservation of modified rowsData Source: Tracking of external databases and their releasesProject Tracking: Data grouping by projectAlgorithm: Tracking of algorithms, their execution and parameters, row-level impact, and result status

Application Framework Overview

Provides consistent, reusable access, management, and display of data

Object Relational LayerPerlJava

Data Loading API & PluginsPipeline APIWeb Development Kit (WDK) GUS Database

PerlObjectLayer

WDKJava

ObjectLayer

Data LoadingAPI

PluginsPipeline

GUIApplications Websites

Object Layer

One-to-one relationship between objects and tables/viewsLight weight: centered primarily around data loading

Limited support for object-specific logicApplications generally define additional object models

Provides Simple constructors and full accessorsSmart update/insertParent/child relationship managementCascading insert and deleteCache management

Automation of Data Provenance and Evidence

Data Loading API & PluginsAPI provides:

Data ProvenanceObject layer and database connectivityStandardized documentationCommand line argument processingLoggingError Handling

Plugins are objects which utilize the Data Loading API Example GUS plugins:

Loading Data: Loading sequences from flat (FASTA) filesParsing Genbank records and storing results

Analysis: Predicting RNA/Protein function using Gene Function Ontology

General:Updating records from XML

Pipeline API

Allows for a chain of plugins and other Perl programs to be strung together for the automation of complex protocols

CBIL Example: DoTS BuildDownloads and inserts data, assembles transcripts, produces consensus sequences, and performs annotationAbout 150 total stepsAbout six weeks of processing (Human: 5-6M Sequences)

Web Development Kit (WDK)Facilitates development of data mining oriented websites:

Multiple parameterized canned queriesSophisticated recordsGraphical viewsBoolean query facilityQuery historySession management, process pooling, flow control

Model, View, Controller (MVC) DesignSeparates application logic (Model) from website layout (View) and application flow (Controller)Model: XML-based queries and recordsView: JSPController: Struts

New WDK under development, scheduled for release by the end of summer

GUS In Use

GUS In Use: Versatility

Large scale sites associated with sequencing centers GeneDB: Pathogen Sequencing Unit at the Sanger Institute

Lightly staffed genomics projects TcruziDB, CryptoDB: Kissinger Lab, University of Georgia

Data mining projects Multiple plant projects: Brett Tyler, Virginia Bioinformatics Institute and collaborators

Expression based projects dbDirt: Allen Okey, University of Toronto

Bioinformatics Core Facilities University of Pennsylvania Bioinformatics Core Facility

GUS In Use: Modularity

Several instances of GUS which exclusively useRAD or DoTSAllows for small initial investment of time and energy, while providing significant potential for future growth

GUS and Relational Database Systems

Oracle and PostgreSQL SupportOracle supplanted Sybase in 2001PostgreSQL added in 2004

GUS compatible RDBM systems require:SchemasViewsSequencesPrimary and Foreign Key Constraints

“Enhanced” functionality when using OracleUnder development: Workspace Manager IntegrationImplemented through database module, triggers, and GUS Projects

GUS: Adding PostgreSQL compatibility

Database module provides alternate SQL for use in the object layer

SQL-Function CallsDate functions

SequencesMetadata:

Constraint relationsTable attributesTable definition views

Third party utilities (SQL::Translator) and hand-editing to convert table definitions from Oracle to PostgreSQL

GUS & GUS Projects at CBIL

Multiple GUS-based projects sharing the same database instanceProject-specific extensions use their own schemas for application specific functionalityThese extensions may use Oracle specific functionality:

Query optimization, hintsMaterialized viewsAdvanced storage-- table compressionDatabase links

New concepts are introduced within projects and migrate to GUS

GUS Future Development

Extension of GUS to include proteomics and other domains (e.g. in situ hybrdization)Improved distribution: documentation, installation, API10g Migration / Integration:

Integrated analysis: BLAST, RegexData loading: UpsertImplementing obvious performance features

Workspace Manager Support

GUS & Workspace ManagerMultiple GUS-based projects, all sharing the same instanceEach project maintains its own release schedule and data build process

Project data releases range from weekly to once every 6 six weeks

Data is unreliable during the build process

Oracle Workspace ManagerEach project may manipulate data independentlyUpon completion of a build cycle, the data is committed back to the primary workspaceProjects may release at any time because just the primary workspace is made available.

Workspace Manager provides functionality to GUS by allowing morepowerful manipulation of data among many concurrent projects andresearchers

Obtaining GUS

www.gusdb.orgOpen SourceDocumentation -- Wiki, Installation GuidesSourceForge: gusdev

Mailing ListsTrackers

Coming Soon: Demonstration Instance

Acknowledgements

Steve FischerJonathan SchugChris Stoeckert

The Computational Biology and Informatics Laboratory Group

GUS is funded by grants from the National Institute of Health