GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting,...
-
Upload
gordon-hutchinson -
Category
Documents
-
view
216 -
download
1
Transcript of GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting,...
![Page 1: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/1.jpg)
GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components
GMOD Meeting, Oct. 2004
Don Gilbert, [email protected]
![Page 2: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/2.jpg)
• GMOD Tools for public data releases• Argos framework for genome databases
• LuceGene fast document/object search
• Genome Directory System for genome
data mining
• Unified Gene Pages (XML, web page)
Genome DB building blocks
![Page 3: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/3.jpg)
GMOD Tools: Bulkfilescvs.sourceforge.net:/cvsroot/gmod checkout schema/GMODTools
![Page 4: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/4.jpg)
• Support common data update and public release
tasks.
• GmodTools to load and extract reagent sequences
(EST, cDNA, GSS) to/from Chado databases.
• GMOD Bulkfiles creates bulk genome sequence and
feature files for public distribution from a Chado
database.
• Citrina is a workflow tool to automate external
databank updates, such as GenBank and Gene
Ontologies.
Genome Data Tools
![Page 5: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/5.jpg)
12 New genomes to go
• Need to publish numerous new genomes• Bulk files are standard public access:
• Sequence (fasta, …), features (gff,…), searches (Blast, ..);
• 11 new Drosophila genomes; Daphnia genome; many more• Chado database; XORT & other GMOD Tools to export data• http://flybase.net/species
![Page 6: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/6.jpg)
Bulkfiles
• Build release files from Chado DB
• Standardized files, headers
• DNA - fasta, raw• Features - GFF3,
gnomap• Blast indices• Lucene file indices• Config files (blast,
gbrowse,…)
![Page 7: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/7.jpg)
Bulkfiles - BLAST indices
![Page 8: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/8.jpg)
Bulkfiles - Map features
![Page 9: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/9.jpg)
Bulkfiles OUTPUTS
• DNA files (full chromosomes) in raw and fasta formats
• GFF (v3) and FFF (used in FlyBase) feature files• Fasta sequence for each feature set, with
standardized headers (ID,names,db_xref,…)from feature files
• NCBI BLAST indices & configs• Gbrowse config files with feature sets matching db• Others added as needed (more easily than before)
![Page 10: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/10.jpg)
Bulkfiles Logic
• Organism/database logic (mostly) in configuration files• Dump all chado db features using simple sql to common
intermediate table files• Feature info is simple: type, location, name/id, and a few
attributes (db_xrefs,.. GFF-like)• Easier checking of SQL to get all features desired• Fast (30 - 60 min for full fly genome)
• Postprocess table files to create public use formats• Tested with FOUR different Chado dbs (Dmel, Dmel_hetero,
Dpse_Dmel, and SGDLite)
![Page 11: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/11.jpg)
Bulkfiles stages
• postprocess table files in stages• Recode feature “oddities” to public view needs• Better debugging of steps in the process• Engineering time and configuration here• Stages are loosely coupled; go back, tweak
configurations, re-run partially as needed.• convert common feature table + dna to several
output formats in one step.• combine features from several dbs and other
sources like cytology here.
![Page 12: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/12.jpg)
Bulkfiles config example <opt name="fbbulk-r3" relid="3" ROOT="${GMOD_ROOT}/" TMP="${GMOD_ROOT}/tmp" datadir="genomes/Drosophila_melanogaster"> <title>FlyBase Chado DB r3.2</title> <about> Configuration for feature and sequence bulk files from FlyBase chado data release 3.2.1 </about> <org>dmel</org> <species>Drosophila melanogaster</species> <doc name="README"> D. melanogaster euchromatin genome data from
FlyBase Release 3.2.1. See http://flybase.net/annot/dmel_r3.2.1.txt
</doc>
<include>fbreleases</include> <db driver="Pg" name="dmel_chado" host="localhost" port="7302" user="” password="" />
<idpattern>(FBgn|FBti)\d+</idpattern>
<include>filesets</include> <include>featuresets</include></opt>
![Page 13: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/13.jpg)
ARGOS http://www.gmod.org/argos
![Page 14: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/14.jpg)
ARGOS Genome DBs
![Page 15: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/15.jpg)
• Automate genome database install & update• Eliminate { fetch, compile, install, configure,…} cycle• Developers test, compile, config once; others copy/run
• Start new project quickly - copy existing project & edit to suit
• Clone servers easily (local cluster; global mirrors; company/lab; laptop)
• Compatible with most GMOD projects• Secure collaborative genome db features• Goal: easy for biologists to use with minimal
informatics expertise
ARGOS Focus
![Page 16: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/16.jpg)
ARGOS Components
Section ComponentsFlyBase (e.g.) Data, database indices, documents, web tools specific to genome service
Java Chado database tools, genome sequence reports, LuceGene search, Ant buildsystem, database interfaces, XML tools, Tomcat web server, Axis web services
Perl BioPerl, GBrowse, Chado database tools, Cmap comparative maps,database interfaces, Web tools, XML tools
Servers BLAST (NCBI), Apache web server, PostgreSQL, and BerkeleyDBdatabases
Systems Compiled portions for supported operating systemsInstall & Root Common configurations, web server, installation scripts and
instructions
![Page 17: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/17.jpg)
ARGOS INSTALL
![Page 18: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/18.jpg)
ARGOS INSTALL
![Page 19: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/19.jpg)
Edit wFleaBase
![Page 20: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/20.jpg)
Lucegene (‘Lucy Jean’)for Genome Information Search and Retrieval
![Page 21: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/21.jpg)
Document/Object Search and Retrieval in Genome Databases • high-volume data search and retrieval system for genomics and
bioinformatics databases
• standard search features: booleans, phrase, near, relevance
• performance exceeds and extends relational databases
• suited to range of genome data: genes, literature, sequences,
XML annotations, Medline abstracts, HTML, PDF and text
documents.
LuceGene
![Page 22: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/22.jpg)
Example LuceGene libraries
• FlyBase database• Annotation GAME XML, Medline XML (gamexml, medxml)• Genes, Annotation, References (fbgn, fban, fbrf)• Web, literature PDF Documents (docs) • Unified Gene Page XML (ugpxml)• Sequences, Genome Features (seqs)
• euGenes database• Gene summaries, Sequences, Genome Features • Unified Gene Page XML • Web Documents
• wFleaBase database• Sequences, Medline XML, Web documents
![Page 23: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649cff5503460f949d0531/html5/thumbnails/23.jpg)
• Josh Goodman (gmod)• Paul Poole (gmod/iubio)• Hardik Sheth (flybase)• Nihar Sheth (flybase)• Vasanth Singan (gmod)• Victor Strelets (flybase)
And to many developers whose work we learn from and borrow from
Thanks to these folks