GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting,...

23
GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd @indiana.edu

Transcript of GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting,...

Page 1: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components

GMOD Meeting, Oct. 2004

Don Gilbert, [email protected]

Page 2: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

• GMOD Tools for public data releases• Argos framework for genome databases

• LuceGene fast document/object search

• Genome Directory System for genome

data mining

• Unified Gene Pages (XML, web page)

Genome DB building blocks

Page 3: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

GMOD Tools: Bulkfilescvs.sourceforge.net:/cvsroot/gmod checkout schema/GMODTools

Page 4: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

• Support common data update and public release

tasks.

• GmodTools to load and extract reagent sequences

(EST, cDNA, GSS) to/from Chado databases.

• GMOD Bulkfiles creates bulk genome sequence and

feature files for public distribution from a Chado

database.

• Citrina is a workflow tool to automate external

databank updates, such as GenBank and Gene

Ontologies.

Genome Data Tools

Page 5: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

12 New genomes to go

• Need to publish numerous new genomes• Bulk files are standard public access:

• Sequence (fasta, …), features (gff,…), searches (Blast, ..);

• 11 new Drosophila genomes; Daphnia genome; many more• Chado database; XORT & other GMOD Tools to export data• http://flybase.net/species

Page 6: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles

• Build release files from Chado DB

• Standardized files, headers

• DNA - fasta, raw• Features - GFF3,

gnomap• Blast indices• Lucene file indices• Config files (blast,

gbrowse,…)

Page 7: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles - BLAST indices

Page 8: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles - Map features

Page 9: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles OUTPUTS

• DNA files (full chromosomes) in raw and fasta formats

• GFF (v3) and FFF (used in FlyBase) feature files• Fasta sequence for each feature set, with

standardized headers (ID,names,db_xref,…)from feature files

• NCBI BLAST indices & configs• Gbrowse config files with feature sets matching db• Others added as needed (more easily than before)

Page 10: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles Logic

• Organism/database logic (mostly) in configuration files• Dump all chado db features using simple sql to common

intermediate table files• Feature info is simple: type, location, name/id, and a few

attributes (db_xrefs,.. GFF-like)• Easier checking of SQL to get all features desired• Fast (30 - 60 min for full fly genome)

• Postprocess table files to create public use formats• Tested with FOUR different Chado dbs (Dmel, Dmel_hetero,

Dpse_Dmel, and SGDLite)

Page 11: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles stages

• postprocess table files in stages• Recode feature “oddities” to public view needs• Better debugging of steps in the process• Engineering time and configuration here• Stages are loosely coupled; go back, tweak

configurations, re-run partially as needed.• convert common feature table + dna to several

output formats in one step.• combine features from several dbs and other

sources like cytology here.

Page 12: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Bulkfiles config example <opt name="fbbulk-r3" relid="3" ROOT="${GMOD_ROOT}/" TMP="${GMOD_ROOT}/tmp" datadir="genomes/Drosophila_melanogaster"> <title>FlyBase Chado DB r3.2</title> <about> Configuration for feature and sequence bulk files from FlyBase chado data release 3.2.1 </about> <org>dmel</org> <species>Drosophila melanogaster</species> <doc name="README"> D. melanogaster euchromatin genome data from

FlyBase Release 3.2.1. See http://flybase.net/annot/dmel_r3.2.1.txt

</doc>

<include>fbreleases</include> <db driver="Pg" name="dmel_chado" host="localhost" port="7302" user="” password="" />

<idpattern>(FBgn|FBti)\d+</idpattern>

<include>filesets</include> <include>featuresets</include></opt>

Page 13: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

ARGOS http://www.gmod.org/argos

Page 14: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

ARGOS Genome DBs

Page 15: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

• Automate genome database install & update• Eliminate { fetch, compile, install, configure,…} cycle• Developers test, compile, config once; others copy/run

• Start new project quickly - copy existing project & edit to suit

• Clone servers easily (local cluster; global mirrors; company/lab; laptop)

• Compatible with most GMOD projects• Secure collaborative genome db features• Goal: easy for biologists to use with minimal

informatics expertise

ARGOS Focus

Page 16: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

ARGOS Components

Section ComponentsFlyBase (e.g.) Data, database indices, documents, web tools specific to genome service

Java Chado database tools, genome sequence reports, LuceGene search, Ant buildsystem, database interfaces, XML tools, Tomcat web server, Axis web services

Perl BioPerl, GBrowse, Chado database tools, Cmap comparative maps,database interfaces, Web tools, XML tools

Servers BLAST (NCBI), Apache web server, PostgreSQL, and BerkeleyDBdatabases

Systems Compiled portions for supported operating systemsInstall & Root Common configurations, web server, installation scripts and

instructions

Page 17: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

ARGOS INSTALL

Page 18: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

ARGOS INSTALL

Page 19: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Edit wFleaBase

Page 20: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Lucegene (‘Lucy Jean’)for Genome Information Search and Retrieval

Page 21: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Document/Object Search and Retrieval in Genome Databases • high-volume data search and retrieval system for genomics and

bioinformatics databases

• standard search features: booleans, phrase, near, relevance

• performance exceeds and extends relational databases

• suited to range of genome data: genes, literature, sequences,

XML annotations, Medline abstracts, HTML, PDF and text

documents.

LuceGene

Page 22: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

Example LuceGene libraries

• FlyBase database• Annotation GAME XML, Medline XML (gamexml, medxml)• Genes, Annotation, References (fbgn, fban, fbrf)• Web, literature PDF Documents (docs) • Unified Gene Page XML (ugpxml)• Sequences, Genome Features (seqs)

• euGenes database• Gene summaries, Sequences, Genome Features • Unified Gene Page XML • Web Documents

• wFleaBase database• Sequences, Medline XML, Web documents

Page 23: GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu.

• Josh Goodman (gmod)• Paul Poole (gmod/iubio)• Hardik Sheth (flybase)• Nihar Sheth (flybase)• Vasanth Singan (gmod)• Victor Strelets (flybase)

And to many developers whose work we learn from and borrow from

Thanks to these folks