(The Encyclopedia of Life (EOL))

31
(The Encyclopedia of Life (The Encyclopedia of Life (EOL)) (EOL)) medicin medicin e e researc researc h h educati educati on on The Annotation and Cataloging of Proteins, Life's Building Blocks The Annotation and Cataloging of Proteins, Life's Building Blocks for… for… The Open Notebook The Open Notebook

description

The Open Notebook. (The Encyclopedia of Life (EOL)). The Annotation and Cataloging of Proteins, Life's Building Blocks. for…. research. education. medicine. A Multitude of Data Sites. Current Problem Using Data Sites. Difficult to keep track of data files - PowerPoint PPT Presentation

Transcript of (The Encyclopedia of Life (EOL))

Page 1: (The Encyclopedia of Life (EOL))

(The Encyclopedia of Life (EOL))(The Encyclopedia of Life (EOL))

medicinemedicine researchresearch educationeducation

The Annotation and Cataloging of Proteins, Life's Building Blocks The Annotation and Cataloging of Proteins, Life's Building Blocks for…for…

The Open NotebookThe Open Notebook

Page 2: (The Encyclopedia of Life (EOL))

A Multitude of Data Sites

Page 3: (The Encyclopedia of Life (EOL))

Current Problem Using Data Sites

• Difficult to keep track of data files

• Data often returned in various formats

• Searches are often frequently repeated in entirety, tying up server resources

Page 4: (The Encyclopedia of Life (EOL))

Developments in Data Transfer• XML increasingly being used to encapsulate data

• SOAP-based access to data services, an XML-based method for exchanging information, springing up

string[] getGenomeAnnotationStatus ( int Format_option)

SOAP server

SOAP consumer invokes SOAP method over HTTP protocol

SOAP server processes request and returns any data in an XML-formatted SOAP packet

SOAP consumer

<?xml version="1.0"?><notebook-data></notebook-data>

Page 5: (The Encyclopedia of Life (EOL))

Notebook Overview

XML/RDF store

Background SOAP Queries

BLAST DataKeyword dataStored queriesAnnotations

SOAP Server

Session info

Scheduler

BLASTKeyword queries

Metadata sharing

Virtual community messaging

Application invoked by mime type

Web Services Interface

Open Notebook

Notebook link

getIncrementalUpdate(string sequence, string date)

<?xml version="1.0"?><notebook_data><data> …

Annotations

Page 6: (The Encyclopedia of Life (EOL))

Open Notebook Protocol

• Agreed set of protocols for invoking and then feeding with data a client-side application to enable client-side data persistence

• Not tied to one programming language

Page 7: (The Encyclopedia of Life (EOL))

Invocation of Client-side Application

• Experimental mime type (as per RFC2048 )application/x-opennotebook

• Application registers with web browser/OS to handle this mime type.

• Data then streams to application in agreed XML schema format

<?xml version="1.0"?><notebook_data><data> …

Page 8: (The Encyclopedia of Life (EOL))

Data would describe required data viewers

• Specialized viewers and their current availability specified in XML data download

<?xml version="1.0"?><notebook_data>

<basic-viewer>blast</basic-viewer><advanced-viewer>

<availability>available</availability><platforms>Java;win32;macosx</platforms><download>http://www.xxx.com/…</download>

</advanced-viewer>

Page 9: (The Encyclopedia of Life (EOL))

Data updates

• Indication whether data is updatable

<?xml version="1.0"?><notebook_data> <updatable>yes</updatable> <SOAP-proxy> http://www.xxx.org/soapservice< SOAP-proxy> <update-method>getGenomes(string seq)</update-method> <incrementally-updatable>yes</incrementally-updatable> …

Page 10: (The Encyclopedia of Life (EOL))

Programming Language-Neutral

• Important to just specify protocols and activation scenarios

• Enables development of a variety of different and branded versions

• Java is envisaged an excellent programming language choice for starting development of an open source version

Page 11: (The Encyclopedia of Life (EOL))

Encyclopedia of Life

• The Encyclopedia of Life (EOL) project is a joint development of the San Diego Supercomputer Center (SDSC) and scientists and biological resources worldwide

• EOL involves SDSC staff from HPC (High Performance Computing), DAKS (Distributed Annotation and Knowledge System), Grids, Clusters and Visualization

• EOL has three parts:– Putative functional and 3-D structure assignment through the

largest computation ever attempted in biology – Integration of key biological resources– Make this data available to end-user through an intuitive

interface

• Opportunity to start from ground up

Page 12: (The Encyclopedia of Life (EOL))

integrated Genomic Annotation Pipeline - iGAP

Deduced Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

Page 13: (The Encyclopedia of Life (EOL))

Deduced Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

~800 genomes @ 10k-20k per =~107 ORF’s

4 CPU years

228 CPU years

3 CPU years

9 CPU years

252 CPU years

3 CPU years

104 entries

integrated Genomic Annotation Pipeline - iGAP

Page 14: (The Encyclopedia of Life (EOL))

EOL Data Flow

MySQL DataMart(s)

Structure assignment by PSI-BLAST

Structure assignment by 123D

Domain location prediction

Data warehouse

Pipeline data

Load/update scripts

Integrated Genome Annotation Pipeline (iGAP)

Sequence data from genomic sequencing projects

Normalized DB2 schema Web Server/ Web Services

Application Server

JBOSS v3.1

Apache AXIS

Query databases

Return data

Web Services consumersWeb Interface

Retrieve Web pages & Invoke SOAP methods

Putative Functional and 3D Assignment

Integrated with Other Resources

Page 15: (The Encyclopedia of Life (EOL))

Local Data Aggregation

EOL Registry

iGAP

Oracle db

Java Application Server

Local lookup tables

Temporary session search data

PHProjekt

Keyword search

BLAST

NLQ search

Page 16: (The Encyclopedia of Life (EOL))

EOL Front End: Web Interface

Page 17: (The Encyclopedia of Life (EOL))

Interactive Data Rendering

• Need for interactive client side graphical data rendering

• Flash used in EOL prototype but… – development time high– thin client capabilities limited by player parsing

capabilities

• Scalable Vector Graphics (SVG)– Described by an XML-based text file– graphic description can be created server-side– standards based– Interactivity provided by embedded ECMA scripting

• Negatives:– Little native support in web browsers– Must use proprietary plugin (Adobe) in practice

Page 18: (The Encyclopedia of Life (EOL))

SVG Data Rendering

Embedded ECMA Script makes calls to EOL server for data

Data is returned to the SVG component

EOL Web Server

EOL Data

SVG XML-based graphic is generated in real-time on the server

<svg><rect x=“0” y=“0”>…</svg>

Page 19: (The Encyclopedia of Life (EOL))

Session Data Persistence

EOL Server

Temp Data

Session Object retains pointers to temp data

Page 20: (The Encyclopedia of Life (EOL))

Web Server

Application Server

JBOSS v3.1

Open Notebook

Apache AXIS

org.eolproject.ejbPackage:getDomains(int id, int format_option)

getDomains(33499519, 1)

Flash XML rendering

getDomains(33499519, 0)

Integration into enterprise applications

HTML rendering

EOL Front End: Web Services (cont)

Open Notebook

General data access

Page 21: (The Encyclopedia of Life (EOL))

Open Notebook Software Wish List

• Multi-Platform application• Easy installation and update• Local search functionality• Data annotation• Built-in basic data viewers for popular data, i.e.

BLAST, sequence alignments, basic molecular rendering

• Automated download of specialized data viewers• Automatic data updates via background use of web

services• User notification of new data• Point-and-click interface to support new breed of

PDA’s and Tablets• Peer-to-peer querying of annotation data

Page 22: (The Encyclopedia of Life (EOL))

Easy Installation and Update

• Idiot-proof installation

• Java Network Launch Protocol (JNLP) good contender, i.e. WebStart

• JNLP has ability to provide application updates

Page 23: (The Encyclopedia of Life (EOL))

Local search functionality

• Whatever kind of database is used, it needs to be able to support some kind of search functionality

• For the open notebook project we would seek an open source XML-based database, look to xml:db API for a means to interact with a native XML database

• EXIST is one example of an open source, native XML database

Page 24: (The Encyclopedia of Life (EOL))

Data annotation & Peer-to-peer querying of annotation data

• Personal annotations on local data a useful and relatively easy feature to implement

• Peer-to-peer access contentious and needs to be well controlled

• Potentially could create a real community of online scientists

• Effectively a scientific “Napster”

Page 25: (The Encyclopedia of Life (EOL))

Built-in Basic Data Viewers

• Need to have minimum built-in capability– Text viewer– SVG Graphics viewer– NCBI DTD-based BLAST browser– Multiple sequence alignment viewer– Molecule renderer

Page 26: (The Encyclopedia of Life (EOL))

Automatic data updates via SOAP calls

• Server-side must be set up for providing SOAP method calls

• Potential to drastically reduce server load by performing incremental search

getBlastData( string sequence, string last-queried )

Page 27: (The Encyclopedia of Life (EOL))

Point-and-click interface

• Intuitive interface• Constructed with an eye on developments

in personal computing e.g. PDA’s and Tablet computers

Page 28: (The Encyclopedia of Life (EOL))

What Next…?

• Upload a seed Java-based project onto the Bioinformatics.org site together with an RFC

• Discuss online the merits of the project

Page 29: (The Encyclopedia of Life (EOL))

Summary

• A genuine need for a means to:– Collate data– Automatic updates of data– Enable shared data annotations– Specialized data processing

• Java provides a compelling platform to develop an open version of this client-side application

Page 30: (The Encyclopedia of Life (EOL))

Dave ArchbellKim Baldridge Chaitanya Baru Fran BermanPhilip Bourne Robert ByrnesHenri Casanova Eliot Clingman Neil Cotofana Cassie Ferguson Tony Fountain Jerry Greenberg Michael GribskovDana Jermanis

Wilfred Li Jennifer MatthewsMark MillerJulie MitchellColeman MosleyGreg QuinnVicente ReyesJerry RowleyPeter Shin Ilya ShindyalovChris SmithDavid StonerStella Veretnik

EOL Team

Page 31: (The Encyclopedia of Life (EOL))

Further information:

http://www.eolproject.info

http://www.bioinformatics.org/opennotebook