San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

1
The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome data Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S. San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA Genomic Pipeline Arabidopsis Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A- prediction Only sequences w/out A- prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences Store assigned regions in the DB Functional assignment by PFAM, NR, PSI-Pred assignments FOLDLIB NR, PFAM Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non- identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Domain location prediction by sequence structure info sequence info SCOP, PDB Data warehouse MySQL DataMart(s) Structure assignment by PSI- BLAST Structure assignment by 123D Domain locatio n predict ion SOAP/Web Server Load/ update scripts Application server UDDI directory Publish Web Services & API Automated data downloads to mirrors and researchers WWW Data incorporated into third party web pages EOL Web pages served via JSP EOL Noteboo k For further information about EOL, please visit us online at: http://www.eolproject.info or contact Mark Miller at [email protected], +1-858-822-0866 The Sequence Analysis Pipeline The Need for Protein Annotation Innovative Data Access Accompanying the massive supply of genomic data is a need to annotate proteins from structural and functional points of view. Questions that researchers look to answer using the massive amount of new genomic data include: - What other genomic proteins are similar to the protein that I am researching? - What level of conservation is there for a particular protein sequence across species? - Which protein domains are common to various protein sequences? - What is the likely cellular location of a specific protein or class of proteins? On a limited basis, researchers are able to manually perform BLAST searches, sequence analysis and data collation for small collections of protein sequences of interest, but for the very large numbers of sequences (10,000 to 15,000 or greater) coded for in an individual genome, this becomes impractical. Therefore, key to large-scale genomic sequence analysis is the creation of a reliable and automated software “pipeline” to handle both the analysis functions and then the collation of output data from the analysis. The EOL Model Figure 1 Genomic analysis pipeline used to analyze Arabidopsis thaliana sequence data The Proteins of Arabidopsis thaliana (PAT) project was a prototype initiative to establish a reliable and accurate pipeline for genome annotation (iGAP) ( Figure 1). Using homology modeling, the iGAP provides functional annotations and predicts three-dimensional structures (where possible) for proteins encoded in the Arabidopsis thaliana genome. The results from iGAP (BLAST-WU, PSI-BLAST, 123D+, COILS, TMHMM, SignalP) were combined and organized into a relational database with a web-based GUI. •Structural assignment by sequence similarity and fold recognition. Fold assignment. Function assignment. Modeling by aligning with template. • Functional assignment by sequence similarity. • Assignment of special classes (filtering). • Assignment of protein features. Steps in Protein Annotation An important issue in this process is automation and its associated automated quality assessment. In the pipeline model, this was addressed by: • Introduction of six reliability categories. • Introduction of benchmark based on 1000 non-redundant SCOP folds [Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. J. Mol. Biol., 1995, 247:536]. • Testing a variety of search conditions and methods within this benchmark. Further information about the PAT project may be found at the PAT web site: http://arabidopsis.sdsc.edu Reliability Categories (based on selectivity benchmark): A. Certain (99.9% of true positives among predicted positives) B. Reliable (99%) C. Probable (90%) D. Possible (50%) F. No annotation Sensitivity = tp/(tp+fn) Selectivity = tp/(tp+fp) E. Potential (10%) Figure 2 The EOL Data Analysis and Delivery Model Large-Scale Computing Resources and Data Storage Stages in EOL Data Processing and Delivery Publicly available genomic sequence data are obtained via a high-speed Internet 2 connection from NCBI to the San Diego Supercomputer Center. Sequence data is distributed to several large-scale computing resources such as at partner institutions, such as the BII in Singapore; and the TeraGrid at SDSC (see below), to which the PAT software pipeline has been ported. Data from the pipeline is deposited into a DB2-based multi-species version of the PAT data warehouse schema, and federated with data from a number of other local database projects. Multiple complex queries on the data are run and the results are stored in the data base. Data is loaded into multiple data marts for fast, read-only query access/distribution to both end-users (via a Web interface and a SOAP- based Web services paradigm), and to EOL data mirror sites. Researchers throughout the world are able to access the data by pointing their Web browser to the EOL data Web site or one of its mirrors. Additionally, the World Wide Web Consortium (W3C) standards- based Web Service protocol allows for peer-to-peer automated computer data access for a variety of uses. Figure 3 Book Metaphor Web Interface Ported pipeline applications Sequence data from genomic sequencing projects Pipeline data Key to the success of the EOL project has been the ability to partner with computing projects that will provide the resources to drive the software pipeline to process over 800 available genomes. Large-scale computing resources being recruited for the EOL project include the TeraGrid, the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research (http://www. teragrid .org ), PRAGMA, an open organization in which Pacific Rim institutions formally collaborate to develop grid-enabled applications and to deploy Grid infrastructure throughout the Pacific region (http:// pragma . ucsd . edu ), and NRAC reources, including SDSC’s Blue Horizon; University of Michigans AMD cluster, and the University of Wisconsin Condor Flock. Another factor in the development of EOL has been the ability to deploy large-scale, mass storage to handle the enormous amount of data generated by iGAP analyses and loaded into EOL data warehouse schema and data marts. Ultimately more than 10 terabytes of storage will be deployed for genome annotation alone. Multiple EOL Data Mirror Sites Data mirrors will be a major component in the EOL data distribution system. A software package can be downloaded from the EOL interface that allows researchers to store selected EOL data on local machines, and, if desired, the software makes it possible to act as a public EOL data mirror. This mirror package software will be based upon a freely available relational database management system (MySQL) and application server (JBoss). This ensures the widest possible deployment of an EOL mirror data repository, from major university and biotech sites to the smallest research institutions, even including high schools, The end-user experience of accessing data processed in this manner is fast, comprehensive and flexible. The EOL model (Figure 2), applies the iGAP pipeline (proven by the PAT project) al available (cuurently 800+) genomes. It is a key goal of the the project to provide the computational and storage resources necessary to accommodate the analysis of this magnitude of sequnce data (current esitmates are 300 cpu years with available hardware). Ongoing efforts are aimed at obtaining more cpu resources, and improving the efficiency of computational resource utilization. An unique aspect of the EOL model is its ability to deliver data through multiple routes.One arm of this data delivery system is the Web interface, driven by Java Server Pages (JSP). Building on the “Encyclopedia of Life” concept, the interface provides fast access to EOL data through a book metaphor design. Data is cataloged alphabetically by species, and the user is provided with multiple additional tools to search sequence data, including: BLAST search with a protein query sequence to one or more specific species data. Keyword search. Natural Language Query search. Sequence identifier (accession ID) search. SCOP Fold browser. Putative function browser. Query results will be returned in multiple forms, including a Web page summary at the genome, sequence, and structure data levels; as well as by links to the same information in XML, a PDF printer-friendly output, EOL notebook version (see below), and a narrated summary in Flash. The Web interfaces make extensive use of Scalable Vector Graphics (SVG) components to deliver fast, client-side graphical data renderings using XML encapsulated data accoroding to W3C standards. An example is the SVG “chromosome mapper” shown in Figure 4. SVG molecular rendering is used at the client side to provide fast, interactive, and visually informative molecular graphics. Web Services and the EOL Notebook In addition to obtaining access to EOL data via the web, other components of data delivery include publication of Web Services-based API; and the SDSC Blue Titan web services network direction system. Through Web Services, any researcher or data service is able to access EOL data automatically and with minimal programmatic effort. The EOL notebook is a subproject within EOL (and bioinformatics.org) to create a Java-based application, distributed via JNLP, that will act as a local repository for EOL data. In addition to being able to store and search data locally, the EOL notebook will also be a consumer of EOL Web Services and, via automation, will ensure locally kept data (stored in XML format for interoperability) is kept in sync with data in the main EOL repository. Figure 4 Client-side data rendering using SVG mmiller: mmiller: This should read “all available genome sequences” mmiller: Strike out arabidopsis

description

mmiller:. Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S. - PowerPoint PPT Presentation

Transcript of San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

Page 1: San Diego Supercomputer Center,  MC 0505, 9500 Gilman Drive, La Jolla,  CA 92093-0505, USA

The Encyclopedia of Life (EOL) ProjectAn initiative to analyze and provide annotation for putative protein sequences from all publicly available genome data

Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S.

San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

Genomic Pipeline

Arabidopsis Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSI-Pred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

Data warehouse MySQL DataMart(s)Structure assignment by PSI-BLAST

Structure assignment by 123D

Domain location prediction

SOAP/Web Server

Load/update scripts

Application server

UDDI directory

Publish Web Services & API

Automated data downloads to mirrors and researchers

WWW

Data incorporated into third party web pages

EOL Web pages served via JSP

EOL Notebook

For further information about EOL, please visit us online at: http://www.eolproject.info or contact Mark Miller at [email protected], +1-858-822-0866

The Sequence Analysis Pipeline

The Need for Protein Annotation Innovative Data AccessAccompanying the massive supply of genomic data is a need to annotate proteins from structural and functional points of view. Questions that researchers look to answer using the massive amount of new genomic data include:

- What other genomic proteins are similar to the protein that I am researching?

- What level of conservation is there for a particular protein sequence across species?

- Which protein domains are common to various protein sequences?

- What is the likely cellular location of a specific protein or class of proteins?

On a limited basis, researchers are able to manually perform BLAST searches, sequence analysis and data collation for small collections of protein sequences of interest, but for the very large numbers of sequences (10,000 to 15,000 or greater) coded for in an individual genome, this becomes impractical.

Therefore, key to large-scale genomic sequence analysis is the creation of a reliable and automated software “pipeline” to handle both the analysis functions and then the collation of output data from the analysis.

The EOL Model

Figure 1

Genomic analysis pipeline used to analyze Arabidopsis thaliana sequence data

The Proteins of Arabidopsis thaliana (PAT) project was a prototype initiative to establish a reliable and accurate pipeline for genome annotation (iGAP) (Figure 1). Using homology modeling, the iGAP provides functional annotations and predicts three-dimensional structures (where possible) for proteins encoded in the Arabidopsis thaliana genome. The results from iGAP (BLAST-WU, PSI-BLAST, 123D+, COILS, TMHMM, SignalP) were combined and organized into a relational database with a web-based GUI.

•Structural assignment by sequence similarity and fold recognition.

– Fold assignment.– Function assignment.– Modeling by aligning with template.

• Functional assignment by sequence similarity.• Assignment of special classes (filtering).• Assignment of protein features.

Steps in Protein Annotation

An important issue in this process is automation and its associated automated quality assessment. In the pipeline model, this was addressed by:

• Introduction of six reliability categories.

• Introduction of benchmark based on 1000 non-redundant SCOP folds [Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. J. Mol. Biol., 1995, 247:536].

• Testing a variety of search conditions and methods within this benchmark.

Further information about the PAT project may be found at the PAT web site:

http://arabidopsis.sdsc.edu

Reliability Categories (based on selectivity benchmark):A. Certain (99.9% of true positives among predicted positives)

B. Reliable (99%)C. Probable (90%)D. Possible (50%)F. No annotation

Sensitivity = tp/(tp+fn) Selectivity = tp/(tp+fp) E. Potential (10%)

Figure 2

The EOL Data Analysis and Delivery Model

Large-Scale Computing Resources and Data Storage

Stages in EOL Data Processing and Delivery• Publicly available genomic sequence data are obtained via a high-speed

Internet 2 connection from NCBI to the San Diego Supercomputer Center.• Sequence data is distributed to several large-scale computing resources

such as at partner institutions, such as the BII in Singapore; and the TeraGrid at SDSC (see below), to which the PAT software pipeline has been ported.

• Data from the pipeline is deposited into a DB2-based multi-species version of the PAT data warehouse schema, and federated with data from a number of other local database projects.

• Multiple complex queries on the data are run and the results are stored in the data base.

• Data is loaded into multiple data marts for fast, read-only query access/distribution to both end-users (via a Web interface and a SOAP-based Web services paradigm), and to EOL data mirror sites.

• Researchers throughout the world are able to access the data by pointing their Web browser to the EOL data Web site or one of its mirrors. Additionally, the World Wide Web Consortium (W3C) standards-based Web Service protocol allows for peer-to-peer automated computer data access for a variety of uses.

Figure 3

Book Metaphor Web Interface

Ported pipeline applications

Sequence data from genomic sequencing projects

Pipeline data

Key to the success of the EOL project has been the ability to partner with computing projects that will provide the resources to drive the software pipeline to process over 800 available genomes. Large-scale computing resources being recruited for the EOL project include the TeraGrid, the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research (http://www.teragrid.org), PRAGMA, an open organization in which Pacific Rim institutions formally collaborate to develop grid-enabled applications and to deploy Grid infrastructure throughout the Pacific region (http://pragma.ucsd.edu), and NRAC reources, including SDSC’s Blue Horizon; University of Michigans AMD cluster, and the University of Wisconsin Condor Flock.

Another factor in the development of EOL has been the ability to deploy large-scale, mass storage to handle the enormous amount of data generated by iGAP analyses and loaded into EOL data warehouse schema and data marts. Ultimately more than 10 terabytes of storage will be deployed for genome annotation alone.

Multiple EOL Data Mirror SitesData mirrors will be a major component in the EOL data distribution system. A software package can be downloaded from the EOL interface that allows researchers to store selected EOL data on local machines, and, if desired, the software makes it possible to act as a public EOL data mirror. This mirror package software will be based upon a freely available relational database management system (MySQL) and application server (JBoss). This ensures the widest possible deployment of an EOL mirror data repository, from major university and biotech sites to the smallest research institutions, even including high schools,

The end-user experience of accessing data processed in this manner is fast, comprehensive and flexible.

The EOL model (Figure 2), applies the iGAP pipeline (proven by the PAT project) al available (cuurently 800+) genomes. It is a key goal of the the project to provide the computational and storage resources necessary to accommodate the analysis of this magnitude of sequnce data (current esitmates are 300 cpu years with available hardware). Ongoing efforts are aimed at obtaining more cpu resources, and improving the efficiency of computational resource utilization.

An unique aspect of the EOL model is its ability to deliver data through multiple routes.One arm of this data delivery system is the Web interface, driven by Java Server Pages (JSP). Building on the “Encyclopedia of Life” concept, the interface provides fast access to EOL data through a book metaphor design. Data is cataloged alphabetically by species, and the user is provided with multiple additional tools to search sequence data, including:

• BLAST search with a protein query sequence to one or more specific species data.• Keyword search.• Natural Language Query search.• Sequence identifier (accession ID) search.• SCOP Fold browser.• Putative function browser.

Query results will be returned in multiple forms, including a Web page summary at the genome, sequence, and structure data levels; as well as by links to the same information in XML, a PDF printer-friendly output, EOL notebook version (see below), and a narrated summary in Flash.

The Web interfaces make extensive use of Scalable Vector Graphics (SVG) components to deliver fast, client-side graphical data renderings using XML encapsulated data accoroding to W3C standards. An example is the SVG “chromosome mapper” shown in Figure 4. SVG molecular rendering is used at the client side to provide fast, interactive, and visually informative molecular graphics.

Web Services and the EOL NotebookIn addition to obtaining access to EOL data via the web, other components of data delivery include publication of Web Services-based API; and the SDSC Blue Titan web services network direction system. Through Web Services, any researcher or data service is able to access EOL data automatically and with minimal programmatic effort.

The EOL notebook is a subproject within EOL (and bioinformatics.org) to create a Java-based application, distributed via JNLP, that will act as a local repository for EOL data. In addition to being able to store and search data locally, the EOL notebook will also be a consumer of EOL Web Services and, via automation, will ensure locally kept data (stored in XML format for interoperability) is kept in sync with data in the main EOL repository.

Figure 4

Client-side data rendering using SVG

mmiller:

mmiller:

This should read “all available genome sequences”

mmiller:

Strike out arabidopsis