Protein Information Resource

57
Protein Informatio n Resource Oversight and Scientific Advisory Board Meeting November 14, 2005 Georgetown University Medical Center

description

Protein Information Resource. Oversight and Scientific Advisory Board Meeting November 14, 2005 Georgetown University Medical Center. Welcome and Introduction. Vassilios Papadopoulos, Ph.D. Associate Vice President & Director, Biomedical Graduate Research Organization - PowerPoint PPT Presentation

Transcript of Protein Information Resource

Page 1: Protein  Information Resource

Protein Information Resource

Oversight and Scientific Advisory Board Meeting

November 14, 2005Georgetown University Medical Center

Page 2: Protein  Information Resource

Welcome and Introduction

Vassilios Papadopoulos, Ph.D.Associate Vice President & Director, Biomedical Graduate Research OrganizationGeorgetown University Medical Center

David States, M.D., Ph.D.Chair, PIR Oversight and Scientific Advisory BoardProfessor & Director of Bioinformatics, University of Michigan

Page 3: Protein  Information Resource

PIR/UniProt Overview

Project Overview, Organization, Infrastructure

Cathy H. Wu, Ph.D.

Director, PIR

Professor, Georgetown University Medical Center

Page 4: Protein  Information Resource

4

Protein Information Resource (PIR)

UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function

PIRSF Family Classification System: Protein Classification and Functional Annotation

iProClass Integrated Protein Database: Data Integration and Protein Mapping

Cyber Infrastructure (Interoperability and Dissemination): Ontology, XML, Object/Relational DB, J2EE Architecture

Integrated Protein Informatics Resource for Genomic/Proteomic Research

http://pir.georgetown.edu

Page 5: Protein  Information Resource

5

UniProt: Universal Protein Resource

International Consortium Protein Information Resource (PIR) European Bioinformatics Institute (EBI) Swiss Institute of Bioinformatics (SIB)

NIH U01 Grant (NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR) Phase I (09/02-08/05): $6 Million Annual Bridge (09/05-?/06): $6.6M Phase II (?/06-?/09): $6.6-8.0(?)M

Central Resource of Protein Sequence and Function

http://www.uniprot.org

NHGRI

Page 6: Protein  Information Resource

6

UniProt Databases UniProt Archive (UniParc)

Comprehensive sequence archive with sequence history

Produced at EBI UniProt Reference Clusters (UniRef)

Non-redundant reference clusters for sequence search Produced at PIR

UniProt Knowledgebase (UniProtKB) Integration of PIR-PSD, Swiss-Prot and TrEMBL databases Stable, comprehensive, fully classified, richly and accurately annotated

knowledgebase UniProtKB/Swiss-Prot: Produced at SIB UniProtKB/TrEMBL: Produced at EBI Literature-based and automated annotation at SIB, PIR, EBI

Page 7: Protein  Information Resource

7

UniProt Management Structure Scientific Advisory Panel (SAP) to be established by NHGRI

Page 8: Protein  Information Resource

8

UniProt Project Coordination UniProt email discussion groups

Project Liaisons and Ad hoc teams Tri-weekly teleconference calls Tri-annual face-to-face Consortium meetings

January 12-13, 2006 at Geneva April 10-11, 2006 at Georgetown University

Exchange visits of scientific and technical staff Five PIR staff at SIB (1-2 weeks, Nov 05) for annotation integration

Retreats

France, 2004

Page 9: Protein  Information Resource

9

UniProt Activities at PIR Integration of PIR-PSD into UniProtKB Swiss-Prot/TrEMBL

Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental

features with literature evidence tag Functional annotation of UniProtKB proteins

Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins

Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site)

Production of UniRef100/90/50 databases => Enhancement & scaling Creation of UniProt web site and help system => Unified UniProt web

site & user community interaction

Page 10: Protein  Information Resource

10

PIRSF Classification System

PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Deliverables: UniProtKB annotations, InterPro families,

PIRSF reports, PIRSF curation platform

Protein Classification and Functional Annotation

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Work Group Meeting, April 2003

Page 11: Protein  Information Resource

11

iProClass Integrated Protein Database

Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins

Data Integration and Protein Mapping

Disease/Variation

OMIMHapMap

…Ontology

GO

Protein Sequence

UniProtUniRefUniParcRefSeq

GenPept…

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

MGITIGR

Gene Expression

GEOGXD

ArrayExpressCleanExSOURCE

Structure

PDBSCOPCATH

PDBSumMMDB

Family

PIRSFInterPro

PfamPrositeCOG

Interaction

DIPBIND

Taxonomy

NCBI TaxonNEWT

Protein Expression

Swiss-2DPAGEPMG

Literature

PubMed

Function/Pathway

EC-IUBMBKEGG

BioCartaEcoCyc

WIT…

Modification

RESIDPhosphoBase

iProClass

Integrated Protein Knowledgebase

iProClass

Integrated Protein Knowledgebase

NCBI X-Refs

Gene/Genome

Gene Ontology

KEGG PathwayStructure Homolog

PTM

EC

Additional Refs

NCBI X-Refs

Gene/Genome

Gene Ontology

KEGG PathwayStructure Homolog

PTM

EC

Additional Refs

Funded by NSF

Page 12: Protein  Information Resource

12

iProLINK Literature Mining Resource

iProLINKNLP Research

Literature-Based Curation

Bibliography Mapping& Annotation Extraction

Protein Name Ontology

Named Entity Recognition& Ontology Induction

Databases

UniProtPIRSF

iProClassGO

Bibliography

PubMed

Literature Mining &Protein Curation

Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG

Guidelines• Protein Naming Rules• Name Tagging Guidelines

Literature Corpus• Name-Tagged

Bibliography Display• Mapping of Protein ID to PubMed ID• Papers Categorized by Annotations• Papers Tagged with Annotations

Literature Corpus• Annotation-Tagged

Bibliography Submission• Protein Mapping• Annotation Categorization

integrated Protein Literature, INformation and Knowledge

http://pir.georgetown.edu/iprolink

Funded by NSF

Bibliography report: Annotated bibliography for UniProtKB proteins BioThesaurus reports: Protein and gene names for UniProtKB proteins RLIMS-P program: Tag PubMed abstracts for phosphorylation objects Protein ontology DAG: PIRSF-based ontology

Page 13: Protein  Information Resource

13

NIAID Proteomic Admin Center

Funded by NIAID

NIAID Proteomic Master Catalog & Complete Proteomes iProXpress for Protein Function and Pathway Analysis

Gene/Peptide-Protein Mapping Sequence Analysis & Data Mining Function/ Pathway Discovery

Clustered Matrix Clustered Graph Pathway Map

Gene/Peptide-Protein MappingSequence Analysis & Data Mining

Function and Pathway Analysis

Protein Information

Matrix

Interaction Map

IP/2D/MS Proteomic DataGene ExpressioniProXpressintegrated Protein eXpressionAnalysis System

iProClassiProClasshttp://pir.georgetown.edu/proteomics/

Page 14: Protein  Information Resource

14

Bioinformatics Infrastructure NCI caBIG: PIR grid-enablement (Programming access to UniProtKB) NSF TeraGrid: All-against-all BLAST (UniProtKB related sequences) PIR Bioinformatics Framework

Software Framework: J2EE n-Tier Architecture with Object Models Database Distribution: XML, FASTA, Relational (Oracle 9i, MySQL) Other Deliverables: Object Models, Web Services

Funded by NCI

Clients Middle Tier Data Source

(JavaWebStart)

Applications

Web Browser

(JavaWebStart)

Applications

Web Browser

JDBC

FlatFileAdapter

XMLAdapter

JDBC

FlatFileAdapter

XMLAdapter

MySqlDB2

Oracle

LegacyDatabases

XMLRepositories

MySqlDB2

Oracle

MySqlDB2

Oracle

LegacyDatabases

LegacyDatabases

XMLRepositoriesXMLRepositories

Servlet[Controller]

JSP,HTML,

XML (XSLT)[Presentation]

SQLDAO

DAOManager

Domain Objects[Model]

FLATDAO

XMLDAO

Servlet[Controller]

JSP,HTML,

XML (XSLT)[Presentation]

SQLDAO

DAOManager

Domain Objects[Model]

FLATDAO

XMLDAO

Page 15: Protein  Information Resource

15

Computing Environment Computers:

Two Sun V880, IBM P690, 100-CPU Linux Cluster, Compaq 4100 Alpha

Networking: Internet2, GU Network (1Gbps)

GU UIS Advanced Research Computing

GU CiscoSwitch

10/100 mbsPC’s

Alpha Server4100

PIR WebsiteDevelopment System

Oracle Database

IBMP690

Uniprot MirrorDB2

OracleFTP Site #2

GUGateway

OutsideWorld

Windows 2K ServerPrinting, Virus Protection,

Backups

1 Gbit/secSun Fire V880

Uniprot WebsiteOracle

Time LogicProduction System

FTP Site #1

Linux ServerUniprot Mail

ServerJitterbug

NetworkPrinters

Linux Cluster50 Linux PC’sWith 100 CPU

Blast/FastaLinux NFSFile Server

PortablePC’s

GU CiscoSwitch

10/100/1000 mbs

Sun Fire V880Development System

FTP Site #3

Page 16: Protein  Information Resource

16

PIR Environment Funding: ~$3Million Annual Total (2/3 UniProt, 1/3 Other) Home Institution: Georgetown University Medical Center (GUMC) Subcontract: National Biomedical Research Foundation (NBRF) New Location: Off-Campus (GU North Campus), 6250 SQFT

Suite 1200, 3300 Whitehaven Street NW, Washington, DC 20007

Page 17: Protein  Information Resource

17

PIR Organization

25 Staff Members 14 GU, 11 NBRF

22 FTEs 12.7 GU, 9.3 NBRF

17 with Doctorate Degree 11 GU Faculty

2 Professors 1 Research Associate Professor 6 Research Assistant Professors 2 Research Instructors

Informatics Team (12) (10.7 FTE)

Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)

Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)

Informatics Team (12) (10.7 FTE)

Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)

Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)

Protein Science Team (12) (10.3 FTE)

Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)

Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF)

Protein Science Team (12) (10.3 FTE)

Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)

Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF) PIR Director

Dr. Cathy Wu Professor (GU)

PIR Director

Dr. Cathy Wu Professor (GU)

Informatics Team (12) (10.7 FTE)

Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)

Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)

Informatics Team (12) (10.7 FTE)

Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)

Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)

Protein Science Team (12) (10.3 FTE)

Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)

Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF)

Protein Science Team (12) (10.3 FTE)

Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)

Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF) PIR Director

Dr. Cathy Wu Professor (GU)

PIR Director

Dr. Cathy Wu Professor (GU)

Page 18: Protein  Information Resource

18

PIR Community Interactions (since 2004)

Presentations and Invited Seminars NIH Proteomics Workshop (Bi-Annual) – Bioinformatics Day Conference Demos/Posters: ISMB-05, US HUPO-05, SOFG04 Over 20 Invited Presentations: Keystone, Human Brain Project Satellite

Symposium, PDB Symposium, HUPO-05 Policy Forums, Committees: NSF Plant Cyberinfrastructure, NIH Protein

Structure Initiative, HUPO Proteomics Standards Initiative Publications: Over 25 Refereed Papers and Book Chapters Collaborations and Interactions

Collaborated and interacted with over 10 research institutions Hosted face-to-face meetings for NIAID/caBIG projects

Paper and Grant Reviews Reviewed over 20 papers for referred journals and conferences Served on NSF/NIH grant review panels

Page 19: Protein  Information Resource

19

PIR-Georgetown Interactions

Teaching Courses: Bioinformatics (BCHB 521), Advanced

Bioinformatics (BCHB 621) Lectures: Medical Biochemistry, Protein Biomarker,

Introductory Biology Mentoring

Mentored 9 graduate students (PhD students, MS Internship projects)

Intercampus Seminars Proposal Submission by PIR Young Investigators as PI

Six proposals to federal and other agencies

Page 20: Protein  Information Resource

PIR/UniProt – Summary & Statistics

Database Growth Database Usage Unified UniProt WebSite PIR UniProt Consortium Interactions

Peter McGarvey, Ph.D.

Page 21: Protein  Information Resource

21

UniProt Reference Clusters (UniRef)

UniProt Archive(UniParc)

UniProt Knowledgebase(UniProtKB)

UniProt: the world's most comprehensive catalog of information on proteins

http://www.uniprot.orgUniProt (Universal Protein Resource) http://www.uniprot.orgUniProt (Universal Protein Resource)

Swiss-Prot sectionManually-annotated protein sequences

= + += + +

UniRef100

UniRef90

UniRef50

UniRef100

UniRef90

UniRef50

A stable, comprehensive

archive of all publicly available protein sequences for

sequence tracking from:

Swiss-Prot, TrEMBL, PIR-PSD,

EMBL, Ensembl, IPI, PDB, RefSeq,

FlyBase, WormBase, Patent Offices, etc.

Non-redundant reference sequences clustered from UniProtKB and UniParc for

comprehensive or fast sequence searches at 100%,

90%, or 50% identity

Integration of Swiss-Prot, TrEMBLand PIR-PSD

Fully classified, richly and accurately annotated protein sequences with minimal redundancy and extensive

cross-references

TrEMBL sectionComputer-annotated protein sequences

UniProt: Universal Protein Resource http://www.uniprot.org

Page 22: Protein  Information Resource

22

Database Growth

0

1000000

2000000

3000000

4000000

5000000

6000000

Rel 1.0,Dec-03

Rel 2.0 Rel 3.0 Rel 4.0 Rel 5.0 Rel 6.0 Rel 6.4,Nov-05

Major Releases

UniParc

UniRef100

UniProtKB

UniProtKB/TrEMBL

UniRef90

UniRef50

UniProtKB/SwissProt

+EVN -EVN

Page 23: Protein  Information Resource

23

FTP Downloads 2005

0

1000

2000

3000

4000

5000

6000

7000

2005

UniRef50

UniRef90

UniRef100

UniProt/SwissProt

UniPrtot/TrEMBL

Unique Domains

0

10000

20000

30000

40000

50000

PIR.Georgetown.Edu

PIR.UniProt.Org

Hits

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

PIR.UniProt.Org PIR.Georgetown.Edu

Page 24: Protein  Information Resource

24

Customer Email [email protected] & [email protected]

UniProtKB UniRef UniParc iProClass PSD NREF PIRSF

UniProt ~75% 12% 8% < 1% < 1% < 1% < 1%

PIR 22% 1% 1% 21% ~15% 18% 10%

FTP Site Web Site

XML ID Mapping

UniProt 16% 29% 16% ~18%

PIR 11% 27% 15% ~3% 1 Day Turnaround

“PIR is a wonderful resource.” – Craig“Thank you for your prompt response, as always UniProt is on the ball!” – Fiona

550 UniProt emails720 PIR emails

Page 25: Protein  Information Resource

25

PIR/UniProt – Unified UniProt Web Site

Dec. 03, Three Synchronized Sites based on PIR Design

Nov. 04, Established Goals for Unified Web Sites.

2005, Back-end Data and Software Platform Developed.

Nov. 05, PIR Playing a Lead Role in Developing Specifications for the Interface.

June 06, Release of Unified UniProt Web Site Hosted by PIR and EBI

Page 26: Protein  Information Resource

26

PIR/UniProt - Consortium Interactions

UniProt liaison group (discussion of high-level issues) UniProt web site committee (Unified UniProt web site planning) UniProt Link committee (working with external databases) UniProt help-mail (answering user inquiries) UniProt document committee (documentation, tutorials and FAQs) UniProt XML group (XML documentation and maintenance) UniProt group for automatic annotation pipeline Manual curation of Swiss-Prot template sequences Manual curation of site rules and controlled vocabularies Development of automatic annotation rules Development of protein naming guidelines Incorporation of new protein families into InterPro PIR routinely visits or hosts colleagues from EBI and SIB for

discussions. Biweekly update of UniRef, UniParc and UniProtKB databases

Page 27: Protein  Information Resource

Protein Classification and Annotation

Darren Natale, Ph.D.

Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC

Page 28: Protein  Information Resource

28

Protein Curation Activities

PIRSF – classification of homeomorphic proteins based on evolutionary relationships

PIRNR – family-based “Name Rules” that define the parameters for propagating specific name, EC and GO annotation to members

PIRSR – family-based “Site Rules” that define the parameters for propagating specific feature annotation to members

Page 29: Protein  Information Resource

29

Specialized Tools (I)

DAGPreserves these three features in a navigable format

•Pfam/PIRSF Hierarchy

•Domain Relatives

•Domain Composition

In edit mode, allows easy creation, destruction, and movement of PIRSFs

Page 30: Protein  Information Resource

30

Specialized Tools (II)

PIR Tree and Alignment Viewer (PIRTAV)HPS = 3-hexulose-6-phosphate synthase

HPS

KGPDC

KGPDC = 3-keto-L-gulonate 6-phosphate decarboxylase

Phylogenetic Tree Classification/Annotation Alignment

Page 31: Protein  Information Resource

PIRSF Curation Pipeline Uncurated level – computer-generated Preliminary Curation Level

Curate membership (principle tools: BLAST results, iterative blastclust, on-the-fly HMM)

Curate domain architecture Select seeds

Full Curation Level Curate name and some references Optional: write abstract indicating function, structure, etc.

After name review session and HMM performance check, all information (HMM, membership, annotation) is sent to EBI for integration into InterPro.

(Full level only)

Page 32: Protein  Information Resource

32

PIRNR Curation Pipeline

Start with PIRSF curated to Full level Define match criteria for application of the rule Review protein name, synonyms, EC numbers,

GO terms Find those that are appropriate to propagate to

members that match rule criteria

After review of propagable information, send match conditions, exclusion conditions, and propagated fields to EBI for inclusion into automatic annotation pipeline. Results are displayed in EBI’s UniProt entry extended view.

Page 33: Protein  Information Resource

33

PIRSR Curation Pipeline Start with PIRSF with curated membership and

seeds. At least one member must have solved structure.

Edit seed-to-structure alignment to define and retain conserved regions covering pertinent residues

Build Site HMM from concatenated conserved regions Define feature annotation using controlled vocabulary

with evidence attribution

Apply rules to PIRSF members, create log files to send to SIB (UniProtKB/Swiss-Prot) or EBI (UniProtKB/TrEMBL). Results are incorporated into UniProtKB flat files.

Page 34: Protein  Information Resource

4222

1595

1266

162 Preliminary

693 Full

352 Full + Desc

Nov-2004 Nov-2005

PIRSF (Families) 5876 7083

PIRNR (Name Rules) 320 1321

PIRSR (Site Rules) 81 164

Progress on Protein Curation Activities

1001

1207

83

428 DE/GO/EC

342 DE/GO

157 DE

561

420

251

35 Active

34 Metal/Binding

14 Misc.

112

38

14

Page 35: Protein  Information Resource

35

PIRSFs integrated into InterPro Sent: PIRSF-unique:

PIRNR touches on UniProtKB/TrEMBL Entries: Annotation lines:

PIRSR touches on UniProtKB Entries: Feature lines:

1,775

840

60,300

281,400

41,000 ( 9,800)

100,000 (27,000)

Impact Measurements

Page 36: Protein  Information Resource

Increasing Throughput & Impact

PIRSF PIRNR PIRSR

Curated

Full

To InterPro

AutoAnno

With Structure

Active

•Comprehensive coverage

•Curation “push”

•Propagation at PIR

•Add ligand-binding

Increased specificity

Active +

Ligand

•Emphasize Full/InterPro •Rules to EBI •Active sites

All three will be integrated into the Swiss-Prot annotation platformAll three will be integrated into the Swiss-Prot annotation platformAll three will be integrated into the Swiss-Prot annotation platform

Page 37: Protein  Information Resource

UniRef Databases

Hongzhan Huang, Ph.D.

Bioinformatics Team Lead

Protein Information Resource, GUMC

Page 38: Protein  Information Resource

38

UniRef (UniProt Reference Clusters) Non-Redundant Reference Clusters for Sequence Searching Derived from UniProtKB and Selected UniParc Sources

UniRef100: 100% sequence identity UniRef90: 90% sequence identity (1/3 size reduction from UniRef100) UniRef50: 50% sequence identity (2/3 size reduction)

Release 6.4 (Nov 05)

Page 39: Protein  Information Resource

39

UniRef100 The most comprehensive sequence dataset for sequence similarity search

3,176K sequences in UniRef100 vs. 3,022K sequences in NCBI nr Source Sequences

Complete UniProtKB - Splice Variants as separate entries Selected UniParc (e.g. Ensembl and RefSeq)

Non-Redundancy Combine identical sequences from all species Merge sub-fragments

Sub-fragments

Page 40: Protein  Information Resource

40

UniRef90 & UniRef50 Reduced sequence datasets for faster sequence similarity search Representative sequence for each cluster Clustering Algorithm

CD-HIT: Fast, top down, non-overlapping PIR’s parallelized version running on Linux Cluster

UniRef90: 1/3 size reduction UniRef50: 2/3 size reduction

Page 41: Protein  Information Resource

41

UniRef50 Sequence Classification

Completely automated, biweekly-updated classification of all proteins

How good are the UniRef50 clusters? Evaluated by all-against-all BLAST search results 98% of the clusters are of good quality: each sequence matches every

other sequences within the cluster Problematic clusters

One long sequence bridges two or more non-related sub-clusters. May be resulted from incorrect gene models, domain-fusion, polyprotein New algorithm will be developed with length/overlap parameters to

detect and regroup such clusters.

Page 42: Protein  Information Resource

42

Usages of UniRef Clusters UniRef90/50 for comprehensive automated classification of proteins

Faster searches and less cluttered similarity search outputs More even sampling of sequence space and reduction of search bias

UniRef for integrity check of database annotation Uniref100 to annotate EST sequences UniRef50 to detect incorrect gene models

UniRef90/50 for PIRSF family classification UniRef90 to recruit new PIRSF family members UniRef50 to create new PIRSF families

UniRef50 Clusters

PIRCF Families(Computer-generated

Families)

PIRSF Families

Merge related clusters

Checked by

curator

Page 43: Protein  Information Resource

Literature Mining

Zhang-Zhi Hu, M.D.

Associate Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC

Page 44: Protein  Information Resource

44

iProLINKAn Integrated Resource for Protein Literature Mining

Complete UniProtKB bibliography mapping

RLIMS-P text mining tool for protein phosphorylation

BioThesaurus: protein/gene names

Page 45: Protein  Information Resource

45

PIR/UniProt Protein Bibliography

355,629 unique citations (PMID) are in iProClass for 2.4 million UniProtKB entries.

166,950 (47%) citations are currently in UniProtKB.

The additional 188,679 (53%) unique citations are taken from sources such as GeneRIF, SGD, MGI.

Bibliography report: curated citations

user submitted computationally mapped

Page 46: Protein  Information Resource

46

BioThesaurus report

Gene/protein names mapping Search synonyms Resolve name ambiguity

Database annotation Error detection: conflicting

names in UniProtKB Literature mining

Query expansion: synonyms and text-variants allow for expanded search results

Applications of BioThesaurus

IAPP

BioThesaurus – comprehensive collection of gene/protein names from multiple sources and their associations with database entities.

IAPP named in 18 entries

Page 47: Protein  Information Resource

47

Rule-based LIterature Mining System for Protein Phosphorylation

RLIMS-P report – PMID:1939059

kinase substrate sites

MEDLINE abstract (PubMed ID)

Phosphorylation feature extraction

UniProtKB entry mapping

UniProtKB site feature annotation & evidence

attribution

PMID mapping

RLIMS-P

1876 UniProtKB entries are currently annotated with 4042 phosphorylation sites.

105K unique citations (PMID) are in UniProtKB/Swiss-Prot Batch processing by RLIMS-P yielded 4690 abstracts with

phosphorylation information, 913 of them with site information, including 214 in UniProtKB entries with no annotated phosphorylation features.

P12957

RLIMS-P –

Page 48: Protein  Information Resource

48

NIAID Biodefense Proteomics ProgramNIAID Biodefense Proteomics Program

Peter McGarvey, Ph.D.

Page 49: Protein  Information Resource

49

NIAID Biodefense Proteomics ProgramNIAID Biodefense Proteomics Program

7 Proteomics Research Centers: Identifying Targets for Therapeutic Interventions “..discovering targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics”

Administrative Resource Center: Support research centers, public distribution of results and protocols

..establish a Scientific Working Group, Interoperability Working Group, Data infrastructure and promote awareness of the project so scientists worldwide can utilize these resources.

Page 50: Protein  Information Resource

50

Administrative ResourceAdministrative Resource Project Management - Social & Scientific Systems (SSS)

Meetings and Communications Web Portal NIAID Annual Meeting at PIR May 2006

Scientific Coordination - PIR & VBI Scientific Advisory Working Group (SWG) Interoperability Working Group (IWG)

Data Infrastructure – PIR & VBI Proteomic Database: Storage and Retrieval (VBI) Data Management and Analysis Tools (PIR/VBI) Integrated Protein Knowledge System (PIR)

Page 51: Protein  Information Resource

51

Proteomics Program Interaction MapProteomics Program Interaction Map

Page 52: Protein  Information Resource

52

Multiple Data Typesfrom ProteomicsResearch Centers

Data Integration at Admin Center

Integrated Dataat VBI

Data Exchange FormatControlled Vocabulary

Ontology

Master Catalog & Complete Proteomes

at GU-PIR

iProClassUniProt PIRSF

Protein IDPeptide/Protein

Sequence Mapping

Page 53: Protein  Information Resource

NCI caBIG™ Projects

Baris E. Suzek

Associate Bioinformatics Team Lead

Protein Information Resource, GUMC

Page 54: Protein  Information Resource

54

About caBIG The cancer Biomedical Informatics Grid - WWW of cancer

research National Cancer Institute (NCI) and over 50 cancer centers Goals:

Breaking down technical and collaborative barriers within the cancer community

Facilitating connectivity and sharing of information through common standards and unifying architecture

Addressing not only syntactic but also semantic interoperability

https://cabig.nci.nih.gov

Page 55: Protein  Information Resource

55

PIR Activities in caBIG

Domain Workspaces Clinical Trial Management Systems Integrative Cancer Research Workspace

PIR Developer Project: Grid Enablement of PIR PIR Adopter Project (Tester): SEED Genome Annotation Tool PIR Participant (Consultant): Protein informatics tools, databases

Tissue Banks and Pathology Tools Workspace Cross Cutting Workspaces

Architecture Vocabularies and Common Data Elements

PIR Participant: Protein models, objects, vocabularies, ontologies

Page 56: Protein  Information Resource

56

Grid-Enablement of PIR Goal: UniProt Knowledgebase (UniProtKB) serves as the central

protein information resource for cancer research One of four caBIG reference projects

PIR (Georgetown University) caTIES (University of Pittsburg) rProteomics (Duke University) caArray (NCICB/Georgetown)

First phase completed UniProKB is searchable through caGrid browser

Second phase to be developed Expose more information from PIR/UniProt databases to caBIG Increase semantic/syntactic interoperability with other services

Current Architecture caGrid 0.5

Page 57: Protein  Information Resource

57

PIR SEED Adoption

SEED Genome Annotation Tool Developer: U Chicago/Argonne National Lab Open source and distributed framework for genome annotation Support subsystems annotation and metabolic reconstructions Explore functional coupling based on genome context,

metabolic pathway, and phylogenetic profile PIR roles

Assist development of use cases Create test procedures and test the system Develop user manual