1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases:...

46
EMBL Outstation — The European Bioinformatics EMBL Outstation — The European Bioinformatics Institute Institute Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro

Transcript of 1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases:...

1EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro

2EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Large-Scale Characterization of Protein Sequence Data: The Integrative

Approach of SWISS-PROT + TrEMBL

3EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Times are changing

4EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

‘Data Waves’

Biological sequences Mutation Metabolism Polymorphism Signaling Expression

Size Complexity Integration

5EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The Challenge of the Genome Era

rapidly growing amounts of data lacking experimental determination of the biological function enhances the need for computational analyses of the data

6EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Need for Bioinformatics

7EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Bioinformatics: 5 years ago.....

Pharmaceutical companies were not interested Life scientists believed that it was an outlet for

failed biologists who like to play with computers Computer scientists did not even know of its

existence

8EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Bioinformatics: today.....

Pharmaceutical companies believe that it is a way to streamline the drug discovery process

Some life scientists believe that it is the solution to all problems in life sciences

Computer scientists find it most useful as a new way to get grants

9EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Bioinformatics: In 5 years.....

Pharmaceutical companies use it routinely complementary to experimental work

Life scientists use it efficiently and therefore forget that it exists

Computer scientists have jumped on another hot subject

10EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Bioinformatics

is a complement but no substitute of experimental research: it can help to plan experiments, but not replace experiments

is not cheap takes a significant amount of time to be any good Quality control is crucial: Some garbage in, a lot

of garbage out!

11EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Materials and Methods

Materials: biological data Methods: a wide range of computational

techniques

12EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Essential in Bioinformatics: Databases as a tool for

computational analysis and data-mining

(with SWISS-PROT being the gold-standard)

13EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

SWISS-PROT is a curated protein sequence data bank

established in July 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since June 1987

contains currently 76 000 protein sequence entries

14EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Essential criteria for a sequence data bank

it must be complete with minimal redundancy it must contain as much up-to-date information as

possible on each sequence all the information items must be retrievable by

computer programs in a consistent manner it should be integrated (cross-referenced) with

other sequence related data banks

15EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Integration with other databases

76 000 SWISS-PROT entries abstracted from > 60 000 references linked by > 275 000 direct pointers to 30 related

or specialized data collections

16EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Integration with other databases

EMBL Nucleotide Sequence Database PDB Genomic databases (FlyBase, SubtiList, MaizeDB,

EcoGene, LISTA, SGD, StyGene) 2D-Gel databases (ECO2DBASE, SWISS-

2DPAGE, Aarhus/Ghent, YEPD, Harefield) Specialized collections (OMIM, PROSITE,

ENZYME, GCRDB, Transfac, HSSP)

17EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Connections between databasesEMBL Nucleotide

Sequences

SWISS-PROTProtein Sequences

WormPep[C. elegans]

EPD [Euk. Prom]

FlyBase

SubtiList

MaiseDb

EcoGene [E. coli]

LISTA [Yeast]

Transfac

GCRDb [7TM recep.]

REBASE[RestrictionEnzymes]

StyGene[S. typhimurium]

Prosite[Patterns]

ECD [E. coli map]

SWISS-2DPAGE [2D]

Aarhus/Ghent [2D]

ECO2DBASE [2D]

ENZYME [Nomencl.]

DictyDB [D.disco.]

OMIM [Diseases]

YEPD [yeast]

HSSP [3 simil.]PDB [3D structures]

18EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

SWISS-PROT Growth.

0

5

10

15

20

25

Am

ino A

cid

s (M

illion

s)

87 88 89 90 91 92 93 94 95 96

Year

19EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Nucleotide sequence database growth

.

0

200

400

600

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96Year

Meg

ab

ase

s

20EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The Bottleneck: Annotation

21EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Annotation consists of the description of:

Function(s) of the protein Post-translational modification(s) Domains and sites Secondary structure Quaternary structure Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein Sequence conflicts, variants, etc.

22EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Annotation sources:

publications that report new sequence data review articles to periodically update the

annotation of families or groups of proteins external experts

23EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

TrEMBL

is a Computer-annotated supplement to SWISS-PROT

consists of entries in SWISS-PROT format translations of CDS in the Nucleotide Sequence

Database not in SWISS-PROT

24EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

August 1998: SWISS-PROT 36 + TrEMBL 7

327 000 CDS in corresponding EMBL release

74 000 SWISS-PROT entries 109 000 CDS integrated in SWISS-PROT the remaining 216 000 CDS were merged

whenever possible to reduce redundancy

25EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

TrEMBL release 7

194 000 TrEMBL entries 54 000 000 amino acids linked by > 300 000 direct pointers to 14 related or specialized data collections

26EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The Production of TrEMBL

translation and entry creation sorting the entries post-processing the SP-TrEMBL entries

27EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Translation and entry creation

translation of every CDS not yet cross-referenced to SWISS-PROT

parsing of information in EMBL entries into TrEMBL entries

28EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Sorting the entries

into SP-TrEMBL and REM-TrEMBL SP-TrEMBL is split in taxonomic divisions

29EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Post-processing

reducing redundancy enhancing the information content

30EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Improving AutomaticAnnotation will streamline flow

into TrEMBL will bring TrEMBL

nearer to SWISS-PROT quality

will make the transition from TrEMBL to SWISS-PROT easier

Hands-onCuration

Removal ofredundancy

PROSITE patternSearching

Enhancement

ReliableProsite

Matches

EnzymeNumbers

SP-TREMBL

SWISS-PROT

Hot Spot forDevelopment

31EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Demands on a system for automated data analysis and annotation

Correctness Scalability Updateable Low level of redundant information Completeness Standardized vocabulary

32EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Standardized transfer of annotation from characterized proteins in SWISS-PROT

to TrEMBL entries TrEMBL entry is reliably recognized by a given

method as a member of a certain group of proteins

corresponding group of proteins in SWISS-PROT shares certain annotation

common annotation is transferred to the TrEMBL entry and flagged as annotated by similarity

33EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Environment for Distributed Information Transfer to TrEMBL

(EDITtoTrEMBL)

RuleBase Analyzers Dispatchers

34EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

EDITtoTrEMBL

35EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

EDITtoTrEMBL: RuleBase

SWISS-PROT as source of annotation: correctness and controlled vocabulary

Rules can be semi-automatically and/or manually created

Rules can be updated

36EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

EDITtoTrEMBL: Analyzers

Directly implement an algorithm or communicate with external programs

Query other databases Use rules to add information to TrEMBL entries

37EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

EDITtoTrEMBL: Examples of Analyzers

sequence analysis tools (PROSITE, PFAM, PRINTS, TM, Coiled Coils, Signal etc)

sequence similarity searching (FASTA, SW, BLAST)

database scanning/parsing (MGD, FlyBase, ENZYME, etc)

38EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

EDITtoTrEMBL: Dispatchers

Control of annotation flow Error checking Removal of redundant information

39EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Automated post-processing of TrEMBL entries

redundancy removal: affects currently around 20% of the entries

improvements of annotation: affects currently around 25% of the entries

40EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

SWISS-PROT + TrEMBL

complete and up-to-date protein sequence collection

minimal redundancy: SP_TR_NRDB linked by > 500 000 direct pointers to 30

related or specialized data collections deeper integration between the EMBL

Nucleotide Sequence Database and SWISS-PROT + TrEMBL by using PID numbers

41EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Integrated resource of Protein domain and functional sites

(InterPro) Integration of different pattern recognition

methods (PROSITE, PRINTS and PFAM) Incorporation of new families and domains into

InterPro Enhancing the functional annotation of TrEMBL

entries Enhancing genome annotation

42EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The InterPro project participants Co-ordinated by EBI (R. Apweiler) PROSITE (A. Bairoch, P. Bucher) PRINTS (T. Attwood) PFAM (R. Durbin, E. Birney, A. Bateman, E. Sonnhammer) PRODOM (D. Kahn) PRATT (I. Jonassen) GENE-IT (J.-J. Codani) LION bioscience AG (R. Schneider)

43EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

1.9.1998: SWISS-PROT ceased

to be in the public domain

44EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

What has changed

No changes for academic users Almost no restrictions on the redistribution of

SWISS-PROT by academic servers or software companies

Commercial users are required to pay yearly subscription fees. These fees will be used to complement the existing grants in order to provide stable long-term funding

45EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

CreditsSWISS-PROT at EBI Rolf Apweiler Sergio Contrino Wolfgang Fleischmann Gill Fraser Henning Hermjakob Viv Junker Alexander Kanapin Youla Karavidopoulou Evguenia Kriventseva Fiona Lang Claire O'Donovan Michele Magrane Maria Jesus Martin Nicoletta Mitaritonna Steffen Moeller Evgenui Zdobnov

Collaborators Amos Bairoch Jean-Jacques Codani Keith Tipton Marvin Edelman Compugen Paracel Sue Povey and Julia White MGD Flybase Neil Rawlings Network of > 200 external experts

46EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Take-home message:

Bioinformatics is not essential for biologists, since 2 months in the lab can easily save you an afternoon at the computer