EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator...

Post on 31-Dec-2015

224 views 1 download

Tags:

Transcript of EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator...

EBI is an Outstation of the European Molecular Biology Laboratory.

Amaia SangradorInterPro curatoramaia@ebi.ac.uk

Introduction to InterPro

What is InterPro?

DIAGNOSTICS RESOURCE :

InterPro uses signatures from several different databases (referred to as member databases) to predict information

about proteins

*

Provides functional analysis of proteins by classifying them into families and predicting domains and important sites

*

Adds information about the signatures and the types of proteins they match

InterPro Consortium

Consortium of 11 major signature databases

Why do we need predictive annotation tools?

Based on the original work on PIR , Swiss-Prot and TrEMBL

Collaboration between EBI, SIB and PIR

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

What is UniProt?

UniParc - Sequence archive Current and obsolete sequences

UniMES

Metagenomicand environmentalsample sequences

UniProtKB/Swiss-Prot

Reviewed

UniProtKB/TrEMBL

Unreviewed

UniProtKBProtein knowledgebase

EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources

UniRefSequence clusters

UniRef100

UniRef90

UniRef50

High-quality manual annotation

Automatic annotation

Annotation using InterPro

Swiss-Prot

groups of related proteins

(same family or share domains)

TrEMBL

uncharacterised sequence

protein signatures

InterPro

automatic annotation

pipelineCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

manually annotated sequence

Protein family classificationProtein family classification

• Given a set of sequences, we usually want to know:

– what are these proteins; to what family do they belong?

– what is their function; how can we explain this in structural terms?

Protein family classification : Protein family classification : BLAST (BLAST (pairwise comparisons)

Protein family classification: Protein family classification: BLASTBLAST

Limitations with Pairwise comparisons

• BLAST alignment of 2 proteins: • 60S acidic ribosomal protein P0 from 2 species

Limitations with Pairwise comparisons

Protein family classification: Protein family classification: signature databasessignature databases

• Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences

• This is the approach taken by ‘signature’ databases

Protein signatures

• More sensitive homology searches

• Each member database creates signatures using different methods and

methodologies:

manually-created sequence alignments

automatic processes with some human input and correction

entirely automatically.

What are protein signatures?

Multiple sequence alignment

Protein family/domainBuild model

Search

Mature model

ITWKGPVCGLDGKTYRNECALL

AVPRSPVCGSDDVTYANECELK

UniProtit.

Significant match

Protein analysis

Member databases

Hidden Markov Models Finger-Prints

Profiles PatternsSequence Clusters

Structural Domains

Functional annotation of families/domains

Prediction of conserved domains

Protein features (active sites…)

METHODS

Full domain alignment methods

Single motif methods

Multiple motif methods

Regex patterns (PROSITE)

Profiles (Profile Library)

HMMs (Pfam)

Identity matrices (PRINTS)

Diagnostic approaches (sequence-based)

Patterns

Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx

Sequence alignment

MotifDefine pattern

Pattern signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-CBuild regular expression

PS00000

Patterns

Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites

• Anchoring the match to the extremity of a sequence<M-R-[DE]-x(2,4)-[ALT]-{AM}

• Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies

• Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C

Drawbacks

• Simple but less powerful

Advantages

>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SANGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCELDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGFGENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGIEERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK

AAVEEGILPGGGVALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGAVIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLPKDESESGAAGAGMGGMGGMDY

EXAMPLE:  PS00296; Chaperonins cpn60 signature  (PATTERN)

A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA]

Pattern/motif in sequence regular expression

Prosite patterns

Fingerprints

Sequence alignment

Correct order

Correct spacing

Motif 2 Motif 3Motif 1Define motifs

Fingerprint signature 1 2 3

PR00000

Extract motif sequences

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

Weight matrices

The significance of motif context

order

interval

• Identify small conserved regions in proteins

• Several motifs characterise family

• Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

PRINTS families are hierarchical Different motifs describe subfamilies

G protein-coupled receptors

rhodospin-like secretin-like cAMP receptors

metabotropicglutamatereceptors

etc

adenosine receptors

opsin receptors

dopamine receptors

somatostatin receptors

histaminereceptors

etc

somatostatin receptor type 1

somatostatin receptor type 2

somatostatin receptor type 3

etc

Profiles & HMMs

Sequence alignment

Entire domainDefine coverage

Whole protein

Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Build model Models insertions and deletions

Profile or HMM signature

Hidden Markov Models (HMM)

Models insertions and deletions

More flexible (can use partial alignments)

Profiles

Built using weight matrices

More sophisticated algorithm

• PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination.

• HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions.

PROSITE and HAMAP profiles:a functional annotation perspective

HMM databases

Sequence-based

• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship

• PANTHER: families/subfamilies model the divergence of specific functions

• TIGRFAM: microbial functional family classification

• PFAM : families & domains based on conserved sequence

• SMART: functional domain annotation

Structure-based

•SUPERFAMILY : models correspond to SCOP domains

• GENE3D: models correspond to CATH domains

Why we created InterPro

By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database

– to simplify & rationalise protein analysis

– to facilitate automatic functional annotation of uncharacterised proteins

– to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases

InterPro entry

InterPro entry

The InterPro entry: types

Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure

Family

Distinct functional, structural or sequence units that may exist in a variety of biological contextsDomain

Short sequences typically repeated within a proteinRepeats

PTM Active Site

Binding Site

Conserved Site

Sites

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

Quality control

Removes redundancy

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

Hierarchical classification

Interpro hierarchies: Families

FAMILIES can have parent/child relationships with other Families

Parent/Child relationships are based on:

• Comparison of protein hits

child should be a subset of parent

siblings should not have matches in common

• Existing hierarchies in member databases

• Biological knowledge of curators

Interpro hierarchies: Domains

DOMAINS can have parent/child relationships

with other domains

Domains and Families may be linked through Domain Organisation

Hierarchy

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

The Gene Ontology project provides a controlled vocabulary of terms for

describing gene product characteristics

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

UniProt

KEGG ... Reactome ... IntAct ...

UniProt taxonomy

PANDIT ... MEROPS ... Pfam clans ...

Pubmed

InterPro Entry

Adds extensive annotation

Links to other databases

Structural information and viewers

Groups similar signatures together

Adds extensive annotation

Links to other databases

PDB 3-D Structures

SCOP Structural domains

CATH Structural domain classification

Understanding signatures:

Non-overlapping signatures can be describing the same thing

Not always possible to use signature overlap to determine how family signatures are related

PF03157 336 protein hitsPR00210 331 protein hits

Two very different signatures both describing the same thing!

e.g. High molecular weight glutenins

PFAM shows domain is composed of two types of repeated sequence motifs

SUPERFAMILY shows the potential domain

boundaries

www.ebi.ac.uk/interpro

Some signatures give us similar, but complementary information

4) Non-contiguous domains

3) Repeated elements

2) Duplicated domains

1) Signature method

www.ebi.ac.uk/interpro

Discontinuous Signatures Require Interpretation

• e.g. PRINTS – discrete motifs1) Signature methodSignature method

3) Repeated elements

2) Duplicated domains

4) Non-contiguous domains

www.ebi.ac.uk/interpro

Discontinuous Signatures Require Interpretation

1) Signature method

2) Duplicated domainsDuplicated domains

3) Repeated elements

4) Non-contiguous domains

• e.g. SSF - duplication consisting of 2 domains with same fold

www.ebi.ac.uk/interpro

Discontinuous Signatures Require Interpretation

3) Repeated elementsRepeated elements

2) Duplicated domains

• e.g. Kringle, WD40

4) Non-contiguous domains

1) Signature method

www.ebi.ac.uk/interpro

Discontinuous Signatures Require Interpretation

3) Repeats

4) Non-contiguous domainsNon-contiguous domains

2) Duplicated domains

1) Signature method

• Structural domains can consist of non-contiguous sequence

www.ebi.ac.uk/interpro

Discontinuous Signatures Require Interpretation

Discontinuous Signatures Require Interpretation

4) Non-contiguous domains

3) Repeats

2) Duplicated domains

1) Signature method

www.ebi.ac.uk/interpro

Searching InterPro:

WHEN TO USE INTERPRO

Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence.

You can search InterPro if you have

•a protein sequence•a UniProtKB protein identifier, •a Gene Ontology term, •a protein structure code •a general search term

keywordshort phrase

and require further information regarding your protein of interest.

http://www.ebi.ac.uk/interpro/

Search tools include:

• Text Search

• InterProScan (sequence search)

• BioMart (builds queries)

Beta version: http://wwwdev.ebi.ac.uk/interpro/

InterPro Search

wwwdev.ebi.ac.uk/interpro

Search using:• text• protein ID• InterPro ID• GO term ID: GO:0006915

Name : apoptosis

InterPro Search

Search results for GO:0006915 (apoptosis )

InterPro Search

wwwdev.ebi.ac.uk/interpro

protein ID

InterPro Search Results

Structural data

Link to PDBe

Unintegrated signatures

Domains and sites

Family

Structural information

CATH and SCOP divide PDB structures into domains

Swiss-Model and ModBase can predict structure for regions not covered by PDB

Note that one domain is discontiguous

Searching InterPro:

InterProScan

InterProScan – Searching New Sequence

wwwdev.ebi.ac.uk/interpro

Paste in unknown sequence

Additional options

InterProScan New Search Results

Links to signature database

s

Link to InterPro entry

Searching InterPro:

BioMart

• Large volumes of data can be queried efficiently

• The interface is shared with many other bioinformatics resources

• It allows federation with other databases

PRIDE (mass spectrometry-derived proteins and peptidesREACTOME (biological pathways)

BioMart Search

BioMart allows more powerful and flexible queries

BioMart Search

1) Choose Dataseta. Choose InterPro BioMart

BioMart Search

1) Choose Dataseta. Choose InterPro BioMart

b. Choose InterPro entries or protein matches

BioMart Search

2) Choose FiltersSearch specific entries, signatures or proteins

BioMart Search

2) Choose Filters e.g. Filter by specific proteins

BioMart Search

3) Choose Attributes What results you want

BioMart Search

4) Choose additional Dataset (optional) This is where you link results to Pride and Reactome

BioMart Search Results

User manual

HTML = web-formatted tableCSV = comma-separated valuesTSV = tab-separated valuesXLS = excel spreadsheet

Click to view results

InterPro – the numbers

Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise!

• InterPro 32.0: 21516 entries

101175 signatures covering 85.5% of UniProtKB

• Frequent releases – both protein and method updates

• 45 000 unique visitors per month

• The database has grown almost 10-fold in ~11 years

Caveats

We need your feedback!missing/additional referencesreporting problemsrequests

InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented.

•for example, inactive peptidases, such as Q8N3Z0, Q9W3H0

InterPro entries are based on signatures supplied to us by our member databases

•....this means no signature, no entry!

EBI support page.

InterPro Team:

Acknowledgements

Amaia Sangrador

David Lonsdale

Craig McAnulla

MatthewFraser

Anthony Quinn

Maxim Scheremetjew

PhilJones

Siew-Yit Yong

Alex Mitchell

Sebastien Pesseat

PrudenceMutowo

SarahHunter

ChristopherHunter