Post on 31-Dec-2015
EBI is an Outstation of the European Molecular Biology Laboratory.
Amaia SangradorInterPro curatoramaia@ebi.ac.uk
Introduction to InterPro
What is InterPro?
DIAGNOSTICS RESOURCE :
InterPro uses signatures from several different databases (referred to as member databases) to predict information
about proteins
*
Provides functional analysis of proteins by classifying them into families and predicting domains and important sites
*
Adds information about the signatures and the types of proteins they match
InterPro Consortium
Consortium of 11 major signature databases
Why do we need predictive annotation tools?
Based on the original work on PIR , Swiss-Prot and TrEMBL
Collaboration between EBI, SIB and PIR
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
What is UniProt?
UniParc - Sequence archive Current and obsolete sequences
UniMES
Metagenomicand environmentalsample sequences
UniProtKB/Swiss-Prot
Reviewed
UniProtKB/TrEMBL
Unreviewed
UniProtKBProtein knowledgebase
EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources
UniRefSequence clusters
UniRef100
UniRef90
UniRef50
High-quality manual annotation
Automatic annotation
Annotation using InterPro
Swiss-Prot
groups of related proteins
(same family or share domains)
TrEMBL
uncharacterised sequence
protein signatures
InterPro
automatic annotation
pipelineCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
manually annotated sequence
Protein family classificationProtein family classification
• Given a set of sequences, we usually want to know:
– what are these proteins; to what family do they belong?
– what is their function; how can we explain this in structural terms?
Protein family classification : Protein family classification : BLAST (BLAST (pairwise comparisons)
Protein family classification: Protein family classification: BLASTBLAST
Limitations with Pairwise comparisons
• BLAST alignment of 2 proteins: • 60S acidic ribosomal protein P0 from 2 species
Limitations with Pairwise comparisons
Protein family classification: Protein family classification: signature databasessignature databases
• Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences
• This is the approach taken by ‘signature’ databases
Protein signatures
• More sensitive homology searches
• Each member database creates signatures using different methods and
methodologies:
manually-created sequence alignments
automatic processes with some human input and correction
entirely automatically.
What are protein signatures?
Multiple sequence alignment
Protein family/domainBuild model
Search
Mature model
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
UniProtit.
Significant match
Protein analysis
Member databases
Hidden Markov Models Finger-Prints
Profiles PatternsSequence Clusters
Structural Domains
Functional annotation of families/domains
Prediction of conserved domains
Protein features (active sites…)
METHODS
Full domain alignment methods
Single motif methods
Multiple motif methods
Regex patterns (PROSITE)
Profiles (Profile Library)
HMMs (Pfam)
Identity matrices (PRINTS)
Diagnostic approaches (sequence-based)
Patterns
Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx
Sequence alignment
MotifDefine pattern
Pattern signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-CBuild regular expression
PS00000
Patterns
Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites
• Anchoring the match to the extremity of a sequence<M-R-[DE]-x(2,4)-[ALT]-{AM}
• Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies
• Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C
Drawbacks
• Simple but less powerful
Advantages
>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SANGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCELDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGFGENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGIEERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK
AAVEEGILPGGGVALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGAVIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLPKDESESGAAGAGMGGMGGMDY
EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)
A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA]
Pattern/motif in sequence regular expression
Prosite patterns
Fingerprints
Sequence alignment
Correct order
Correct spacing
Motif 2 Motif 3Motif 1Define motifs
Fingerprint signature 1 2 3
PR00000
Extract motif sequences
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Weight matrices
The significance of motif context
order
interval
• Identify small conserved regions in proteins
• Several motifs characterise family
• Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours
PRINTS families are hierarchical Different motifs describe subfamilies
G protein-coupled receptors
rhodospin-like secretin-like cAMP receptors
metabotropicglutamatereceptors
etc
adenosine receptors
opsin receptors
dopamine receptors
somatostatin receptors
histaminereceptors
etc
somatostatin receptor type 1
somatostatin receptor type 2
somatostatin receptor type 3
etc
Profiles & HMMs
Sequence alignment
Entire domainDefine coverage
Whole protein
Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Build model Models insertions and deletions
Profile or HMM signature
Hidden Markov Models (HMM)
Models insertions and deletions
More flexible (can use partial alignments)
Profiles
Built using weight matrices
More sophisticated algorithm
• PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination.
• HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions.
PROSITE and HAMAP profiles:a functional annotation perspective
HMM databases
Sequence-based
• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship
• PANTHER: families/subfamilies model the divergence of specific functions
• TIGRFAM: microbial functional family classification
• PFAM : families & domains based on conserved sequence
• SMART: functional domain annotation
Structure-based
•SUPERFAMILY : models correspond to SCOP domains
• GENE3D: models correspond to CATH domains
Why we created InterPro
By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database
– to simplify & rationalise protein analysis
– to facilitate automatic functional annotation of uncharacterised proteins
– to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases
InterPro entry
InterPro entry
The InterPro entry: types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure
Family
Distinct functional, structural or sequence units that may exist in a variety of biological contextsDomain
Short sequences typically repeated within a proteinRepeats
PTM Active Site
Binding Site
Conserved Site
Sites
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
Quality control
Removes redundancy
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
Hierarchical classification
Interpro hierarchies: Families
FAMILIES can have parent/child relationships with other Families
Parent/Child relationships are based on:
• Comparison of protein hits
child should be a subset of parent
siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
Interpro hierarchies: Domains
DOMAINS can have parent/child relationships
with other domains
Domains and Families may be linked through Domain Organisation
Hierarchy
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
The Gene Ontology project provides a controlled vocabulary of terms for
describing gene product characteristics
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
InterPro Entry
Adds extensive annotation
Links to other databases
Structural information and viewers
Groups similar signatures together
Adds extensive annotation
Links to other databases
PDB 3-D Structures
SCOP Structural domains
CATH Structural domain classification
Understanding signatures:
Non-overlapping signatures can be describing the same thing
Not always possible to use signature overlap to determine how family signatures are related
PF03157 336 protein hitsPR00210 331 protein hits
Two very different signatures both describing the same thing!
e.g. High molecular weight glutenins
PFAM shows domain is composed of two types of repeated sequence motifs
SUPERFAMILY shows the potential domain
boundaries
www.ebi.ac.uk/interpro
Some signatures give us similar, but complementary information
4) Non-contiguous domains
3) Repeated elements
2) Duplicated domains
1) Signature method
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
• e.g. PRINTS – discrete motifs1) Signature methodSignature method
3) Repeated elements
2) Duplicated domains
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domainsDuplicated domains
3) Repeated elements
4) Non-contiguous domains
• e.g. SSF - duplication consisting of 2 domains with same fold
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
3) Repeated elementsRepeated elements
2) Duplicated domains
• e.g. Kringle, WD40
4) Non-contiguous domains
1) Signature method
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
3) Repeats
4) Non-contiguous domainsNon-contiguous domains
2) Duplicated domains
1) Signature method
• Structural domains can consist of non-contiguous sequence
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
Discontinuous Signatures Require Interpretation
4) Non-contiguous domains
3) Repeats
2) Duplicated domains
1) Signature method
www.ebi.ac.uk/interpro
Searching InterPro:
WHEN TO USE INTERPRO
Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence.
You can search InterPro if you have
•a protein sequence•a UniProtKB protein identifier, •a Gene Ontology term, •a protein structure code •a general search term
keywordshort phrase
and require further information regarding your protein of interest.
http://www.ebi.ac.uk/interpro/
Search tools include:
• Text Search
• InterProScan (sequence search)
• BioMart (builds queries)
Beta version: http://wwwdev.ebi.ac.uk/interpro/
InterPro Search
wwwdev.ebi.ac.uk/interpro
Search using:• text• protein ID• InterPro ID• GO term ID: GO:0006915
Name : apoptosis
InterPro Search
Search results for GO:0006915 (apoptosis )
InterPro Search
wwwdev.ebi.ac.uk/interpro
protein ID
InterPro Search Results
Structural data
Link to PDBe
Unintegrated signatures
Domains and sites
Family
Structural information
CATH and SCOP divide PDB structures into domains
Swiss-Model and ModBase can predict structure for regions not covered by PDB
Note that one domain is discontiguous
Searching InterPro:
InterProScan
InterProScan – Searching New Sequence
wwwdev.ebi.ac.uk/interpro
Paste in unknown sequence
Additional options
InterProScan New Search Results
Links to signature database
s
Link to InterPro entry
Searching InterPro:
BioMart
• Large volumes of data can be queried efficiently
• The interface is shared with many other bioinformatics resources
• It allows federation with other databases
PRIDE (mass spectrometry-derived proteins and peptidesREACTOME (biological pathways)
BioMart Search
BioMart allows more powerful and flexible queries
BioMart Search
1) Choose Dataseta. Choose InterPro BioMart
BioMart Search
1) Choose Dataseta. Choose InterPro BioMart
b. Choose InterPro entries or protein matches
BioMart Search
2) Choose FiltersSearch specific entries, signatures or proteins
BioMart Search
2) Choose Filters e.g. Filter by specific proteins
BioMart Search
3) Choose Attributes What results you want
BioMart Search
4) Choose additional Dataset (optional) This is where you link results to Pride and Reactome
BioMart Search Results
User manual
HTML = web-formatted tableCSV = comma-separated valuesTSV = tab-separated valuesXLS = excel spreadsheet
Click to view results
InterPro – the numbers
Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise!
• InterPro 32.0: 21516 entries
101175 signatures covering 85.5% of UniProtKB
• Frequent releases – both protein and method updates
• 45 000 unique visitors per month
• The database has grown almost 10-fold in ~11 years
Caveats
We need your feedback!missing/additional referencesreporting problemsrequests
InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented.
•for example, inactive peptidases, such as Q8N3Z0, Q9W3H0
InterPro entries are based on signatures supplied to us by our member databases
•....this means no signature, no entry!
EBI support page.
InterPro Team:
Acknowledgements
Amaia Sangrador
David Lonsdale
Craig McAnulla
MatthewFraser
Anthony Quinn
Maxim Scheremetjew
PhilJones
Siew-Yit Yong
Alex Mitchell
Sebastien Pesseat
PrudenceMutowo
SarahHunter
ChristopherHunter