EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator...

of 72 /72
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator [email protected] Introduction to InterPro

Embed Size (px)

Transcript of EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator...

  • What is InterPro?

    DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins*Provides functional analysis of proteins by classifying them into families and predicting domains and important sites*Adds information about the signatures and the types of proteins they match

  • InterPro ConsortiumConsortium of 11 major signature databases

  • Why do we need predictive annotation tools?

  • Based on the original work on PIR , Swiss-Prot and TrEMBL

    Collaboration between EBI, SIB and PIRThe mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. What is UniProt?

  • UniParc - Sequence archive Current and obsolete sequencesUniMES

    Metagenomicand environmentalsample sequencesUniProtKB/Swiss-ProtReviewed UniProtKB/TrEMBLUnreviewedUniProtKBProtein knowledgebaseEMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resourcesUniRefSequence clustersUniRef100UniRef90UniRef50High-quality manual annotationAutomatic annotation

  • Annotation using InterProSwiss-ProtCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCGCGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCGmanually annotated sequence

  • Protein family classificationGiven a set of sequences, we usually want to know:what are these proteins; to what family do they belong?what is their function; how can we explain this in structural terms?

  • Protein family classification : BLAST (pairwise comparisons)

  • Protein family classification: BLAST

  • Limitations with Pairwise comparisonsBLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 species

  • Limitations with Pairwise comparisons

  • Protein family classification: signature databasesAlternatively, we can seek patterns that will allow us to infer relationships with previously-characterised sequencesThis is the approach taken by signature databases

  • Protein signaturesMore sensitive homology searches

    Each member database creates signatures using different methods and methodologies: manually-created sequence alignmentsautomatic processes with some human input and correctionentirely automatically.

  • What are protein signatures?Protein family/domainBuild modelSearchMature modelITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELKUniProtSignificant matchProtein analysis

  • Member databasesHidden Markov ModelsFinger-PrintsProfilesPatternsSequence ClustersStructural DomainsFunctional annotation of families/domainsPrediction of conserved domainsProtein features (active sites)METHODS

  • Full domain alignment methodsSingle motif methodsMultiple motif methodsRegex patterns (PROSITE)Profiles (Profile Library)HMMs (Pfam)Identity matrices (PRINTS)Diagnostic approaches (sequence-based)

  • PatternsSequence alignment

  • PatternsPatterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Anchoring the match to the extremity of a sequence
  • >sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SANGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCELDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGFGENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGIEERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATKAAVEEGILPGGGVALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGAVIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLPKDESESGAAGAGMGGMGGMDY EXAMPLE: PS00296;Chaperonins cpn60 signature(PATTERN)A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA]Pattern/motif in sequence regular expressionProsite patterns

  • FingerprintsSequence alignment

  • The significance of motif contextIdentify small conserved regions in proteins

    Several motifs characterise family

    Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

  • PRINTS families are hierarchical Different motifs describe subfamiliesG protein-coupled receptors

  • Profiles & HMMsSequence alignment

  • Hidden Markov Models (HMM) Models insertions and deletionsMore flexible (can use partial alignments)ProfilesBuilt using weight matricesMore sophisticated algorithm

  • PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination.

    HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions.

    PROSITE and HAMAP profiles:a functional annotation perspective

  • HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER: families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotationStructure-basedSUPERFAMILY : models correspond to SCOP domains GENE3D: models correspond to CATH domains

  • Why we created InterProBy uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database

    to simplify & rationalise protein analysis

    to facilitate automatic functional annotation of uncharacterised proteins

    to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases

  • InterPro entry

  • InterPro entry

  • The InterPro entry: types

  • InterPro Entry

    Quality control

    Removes redundancy

  • InterPro Entry Hierarchical classification

  • Interpro hierarchies: FamiliesFAMILIES can have parent/child relationships with other FamiliesParent/Child relationships are based on:

    Comparison of protein hits child should be a subset of parent siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators

  • Interpro hierarchies: DomainsDOMAINS can have parent/child relationships with other domains

  • Domains and Families may be linked through Domain OrganisationHierarchy

  • InterPro Entry

  • InterPro EntryThe Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics

  • InterPro EntryUniProt

    KEGG ... Reactome ... IntAct ...

    UniProt taxonomy

    PANDIT ... MEROPS ... Pfam clans ...

    Pubmed

  • InterPro EntryPDB 3-D StructuresSCOP Structural domains CATH Structural domain classification

  • Understanding signatures:

  • Non-overlapping signatures can be describing the same thing

    Two very different signatures both describing the same thing!e.g. High molecular weight glutenins

  • www.ebi.ac.uk/interproSome signatures give us similar, but complementary information

  • www.ebi.ac.uk/interproDiscontinuous Signatures Require Interpretation

  • 1) Signature methodwww.ebi.ac.uk/interproDiscontinuous Signatures Require Interpretation

  • 1) Signature method2) Duplicated domainswww.ebi.ac.uk/interproDiscontinuous Signatures Require Interpretation

  • 3) Repeated elements2) Duplicated domains4) Non-contiguous domains1) Signature methodwww.ebi.ac.uk/interproDiscontinuous Signatures Require Interpretation

  • 3) Repeats4) Non-contiguous domains2) Duplicated domains1) Signature methodwww.ebi.ac.uk/interproDiscontinuous Signatures Require Interpretation

  • Discontinuous Signatures Require Interpretation

    4) Non-contiguous domains3) Repeats2) Duplicated domains1) Signature methodwww.ebi.ac.uk/interpro

  • Searching InterPro:

  • WHEN TO USE INTERPROUse InterPro to predict family, domain or active site information for a given protein or amino acid sequence.

    You can search InterPro if you have

    a protein sequencea UniProtKB protein identifier, a Gene Ontology term, a protein structure code a general search termkeywordshort phraseand require further information regarding your protein of interest.

  • http://www.ebi.ac.uk/interpro/Search tools include: Text Search InterProScan (sequence search) BioMart (builds queries)Beta version: http://wwwdev.ebi.ac.uk/interpro/

  • InterPro Searchwwwdev.ebi.ac.uk/interproSearch using: text protein ID InterPro ID GO term

  • InterPro Search

  • InterPro Searchwwwdev.ebi.ac.uk/interproprotein ID

  • InterPro Search ResultsStructural dataLink to PDBe Unintegrated signaturesDomains and sitesFamily

  • Structural informationCATH and SCOP divide PDB structures into domainsSwiss-Model and ModBase can predict structure for regions not covered by PDBNote that one domain is discontiguous

  • Searching InterPro:

    InterProScan

  • InterProScan Searching New Sequencewwwdev.ebi.ac.uk/interproPaste in unknown sequenceAdditional options

  • InterProScan New Search ResultsLinks to signature databases

    Link to InterPro entry

  • Searching InterPro:

    BioMart

  • Large volumes of data can be queried efficiently

    The interface is shared with many other bioinformatics resources

    It allows federation with other databases

    PRIDE (mass spectrometry-derived proteins and peptidesREACTOME (biological pathways)

    BioMart SearchBioMart allows more powerful and flexible queries

  • BioMart SearchChoose Dataset Choose InterPro BioMart

  • BioMart SearchChoose Dataset Choose InterPro BioMart Choose InterPro entries or protein matches

  • BioMart SearchChoose FiltersSearch specific entries, signatures or proteins

  • BioMart SearchChoose Filters e.g. Filter by specific proteins

  • BioMart SearchChoose Attributes What results you want

  • BioMart SearchChoose additional Dataset (optional) This is where you link results to Pride and Reactome

  • BioMart Search ResultsUser manualHTML = web-formatted tableCSV = comma-separated valuesTSV = tab-separated valuesXLS = excel spreadsheetClick to view results

  • InterPro the numbersOur member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise!InterPro 32.0: 21516 entries 101175 signatures covering 85.5% of UniProtKBFrequent releases both protein and method updates45 000 unique visitors per monthThe database has grown almost 10-fold in ~11 years

  • CaveatsWe need your feedback!missing/additional referencesreporting problemsrequests

    InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented.

    for example, inactive peptidases, such as Q8N3Z0, Q9W3H0

    InterPro entries are based on signatures supplied to us by our member databases

    ....this means no signature, no entry!

    EBI support page.

  • InterPro Team:Acknowledgements

    *NOTE: Talk more about UniPro, for example how an entry looks and the information found there. Ask Sandra.**********InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins, such as possible function and the potential location of functionally important sites and domains. Each member database creates signatures in different ways: some groups build them from manually-created sequence alignments, some use automatic processes with some human input and correction and others build their signatures entirely automatically. The signatures are represented using a variety of different model types (HMMs, Profiles, Regular Expressions, etc.)The member databases all have their own particular niche or focus; at InterPro we aim to be a combination of their individual strengths. To do this we integrate signatures from the member databases that represent the same protein family, domain or site into a single InterPro entry. We check the biological accuracy of the individual signatures and add concise information about the signatures and the types of proteins they match, including consistent names, descriptive abstracts (with links to original publications) and GO terms.

    ****This is a good summary of how Profiles and HMMs are made.

    Take a multiple sequence alignment and either use the entire alignment (family model) or define the domain of interest (domain model).If a domain model, then extract the sequence from the alignment defining the domain.Use the alignment to build a Profile matrix or an HMM.A signature match is either non-positional and defines family membership, or it defines the position of the domain on the protein.The view of a Profile or HMM hit in InterPro.

    **Mention why this needs to be InterPro spefic,we have to cover a lot of different member database definitions.***TALK MORE ABOUT HOW WE DO GO MAPPING IN INTERPRO*C/F only between Family and Domain not domain/domain or family/familyDomains usually have well defined boundaries its not always easy to tell for family signatures. Just because a signature does not cover the full length of a protein does not mean its not diagnosing a family. So signature overlap is not an ideal way of creating Signature match length not fixedThe protein/full model matches made by the Child entry must be a complete (>75%) subset of the Parent entry. All signatures in the Child entry must overlap by >50% of their sequence (when comparing individual full matches) with all signatures in the Parent entry (in both directions) in >90% of cases. The Parent entry must make all the relationships of the Child entry Or at least be present (>90% of cases) when the Child signatures are present in the case of Contains/Found In and Overlapping relationships.There must be no signature adjacent (ie >50% overlap either direction) to the Child that is covered by the Parent in >90% of cases. There must be no signature adjacent (ie >50% overlap either direction) to the Parent that is covered by the Child in >90% of cases. Note: adjacent means the signatures overlap by