[email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics
-
Upload
owen-mckenzie -
Category
Documents
-
view
19 -
download
1
description
Transcript of [email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics
![Page 1: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/1.jpg)
[email protected] group, GenevaSIB Swiss Institute of Bioinformatics
Protein sequence databases:dissemination of protein
knowledge
http://education.expasy.org/cours/UniProt/
![Page 2: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/2.jpg)
Menu
Introduction
Nucleic acid sequence databases ENA, GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq…)
![Page 3: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/3.jpg)
Protein sequences are the fundamental determinants
of biological structure and function.
http://www.ncbi.nlm.nih.gov/protein
![Page 4: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/4.jpg)
Challenge
Flood of data -> need to be stored, curated and made available for analysis and knowledge
discovery
![Page 5: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/5.jpg)
TrEMBL Genpept
Swiss-Prot
RefSeqPRF
Ensembl
CCDS
UniParc
UniProtKB
PDB(PIR)
(IPI)
UniMES
TPA
Challenge (1)Many different protein sequence databases
NCBInr
![Page 6: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/6.jpg)
![Page 7: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/7.jpg)
These identifiers are all pointing to the same TP53 protein sequence (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc.
Challenge (1bis)
Different protein sequence databases : many identifiers for the same protein sequence
![Page 8: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/8.jpg)
A HUPO test sample study reveals common problems in mass spectrometry–based proteomics
PubMed 19448641 (2009)
• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)
• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).
• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…
![Page 9: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/9.jpg)
Nucleic Acids Res. 2010 ; 38(Database issue): D633–D639.
‘Examining links from the perspective of PubMed, we found that only a small fraction of published articles are linked to human genes (Entrez Gene).’
Challenge (3)(protein) sequence annotation
![Page 10: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/10.jpg)
Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC
number is not available…
‘journal publishers generally require deposition prior to publication so that an accession number can be included in
the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq
…not the case for protein sequences
!!! no more the case for a lot of genomes !!!
![Page 11: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/11.jpg)
Protein sequence origin…
![Page 12: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/12.jpg)
More than 99 % of the protein sequences are derived from the translation of nucleotide sequences
(genomes and/or cDNAs)
sequencing quality
coding sequence (CDS) annotation accuracy
gene prediction quality
![Page 13: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/13.jpg)
![Page 14: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/14.jpg)
… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
![Page 15: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/15.jpg)
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
~ 50-100 genomes/month
+ ~2’500 viral genomes=> Total ~ 5’000 genomes
![Page 16: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/16.jpg)
… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,
![Page 17: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/17.jpg)
Metagenomicsstudy of genetic material recovered directly from environmental samples
• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus
• Whale fall (AAFZ00000000.1)
• Soil, sand beach, New-York air, …
• Human fluids, mouse gut (millions of bacteria within human body)
• Water treatment industry…
• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
Venter’s Sorcerer II
![Page 18: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/18.jpg)
… ~ 2500 genomes sequenced (single organism, varying sizes)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
… personal human genomes
new generation sequencers : Illumina: 25 billions of bp /day;
![Page 19: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/19.jpg)
http://www.youtube.com/watch?v=mVZI7NBgcWM
…2700 genomes in 2010, 30’000 genomes in 2011 ?
2’000’000 $(2007)
70’000’000 $(diploid,
2007)
3’000’000’000 $(public consortium,
2000)
300’000’000 $(Celera, 2000)
2010
![Page 20: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/20.jpg)
But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele…
![Page 21: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/21.jpg)
apoE gene (Ensembl genome browser)
![Page 22: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/22.jpg)
New projects (homo sapiens)
• 1000 genomes (first publication, October 2010)
• Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…)
• International cancer genome consortium (www.icgc.org).
They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals….
How to define the human proteome ??? Which sequences ???
![Page 23: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/23.jpg)
How many proteins-coding genes at the end?
![Page 24: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/24.jpg)
190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:
20 million bacteria/archea x 4'000 genes
1 million protists x 6'000 genes
5 million insects x 14'000 genes
2 million fungi x 6'000 genes
0.5 million plants x 20'000 genes
0.5 million molluscs, worms, arachnids, etc. x 20'000 genes
0.1 million vertebrates x 25'000 genes
The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000
+20000 (Craig Venter)+ 42(Douglas Adam) + …
![Page 25: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/25.jpg)
About 190 billions of proteins (?)
About 14.0 millions of ‘known’ protein sequences in 2011(from ~300’000 species)
More than 99 % of the protein sequences are derived from the translation of nucleotide sequences
Less than 1 % direct protein sequencing (Edman, MS/MS…)
-> It is important that protein database users know where the protein sequence comes from…
![Page 26: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/26.jpg)
cDNAs, ESTs, genes, genomes, …
Nucleic acid sequence databases
The ideal life of a sequence …
Protein sequence databases
![Page 27: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/27.jpg)
Menu
Introduction
Nucleic acid sequence databases ENA/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
![Page 28: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/28.jpg)
ENA (EMBL-Bank) GenBankDDBJDNA Data Bank of Japan
archive of primary sequence data and corresponding annotation submitted by the laboratories that did the
sequencing.
European Nucleotide Archive
![Page 29: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/29.jpg)
http://www.insdc.org/
ENA/GenBank/DDBJ
![Page 30: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/30.jpg)
• Serve as archives : ‘nothing goes out’• Contain all public sequences derived from:
– Genome projects (> 80 % of entries)– Sequencing centers (cDNAs, ESTs…)– Individual scientists ( 15 % of entries)– Patent offices (i.e. European Patent Office, EPO)
• Currently: ~210x106 sequences, ~320 x109 bp;• Sequences from > 300’000 different species;
ENA/GenBank/DDBJ
![Page 31: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/31.jpg)
Archival databases:
- Can be very redundant for some loci
- Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA)
![Page 32: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/32.jpg)
taxonomy
Cross-references
references
accession number
![Page 33: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/33.jpg)
CDS annotation
(Prediction or experimentally determined)
sequence
CDSCoDing Sequence
(proposed by submitters)
![Page 34: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/34.jpg)
The hectic life of a sequence …
cDNAs, ESTs, genes, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
with or without annotated CDS
provided by authors
CDSCoDing Sequence
portion of DNA/RNA translated into protein(from Met to STOP)
Experimentally provedor derived from gene prediction
!!! not so well documented !!!
![Page 35: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/35.jpg)
CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************
CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
**************************** ***********************************************
CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************
CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***
CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *
CoDing SequenceAlignment between a mRNA and a genomic sequence
exon
exon
exon
exon
exon
intron
intron
intron
![Page 36: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/36.jpg)
CDS translation provided by ENA
CDS provided by the submitters
The first Met !
![Page 37: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/37.jpg)
UCSC: human EPO
5’ 3’
mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)
contig
![Page 38: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/38.jpg)
mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)
![Page 39: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/39.jpg)
Very rarely done…
![Page 40: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/40.jpg)
Complete genome (submitted)
but only ~ 2,000 CDS/proteins available !
![Page 41: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/41.jpg)
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
…annotated CDS in UniProtKB
![Page 42: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/42.jpg)
From nucleic acid to amino acid sequences databases….
![Page 43: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/43.jpg)
The hectic life of a protein sequence …
cDNAs, ESTs, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
…if the submitters provide an annotated Coding Sequence
(CDS)(1/10 ENA entries)
Protein sequence databases
Nucleic acid databases
Gene predictionRefSeq, Ensembl
no CDS
RefSeq, Ensembl and other
![Page 44: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/44.jpg)
Why doing things in a simple way, when you can do it in a very complex
one ?
![Page 45: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/45.jpg)
The hectic life of a sequence …
TrEMBL Genpept
CoDing Sequences provided by submitters
cDNAs, ESTs, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
Swiss-Prot
RefSeq PRF
Scientific publications derived sequences
Ensembl
CCDS
UniParc
UniProtKB
PDB(PIR)
+ all ‘species’ specific databases (EcoGene, TAIR, …)
(IPI)
UniMES
CoDing Sequences provided by submitters
and gene prediction
TPA
![Page 46: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/46.jpg)
Major ‘general’ protein sequence database ‘sources’
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)
TPA: Third part annotation
Integrated resources
‘cross-references’
Resources kept separated
TPA
not complete !!! (only entries created before 2007 ?)
![Page 48: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/48.jpg)
Look for EPO(homo sapiens)
Swiss-Prot
Swiss-Prot
GenPept
RefSeq
RefSeq
GenPept
![Page 49: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/49.jpg)
Menu
Introduction
Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
![Page 50: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/50.jpg)
UniProtUniProt consortium
EBI : European Bioinformatics Institute (UK)SIB : Swiss Institute of Bioinformatics (CH)PIR : Protein information resource (US)
![Page 51: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/51.jpg)
UniProt databases
![Page 52: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/52.jpg)
UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~14 mo entries)
UniParc: protein sequence archive (ENA equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)
![Page 53: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/53.jpg)
UniProtKBan encyclopedia on proteins
composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed automatically annotated and manually annotated
released every 4 weeks
![Page 54: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/54.jpg)
UniProtKB
from EMBL to TrEMBL
UniProtKB protein sequence data are mainly derived from EMBL (CDS) but also from Ensembl, RefSeq, model organism databases (MODs; e.g. TAIR) and
PDB.
Data from the PIR database have been integrated in UniProtKB since 2003.
![Page 55: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/55.jpg)
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public.
However, UniProtKB excludes the following protein sequences:- Most non-germline immunoglobulins and T-cell receptors- Synthetic sequences- Most patent application sequences- Small fragments encoded from nucleotide sequence (<8 amino acids)- Pseudogenes*- Fusion/truncated proteins- Not real proteins
* many putative pseudogene sequences (which are tagged as potential pseudogenes) may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein
![Page 56: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/56.jpg)
Data increase in UniProtKB
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
UniProtKB
UniProtKB/Swiss-Prot
Date
Num
ber
of s
eque
nces
![Page 57: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/57.jpg)
TrEMBL
EMBL
Automated extraction of protein sequence
(translated CDS), gene name and references.+Automated annotation
![Page 58: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/58.jpg)
One protein sequenceOne species
Automated annotationKeywords
and Gene Ontology
Automated annotationFunction, Subcellular location,
Catalytic activity, Sequence similarities…
Automated annotationtransmembrane domains,
signal peptide…
Cross-references to over 125 databases
References
Protein and gene namesTaxonomic information
UniProtKB/TrEMBLwww.uniprot.org
![Page 59: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/59.jpg)
UniProtKB: from EMBL to TrEMBL
Automated annotation
1. Protein sequence
2. Biological information
![Page 60: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/60.jpg)
Protein sequence- The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same lenght, same organism are merged automatically).
Biological informationSources of annotation- Provided by the submitter (EMBL, PDB, TAIR…)- From automated annotation (SAAS: automated
generated annotation rules)- From automated annotation (UniRule; manually
generated annotation rules)
UniProtKB/TrEMBL
![Page 61: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/61.jpg)
![Page 62: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/62.jpg)
Protein sequence- The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same length, same organism are merged automatically).
Biological informationSources of annotation- Provided by the submitter (EMBL, PDB, TAIR)- From automated annotation (SAAS: automated
generated annotation rules)- From automated annotation (UniRule; manually
generated annotation rules)
UniProtKB/TrEMBL
![Page 63: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/63.jpg)
Automatic annotation in UniProtKB/TrEMBL
System Rule creation Trigger Annotations Scope
SAAS automatic InterPro comments, KW all taxa
UniRules manual InterPro*
protein names,
comments, features, KW,
GO terms
all taxa
* Flexibility to create custom signatures for InterPro as required
![Page 64: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/64.jpg)
SAAS
• Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
• Fully automated rule generation based on C4.5 decision tree algorithm.
• One annotation, one rule.
• Precision calculated for each rule vs UniProtKB/Swiss-Prot.
• High stringency – require 99% or greater estimated precision on UniProtKB/TrEMBL to generate annotation.
• Rules are produced, updated and validated at each release.
UniProtKB/TrEMBL
![Page 65: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/65.jpg)
![Page 66: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/66.jpg)
![Page 67: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/67.jpg)
UniRules (RuleBase, HAMAP, PIRSF)
• Rules of varying complexity: annotation varies from simple KW attribution to complete annotation as for UniProtKB/Swiss-Prot
• Rules are manually curated: From SAAS rules as input From UniProtKB/Swiss-Prot annotation and InterPro match data,
taxonomy information – continuously reported to curators From literature based curation of characterized families - with the
possibility to create new signatures for specific functional groups
• Rules are continuously monitored – validation on UniProtKB/Swiss-Prot – 97% confidence
• HAMAP is also used for the annotation of some UniProtKB/Swiss-Prot entries
![Page 68: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/68.jpg)
UniRule – HAMAP
![Page 69: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/69.jpg)
UniRule – HAMAP
![Page 70: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/70.jpg)
• SAAS – automatically generated annotation rules for comments, KWs- Tested on UniProtKB/Swiss-Prot
• UniRule – manually curated annotation rules (e.g. HAMAP) – annotation varies from simple KWs to full annotation– start point can be SAAS rules, InterPro reports, literature-based
curation of protein families – possibility to create custom signatures -> InterPro
• Automatic annotation of UniProtKB/TrEMBL is refreshed, and validated, each UniProtKB release – validation using UniProtKB/Swiss-Prot as reference. ~10% of the rules are ‘refreshed’ at each release.
• The source of each annotation is indicated - users can access rule logic
Automatic annotation in UniProtKB/TrEMBL - Summary
![Page 71: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/71.jpg)
Current status – coverage of UniProtKB/TrEMBL
System Rules Coverage*
SAAS 1684 17.2%
UniRules
RuleBase 1108 23.0%
PIR name /
site rules142 0.26%
HAMAP 1087 4.7%
* Proportion of entries with at least one annotation from the specified system
UniProtKB/TrEMBL 2010_12: 12,769,092 entries, all systems combined, 33%
![Page 72: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/72.jpg)
Current status - coverage of UniProtKB/TrEMBL
All CC DE FT GN KW0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
UniRule UniRule + SAAS SAAS
Annotation type
% c
ove
rag
e o
f Un
iPro
tKB
/TrE
MB
L
UniProt release 2010_10 included annotations from 7767 SAAS rules, 1814 UniRules
![Page 73: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/73.jpg)
GO annotation- KW2GO- InterPro2GO- HAMAP2GO
![Page 74: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/74.jpg)
UniProtKB
from TrEMBL to Swiss-Prot
Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL
-> minimal redundancy
![Page 75: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/75.jpg)
TrEMBL
EMBL
Automated extraction of protein sequence (translated CDS), gene name and
references.+Automated annotation
Manual annotation of the sequence and associated
biological information
Swiss-Prot
![Page 76: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/76.jpg)
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
One protein sequenceOne gene
One species
Manual annotationKeywords
and Gene Ontology
Manual annotationFunction, Subcellular location,
Catalytic activity, Disease, Tissue specificty, Pathway…
Manual annotationPost-translational modifications,
variants, transmembrane domains, signal peptide…
Cross-references to over 125 databases
References
Protein and gene namesTaxonomic information
Alternative products:protein sequences produced by
alternative splicing, alternative promoter usage,
alternative initiation…
UniProtKB/Swiss-Protwww.uniprot.org
![Page 77: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/77.jpg)
UniProtKB: from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
![Page 78: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/78.jpg)
UniProtKB
1- Sequence curation
![Page 79: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/79.jpg)
The displayed protein sequence
…canonical, representative, consensus…
![Page 80: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/80.jpg)
The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.
The displayed sequence is generally derived from the translation of the genomic sequence (when available).
Sequence differences are documented.
1 entry <-> 1 gene (1 species) 1 displayed sequence
(annotation of alternative sequences, when available)
UniProtKB/Swiss-Prot protein sequence annotation‘Merging/Redundancy policy’:
a gene-centric view of protein space
![Page 81: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/81.jpg)
What is the current status?
• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.
• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’
![Page 82: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/82.jpg)
![Page 83: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/83.jpg)
… once a gene on chromosome 11…
![Page 84: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/84.jpg)
Quality of protein information from genome projects
• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look
like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;
– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;
– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…
![Page 85: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/85.jpg)
UniProtKB/Swiss-ProtProtein sequence annotation
![Page 86: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/86.jpg)
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..
ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the
degradation of uric acid were inactivated and converted to pseudogenes.
![Page 87: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/87.jpg)
• Producing a clean set of sequences is not a trivial task;
• It is not getting easier as more and more types of sequence data are submitted;
![Page 88: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/88.jpg)
• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein;
• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular
location),…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
http://www.uniprot.org/docs/pe_criteria
![Page 89: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/89.jpg)
![Page 90: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/90.jpg)
In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’
![Page 91: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/91.jpg)
![Page 92: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/92.jpg)
The ‘alternative’ sequence(s)
![Page 93: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/93.jpg)
UniProtKB/Swiss-Prot
1 entry <-> 1 gene (1 species)
Annotation of the sequence differences
(including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
![Page 94: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/94.jpg)
Multiple alignment of the end of the available GCR sequences
Annotation of the sequence differences (protein diversity)
1 entry <-> 1 gene (1 species)
…and natural variants
![Page 96: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/96.jpg)
UniProtKB (and RefSeq) do under-represent alternatively spliced products
According to PMID:21307931: alternative splicing seems to occur at more than 90% of protein-coding genes (might not always modify the protein sequence).
Transcript variants are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that may exist in vivo.
Uncertain alternative sequences (confirmed by only one cDNA) are tagged with ‘No experimental confirmation available’http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me
![Page 97: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/97.jpg)
Available in separated files!
Important remark
> 30’000 additional sequences (total)
![Page 98: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/98.jpg)
The ‘alternative’ sequence(s)
not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server
!….
Not included yet in the UniProtKB complete proteome sets !
![Page 99: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/99.jpg)
Depending upon the organism, the inclusion of alternative sequences to the basic set of protein sequences can make a tremedous difference. For instance, in Homo sapiens, alternative sequences currently represent close to 40% of the total number of annotated human sequences described in UniProtKB/Swiss-Prot.
http://www.uniprot.org/faq/38
![Page 100: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/100.jpg)
Blast P04150 against Swiss-Prot / homo sapiens @ UniProt
Isoform sequences
![Page 101: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/101.jpg)
Blast P04150 against Swiss-Prot / homo sapiens @ NCBI
The isoform sequences are not present in the NCBI protein databases !The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence !
![Page 102: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/102.jpg)
How to track sequence changes ?
• The sequence version number applies to the canonical sequence only
• There is no easy way yet to track sequence updates of isoforms
![Page 103: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/103.jpg)
UniProtKB
2- Biological data curation
![Page 104: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/104.jpg)
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)- prediction programs (Prosite, TMHMM, …)- contacts with experts - other databases- nomenclature committees
An evidence attribution system allows to easily trace the source of each annotation
Extract literature informationand protein sequence analysis
maximum usage of controlled vocabulary
![Page 105: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/105.jpg)
Protein and gene names
![Page 106: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/106.jpg)
…enable researchers to obtain a summary of what is known about a protein…
General annotation
(Comments)
www.uniprot.org
![Page 107: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/107.jpg)
Human protein manual annotation: some statistics (Aug 2010)
![Page 108: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/108.jpg)
Sequence annotation
(Features)
…enable researchers to obtain a summary of what is known about a protein…
www.uniprot.org
![Page 109: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/109.jpg)
Human protein manual annotation: some statistics
(PTM)
![Page 110: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/110.jpg)
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between
both.
Level. Type of evidence Qualifier
1st. Strong experimental evidence Ref.X
2nd. Light experimental evidence Probable
3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)
By similarity
4th. Inferred by sequence prediction Potential
![Page 111: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/111.jpg)
Find all the protein localized in the cytoplasm (experimentally
proven) which are phosphorylated on a serine
(experimentally proven)
![Page 112: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/112.jpg)
UniProtKB
Additional information can be found in the cross-references (to more than 140 databases)
![Page 113: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/113.jpg)
2D gel2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)
OGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEUCD-2DPAGEWorld-2DPAGE
Family and domainGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTSUPFAMTIGRFAMs
Organism-specificAGDArachnoServerCGDConoServerCTDCYGD dictyBaseEchoBASEEcoGeneeuHCVdbEuPathDBFlyBaseGeneCardsGeneDB_SpombeGeneFarmGenoListGrameneH-InvDB HGNCHPA LegioListLepromaMaizeGDBMGIMIMneXtProtOrphanet PharmGKBPseudoCAPRGDSGDTAIRTubercuListWormBaseXenbaseZFIN
Protein family/groupAllergomeCAZyMEROPSPeroxiBasePptaseDBREBASETCDB
Genome annotationEnsemblEnsemblBacteriaEnsemblFungiEnsemblMetazoaEnsemblPlantsEnsemblProtistsGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase
Enzyme and pathwayBioCycBRENDAPathway_Interaction_DBReactome
OtherBindingDBDrugBank NextBio PMAP-CutDB
SequenceEMBLIPIPIRRefSeqUniGene
3D structureDisProtHSSPPDBPDBsumProteinModelPortalSMR
PTMGlycoSuiteDBPhosphoSitePhosSite
UniProtKB/Swiss-Prot:129 explicit links
and 14 implicit links!
ProteomicPeptideAtlasPRIDEProMEX
PPIDIPIntAct MINTSTRING
Phylogenomic dbseggNOGGeneTreeHOGENOMHOVERGENInParanoidOMAOrthoDBPhylomeDBProtClustDB
PolymorphismdbSNP
Gene expressionArrayExpressBgeeCleanExGenevestigatorGermOnline
Ontologies GO
![Page 114: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/114.jpg)
Protein sequence origin
http://www.uniprot.org/faq/35
![Page 116: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/116.jpg)
The UniProt web site - www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS)
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access
• Tools: Blast, Align, IDmapping, Batch retrieval (Retrieve)
![Page 117: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/117.jpg)
Search
A very powerful text search tool with autocompletion and refinement
options allowing to look for UniProt entries and documentation by
biological information
![Page 118: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/118.jpg)
Search
A very powerful text search tool with autocompletion and refinement
options allowing to look for UniProt entries and documentation by
biological information
![Page 119: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/119.jpg)
The search interface guides users with helpful suggestions and hints
![Page 120: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/120.jpg)
![Page 121: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/121.jpg)
Advanced Search
A very powerful search tool
To be used when you know in which entry section the information is stored
Have first a look to annotation examples.
![Page 122: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/122.jpg)
Find all human proteins with experimental evidence for their
location in the nucleus
![Page 123: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/123.jpg)
The information is stored in the ‘General annotation’ section, Subcellular location
![Page 124: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/124.jpg)
![Page 125: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/125.jpg)
Find all human proteins with experimental evidence for their
location in the nucleus
![Page 126: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/126.jpg)
Result pages: Highly customizable
![Page 127: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/127.jpg)
Custom downloads….
Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675) (UNQ696/PRO1341) Albumin domains (3) Evidence at protein level P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level
Open with Excel etc.
![Page 128: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/128.jpg)
The URL (results) can be bookmarked and manually modified.
![Page 129: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/129.jpg)
Blast
A tool associated with the standard options to search
sequences in UniProt databases
![Page 130: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/130.jpg)
Blast results: customize display
![Page 131: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/131.jpg)
Blast: use of UniProt annotationamino-acids highlighting options
and feature annotation highlighting option in the local alignment
![Page 132: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/132.jpg)
Align
A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting
option
![Page 133: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/133.jpg)
ClustalW multiple alignment of insulin
sequencesamino-acids highlighting options
and feature annotation highlighting option in the local alignment
![Page 134: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/134.jpg)
Retrieve
A UniProt specific tool allowing to retrieve a list of entries in several standard identifiers formats.
You can then query your ‘personal database’ with the UniProt search tool.
![Page 135: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/135.jpg)
Your dataset: results of a Scan Prosite
![Page 136: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/136.jpg)
ID Mapping
Gives the possibility to get a mapping between different databases for a given
protein
![Page 137: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/137.jpg)
These identifiers are all pointing to TP53 (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc.
![Page 138: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/138.jpg)
![Page 139: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/139.jpg)
Download
![Page 142: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/142.jpg)
Canonical and isoform sequences
![Page 143: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/143.jpg)
Complete proteome
‘gene’ centredor
all known proteins ?
![Page 144: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/144.jpg)
http://www.uniprot.org/faq/38
![Page 145: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/145.jpg)
![Page 146: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/146.jpg)
Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome
![Page 147: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/147.jpg)
UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record !
‘gene’ centred
all protein sequences in UniProtKB/Swiss-Prot…Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL
![Page 148: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/148.jpg)
IPI closure
• The complete MOUSE proteome will be composed of all MOUSE sequences in UniProtKB/Swiss-Prot plus those MOUSE sequences in UniProtKB/TrEMBL that have a cross-reference to an Ensembl protein.
• The complete HUMAN proteome will therefore be composed of all HUMAN sequences in UniProtKB/Swiss-Prot plus those HUMAN sequences in UniProtKB/TrEMBL that have a cross-reference to an Ensembl protein.
• News: 30th March 2011: next UniProtKB release.
![Page 149: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/149.jpg)
UniProtKB
Statistics
![Page 150: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/150.jpg)
520’000 + 14’000’000 14’000’000
Swiss-Prot & TrEMBL introduce a new arithmetical
concept !
Redundancy in TrEMBL&
Redundancy between TrEMBL and Swiss-Prot
12’000 species 130’000 species
Swiss-Prot TrEMBL
![Page 151: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/151.jpg)
12’000 speciesmainly model organisms
![Page 152: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/152.jpg)
Not yet available
![Page 153: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/153.jpg)
~ 200 new entries / day new release every 4 weeks
- Annotation is useful, good annotation is better, update is essential !
- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot
![Page 154: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/154.jpg)
UniProtKB entry history
Always cite the primary accession number (AC) !
![Page 155: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/155.jpg)
UniParc
![Page 156: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/156.jpg)
UniParc
- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)
- the equivalent of ENA/GenBank/DDBJ at the protein level
- species-merged: merge sequences between species when 100% identical over the whole length.
- no annotation (only taxonomy)
- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.
- Beware: contains wrong prediction, pseudogenes etc…
![Page 157: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/157.jpg)
Query UniParc
![Page 158: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/158.jpg)
UniRef
![Page 159: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/159.jpg)
‘UniRef is useful for comprehensive BLAST similarity searches by providing
sets of representative sequences’
![Page 160: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/160.jpg)
«Collapsing BLAST results»
Three collections of sequence clusters from UniProtKB and selected UniParc entries:
One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %
One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %
One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %
Based on sequence identity -> Independent of the species !
![Page 161: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/161.jpg)
![Page 162: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/162.jpg)
Independent of species and
sequence length
UniRef 90
![Page 163: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/163.jpg)
UniMes
![Page 164: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/164.jpg)
The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).
Download only (but included in UniParc -> Blast).
- UniMES Fasta sequences- UniMES matches to InterPro methods
ftp.uniprot.org/pub/databases/uniprot
![Page 165: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/165.jpg)
![Page 166: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/166.jpg)
UniMES: sequences in fasta format
![Page 167: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/167.jpg)
![Page 168: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/168.jpg)
Menu
Introduction
Nucleic acid sequencedatabases ENA/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
![Page 169: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/169.jpg)
NCBI protein databases
(Entrez protein, NCBI nr)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
![Page 170: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/170.jpg)
Major ‘general’ protein sequence database ‘sources’
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)
TPA: Third part annotation
Integrated resources
‘cross-references’
Resources kept separated
TPA
not complete !!! (only entries created before 2007 ?)
![Page 171: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/171.jpg)
Query at Entrez protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
![Page 172: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/172.jpg)
Typical result of a query at
« Entrez protein » RefSeq
Swiss-Prot
Genpept
![Page 173: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/173.jpg)
![Page 174: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/174.jpg)
![Page 175: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/175.jpg)
A Swiss-Prot entry with the NCBI look
![Page 176: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/176.jpg)
A TrEMBL entry with the NCBI look
!!!!
![Page 177: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/177.jpg)
GI number ‘GenInfo identifier’ number
- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.
![Page 178: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/178.jpg)
AC
![Page 179: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/179.jpg)
GI number: ‘GenInfo identifier’ number
- If the sequence changes in any way, a new GI number will be assigned:
GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search.
- A separate GI number is assigned to each protein translation (alternative products)
- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
![Page 180: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/180.jpg)
ID/AC mapping
![Page 181: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/181.jpg)
![Page 182: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/182.jpg)
http://www.ebi.ac.uk/Tools/picr/
![Page 183: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/183.jpg)
GenPept
Translation from annotated CDS in GenBankContains all translated CDS annotated in
GenBank/ENA/DDBJ sequences
- equivalent to UniProtKB/TrEMBL, except that it is
redundant with other databases (Swiss-Prot, RefSeq, PIR….)
![Page 184: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/184.jpg)
GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’
![Page 185: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/185.jpg)
RefSeq
Produced by NCBI and NLM
http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf
FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/
http://www.ncbi.nlm.nih.gov/RefSeq/
![Page 186: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/186.jpg)
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.
Protein – mRNA – genomic sequence
Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs.
- tighly linked to Entrez Gene (« interdependent curated resources »)
![Page 187: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/187.jpg)
Example: NP_000790
Beware: NeXtProt accession number: NX_P00918
![Page 188: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/188.jpg)
![Page 189: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/189.jpg)
KW
AC
Taxonomy
References
![Page 190: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/190.jpg)
GenBank sourceand status
Annotation and ontologies
![Page 191: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/191.jpg)
Curated records
![Page 192: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/192.jpg)
UniProtKB vs RefSeq
![Page 193: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/193.jpg)
![Page 194: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/194.jpg)
UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences
UniProtKB/Swiss-Prot P04150 (GCR_HUMAN):
![Page 195: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/195.jpg)
RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.
- If there is an alternative splicing event, there will be several distinct entries for a given gene
Example: GCR_HUMAN
GCR_HUMANUniProtKB/Swiss-Prot
1 UniProtKB entry 7 RefSeq entriescross-linked with
![Page 196: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/196.jpg)
Protein feature annotation found in RefSeq
- Conserved domains - Signal and mature petides- Propagation of a subset of features from Swiss-Prot.
![Page 197: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/197.jpg)
PTM annotation Swiss-Prot vs
RefSeq
GCR_human
![Page 198: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/198.jpg)
RefSeq statistics
The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)
![Page 199: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/199.jpg)
SummaryUniProtKB vs NCBI protein
![Page 200: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/200.jpg)
ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/
UniProtwww.uniprot.org
Protein and nucleotide data Genomic, RNA and protein data
Protein data only
Biological data added by the submitters (gene name, tissue…)
Biological data annotated by curators, also found in the corresponding Entrez Gene entry
Biological data annotated by curators (Swiss-Prot), within the entry
Not curated Partially manually curated (‘reviewed’ entries)
Manually curated in Swiss-Prot, not in TrEMBL
Author submission NCBI creates from existing data + gene prediction
UniProt creates from existing data
Only author can revise (except TPA)
NCBI revises as new data emerge
UniProt revises as new data emerge
Multiple records for same loci common
Single records for each molecule of major organisms
Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant)
Records can contradict each other
Identification and annotation of discrepancy
No limit to species included Limited to model organisms Priority (but not limited) to model organisms
Data exchanged among INSDC members
NCBI database; collaboration with UniProt
UniProt database; collaboration with NCBI (RefSeq, CCDS)
![Page 201: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/201.jpg)
PIR
![Page 202: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/202.jpg)
PIR: the Protein Identification Resource
PIR-PSD is no more updated, but exists as an archive
![Page 203: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/203.jpg)
PDB
![Page 204: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/204.jpg)
PDB• PDB (Protein Data Bank), 3D structure
• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies
• Contains also the corresponding protein sequences
*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools
• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)
![Page 205: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/205.jpg)
PDB: Protein Data Bankwww.rcsb.org/pdb/
• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).
• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).
• Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !
![Page 206: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/206.jpg)
PDB: example
![Page 207: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/207.jpg)
Coordinates of each atom
Sequence
![Page 208: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/208.jpg)
Visualisation with Jmol
![Page 209: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/209.jpg)
PRF
Protein Research Foundation
![Page 210: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/210.jpg)
http://www.genome.jp/dbget-bin/www_bfind?prf
Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)
![Page 211: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/211.jpg)
![Page 212: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/212.jpg)
Other protein databases
![Page 213: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/213.jpg)
Ensembl http://www.ensembl.org/
Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610
Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942
![Page 214: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/214.jpg)
- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)
- Also do gene prediction (-> novel genes)
Ensembl= UniProtKB + RefSeq + gene prediction
- DNA, RNA and protein sequences available for several species.
- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.
![Page 215: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/215.jpg)
![Page 216: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/216.jpg)
![Page 217: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/217.jpg)
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..
ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the
degradation of uric acid were inactivated and converted to pseudogenes.
![Page 219: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/219.jpg)
![Page 220: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/220.jpg)
Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.
IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA).
!!! Complete proteome sets include all alternative splicing sequences….
Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow
![Page 221: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/221.jpg)
![Page 222: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/222.jpg)
![Page 223: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/223.jpg)
![Page 224: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/224.jpg)
CCDS
![Page 225: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/225.jpg)
htt
p:/
/ww
w.n
cb
i.n
lm.n
ih.g
ov/C
CD
S/
![Page 226: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/226.jpg)
CCDS (human, mouse)
Combining different approaches – ab initio, by
similarity - and taking advantage of the expertise
acquired by different institutes, including manual
annotation…
Consensus between 4 institutions…
![Page 227: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/227.jpg)
![Page 228: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/228.jpg)
Gene Ontology (GO)
![Page 229: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/229.jpg)
Standards :Why is it so important ?
•‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 )
•Standardization of biological data/information (data sharing and computational analysis).
•Aim: extract and compare annotation between different resources or species (semantic similarity).
![Page 230: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/230.jpg)
Secreted or not secreted ?
Pubmed19299134
![Page 231: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/231.jpg)
• The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms.
Gene Ontology (GO)
![Page 232: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/232.jpg)
Gene Ontology (GO) terms
biological process• broad biological phenomena e.g.
mitosis, growth, digestion
molecular function• molecular role e.g. catalytic activity,
binding
cellular component• Subcellular location e.g nucleus,
ribosome, origin recognition complex
![Page 233: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/233.jpg)
GO terms associated with human Erythropoietin
![Page 234: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/234.jpg)
![Page 235: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/235.jpg)
http://www.geneontology.org
![Page 236: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/236.jpg)
Caveats
• Annotation is the process of assigning/mapping GO terms to gene products…
• Electronic vs Manual annotation…
![Page 237: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/237.jpg)
Example with EPO
![Page 238: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/238.jpg)
![Page 239: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/239.jpg)
![Page 240: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/240.jpg)
Histone H4
!!! Large scale derived data (‘proteome’)
![Page 241: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/241.jpg)
GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets…
PMID: 15514041
‘summary of the gene ontology classifications for all mapped ESTs…’
![Page 242: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/242.jpg)
Human proteins functional distribution
Maybe
Potentially
Putative
Expected
Probably
Hopefully
~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).
![Page 243: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics](https://reader037.fdocuments.net/reader037/viewer/2022110405/56813321550346895d99f6bf/html5/thumbnails/243.jpg)
All documents (including practicals) are online
http://education.expasy.org/cours/UniProt/