Genome, Protein and Model Organism Databases
description
Transcript of Genome, Protein and Model Organism Databases
![Page 1: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/1.jpg)
Genome, ProteinGenome, Proteinandand
Model Organism Databases Model Organism Databases
Anne Estreicher Swiss-Prot Group
Swiss Institute of BioinformaticsGeneva – Switzerland
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
![Page 2: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/2.jpg)
OutlineOutline
1. Introduction (definitions, history…)
2. From DNA sequence to genomic tools
3. The flow of information: from DNA to proteins
4. Protein sequence databases
5. MODs at a glance
![Page 3: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/3.jpg)
• A collection of related data, which are– structured – searchable – updated periodically– cross-referenced
• Includes also associated tools necessary for access/query, download, etc.
What is a database ?What is a database ?
![Page 4: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/4.jpg)
Why do we need databases ?Why do we need databases ?
Data need to be stored, curated and made available for analysis and knowledge discovery
Efficient way of sharing data, independently of regular publications
Essential resources for both experimental and computational biologists
![Page 5: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/5.jpg)
Databases in biology : not a Databases in biology : not a new issue …new issue …
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)
![Page 6: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/6.jpg)
The first protein sequence "database" by Margaret Dayhoff (1965)
contained 65 proteins
![Page 7: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/7.jpg)
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• Mid 70s Improvements in DNA sequencing• 1979 Los Alamos Sequence Library (Walter Goad)• 1980 ~ 80 genes fully sequenced
-> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines)
-> ARCHIVE
-> RACE for the central position in life sciences…And the winner is…
![Page 8: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/8.jpg)
Databases: not a new issue…Databases: not a new issue…
EMBL-Bank - Europe 1980GenBank - USA 1982
DDBJ - Asia 1986
leading to the establishment of the INSDC (International Nucleotide Sequence
Database Collaboration) -> daily exchanges of data
![Page 9: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/9.jpg)
www.insdc.org
![Page 10: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/10.jpg)
EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ
• Main resources for DNA and RNA sequences;
• Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications:
“Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.”
1. True for nucleic acid, not for protein sequences;2. Not always put into practice
=> Not submitted sequences are LOST!!!=> Not submitted sequences are LOST!!!
• Archives (primary databases)
• data belong to submitters
![Page 11: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/11.jpg)
EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ
Archive (primary databases) => data belong to the submitter
Minimal checks, such as vector contamination Annotation by the submitters
![Page 12: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/12.jpg)
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –
DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA
![Page 13: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/13.jpg)
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –
DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA
-> ARCHIVES (primary databases) may not be sufficient-> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for
annotated (secondary) databases
![Page 14: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/14.jpg)
The Swiss-Prot conceptThe Swiss-Prot concept
non-redundant: Protein products of
1 gene / 1 species -> 1 entry1 gene / 1 species -> 1 entry,
Manually annotated (=> curator judgement on data !),
Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).
![Page 15: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/15.jpg)
Databases: not a new issue…Databases: not a new issue…• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA
Protein information resource (PIR) – Protein sequences
• 1986 DDBJ – DNASwiss-Prot – protein sequences
• 1996 TrEMBL (Translated EMBL) – Protein sequencesComplement of Swiss-Prot to cope with the
increasing amount of new sequences; AUTOMATIC ANNOTATION !
![Page 16: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/16.jpg)
0
50'000
100'000
150'000
200'000
250'000
300'000
350'000
400'000
450'000
500'000
2 7 12 17 22 27 32 37 42 47 52 57
19863’939 entries
UniProtKB/Swiss-Prot growthUniProtKB/Swiss-Prot growthN
um
ber
of
en
trie
s
Releasenumber
1996: creation of TrEMBLTrEMBLSwiss-Prot: 52’205 entriesTrEMBL: 61’137 entries
Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369 entriesentries
![Page 17: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/17.jpg)
0
1'000'000
2'000'000
3'000'000
4'000'000
5'000'000
6'000'000
7'000'000
8'000'000
9'000'000
UniProtKB growthUniProtKB growth
Releasenumber
TrEMBL rel.40.5 (07-Jul-2009): 8TrEMBL rel.40.5 (07-Jul-2009): 8’’594594’’382382 entries entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entriesentries
1986 1996 2009
TrEMBL growthTrEMBL growth (sequences/day)
2004
1’5002006-2007 3’5002008
>5’0002009
~8’000
TrEMBLTrEMBLAutomated curation
Swiss-ProtSwiss-ProtManual curation
Nu
mb
er
of
en
trie
s
![Page 18: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/18.jpg)
New challengeNew challenge
Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery
![Page 19: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/19.jpg)
Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data;
Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses.
Complex system
(R)evolution of these last 20 years(R)evolution of these last 20 years
List of parts
??
![Page 20: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/20.jpg)
Science (1993) 262, 502
![Page 21: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/21.jpg)
Danger !
EMBL Database GrowthEMBL Database Growthhttp://www.ebi.ac.uk/embl/Services/DBStats/
![Page 22: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/22.jpg)
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
In 4 months, 374 new In 4 months, 374 new genomesgenomes
and 77 were completedand 77 were completed~ 100 genomes/month~ 100 genomes/month
(in 2008 -> ~50 genomes/month)
+ ~2’360 viral (& viroid) genomes=> Total ~ 5’600 genomes
![Page 23: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/23.jpg)
http://genomesonline.org/index2.htm
![Page 24: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/24.jpg)
http://www.genomesonline.org/gold.cgi
![Page 25: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/25.jpg)
![Page 26: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/26.jpg)
http://www.genomesonline.org/gold.cgi
![Page 27: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/27.jpg)
Metagenomics:Metagenomics:study of genetic material recovered directly
from environmental samples
• Global Ocean Sampling (C. Venter)
• Whale fall
• Soil, sand beach, New-York air, …
• Human fluids, mouse gut
• …
Venter’s Sorcerer II
![Page 28: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/28.jpg)
Flood in the world of Flood in the world of proteins…proteins…
1965: first protein sequence "database" by Margaret Dayhoff (65 proteins)
July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/)
UniParc:non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).
![Page 29: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/29.jpg)
New challengeNew challenge
Flood of data
Flood of databases…
![Page 30: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/30.jpg)
NAR 1st issue of the year is always
dedicated to databases + "clean" list of
databases provided
(! not exhaustive !)
![Page 31: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/31.jpg)
The NAR Online Molecular Biology Database collection in 2009
A total of 1’170 databases (19 obsolete removed)
http://www.oxfordjournals.org/nar/database/a/
![Page 32: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/32.jpg)
NAR "clean" list of databaseshttp://www.oxfordjournals.org/nar/database/a/
![Page 33: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/33.jpg)
Most recent NAR paper about the database
(not available for all db, some described in
other journals)
![Page 34: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/34.jpg)
A "clean" list of can be found in the NAR online molecular biology database
collection
http://www.oxfordjournals.org/nar/database/a/
![Page 35: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/35.jpg)
![Page 36: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/36.jpg)
BIOLOGICAL DATABASE CATEGORIES BIOLOGICAL DATABASE CATEGORIES
• Databases of nucleic acid sequences (RNA, DNA)• Databases of protein sequences• Databases of protein motifs and protein domains• Databases of structures• Databases of genomes• Databases of genes• Databases of expression profiles• Databases of SNPs and mutations• Databases of metabolic pathways • Databases of protein interactions• Databases of taxonomy• …
Databases containing sequences or data directly derived from sequences.
![Page 37: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/37.jpg)
DNA sequences :DNA sequences :
What ?What ?Where ?Where ?How ?How ?
& genomic tools& genomic tools
NCBINCBIUCSCUCSC
![Page 38: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/38.jpg)
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Stable accession number (should
always be cited in publications)
Possible molecule types:genomic DNA and RNA
mRNA other DNA and RNA rRNA transcribed RNAtRNA unassigned DNA and RNA viral cRNA
GenBank entry AF415175http://www.ncbi.nlm.nih.gov/nuccore/16589063
![Page 39: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/39.jpg)
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Taxonomy
![Page 40: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/40.jpg)
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Taxonomy
References
![Page 41: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/41.jpg)
Nucleotide sequence
Taxonomy
References
Features:Information provided by the submitterMay include annotation of the sequence
Accession numberMolecule typeDate of submissionDefinition
OrganismMolecule typeChromosomal locationTissue typeGene nameCDS annotation=> protein sequence + Protein IDentifier (PID: stable identifier & version number)
![Page 42: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/42.jpg)
Protein sequence
Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA)
![Page 43: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/43.jpg)
"Features" may provide much more informationdepending upon the sequence and the submitter…
3’end of chromosome Y
EMBL #AJ271736
![Page 44: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/44.jpg)
Very similar view, links and Very similar view, links and options from the 3 sites:options from the 3 sites:
EMBL-Bank – GenBank - DDBJEMBL-Bank – GenBank - DDBJhttp://www.ddbj.nig.ac.jp/http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/
![Page 45: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/45.jpg)
How to find a DNA sequence How to find a DNA sequence at the NCBI…at the NCBI…
![Page 46: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/46.jpg)
http://www.ncbi.nlm.nih.gov/
![Page 47: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/47.jpg)
Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
The Entrez system:The Entrez system:integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others
=> Maximal=> Maximal interconnectivityinterconnectivity
![Page 48: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/48.jpg)
Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
![Page 49: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/49.jpg)
Simple search with aSimple search with aEMBL-Bank/GenBank/DDBJ EMBL-Bank/GenBank/DDBJ
accession numberaccession number
![Page 50: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/50.jpg)
![Page 51: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/51.jpg)
![Page 52: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/52.jpg)
Searching fromSearching froma bibliographic reference…a bibliographic reference…
![Page 53: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/53.jpg)
![Page 54: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/54.jpg)
Search results 2 and 3-> accession numbers provided by the authors in the article-> GenBank records
Search result 1-> corresponds to the RefSeq database…
![Page 55: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/55.jpg)
RefSeq (Reference Sequence)RefSeq (Reference Sequence)
• Provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins;
• Most data extracted from GenBank -> choice of a reference sequence and annotation (no documented comparison between sequences)
• Some entries based on predictions (accession: XM_; XR_; XP_; ZP_);
• Currently, 8'665 species represented;
• Annotation: Manual annotation (only in entries tagged as "reviewed"); Collaboration; Propagation from other sources; Computation.
![Page 56: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/56.jpg)
CURATION
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWEDYes Yes (sequence +
functional information and features)
VALIDATED Yes Yes (initial sequence)
WGS No
RefSeq (Reference Sequence)RefSeq (Reference Sequence)
![Page 57: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/57.jpg)
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Accession numberDefinitionTaxonomyList of references
![Page 58: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/58.jpg)
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Gene nameExon annotationCDS annotation and sequence
![Page 59: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/59.jpg)
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Sequence
![Page 60: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/60.jpg)
Searching withSearching withthe gene name…the gene name…
![Page 61: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/61.jpg)
Etc.
![Page 62: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/62.jpg)
Etc.
GenBank
Refseq
![Page 63: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/63.jpg)
NCBI Entrez systemNCBI Entrez system Looks for the request in all NCBI databases
Cannot be ignored -> no simple way to search only in your favourite NCBI database
![Page 64: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/64.jpg)
Searching using BLAST…Searching using BLAST…
![Page 65: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/65.jpg)
![Page 66: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/66.jpg)
![Page 67: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/67.jpg)
![Page 68: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/68.jpg)
![Page 69: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/69.jpg)
RefSeq
![Page 70: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/70.jpg)
![Page 71: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/71.jpg)
![Page 72: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/72.jpg)
UniGene:Clusters of transcript sequences that appear to come from the same transcription locus
![Page 73: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/73.jpg)
!?UniSTS:62643 maps to multiple loci in Homo sapiens
Information on tissue expression
![Page 74: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/74.jpg)
UniGene Mapping of
known genes
![Page 75: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/75.jpg)
UniGene Mapping of
known genes
Mapping of RNA (EMBL/GenBank/DD
BJ& RefSeq)
![Page 76: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/76.jpg)
UniGene
Mapping of RNA (EMBL/GenBank/DDBJ
& RefSeq)
Mapping of RefSeq RNAMapping of
known genes
![Page 77: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/77.jpg)
UniGene
Mapping of RNA (EMBL/GenBank/DDBJ
& RefSeq)
Mapping of RefSeq RNA
This view by default can be customized
Mapping of known genes
![Page 78: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/78.jpg)
1. Choose desired option;2. Add it (and/remove undesired)3. Apply the new display
![Page 79: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/79.jpg)
Zoom out -> a better view of the genomic
context of the sequence of interest
![Page 80: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/80.jpg)
Original view
![Page 81: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/81.jpg)
Map viewer~ 110 organisms
represented in Genome database.(www.ncbi.nlm.nih.gov/sites/entrez?
db=genome)
![Page 82: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/82.jpg)
Genomic tools on the Genomic tools on the UCSC server:UCSC server:BLAT searchBLAT search
![Page 83: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/83.jpg)
And:A.GambiaeA.MelliferaS.cerevisiae
a total of 47 organisms
![Page 84: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/84.jpg)
http://genome.ucsc.edu/cgi-bin/hgBlat
Feb. 2009 assembly: not all data implemented !May be better to use former assembly for the time being.
Genome browser @ UCSC
cDNAsequen
ce
![Page 85: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/85.jpg)
![Page 86: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/86.jpg)
![Page 87: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/87.jpg)
Chromosomal location
Consensus CDS& other sequences from reliable resources
gDNA sequence
![Page 88: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/88.jpg)
http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi
Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical.
CCDS database goal: provide a standard set of gene annotations.
Collaborative project involving teams (manual and automated annotation): * European Bioinformatics Institute (EBI) * National Center for Biotechnology Information (NCBI) * Wellcome Trust Sanger Institute (WTSI) * University of California, Santa Cruz (UCSC)
Currently available only for human and mouse genomes (July 2009):20'159 human CCDS (including isoforms) -> 17'054 CCDS genes17'707 mouse CCDS (including isoforms) -> 16'889 CCDS genes
![Page 89: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/89.jpg)
Chromosomal location
Consensus CDS& other sequences from reliable resources
gDNA sequence
(Human) ESTs(including unspliced)
(Human) spliced ESTs
(Human) mRNAs
All sequences can be retrieved
![Page 90: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/90.jpg)
The view can be completely
customized…
![Page 91: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/91.jpg)
…including with various tools
allowing comparative
genomics
![Page 92: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/92.jpg)
…and including your own data !
http://genome.ucsc.edu/
![Page 93: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/93.jpg)
Back to the Blat Back to the Blat viewerviewer
![Page 94: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/94.jpg)
Arrows >>>> show the direction of transcription
![Page 95: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/95.jpg)
2 transcripts from the same locus:BDNF (Brain-Derived Neurotrophic Factor) BDNFOS (BDNF Opposite Strand)
![Page 96: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/96.jpg)
Exons
![Page 97: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/97.jpg)
View of alternative exons
Alternative exons
Constitutive exons
![Page 98: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/98.jpg)
Interested by this exon ?
Just zoom in…
![Page 99: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/99.jpg)
![Page 100: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/100.jpg)
Genome browser @ UCSC has many great options, give it a
try!
http://genome.ucsc.edu/
![Page 101: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/101.jpg)
Typical problems
or
Why wonderful tools will never replace the brain of a life
scientist !
![Page 102: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/102.jpg)
![Page 103: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/103.jpg)
… Once upon a time, there was a gene on chromosome 11…
![Page 104: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/104.jpg)
2 essential genome resources are missing from this lecture:
Ensembl (http://www.ensembl.org/index.html): automated annotation of many genomes;
Vega (http://vega.sanger.ac.uk/index.html):High quality manual annotation of genomes (currently Homo sapiens, Mus musculus, Danio rerio, Gorilla gorilla, Macropus eugenii, Sus scrofa, Canis familiaris).
Please go and visit them!
![Page 105: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/105.jpg)
The flow of informationThe flow of information
From DNA sequencesFrom DNA sequencesto protein to protein
sequences:sequences:
A little biologyA little biologyandand
A few databasesA few databases
![Page 106: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/106.jpg)
Increase in complexity 5-10 x
Alternative promoter usage Alternative splicing
Trans-splicingmRNA editing …
Increase in complexity2-5 x
~ 100’000human
transcripts
~ 20’500 human protein-encoding
genes
~ 1'000'000 human proteins
TranscriptoTranscriptomeme
From genome to proteome:From genome to proteome:the example of humanthe example of human
GenomeGenome ProteomeProteome
Post-translational modifications (PTMs)
Most PTMs cannot be predicted from DNA
sequences
![Page 107: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/107.jpg)
The hectic life of a protein The hectic life of a protein sequence…sequence…
cDNAs, ESTs, genomes, …
DDBJDDBJ
Data not submitted to public databases, delayed or cancelled…
…if a Coding Sequence (CDS)is submitted
Protein sequence databases
Nucleic acid databases
Gene predictionRefSeq, Ensembl
+ some MODs
no CDS
EMBL GenBankwww.insdc.orgInternational Nucleotide Sequence Database Collaboration
Sequences from
publicationsJournal scan
Direct submissions
![Page 108: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/108.jpg)
!!!!
99% of the protein sequences found in databases come from the translation
nucleotide sequences=> Experimental evidence may be
lacking!
![Page 109: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/109.jpg)
EMBL (DNA)EMBL (DNA)
TrEMBL TrEMBL Translated EMBL
Translated CDS
Reference + tissue
Protein name
Translated CDS
Product name
Tissue
Reference
Automated extraction of protein
sequence (translated CDS),
gene name and references +Automated annotation.
A similar pipeline is used at the NCBI to go from GenBankGenBank
to GenPeptGenPept
![Page 110: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/110.jpg)
!!!!
The quality of UniProtKB/TrEMBL (& GenPept) entries depends upon the
quality of the submissions in the original EMBL-Bank/GenBank/DDBJ
entry.
![Page 111: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/111.jpg)
EMBLEMBL
TrEMBLTrEMBL
![Page 112: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/112.jpg)
![Page 113: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/113.jpg)
EMBL (DNA)EMBL (DNA)
TrEMBLTrEMBL
Translated CDS
Reference
Protein name
Translated CDS
Product name
Tissue
Reference
Automated extraction of protein
sequence (translated CDS),
gene name and references.Automated annotation.
Swiss-ProtSwiss-ProtManual annotation
of the sequence and review of
associated biological
information
Protein nameSS
Many more references
Translated CDS+ SAPs+ isoforms+ …
Full annotation
![Page 114: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/114.jpg)
Sequence
Sequence
features
Ontologies
References
Nomenclature
Splice variants
Annotations
![Page 115: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/115.jpg)
Evidence for protein existence:Annotation in UniProtKB
5 levels of evidence: 1. evidence at protein level, 2. evidence at transcript level, 3. inferred by homology, 4. predicted,5. uncertain.
![Page 116: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/116.jpg)
http://www.uniprot.org/uniprot/P35613
![Page 117: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/117.jpg)
![Page 118: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/118.jpg)
http://www.uniprot.org/uniprot/Q9Y471
![Page 119: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/119.jpg)
http://www.uniprot.org/uniprot/Q9Y471
![Page 120: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/120.jpg)
2D-gel dbs 2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)
HSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEWorld-2DPAGE
Family and domain dbsGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAMs
Organism-specific dbsAGDBuruListCGDCTDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGenAtlasGeneCardsGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListOrphanet PharmGKBPhotoListPseudoCAPRGDSagaListSGDSubtiListTAIRTubercuListWormBaseWormPepXenbaseZFIN
Protein family/group dbsCAZyMEROPSPeroxiBasePptaseDBREBASETCDB
Genome annotation dbsEnsemblGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase
Enzyme and pathway dbsBioCycBRENDAPathway_Interaction_DBReactome
OthersBindingDBPMAP-CutDBDrugBank NextBio
Sequence dbsEMBLIPIPIRUniGeneRefSeq
3D structure dbsDisProtHSSPPDBPDBsumSMR
PTM dbsGlycoSuiteDBPhosphoSitePhosSite
UniProtKB/Swiss-Prot:115 explicit links
and 19 implicit links!
Proteomic dbsPeptideAtlasPRIDEProMEX
Protein-protein interaction dbsDIPIntAct
Phylogenomic dbsHOGENOMHOVERGENOMA
Polymorphism dbsdbSNP
Gene expression dbsArrayExpressBgeeCleanExGermOnline
Ontologies GO
![Page 121: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/121.jpg)
![Page 122: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/122.jpg)
Protein Information Resource
European Bioinformatics Institute European Molecular Biology Laboratory
Swiss Institute of
Bioinformatics
The UniProt The UniProt consortiumconsortium
![Page 123: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/123.jpg)
UniProt mission:
Provide a comprehensive high-quality and freely accessible resource of protein sequence and functional annotation.
![Page 124: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/124.jpg)
New release every 3 weeks
![Page 125: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/125.jpg)
Update frequencyUpdate frequencyA crucial issue !! A crucial issue !!
• Sometimes very difficult, or even impossible, to find;
• Crucial not only for the database itself, but also for tools using databases.
![Page 126: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/126.jpg)
Update frequencyUpdate frequency
![Page 127: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/127.jpg)
![Page 128: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/128.jpg)
http://www.matrixscience.com/search_intro.html
![Page 129: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/129.jpg)
Mascot MS/MS identification tool is fine, but it cannot be used from this website !
Solution: Download the database of interest and make sure you work with an up-to-date version.
![Page 130: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/130.jpg)
Never hesitate to ask for an Never hesitate to ask for an updateupdate
![Page 131: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/131.jpg)
![Page 132: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/132.jpg)
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)
UniParcUniParc: protein sequence archive (equivalent to
EMBL-Bank/GenBank/DDBJ at the protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)
![Page 133: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/133.jpg)
UniParc entry contains all records for a unique sequence in major publicly available databases.
TrEMBL entry merged into Swiss-Prot => does not
exist anymore
![Page 134: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/134.jpg)
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)
UniParcUniParc: protein sequence archive (EMBL equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)
UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 8’474’689 entries; UniRef90 5’668'669 entries; UniRef50 2'729'565 entries)
![Page 135: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/135.jpg)
UniRef100, 90 and 50UniRef100, 90 and 50
One UniRef100 entry -> merge of identical sequences (including subfragments, splice variants). Based on UniProtKB sequences and selected UniParc records (such as Ensembl & RefSeq).
One UniRef90 entry -> sequences that have at least 90% or more identity. Built from UniRef100.
One UniRef50 entry -> sequences that are at least 50% identical. Built from UniRef100.
![Page 136: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/136.jpg)
![Page 137: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/137.jpg)
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (7’097’874 entries)
UniParcUniParc: protein sequence archive (EMBL equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (17’646’564 entries)
UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 6,652,983 entries; UniRef90 4’438’653 entries; UniRef50 2’104’702 entries)
UniMESUniMES: protein sequences derived from metagenomic projects (Global Ocean Sampling (GOS)) (Blast, download) (UniMes 6'028'191 entries)
![Page 138: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/138.jpg)
What is "Non-Redundancy" ?What is "Non-Redundancy" ?
• UniParcUniParc– One UniParc entry for all entries corresponding to
100% identical sequences (100% identity over the entire length) (from many different databases).
• UniRefUniRef– One UniRef100 entry for all entries corresponding to
100% identical sequences (including fragments) from UniProtKB, Ensembl, Refseq, PDB.
• UniProtKB/Swiss-ProtUniProtKB/Swiss-Prot– One Swiss-Prot entry for all the protein products of
one gene, including fragments, variations/polymorphisms, splice variants, sequencing errors…
![Page 139: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/139.jpg)
Comparing searches:Comparing searches:NCBI and UniProtNCBI and UniProt
![Page 140: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/140.jpg)
GenPept
GenPept
Swiss-Prot
RefSeq
Identical sequences
AAC34135 CAH72619Identical
sequencesAAF05316 BAG55035 CAH72618 AAI17423 AAF89753
NP_612564 O00206
Search for the human Toll-like
receptor 4 Entrez Entrez Protein (NCBI)Protein (NCBI)
![Page 141: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/141.jpg)
Swiss-Prot
Search for the human Toll-like
receptor 4 in
UniProtKBUniProtKB
![Page 142: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/142.jpg)
Sequences retrieved in Entrez Protein:
O00206AAF05316CAH72618 CAH72619BAG55035AAI17423 AAF89753
NP_612564* AAC34135
*Based on A126770, BC117422,AL160272
and AA598398
![Page 143: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/143.jpg)
Major protein sequence resourcesMajor protein sequence resources
UniProtKB: Swiss-Prot + TrEMBL
EntrezProtein: Swiss-Prot+GenPept+PIR+PDB+PRF+RefSeq
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (~12’000 species)
UniProtKB/TrEMBL: submitted CDS (EMBL); automated annotation (~202’000 species)
GenPept: submitted CDS (GenBank)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
Resources integrated in the
entries
Resources integrated in the
search engine
![Page 144: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/144.jpg)
Model Organism Databases Model Organism Databases (MODs) at a glance(MODs) at a glance
![Page 145: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/145.jpg)
Model organismModel organism
Species extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms.
Model organisms MODs
Mus musculus MGI http://www.informatics.jax.org/Rattus norvegicus RGD http://rgd.mcw.edu/Oryza sativa RAP-DB http://rapdb.dna.affrc.go.jp/Arabidopsis thaliana TAIR http://www.arabidopsis.org/Drosophila melanogaster FlyBase http://flybase.org/Schizosaccharomyces pombe S. pombe GeneDB http://www.genedb.org/genedb/pombe/Saccharomyces cerevisiae SGD http://www.yeastgenome.org/Caenorhabditis elegans WormBase http://www.wormbase.org/ Dictyostelium discoideum dictyBase http://dictybase.org/ Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList/ Escherichia coli ecogene http://ecogene.org/ Danio rerio (zebrafish) ZFIN http://zfin.org/
Just a few examples, not an exhaustive list!
Methanocaldococcus jannaschii -> no MOD
![Page 146: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/146.jpg)
Model organism databases (MODs)Model organism databases (MODs)
Genome annotation;Gene models;Gene mapping;Official nomenclature;Gene expression;Functional annotation;Interactions;Information about mutants/knockout/transgenic animals;Phenotypes;(cross-)references;Species-specific reagents…
Key resources for information on a given organismService provided to/from a given community
MODs do not necessarily store sequences,but give access to them
![Page 147: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/147.jpg)
![Page 148: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/148.jpg)
![Page 149: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/149.jpg)
![Page 150: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/150.jpg)
![Page 151: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/151.jpg)
![Page 152: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/152.jpg)
Link to cDNA sequences
![Page 153: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/153.jpg)
http://gmod.org/wiki/Main_Page
![Page 154: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/154.jpg)
The world of databases is a
jungle
![Page 155: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/155.jpg)
A few points to rememberA few points to rememberwhen using databaseswhen using databases
- Content ;
- Primary / secondary / meta-databases ;- Curated / non-curated ;- manual / automated curation ;- Redundant / non-redundant.
- Update frequency;
- Stable identifiers ;
- Strategy ;- Dataflow ;- Collaborations between databases.
![Page 156: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/156.jpg)
Test a few genomic Test a few genomic databases and toolsdatabases and tools
![Page 157: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/157.jpg)
NCBI:http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeEBI:http://www.ebi.ac.uk/genomes/TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/shared/Genomes.cgi
Genome annotation and analysis tools:http://www.ensembl.org/index.htmlhttp://vega.sanger.ac.uk/index.htmlhttp://genome.ucsc.edu/ -> BLAT, Galaxy, Custom tracks, …http://www.jgi.doe.gov/software/ -> Genome portal, Integrated Microbial Genomes (IMG) and other tools
Generic Model Organism Database http://gmod.org/wiki/Main_Page
Genomes and genomic tools: a few sites
![Page 158: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/158.jpg)
Find your favorite (completely sequenced) organism in a genome db;Follow the links to see the options on different sites;Find the sequences;Look at the annotation of your favorite gene;Compare the entries corresponding to this gene across sites;Test search engines (restrict searches, compare results, …)
Whenever possible use on-line tutorials, such as:http://www.ensembl.org/info/website/tutorials/index.html
Visit GMOD, see the tools (http://gmod.org/wiki/GMOD_Components)
Play around with the BLAT search, customize display, follow the links, …
Genomes and genomic tools:Hands-on
![Page 159: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/159.jpg)
Go and visit databases cited in this lecture;
The databases/tools that should be "familiar" to all are:http://genome.ucsc.edu/cgi-bin/hgBlathttp://www.ensembl.org/index.htmlgene/genome databases/tools on http://www.ncbi .nlm.nih.gov/
If none of the databases are of interest for you, go to the NAR database (http://www.oxfordjournals.org/nar/database/a/) and find databases that are closest to your interests;
Play around…
Hands on protein sequence databases and UniProt:http://education.expasy.org/cours/HK09/Protein_database_TP.html(corrections: http://education.expasy.org/cours/HK09/Protein_database_TP_correction.html)
Genomes and genomic tools:Hands-on
![Page 160: Genome, Protein and Model Organism Databases](https://reader033.fdocuments.net/reader033/viewer/2022051115/568146c4550346895db3fdca/html5/thumbnails/160.jpg)
Thank You !Thank You !