Bioinformatica 06-10-2011-t2-databases
description
Transcript of Bioinformatica 06-10-2011-t2-databases
![Page 1: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/1.jpg)
![Page 2: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/2.jpg)
FBW06-10-2011
Wim Van Criekinge
![Page 3: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/3.jpg)
Inhoud Lessen: Bioinformatica
• don 29-09-2011: 1* Bioinformatics (practicum 8.30-11.00)
• don 06-10-2011: 2* Biological Databases (practicum 9.00-11.30)
• don 20-10-2011: 3 Sequence Similarity (Scoring Matrices)
• don 27-10-2011: 4 Sequence Alignments
• don 10-11-2011: 5 Database Searching Fasta/Blast
• don 17-11-2011: 6 Phylogenetics
• don 24-11-2011: 7 Protein Structure
• don 01-12-2011: 8 Gene Prediction, Gene Ontologies & HMM
• don 08-12-2011: 9 ncRNA, Chip Data Analysis, AI
• don 15-12-2011: 10 Bio- & Cheminformatics in Drug Discovery (inhaalweek)
• Opgelet: Geen les op don 13-10-2010 en don 3-11-2010
![Page 4: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/4.jpg)
Outline
• Molecular Biology
• Flat files “sequence” databases– DNA– Protein– Structure
• Relational Databases– What ?– Why ?
• Biological Relational Databases– Howto ?
![Page 5: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/5.jpg)
![Page 6: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/6.jpg)
![Page 7: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/7.jpg)
![Page 8: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/8.jpg)
![Page 9: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/9.jpg)
Flat Files
What is a “flat file” ?
• Flat file is a term used to refer to when data is stored in a plain ordinary file on the hard disk
• Example RefSEQ – See CD-ROM– FILE: hs.GBFF
• Hs: Homo Sapiens
• GBFF: Genbank File Format
• (associated with textpad, use monospaced font eg. Courier)
![Page 10: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/10.jpg)
Sequence entries
gene 10317..12529 /gene="ZK822.4" CDS join(10317..10375,10714..10821,10874..10912,10960..11013, 11061..11114,11169..11222,11346..11739,11859..11912, 11962..12195,12242..12529) /gene="ZK822.4" /codon_start=1 /protein_id="CAA98068.1" /db_xref="PID:g3881817" /db_xref="GI:3881817" /db_xref="SPTREMBL:Q23615" /translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
![Page 11: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/11.jpg)
EMBL Nucleotide Sequence Database (European Molecular Biology Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
GenBank at NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/DDBJ,the Center for operating DDBJ, National Institute of Genetics
(NIG),Japan,established in April 1995.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)Genetic Sequence Data Bank - August 15 2003NCBI-GenBank Flat File Release 137.0Distribution Release Notes33 865 022 251 bases, from 27 213 748 reported sequences
Nucleotide Databases
![Page 12: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/12.jpg)
GenBank Format
LOCUS LISOD 756 bp DNA BCT 30-JUN-1993DEFINITION L.ivanovii sod gene for superoxide dismutase.ACCESSION X64011.1 GI:37619753NID g44010KEYWORDS sod gene; superoxide dismutase.SOURCE Listeria ivanovii.ORGANISM Listeria ivanovii Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; Listeria.REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
![Page 13: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/13.jpg)
FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /product="superoxide dismutase" /db_xref="PID:g44011" /db_xref="SWISS-PROT:P28763" /transl_table=11 /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF DAAK" terminator 723..746 /gene="sod"
![Page 14: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/14.jpg)
Example of location descriptors
Location Description
476 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and including the starting and ending bases
<345..500 The exact lower boundary point of a feature is unknown.
(102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110.
(23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end base 600
123^124 Points to a site between bases 123 and 124
145^177 Points to a site anywhere between bases 145 and 177
J00193:hladr Points to a feature whose location is described in another entry: the feature labeled 'hladr' in the entry (in this database) with primary accession 'J00193'
![Page 15: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/15.jpg)
BASE COUNT 247 a 136 c 151 g 222 tORIGIN
1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa
121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca
241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt
301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta
361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca
421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg
481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt
541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat
601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca
661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta
721 tcgaaaggct cacttaggtg ggtcttttta tttcta
//
![Page 16: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/16.jpg)
EMBL formatID LISOD standard; DNA; PRO; 756 BP. IDentificationXXAC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUIDXXNI g44010 Nucleotide Identifier --> x.xXXDT 28-APR-1992 (Rel. 31, Created) DaTeDT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutase DEscriptionXX.KW sod gene; superoxide dismutase. KeyWordXXOS Listeria ivanovii Organism SpeciesOC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;OC Listeria. Organism ClassificationXXRN [1]RA Haas A., Goebel W.; ReferenceRT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and RT characterization of the gene product."; RL Mol. Gen. Genet. 231:313-322(1992).XX
![Page 17: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/17.jpg)
GenBank,EMBL & DDBJ: Comments
• Collaboration Genbank/EMBL/DDBJ– Effort: Identical within 24 hours
• Redundant information • Historical graveyard
– BANKIT (responsability of the submitter)– Version conflicts
• IDIOSYNCRATIC ( peculiar to the individual)– Heterogeneous annotation– No consistant quality check
• Vectors, sequence errors etc
![Page 18: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/18.jpg)
Other Genbank Formats
• ASN1– Computer friendly, human unfriendly
• FASTA– Brief, loses information– Easy to use– Compatible with multiple sequences
![Page 19: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/19.jpg)
Web Query tools & Programming Query tools
• NCBI website example:– http://www.ncbi.nlm.nih.gov/entrez/query/static/ad
vancedentrez.html
• EBI UniProtKB website example:– http://www.ebi.ac.uk/uniprot/index.html– http://www.ebi.uniprot.org/search/SearchTools.sht
ml
![Page 20: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/20.jpg)
batch download (ftp server)
• Data available via website is most of the time also available via an ftp server to download a complete batch.
• Examples:–ftp://ftp.ncbi.nih.gov/
–ftp://ftp.ebi.ac.uk/pub/
![Page 21: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/21.jpg)
Sequence file format tips
• When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA
• When retrieving from a database or exchanging between programs, use an annotated text format such as Genbank
• When using sequence again with the same program, use that program’s annotated binary format (or annotated text if binary not available)– Asn-1 (NCBI)– Gbff (sanger)– XML
![Page 22: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/22.jpg)
Expressed Sequence Tags
• Sequence that codes for protein is < 5% of the genome.
• Coding sequence can be obtained from mRNA by reverse transcription.
• Tags for that sequence can be obtained by end-sequencing.
• Incyte and HGS gambled on this being the useful part:– Search for homologies to known proteins, motifs.– Search for changed levels of expression and tissue
specificity (“virtual/electronic northern” used in GeneCards)
• ESTs have driven the huge expansion of GenBank:– Unigene now contains some sequence from most genes.– > 4,000,000 human est sequences– http://www.ncbi.nlm.nih.gov/dbEST/
![Page 23: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/23.jpg)
dbEST release 100303 Summary by Organism - October 3, 2003
Number of public entries: 18,762,324
Homo sapiens (human) 5,426,001Mus musculus + domesticus (mouse) 3,881,878Rattus sp. (rat) 538,073Triticum aestivum (wheat) 500,898Ciona intestinalis 492,488Gallus gallus (chicken) 451,565Zea mays (maize) 383,416Danio rerio (zebrafish) 362,362Hordeum vulgare + subsp. vulgare (barley) 348,233Xenopus laevis (African clawed frog) 344,695Glycine max (soybean) 341,573Bos taurus (cattle) 322,074Drosophila melanogaster (fruit fly) 261,404
![Page 24: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/24.jpg)
Traces <-> strings
• Traces contain much more information– TraceDB: http://www.ncbi.nlm.nih.gov/Traces/
Example
![Page 25: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/25.jpg)
Traces <-> strings
• Phrep– base calling, vector trimming, end of sequence
read trimming
• Phrap– Phrap uses Phred’s base calling scores to
determine the consensus sequences. Phrap examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence
• Consend– graphical interface extension that controls both
Phred and Phrap
![Page 26: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/26.jpg)
What is Phred?
• Phred is a program that observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence.
• It then writes base calls and qv to output files that will be used for Phrap assembly. The qv will be useful for consensus sequence construction.• For example, ATGCATGC string1 ATTCATGC string2 AT-CATGC superstring
• Here we have a mismatch ‘G’ and ‘T’, the qv will determine the dash in the superstring. The base with higher qv will replaces the dash.
![Page 27: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/27.jpg)
How Phred calculates qv?
• From the base trace Phred know number of peaks and actual peak locations.
• Phred predicts peaks locations.
• Phred reads the actual peak locations from base trace.
• Phred match the actual locations with the predicted locations by using Dynamic Programming.
• The qv is related to the base call error probability (ep) by the formula qv = -10*log_10(ep)
• Example 1:10000 = qv 40
![Page 28: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/28.jpg)
Why Phred?
• Output sequence might contain errors.
• Vector contamination might occur.
• Dye-terminator reaction might not occur.
• Segment migration abnormal in gel electrophoresis.
• Weak or variable signal strength of peak corresponding to a base.
![Page 29: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/29.jpg)
Vector Trimming
![Page 30: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/30.jpg)
End of Sequence Cropping
• It is common that the end of sequencing reads have poor data. This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp).
• Phred assigns a non-value of ‘x’ to this data by comparing peak separation and peak intensity to internal standards. If the standard threshold score is not reached, the data will not be used.
![Page 31: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/31.jpg)
• Handle traces– Abi-view EMBOSS– Bioedit– Acembly, …
• EXAMPLE
Traces <-> strings
![Page 32: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/32.jpg)
NCBI reference sequences
RefSeq database is a non-redundant set of reference standards that includes chromosomes, complete genomic molecules, intermediate assembled genomic contigs, curated genomic regions, mRNAs, RNAs, and proteins.
![Page 33: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/33.jpg)
RefSeq nomenclature
NC_#### complete genomic
NG_#### incomplete genomic
NM_#### mRNA
NR_#### noncoding transcripts
NP_#### proteins
NT_#### intermediate genomic contigs
![Page 34: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/34.jpg)
RefSeq nomenclature - models
XM_#### mRNA
XR_#### RNA
XP_#### protein
Automated Homo sapiens models provided by the Genome Annotation process; sequence corresponds to the genomic contig.
![Page 35: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/35.jpg)
![Page 36: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/36.jpg)
![Page 37: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/37.jpg)
![Page 38: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/38.jpg)
Open reading frame
• Definition: – A stretch of triplet codons with an initiator
codon at one end and a stop codon sat the other, as identifiable by nucleotide sequences.
• Example– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmd=Retrieve&db=nucleotide&list_uids=6688473&dopt=GenBank&term=Y18948.1&qty=1
![Page 39: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/39.jpg)
Protein sequence databaseSWISS-PROT & TREMBL
SwissProt - http://expasy.hcuge.ch/sprot/
SWISS-PROT is an annotated protein sequence database
The sequences are translated from the EMBL Nucleotide Sequence Database
Sequence entries are composed of different lines. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database.
Continuously updated (daily).
![Page 40: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/40.jpg)
Different Features of SWISS-PROT
• Format follows as closely as possible that of EMBL’s
• Curated protein sequence database
• Three differences:1. Strives to provide a high level of
annotations2. Minimal level of redundancy3. High level of integration with
other databases
![Page 41: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/41.jpg)
Three Distinct Criteria
The sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc.
1. Annotation
![Page 42: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/42.jpg)
any sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
2. Minimal Redundancy
![Page 43: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/43.jpg)
• SWISS-PROT and TrEMBL - Protein
sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional
polyacrylamide gel electrophoresis • SWISS-3DIMAGE - 3D images of proteins
and other biological macromolecules • SWISS-MODEL Repository - Automatically
generated protein models • CD40Lbase - CD40 ligand defects • ENZYME - Enzyme nomenclature • SeqAnalRef - Sequence analysis bibliographic
references
3. Integration With Other Databases
![Page 44: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/44.jpg)
TREMBL- http://expasy.hcuge.ch/sprot/
Translated EMBL sequences not (yet) in Swissprot.
Updated faster than SWISS-PROT.
TREMBL - two parts 1. SP-TREMBL
Will eventually be incorporated into Swissprot Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO,
ROD, UNC, VRL and VRT.2. REM-TREMBL (remaining)
Will NOT be incorporated into Swissprot Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
![Page 45: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/45.jpg)
SWISS-PROT/TrEMBL
• TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT
• SWISS-PROT Release 39.15 of 19-Mar-2001: 94,152 entriesTrEMBL Release 16.2 of 23-Mar-2001: 436,924 entries
![Page 46: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/46.jpg)
Example of a SwissProt entry ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification AC P01375; ACcession DT 21-JUL-1986 (REL. 01, CREATED) DaTe DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). GN TNFA. Gene name OS HOMO SAPIENS (HUMAN). Organism Species OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. Organism Classification RN [1] Reference RP SEQUENCE FROM N.A. RX MEDLINE; 87217060. RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N., RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 85086244. RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; RL NATURE 312:724-729(1984). ...
![Page 47: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/47.jpg)
CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION CC UNDER CERTAIN CONDITIONS. Comments CC -!- SUBUNIT: HOMOTRIMER. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS CC AN EXTRACELLULAR SOLUBLE FORM. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL CC HEALTH AND MALNUTRITION. CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY. DR EMBL; X02910; G37210; -. Database Cross-references DR EMBL; M16441; G339741; -. DR EMBL; X01394; G37220; -. DR EMBL; M10988; G339738; -. DR EMBL; M26331; G339764; -. DR EMBL; Z15026; G37212; -. DR PIR; B23784; QWHUN. DR PIR; A44189; A44189. DR PDB; 1TNF; 15-JAN-91. DR PDB; 2TUN; 31-JAN-94.
![Page 48: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/48.jpg)
KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR; KW MYRISTYLATION; 3D-STRUCTURE. KeyWord FT PROPEP 1 76 Feature Table FT CHAIN 77 233 TUMOR NECROSIS FACTOR. FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN). FT LIPID 19 19 MYRISTATE. FT LIPID 20 20 MYRISTATE. FT DISULFID 145 177 FT MUTAGEN 105 105 L->S: LOW ACTIVITY. FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE. FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE. FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE. FT CONFLICT 63 63 F -> S (IN REF. 5). FT STRAND 89 93 FT TURN 99 100 FT TURN 109 110 FT STRAND 112 113 FT TURN 115 116 FT STRAND 118 119 FT STRAND 124 125
![Page 49: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/49.jpg)
FT STRAND 130 143 FT STRAND 152 159 FT STRAND 166 170 FT STRAND 173 174 FT TURN 183 184 FT STRAND 189 202 FT TURN 204 205 FT STRAND 207 212 FT HELIX 215 217 FT STRAND 218 218 FT STRAND 227 232 SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL //
![Page 50: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/50.jpg)
Protein searching3-levels of Protein Searching
1. Swissprot Little Noise Annotated entries
2. Swissprot + TREMBL More NoisyAll probable entries
3. Translated EMBL - tblast or tfasta Most NoisyAll possible entries
![Page 51: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/51.jpg)
New initiatiaves
• IPI: International Protein Index– http://www.ebi.ac.uk/IPI/
IPIhelp.html
• UNIPROT: Universal Protein Knowledgebase– http://www.pir.uniprot.org/
• HPRD: Human Protein Reference Database– http://www.hprd.org/
![Page 52: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/52.jpg)
UniProt Consortium• European Bioinformatics Institute (EBI) • Swiss Institute of Bioinformatics (SIB)• Protein Information Resource (PIR)
Uniprot Databases•UniProt Knowledgebase (UniProtKB) •UniProt Reference Clusters (UniRef)•UniProt Archive (UniParc)
UniprotKB•Swiss-Prot (annotated protein sequence db, golden standard)•trEMBL (translated EMBL + automated electronic annotations)
UniProt
![Page 53: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/53.jpg)
understanding molecular structure is critical to the understanding of biology
because because structure determines
function
![Page 54: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/54.jpg)
![Page 55: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/55.jpg)
• the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body
From Structure to Function
![Page 56: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/56.jpg)
• the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body
• the receptor molecules located at the synapse (between two neurons) bind morphine much the same way as endorphins
• therefore, morphine is able to attenuate the pain response
From Structure to Function
![Page 57: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/57.jpg)
Structure databases
Protein Data Bank (PDB)
Protein Data Bank - http://www.rcsb.org/pdb
Diffraction 7373 structures determined by X-ray diffractionNMR 388 structures determined by NMR spectroscopyTheoretical Model 201 structures proposed by modeling
![Page 58: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/58.jpg)
• PDB is three-dimensional structure of proteins,some nuclei acids involved
• PDB is operated by RCSB(Research Collaboratory for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine.
• Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures
• In 1980s, the number of deposited structures began to increase dramatically.
• October 1998, the management of the PDB became the responsibility of RCSB.
• Website http://www.rcsb.org
![Page 59: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/59.jpg)
PDB Holdings List: 27-Mar-2001
Molecule Type
Proteins, Peptides, and Viruses
Protein/
Nucleic Acid Complexes
Nucleic
Acids
Carbohydrates
total
Exp.
Tech.
X-ray Diffraction and other
11045 526 552 14 12137
NMR 1832 71 366 4 2273
Theoretical Modeling
281 19 21 0 321
total 13158 616 939 18 14731
5032 Structure Factor Files968 NMR Restraint Files
![Page 60: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/60.jpg)
PDB Content Growth
![Page 61: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/61.jpg)
PDB Growth in New Folds
![Page 62: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/62.jpg)
Other structure databases
BioMagResBank http://www.bmrb.wisc.edu/
A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic Acids
Biological Macromolecule Crystallization Database (BMCD) http://h178133.carb.nist.gov:4400/bmcd/bmcd.htmlContains crystal data and the crystallization conditions, which have been compiled from literature
Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/
Assembles and distributes structural information about nucleic acids
Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/
Structure similarity search. Hierarchic organization.
MOOSE http://db2.sdsc.edu/moose/
Macromolecular Structure Query
Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/
Small molecules.
![Page 63: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/63.jpg)
![Page 64: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/64.jpg)
Protein Splicing?
• Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein protein and the free intein
• http://www.neb.com/inteins/intein_intro.html
![Page 65: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/65.jpg)
Biological databases
• NAR Database Issue– Every year: NAR DB Issue– The 2006 update includes 858 databases – Citation top 5 are:
• Pfam
• Gene Ontology
• UniProt
• SMART
• KEGG
– Primary Nucleotide DB’s and PDB are not cited anymore
![Page 66: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/66.jpg)
Outline
• Molecular Biology
• Flat files “sequence” databases– DNA– Protein– Structure
• Relational Databases– What ?– Why ?
• Biological Relational Databases– Howto ?
![Page 67: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/67.jpg)
Why biological databases ?
• Explosive growth in biological data
• Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases
• Essential tools for biological research, as classical publications used to be !
![Page 68: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/68.jpg)
Problems with Flat files …
• Wasted storage space• Wasted processing time• Data control problems• Problems caused by changes to data
structures • Access to data difficult• Data out of date• Constraints are system based• Limited querying eg. all single exon
GPCRs (<1000 bp)
![Page 69: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/69.jpg)
Relational
• The Relational model is not only very mature, but it has developed a strong knowledge on how to make a relational back-end fast and reliable, and how to exploit different technologies such as massive SMP, Optical jukeboxes, clustering and etc. Object databases are nowhere near to this, and I do not expect then to get there in the short or medium term.
• Relational Databases have a very well-known and proven underlying mathematical theory, a simple one (the set theory) that makes possible – automatic cost-based query optimization,
– schema generation from high-level models and
– many other features that are now vital for mission-critical Information Systems development and operations.
![Page 70: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/70.jpg)
• What is a relational database ?– Sets of tables and links (the data)– A language to query the datanase (Structured
Query Language)– A program to manage the data (RDBMS)
• Flat files are not relational– Data type (attribute) is part of the data– Record order mateters– Multiline records– Massive duplication
• Bv Organism: Homo sapeinsm Eukaryota, …– Some records are hierarchical
• Xrefs– Records contain multiple “sub-records”– Implecit “Key”
![Page 71: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/71.jpg)
The Benefits of Databases
• Redundancy can be reduced
• Inconsistency can be avoided
• Conflicting requirements can be balanced
• Standards can be enforced
• Data can be shared
• Data independence
• Integrity can be maintained
• Security restrictions can be applied
![Page 72: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/72.jpg)
Disadvantages
• size
• complexity
• cost
• Additional hardware costs
• Higher impact of failure
• Recovery more difficult
![Page 73: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/73.jpg)
Relational Terminology
IDID NAMENAME PHONEPHONE EMP_IDEMP_ID
201201 UnisportsUnisports 55-2066101 1255-2066101 12
202202 Simms AtheleticsSimms Atheletics 81-20101 1481-20101 14
203203 Delhi SportsDelhi Sports 91-10351 1491-10351 14
204204 WomansportWomansport 1-206-104-0103 111-206-104-0103 11
Row (Tuple)
Column (Attribute)
CUSTOMER Table (Relation)
![Page 74: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/74.jpg)
Relational Database Terminology
• Each row of data in a table is uniquely identified by a primary key (PK)• Information in multiple tables can be logically related by foreign keys (FK)
IDID LAST_NAMELAST_NAME FIRST_NAMEFIRST_NAME
1010 HavelHavel MartaMarta1111 MageeMagee ColinColin1212 GiljumGiljum HenryHenry1414 NguyenNguyen MaiMai
IDID NAMENAME PHONEPHONE EMP_IDEMP_ID
201201 UnisportsUnisports 55-2066101 1255-2066101 12202202 Simms AtheleticsSimms Atheletics 81-20101 1481-20101 14203203 Delhi SportsDelhi Sports 91-10351 1491-10351 14204204 WomansportWomansport 1-206-104-0103 111-206-104-0103 11
Table Name: CUSTOMER Table Name: EMP
Primary Key Foreign Key Primary Key
![Page 75: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/75.jpg)
• RDBM products– Free
• MySQL, very fast, widely usedm easy to jump into but limited non standard SQL
• PostrgreSQL – full SQLm limited OO, higher learning curve than MySQL
– Commercial• MS Access – Great query builder, GUI
interfaces• MS SQL Server – full SQL, NT only• Oracle, everything, including the kitchen
sink• IBM DB2, Sybase
![Page 76: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/76.jpg)
A simple datamodel (tables and relations)
Prot_id name seq Species_id
1 GTM1_HUMAN
MGTDHG… 1
2 GTM1_RAT MGHJADSW.. 2
3 GTM2_HUMAN
MVSDBSVD.. 1
Species_id name Full Lineage
1 human Homo Sapiens …
2 rat Rattus rattus
![Page 77: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/77.jpg)
Relational Database Fundamentals
• Basic SQL– SELECT
– FROM
– WHERE
– JOIN – NATURAL, INNER, OUTER
• Other SQL functions– COUNT()
– MAX(),MIN(),AVE()
– DISTINCT
– ORDER BY
– GROUP BY
– LIMIT
![Page 78: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/78.jpg)
BioSQL
![Page 79: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/79.jpg)
![Page 80: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/80.jpg)
![Page 81: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/81.jpg)
![Page 82: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/82.jpg)
• Query: een opdracht om gegevens uit een databaase op te vragen noemt men een query
• eg. MyGPCRdb– Bioentry– Taxid (include full lineage)– Linking table (bioentry_tax)
![Page 83: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/83.jpg)
MyGPCR;
Geef me allE GPCR die korter zijn dan 1000bp
select * from bioentry;select count(*) from bioentry;select * from bioentry inner join biosequence on
bioentry.bioentry_id=biosequence.bioentry_id ;select * from bioentry inner join biosequence on
bioentry.bioentry_id=biosequence.bioentry_id where length(biosequence_str)<1000;
![Page 84: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/84.jpg)
Example 3-tier model in biological database
http://www.bioinformatics.beExample of different interface to the same back-end database (MySQL)
![Page 85: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/85.jpg)
Ov
erv
iew
• DataBases– FF
• *.txt• Indexed version
– Relational (RDBMS)• Access, MySQL, PostGRES,
Oracle– OO (OODBMS)
• AceDB, ObjectStore– Hierarchical
• XML– Frame based system
• Eg. DAML+OIL– Hybrid systems
Overview
![Page 86: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/86.jpg)
Object
• The Object paradigm is already proven for application design and development, but it may simply not be an adequate paradigm for the data store.
• Object Database are modelled by graphs. The graph theory plays a great role on computer science, but is also a great source of unbeatable problems, the NP-complex class: problems for which there are no computationally efficient solution, as there's no way to escape from exponential complexity. This is not a current technological limit. It's a limit inherent to the problem domain.
• Hybrid Object-Relational databases will probably be the long term solution for the industry. They put a thin object layer above the relational structure, thus providing a syntax and semantics closer to the object oriented design and programming tools. They simply make it easier to build the data layer classes
![Page 87: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/87.jpg)
Conclusions
• A database is a central component of any contemporary information system
• The operations on the database and the mainenance of database consistency is handled by a DBMS
• There exist stand alone query languages or embedded languages but both deal with definition (DDL) and manipulation (DML) aspects
• The structural properties, constraints and operations permitted within a DBMS are defined by a data model - hierarchical, network, relational
• Recovery and concurrency control are essential
• Linking of heterogebous datasources is central theme in modern bioinformatics
![Page 88: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/88.jpg)
![Page 89: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/89.jpg)
• How do you know which database exists ?
• NAR list
• Weblinks op Nexus– Searchable– Maintainable
![Page 90: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/90.jpg)
![Page 91: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/91.jpg)
• Tools available in public domain for simultaneous access– entrez– srs
• Batch queries for offload in local databases for subsequent analysis (see further)
![Page 92: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/92.jpg)
![Page 93: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/93.jpg)
![Page 94: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/94.jpg)
• What if you want to search the complete human genome (golden path coordinates) instead of separate NCBI entries ?
• ENSEMBL
![Page 95: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/95.jpg)
BioMart
• Joined project between EBI and CSHL, http://www.biomart.org/
• Aim is to develop a generic, query-oriented data management system capable of integrating distributed data sources
• 3 step system:– Start by selecting a dataset to query– Filter this dataset by applying the appropriate filters– Generate the output by selecting the attributes and output
format
• Available public biomart websites: http://www.biomart.org/biomart/martview
![Page 96: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/96.jpg)
BioMart - Single access point - Generic interface
![Page 97: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/97.jpg)
BioMart - ‘Out of the box’ website
![Page 98: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/98.jpg)
BioMart – 3 step system
Dataset
Attribute
Filter
![Page 99: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/99.jpg)
BioMart - 3 step system
Name, chromosome position, description
for all Ensembl genes
located on chromosome 1, expressed in lung, associated with human homologues
Dataset
Attribute
Filter
![Page 100: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/100.jpg)
BioMart - EnsMart
• The first in line was EnsMart, a powerful data mining toolset for retrieving customized data sets from annotated genomes. EnsMart integrates data from Ensembl and various worldwide data sources.
• EnsMart provides .... – Gene and protein annotation– Disease information– Cross-species analyses– SNPs affecting proteins– Allele frequency data– Retrieval by external identifiers– Retrieval by Gene Ontology– Customized sequence datasets– Microarray annotation tools
![Page 101: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/101.jpg)
Other BioMart implementations
• Other data resources also implemented a BioMart interface:– Wormbase– Gramene– HapMap– DictyBase– euGenes
![Page 102: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/102.jpg)
Single interface
![Page 103: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/103.jpg)
![Page 104: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/104.jpg)
![Page 105: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/105.jpg)
![Page 106: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/106.jpg)
BioBar
• A toolbar for browsing biological data and databases http://biobar.mozdev.org/
• The following databases are included http://biobar.mozdev.org/Databases.html
• a toolbar for Mozilla-based browsers including Firefox and Netscape 7+
![Page 107: Bioinformatica 06-10-2011-t2-databases](https://reader035.fdocuments.net/reader035/viewer/2022062418/554e93c4b4c90526358b4fcb/html5/thumbnails/107.jpg)
Weblems
Weblems Online (example posting)
W2.1. Which isolate of Tabac was used in record accession Z71230, and human sample in the genbank entry with accession AJ311677 ?
W2.2: Find all structures of GFP in the Protein Data Bank and draw a histogram of their dates of deposition ?
W2.3: What is the chromosomal location of the human gene for insulin ?
W2.4: How many different human NHR (nuclear hormone receptors) s exist ? How many of these are single exon genes ? Are there any drugs working on this class of receptors ?
W2.5: The gene for Berardinelli-Seip syndrome was initially localized between two markers on chromosome band 11q13-D11S4191 and D11S987. a. How many base pairs are there in the interval between these two markers ? b. How many known genes are there ?c. List the gene ontology terms for that region ?