An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a...

121
An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner !
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a...

Page 1: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

An introduction to biological databases

MCB, Janv 2003

Yes, if you train quickly, you can create a new database of

databases, but first eat your dinner !

Page 2: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

What is a database ?

A collection of structured searchable (index) -> table of contents

updated periodically (release) -> new edition

cross-referenced (hyperlinks) -> links with other db

data

Includes also associated tools (software) necessary for db access/query, db updating, db information insertion, db information deletion….

Data storage/ressource management: flat files, relational databases, objet oriented, …

Page 3: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database: a « flat file » example

Accession number: 1First Name: AmosLast Name: BairochCourse: DEA 2000; DEA 2001; Dea 2002;http://www.expasy.org/people/amos.html//Accession number: 2 First Name: LaurentLast name: FalquetCourse: EMBnet 2000, EMBnet2001;EMBnet 2002; DEA 2000; DEA 2001; DEA 2002//Accession number 3:First Name: Marie-ClaudeLast name: Blatter Course: EMBnet 2000; EMBnet 2001; EMBnet 2002; DEA 2000; DEA 2001; DEA 2002http://www.expasy.org/people/Marie-Claude.Blatter.html//

Easy to manage: all the entries are visible at the same time !

« Introduction To Databases »Teacher Database (flat file, 3 entries)

-> human readable, implicit data

Page 4: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database: a « relational » example

Teacher_ID Teacher Education

1 Amos Biochemistry

2 Laurent Biochemistry

3 M-Claude Biochemistry

Easier to manage; important to known the shema; choice of the output

Course_ID

Date

1 2000

1 2001

1 2002

2 2000

2 2001

2 2002

Course_ID Course

1 DEA

2 EMBnet

Teacher_ID

Course_ID

1 1

2 1

2 2

3 1

3 2

Page 5: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Why biological databases ?

Exponential growth in biological data.

Data (genomic sequences, 3D structures, 2D gel analysis, MS analysis, Microarrays….) are no longer published in a conventional manner, but directly submitted to databases.

Essential tools for biological research.

Page 6: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Distribution of databases

Books, articles 1968 -> 1985 Computer tapes 1982 ->1992 Floppy disks 1984 -> 1990 CD-ROM 1989 -> ? FTP 1989 -> ? On-line services 1982 -> 1994 WWW 1993 -> ? DVD 2001 -> ?

Page 7: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Some statistics More than 1000 different ‘biological’ databases

Variable size: <100Kb to >10Gb DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller

Update frequency: daily to annually

Usually accessible through the web (free !?) Amos’ links: www.expasy.org/alinks.html Biohunt: http://www.expasy.org/BioHunt/ Google: http://www.google.com/

Page 8: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Some databases in the field of molecular biology…

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,

BioMagResBank, BIOMDB, BLOCKS, BovGBASE,BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,

CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,

CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,

ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,

Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,

HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,

KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,

PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,

PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,

TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,

YPM, etc .................. !!!!

Page 9: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Categories of databases for Life Sciences

Sequences (DNA, protein) Genomics Mutation/polymorphism Protein domain/family (----> tools)

Proteomics (2D gel, Mass Spectrometry) 3D structure Metabolism Bibliography ‘Others’ (Microarrays, Protein protein

interaction…)

Page 10: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Sequence databases

1. DNA/RNA2. Proteins

Page 11: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Ideal minimal content of a sequence database entry

Sequences !! Accession number (AC) (unique identifier)

Taxonomic data References ANNOTATION/CURATION Keywords Cross-references Documentation

Page 12: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Sequence database : example

ID EPO_HUMAN STANDARD; PRT; 193 AA.AC P01588; Q9UHA0; Q9UEZ5; Q9UDZ0;DT 21-JUL-1986 (Rel. 01, Created)DT 21-JUL-1986 (Rel. 01, Last sequence update)DT 20-AUG-2001 (Rel. 40, Last annotation update)DE Erythropoietin precursor.GN EPO.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=85137899; PubMed=3838366;RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F.,RA Kawakita M., Shimizu T., Miyake T.;RT "Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin.";RL Nature 313:806-810(1985).….CC -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THECC REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF ACC PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS.CC -!- SUBCELLULAR LOCATION: SECRETED.CC -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALSCC AND BY LIVER OF FETAL OR NEONATAL MAMMALS.CC -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) andCC Procrit (Ortho Biotech).…DR EMBL; X02158; CAA26095.1; -.DR EMBL; X02157; CAA26094.1; -.DR EMBL; M11319; AAA52400.1; -.DR EMBL; AF053356; AAC78791.1; -.DR EMBL; AF202308; AAF23132.1; -.DR EMBL; AF202306; AAF23132.1; JOINED.….

KW Erythrocyte maturation; Glycoprotein; Hormone; Signal; Pharmaceutical.

SWISS-PROT (protein db) (flat file)

Reference

Taxonomy

Annotations(comments)

Keywords

Cross-references

Accession number

Page 13: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Sequence database: example (cont.)

FT SIGNAL 1 27FT CHAIN 28 193 ERYTHROPOIETIN.FT PROPEP 190 193 MAY BE REMOVED IN PROCESSED PROTEIN.FT DISULFID 34 188FT DISULFID 56 60FT CARBOHYD 51 51 N-LINKED (GLCNAC...).FT CARBOHYD 65 65 N-LINKED (GLCNAC...).FT CARBOHYD 110 110 N-LINKED (GLCNAC...).FT CARBOHYD 153 153 O-LINKED (GALNAC...).FT VARIANT 131 132 SL -> NF (IN AN HEPATOCELLULARFT CARCINOMA).FT /FTId=VAR_009870.FT VARIANT 149 149 P -> Q (IN AN HEPATOCELLULAR CARCINOMA).FT /FTId=VAR_009871.FT CONFLICT 40 40 E -> Q (IN REF. 1; CAA26095).FT CONFLICT 85 85 Q -> QQ (IN REF. 5).FT CONFLICT 140 140 G -> R (IN REF. 1; CAA26095).**** ################# INTERNAL SECTION ##################**CL 7q22;SQ SEQUENCE 193 AA; 21306 MW; C91F0E4C26A52033 CRC64; MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR//

Sequence

Annotations(features)

Page 14: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Sequence Databases: some « technical » definitions

Data storage management: flat file: text file, human readable relational database (e.g., Oracle, Postgres) object oriented database

Format: fasta GCG NBRF/PIR MSF…. standardized format ?

Page 15: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Sequence database: example

…a SWISS-PROT entry, in fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

Page 16: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database 1: nucleotide sequences

The 3 main nucleic acid sequence databases are EMBL (Europe)/GenBank (USA) /DDBJ (Japan)

EMBL: since 1982

Specialized databases for the different types of RNAs (i.e. tRNA, rRNA, tm RNA, uRNA, etc…)

3D structure (DNA and RNA) - PDB

Others: Aberrant splicing db; Eukaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource ……

Page 17: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Nucleotids and associated topics databases (AMOS’links) EMBL - EMBL Nucleotide sequence db (EBI) Genbank - GenBank Nucleotide Sequence db (NCBI) DDBJ - DNA Data Bank of Japan dbEST - dbEST (Expressed Sequence Tags) db (NCBI) dbSTS - dbSTS (Sequence Tagged Sites) db (NCBI)

NDB - Nucleic Acid Databank (3D structures) BNASDB - Nucleic acid structure db from University of Pune

AsDb - Aberrant Splicing db ACUTS - Ancient conserved untranslated DNA sequences db Codon Usage Db EPD - Eukaryotic Promoter db HOVERGEN - Homologous Vertebrate Genes db IMGT - ImMunoGeneTics db [Mirror at EBI] ISIS - Intron Sequence and Information System RDP - Ribosomal db Project gRNAs db - Guide RNA db PLACE - Plant cis-acting regulatory DNA elements db PlantCARE - Plant cis-acting regulatory DNA elements db sRNA db - Small RNA db ssu rRNA - Small ribosomal subunit db lsu rRNA - Large ribosomal subunit db 5S rRNA - 5S ribosomal RNA db tmRNA Website tmRDB - tmRNA dB tRNA - tRNA compilation from the University of Bayreuth uRNADB - uRNA db RNA editing - RNA editing site RNAmod db - RNA modification db SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites TelDB - Multimedia Telomere Resource TRADAT - TRAnscription Databases and Analysis Tools Subviral RNA db - Small circular RNAs db (viroid and viroid-like)

MPDB - Molecular probe db OPD - Oligonucleotide probe db VectorDB - Vector sequence db (seems dead!)

Page 18: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL/GenBank/DDBJ These 3 db contain mainly the same informations

within 2-3 days (few differences in the format and syntax)

Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 % Serve as archives containing all sequences

(single genes, ESTs, complete genomes, etc.) derived from: Genome projects (> 80 % of entries) Sequencing centers Individual scientists ( 15 % of entries) Patent offices (i.e. European Patent Office, EPO)

Non-confidential data are exchanged daily Currently: 18 x106 sequences, ~30 x109 bp; Sequences from > 50’000 different species;

Page 19: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

The tremendous increase in nucleotide sequences

EMBL data…first increase in data due to the PCR development…

1980: 80 genes fully sequenced !human

human

mouse

High throughput genomes (HTG)

mouse

human

rat

Page 20: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL/GenBank/DDBJ

Heterogeneous sequence qualities and length: ESTs, genomes, variants, fragments…

Sequence sizes: max 350’000 bp /entry (! genomic sequences*,

overlapping) min 10 bp /entry

Archive: nothing goes out -> highly redundant ! full of errors: in sequences, in annotations, in CDS

attribution…. no consistency of annotations; most annotations are

done by the submitters; heterogeneity of the quality and the completion and updating of the informations

*entries contain only the assembly data

Page 21: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 22: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL/GenBank/DDBJ Unexpected information you can find in these db:

FT source 1..124FT /db_xref="taxon:4097"FT /organelle="plastid:chloroplast"FT /organism="Nicotiana tabacum"FT /isolate="Cuban cahibo cigar, gift from President FidelFT Castro"

Or: FT source 1..17084 FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"

Page 23: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

FT CDS complement(45959..47332)FT /db_xref="SPTREMBL:Q9UZ71"FT /note="PAB2386"FT /transl_table=11FT /product="4-AMINOBUTYRATE qui se dilate AMINOTRANSFERASEFT (EC 2.6.1.19)"FT /protein_id="CAB50188.1"FT /translation="MDYPRIVVNPPGPKAKELIEREKRVLSTGIGVKLFPLVPKRGFGPFT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEKFT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQFT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDEFT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFEFT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEEFT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWRFT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH"

Page 24: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL entry: exampleID HSERPG standard; DNA; HUM; 3398 BP.XXAC X02158;XXSV X02158.1XXDT 13-JUN-1985 (Rel. 06, Created)DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)XXDE Human gene for erythropoietinXXKW erythropoietin; glycoprotein hormone; hormone; signal peptide.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-3398RX MEDLINE; 85137899.RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,RA Shimizu T., Miyake T.;RT Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin;RL Nature 313:806-810(1985).XXDR GDB; 119110; EPO.DR GDB; 119615; TIMP1.DR SWISS-PROT; P01588; EPO_HUMAN.XX…

taxonomy

Cross-references

references

keyword

Link to protein sequence db, if CDS

Page 25: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL entry (cont.)CC Data kindly reviewed (24-FEB-1986) by K. JacobsFH Key Location/QualifiersFHFT source 1..3398FT /db_xref=taxon:9606FT /organism=Homo sapiensFT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)FT /db_xref=SWISS-PROT:P01588FT /product=erythropoietinFT /protein_id=CAA26095.1FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLEFT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRGFT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITADFT TFRKLFRVYSNFLRGKLKLYTGEACRTGDRFT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)FT /product=erythropoietinFT sig_peptide join(615..627,1194..1261)FT exon 397..627FT /number=1FT intron 628..1193FT /number=1FT exon 1194..1339FT /number=2FT intron 1340..1595FT /number=2FT exon 1596..1682FT /number=3FT intron 1683..2293FT /number=3FT exon 2294..2473FT /number=4FT intron 2474..2607FT /number=4FT exon 2608..3327FT /note=3' untranslated regionFT /number=5XXSQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120

annotation

sequence

CDSCoding sequence

Page 26: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

GenBank entry: same entry LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993

DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI …

Page 27: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

GenBank entry (cont.)…

TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" intron 628..1193 /number=1 exon 1194..1339 /number=2 mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760) /product="erythropoietin" intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc

Page 28: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

EMBL: The Genome divisionshttp://www.ebi.ac.uk/genomes/

Schizosaccharomyces pombe strain 972h- complete genome

Page 29: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Human genome•The completion of the draft human genome sequence has been announced on 26-June-2000.

• Publication of the public Human Genome Sequence in Nature the 15 th february 2001. Approx. 30,000 genes are analysed, 1.4 million SNPs and much more.

• The draft sequence data is available at EMBL/GENBANK/DDJB

• Finished: The clone insert is contiguously sequenced with high quality standard of error rate of 0.01%. There are usually no gaps in the sequence.

• The general assumption is that about 50% of the bases are redundant.

2002

Page 30: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Problem:Redundancy = makes Blasts searches of the complete databases useless for detecting anything behond the closest homologs.

Solutions:• assemblies of genomic sequence data (contigs) and corresponding RNA and protein sequences -> dataset of genomic contigs, RNAs and proteins

• annotation of genes, RNAs, proteins, variation (SNPs), STS markers, gene prediction, nomenclature and chromosomal location.

• compute connection to other resources (cross-references)

Examples: RefSeq/Locus link (drosophila, human, mouse, rat and zebrafish), TIGR (bacteria and plants), EnsEMBL (Eukaryota)…

Nucleotide databases and

« associated » genomic projects/databases

Page 31: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

LocusLink Focal point for genes and associated information (fruit fly, human, mouse, rat, zebrafish)

RefSeq NCBI Reference mRNAs and proteins for human, mouse, rat

UniGene UniGene clusters, expression data

Ensembl Provides a bioinformatics framework to organise biology

around the sequences of large genomes. Available now are human, mouse, rat,fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae,

Page 32: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

LocusLink / RefSeqErythropoitin receptor

Page 33: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 34: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 35: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database 2: protein sequences

SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/

TrEMBL: created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations (« proteomic » version of EMBL)

PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

Genpept: « proteomic » version of GenBank

Many specialized protein databases for specific families or groups of proteins.

Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) YPD (Yeast) etc.

Page 36: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT Collaboration between the SIB (CH) and EMBL/EBI (UK)

Fully manually annotated, non-redundant, cross-referenced, documented protein sequence database.

~113 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.

Weekly releases; available from about 50 servers across the world, the main source being ExPASy

Page 37: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

TrEMBL (Translation of EMBL) It is impossible to cope with the quantity of newly

generated data AND to maintain the high quality of SWISS-PROT -> TrEMBL, created in 1996.

TrEMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.

Contains all what is not in SWISS-PROT. SWISS-PROT + TrEMBL = all known protein sequences.

Well-structured SWISS-PROT-like resource.

Page 38: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

The simplified story of a SWISS-PROT entry

cDNAs, genomes, …

EMBLnew EMBL

TrEMBLnew TrEMBL

SWISS-PROT

« Automated »• Redundancy check (merge)• Family attribution (InterPro)• Annotation (computer)

« Manual »• Redundancy (merge, conflicts)• Annotation (manual)• SWISS-PROT tools (macros…)• SWISS-PROT documentation• Medline• Databases (MIM, MGD….)• Brain storming

Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)

CDS

CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.

Some data are not submitted to the public databases !!(delayed or cancelled…)

Page 39: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Remark 1: about 30 % of the genes annotated in newly sequenced genomes such as Arabidopsis thaliana are, at the present (sept 2001), purely the result of computational predictions.

Pertea et al., Nucleic Acids Research (2001), 29, 1185-1190

Remark 2:Human chromosome 21: none of the about 200 already known protein sequences could be correctly predicted by gene prediction programs.

Page 40: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Drosophila

~13’000genes~5000 proved

43 % of sequences have changed

The largest protein: 18’074 aa

Page 41: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

http://www.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-page+top

Some nomenclatureExample: SRS6 at the Sanger Center

Page 42: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

TrEMBL= SPTrEMBL + REMTrEMBL

SPTrEMBL contains TrEMBL entries which will be integrated into SWISS-PROT. REMTrEMBL contains TrEMBL entries which will never be integrated into SWISS-PROT (Immunoglobulins and T-cell receptors, Synthetic sequences, Patent application sequences Small fragments, CDS not coding for real proteins)

TrEMBLnew contains entries which have not yet been integrated into TrEMBL (weekly update to TrEMBL)

SPTR (SWall) = SWISS-PROT + (SP)TrEMBL + TrEMBLnew

! Usually what we call TrEMBL is (SP)TrEMBL and does not include REMTrEMBL !

SWISS-PROT + (SP)TrEMBL + TrEMBL new (SWALL, SPTR) (Standard) (Preliminary)

Page 43: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

a Swiss-Prot entry…overview

sequence

Accession number

Entry name

Page 44: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Protein nameGene name

Taxonomy

Page 45: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

References

Page 46: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Comments

Page 47: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Cross-references

Page 48: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Keywords

Page 49: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Feature table(sequence

description)

Page 50: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 51: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

TrEMBL: example

Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.

Page 52: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 53: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT / TrEMBL: a minimal of redundancy

• SWISS-PROT and TrEMBL introduces some degree of redundancy

• Only 100 % identical sequences are automatically mergedbetween SWISS-PROT and TrEMBL;

• Complete sequences or fragments with 1-3 conflicts will beautomatically merged soon (genome projects; check for chromosomal location and gene names)

Page 54: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT / TrEMBL: a minimal of redundancy

Human EPO: Blastp results

Page 55: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT and TrEMBL introduce a new arithmetical concept !

How many sequences in SWISS-PROT + TrEMBL ?

113’000 + 670’000 about 450’000(sept 2002)

Redundancy in TrEMBL &

Redundancy between SWISS-PROT and TrEMBL

In 3 years….more than 2’000’000 But, in the future: redundancy is going to decrease:

« new » genome sequencing -> « new » proteins(AB, sept 2002)

Page 56: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT and TrEMBL introduce a new arithmetical concept !

In the case of human data, the redundancy is still very high:

8’400 + 41’000 = about 20’000

2

Page 57: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SWISS-PROT and the cross-references (X-ref)

• SWISS-PROT was the 1st database with X-ref.;

• Explicitly X-referenced to 36 databases; X-ref to DNA (EMBL/GenBank/DDBJ), 3D-structure (PDB), literature (Medline), genomic (MIM, MGD, FlyBase, SGD, SubtiList, etc.), 2D-gel (SWISS-2DPAGE), specialized db (PROSITE, TRANSFAC);

• Implicitly X-referenced to 17 additional db added by the ExPASy servers on the WWW (i.e.: GeneCards, PRODOM, HUGE, etc.)

Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55

Page 58: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSSMARTMendel-GFDb

Nucleotide sequence dbEMBL, GeneBank, DDBJ

2D and 3D Structural dbsHSSPPDB

Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVMaizeDBMGDSGDStyGeneSubtiListTIGRTubercuListWormPepZebrafish

Protein-specific dbsGCRDbMEROPSREBASETRANSFAC

SWISS-PROT

2D-gel protein databasesSWISS-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGE

Human diseasesMIM

PTMCarbBankGlycoSuiteDB

Page 59: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database 2: Protein sequence

What else ?

Page 60: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

http://pir.georgetown.edu/

Page 61: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PIR-PSD: example

« well annotated »

Page 62: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

UniProt

United Protein database

SWISS-PROT + TrEMBL + PIR

• Born in oct 2002

• NIH pledges cash for global protein database

The United States is turning to European bioinformatics facilities to help it meet its researchers' future needs for databases of protein sequences.

European institutions are set to be the main recipients of a $15-million, three-year grant from the US National Institutes of Health (NIH), to set up a global database of information on protein sequence and function known as the United Protein Databases, or UniProt (Nature, 419, 101 (2002))

Page 63: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 3: ‘genomics’

Contain informations on gene chromosomal location (mapping) and nomenclature, and provide links to sequence databases; has usually no sequence;

Exist for most organisms important in life science research; usually species specific.

Examples: MIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.;

Generally relational db (Oracle, SyBase or AceDb).

Page 64: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

MIM

OMIM™: Online Mendelian Inheritance in Man

catalog of human genes and genetic disorders

contains a summary of literature and reference information. It also contains links to publications and sequence information.

Page 65: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 66: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Genecardan electronic encyclopedia of biological and medical

information based on intelligent knowledge navigation technology

Page 67: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

http://www.genelynx.org/

Page 68: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Collections of hyperlinks for each human gene

Page 69: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 4: mutation/polymorphism

Contain informations on sequence variations linked or not to genetic diseases;

Mainly human but: OMIA - Online Mendelian Inheritance in Animals General db:

OMIM HMGD - Human Gene Mutation db SVD - Sequence variation db HGBASE - Human Genic Bi-Allelic Sequences db dbSNP - Human single nucleotide polymorphism (SNP) db

Disease-specific db: most of these databases are either linked to a single gene or to a single disease; p53 mutation db ADB - Albinism db (Mutations in human genes causing albinism) Asthma and Allergy gene db ….

Page 70: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

For human (Amos’link)

Page 71: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Mutation/polymorphism: definitions

SNPs: single nucleotide polymorphisms; occur approximately once every 100 to 300 bases

(distinction between sequencing error and polymorphism !)

c-SNPs: coding single nucleotide polymorphisms (Single Nucleotide Polymorphisms within cDNA sequences)

SAPs: single amino-acid polymorphisms

Missense mutation: -> SAP Nonsense mutation: -> STOP Insertion/deletion of nucleotides -> frameshift…

! Numbering of the mutated amino acid depends on the db (aa no 1 is not necessary the initiator Met !)

Page 72: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Mutation/polymorphism

The SNP consortium (TSC) http://snp.cshl.org/ Public/private collaboration: Bayer, Roche, IBM, Pfizer, Novartis,

Motorola…… Has to date discovered and characterized nearly 1.5 million SNPs; in

addition, the allele frequencies in three major world populations have been determined on a subset of ~57,000 SNPs.

SNPs dbSNP at NCBI http://www.ncbi.nlm.nih.gov/SNP/ Collaboration between the National Human Genome Research Institute

and the National Center for Biotechnology Information (NCBI) Mission: central repository for both single base nucleotide subsitutions

and short deletion and insertion polymorphisms (several species) August 2002, dbSNP has submissions for 4’700’000 SNPs.

Chromosome 21 dbSNP http://csnp.isb-sib.ch/ A joint project between the Division of Medical Genetics of the

University of Geneva Medical School and the SIB Mission: comprehensive cSNP (Single Nucleotide Polymorphisms within

cDNA sequences) database and map of chromosome 21

Page 73: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Mutation/polymorphism

Generally modest size; lack of coordination and standards in these databases making it difficult to access the data.

There are initiatives to unify these databases Mutation Database Initiative (4th July 1996).

-> SVD - Sequence Variation Database project at EBI (HMutDB)http://www2.ebi.ac.uk/mutations/

-> HUGO Mutation Database Initiative (MDI). Human Genome Variation Society http://www.genomic.unimelb.edu.au/mdi/dblist/dblist.html

Page 74: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 75: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 76: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 77: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database 5: protein domain/family

Page 78: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Protein domain/family: some definitions

Most proteins have « modular » structures Estimation: ~ 3 domains / protein Domains (conserved sequences or structures)

are identified by multiple sequence alignments

Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position

specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains

Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles.

Page 79: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

[LIVM]-[ST]-A-[STAG]-H-C

Pattern-Profile

• Profile:

• Pattern:

Yes or no

ID TRYPSIN_DOM; MATRIX.AC PS50240;DT DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE).DE Serine proteases, trypsin domain profile.MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234;MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229;MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2=0.00836256; TEXT='-LogE';MA /CUT_OFF: LEVEL=0; SCORE=1134; N_SCORE=9.5; MODE=1; TEXT='!';MA /CUT_OFF: LEVEL=-1; SCORE=775; N_SCORE=6.5; MODE=1; TEXT='?';MA /DEFAULT: M0=-9; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105;MA /I: B1=0; BI=-105; BD=-105;MA A B D E F G H I K L M N P Q R S T V W YMA /M: SY='I'; M= -8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3;MA /M: SY='N'; M= 0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15;MA /M: SY='E'; M= -4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18;MA /M: SY='R'; M=-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9;MA /M: SY='W'; M=-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25;MA /M: SY='V'; M= 1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8;MA /M: SY='L'; M= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1;MA /M: SY='T'; M= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12;MA /M: SY='A'; M= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18;MA /M: SY='A'; M= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21;MA /M: SY='H'; M=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16;MA /M: SY='C'; M= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29;MA /I: E1=0; IE=-105; DE=-105;//

score/threshold

Page 80: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Some statistics 15 most common domains for H. sapiens (Incomplete)

Immunoglobulin and major histocompatibility complex domainZinc finger, C2H2 typeEukaryotic protein kinaseRhodopsin-like GPCR superfamilyPleckstrin homology (PH) domainZinc finger, RING typeSrc homology 3 (SH3) domainRNA-binding region RNP-1 (RNA recognition motif)EF-hand familyHomeobox domainKrab boxPDZ domain (also known as DHR or GLGF)Fibronectin type III domainEGF-like domainCadherin domain…

http://www.ebi.ac.uk/proteome/HUMAN/interpro/top15d.html

Page 81: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 82: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database 5: protein domain/family

Contains biologically significant « pattern / profiles/ HMM » formulated in such a way that, with appropriate computional tools, it can rapidly and reliably determine to which known family of proteins (if any) a new sequence belongs to

Used as a tool to identify the function of uncharacterized proteins translated from genomic or cDNA sequences (« functional diagnostic »)

Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO)

Page 83: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Protein domain/family db

Secondary databases are the fruit of analyses of the sequences found in the primary sequence db

Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM, PSI-BLAST)

Page 84: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Protein domain/family db

PROSITE Patterns / ProfilesProDom Aligned motifs (PSI-BLAST) (Pfam B)PRINTS Aligned motifsPfam HMM (Hidden Markov Models)

SMART HMMTIGRfam HMM

DOMO Aligned motifsBLOCKS Aligned motifs (PSI-BLAST)CDD(CDART)PSI-BLAST(PSSM) of Pfam and SMART

IInntteerrpprroo

Page 85: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Prosite

Created in 1988 (SIB) Contains functional domains fully annotated,

based on two methods: patterns and profiles

Entries are deposited in PROSITE in two distinct files: Pattern/profiles with the list of all matches in SWISS-PROT Documentation

19-Oct-2002: contains 1152 documentation entries that describe 1574 different patterns, rules and profiles/matrices.

Page 86: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Diagnostic performance

List of matches

Page 87: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Prosite (profile): example

Page 88: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PFAM (HMMs): an entry

Page 89: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Page 90: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PFAM (HMMs): query output

Page 91: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only part of it

Page 92: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

ProDom

consists of an automated compilation of homologous domain alignment.

Jan. 2002: 390 ProDom families were generated automatically using PSI-BLAST. built from non fragmentary sequences from SWISS-PROT 39 + TREMBL - Sept, 2001

Page 93: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

ProDom: query output example

Your query

Page 94: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Protein domain/family: Composite databases

Example: InterPro

Single set of documents linked to the various methods;

Will be used to improve the functional annotation of SWISS-PROT (classification of unknown protein…)

The release (sept 2002) contains 5875 entries, representing 1272 domains, 4491 families, 97 repeats and 15 post-translational modification sites.

Page 95: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

InterPro: www.ebi.ac.uk/interpro

Page 96: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 97: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 98: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 99: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 6: proteomics Contain informations obtained by 2D-PAGE:

images of master gels and description of identified proteins

Examples: SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc.

Composed of image and text files

There is currently no protein Mass Spectrometry (MS) database (not for long…)

Page 100: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

This protein does not exist in the current release of SWISS-2DPAGE.

EPO_HUMAN (human plasma)

Page 101: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 102: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 7: 3D structure Contain the spatial coordinates of macromolecules whose 3D

structure has been obtained by X-ray or NMR studies

Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…)

PDB (Protein Data Bank), SCOP (structural classification of proteins (according to the secondary structures)), BMRB (BioMagResBank; RMN results)

DSSP: Database of Secondary Structure Assignments.HSSP: Homology-derived secondary structure of proteins.FSSP: Fold Classification based on Structure-Structure Assignments.

Future: Homology-derived 3D structure db.

Page 103: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PDB: Protein Data Bank

Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

Contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses.

Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Cn3D).

Currently there are ~19’000 structural data for about 6’000 molecules, but far less protein family (highly redundant) !

Page 104: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27………

Page 105: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PDB (cont.)SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102…….

Coordinates of each atom

Page 106: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 107: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 8: metabolic Contain informations that describe enzymes,

biochemical reactions and metabolic pathways;

ENZYME and BRENDA: nomenclature databases that store informations on enzyme names and reactions;

Metabolic databases: EcoCyc (specialized on Escherichia coli), KEGG, EMP/WIT;Usually these databases are tightly coupled with query software that allows the user to visualise reaction schemes.

Page 108: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

BRENDA: example

Page 109: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 9: bibliographic

Bibliographic reference databases contain citations and abstract informations of published life science articles;

Example: Medline Other more specialized databases also

exist (example: Agricola).

Page 110: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Medline

MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the biological sciences

more than 4,000 biomedical journals published in the United States and 70 other countries

Contains over 12 million citations since 1966 until now

Contains links to biological db and to some journals

New records are added to PreMEDLINE daily! Many papers not dealing with humans are not in

Medline ! Before 1970, keeps only the first 10 authors ! Not all journals have citations since 1966 !

Page 111: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

PubMed Search tool for accessing literature citations developed at NCBI.

Provides access to bibliographic information such as MEDLINE, PreMEDLINE, HealthSTAR, and to integrated molecular biology databases (composite db).

Gives also access to : NLM (National Library of Medecine) i.e. to citations before

publication ([MEDLINE record in process]) Publisher supplied citations: citations directly submitted to

PubMed ([Record as supplied by publisher]).

PMID (PubMed ID) UI (Medline ID)

Page 112: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Databases 10: others There are many databases that cannot be

classified in the categories listed previously;

Examples: ReBase (restriction enzymes), TRANSFAC (transcription factors), CarbBank, GlycoSuiteDB (linked sugars), Protein-protein interactions db (DIR, ProNet, Intact, BIND), Protease db (MEROPS), biotechnology patents db, etc.;

As well as many other resources concerning any and new aspects of macromolecules and molecular biology (Ex: Microarrays).

Page 113: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Amos links: Microarrays

Page 114: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Database retrieval tools

Query tools associated with the Databases Sequence Retrieval System (SRS, Europe)

allows any flat-file db to be indexed to any other; allows to formulate queries across a wide range of different db types via a single interface, without any worry about data structure, query languages…

Entrez (NCBI): less flexible than SRS but exploits the concept of « neighbouring », which allows related articles in different db to be linked together, whether or not they are cross-referenced directly

ATLAS: specific for macromolecular sequences db (i.e. NRL-3D)

….

Page 115: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.
Page 116: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

SRS

Depending on the server, SRS gives access to different databasesExample: ExPASy: SWISS-PROT, TrEMBL (SPTR)

Page 117: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Entrez-databases

compiled from a variety of sources, including Protein db, DNA db, 3D-structure, OMIM, PubMed, Taxonomy, maps & genomes, LocusLink.

Gives also access to Blast results.

Exploits links between databases.

Page 118: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Entrez-protein

Page 119: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Proliferation of databases

What is the best db for sequence analysis ? Which does contain the highest quality data ? Which is the more comprehensive ? Which is the more up-to-date ? Which is the less redundant ? Which is the more indexed (allows complex

queries) ? Which Web server does respond most

quickly ? …….??????

Page 120: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Some important practical remarks

Databases: many errors (automated annotation) !

Not all db are available on all servers The update frequency is not the same

for all servers; creation of db_new between releases (exemple: EMBLnew; TrEMBLnew….)

Some servers add automatically useful cross-references to an entry (implicit links) in addition to already existing links (explicit links)

Page 121: An introduction to biological databases MCB, Janv 2003 Yes, if you train quickly, you can create a new database of databases, but first eat your dinner.

Before the introduction to databases…

After the introduction to databases…