Major databases in Bioinformatics
ByMs.M.Vidya Kalaivani
Dept. of zoologyG.A.C.W
What is database????• Database are convenient system to properly store,
search and retrieve any type of data. • A database helps to easily handle and share large
amount of data and supports large scale analysis by easy access and data updating.
What is Biological Database• Biological databases are libraries of life sciences
information ,collected from scientific experiments, published literature, high-throughput experiment technology and computational analysis.
• They contain information from genomics, proteomics, microarray gene expression.
• Information contained in biological databases includes gene function, structure, localization(both cellular and chromosomal),biological sequences and structures.
Databases Architecture
Information system
) Query system
Storage SystemData
(The Google, EntrezSRS)
Your search key words
Oracle,MySQL,PC binary files,Unix text files,Bookshelves
GenBank flat file PDB fileInteraction RecordTitle of a bookBook
A Sequence Retrieving andManipulation Network
DNA ProteinNCBI-GenBANK PIRDDBJ SWISSPROTEBI-EMBL EXPASY, PDB
GCGSeqWEBVector NTIGenoMAX
EntrezSRS
Sequnece, Pdb, Image
GenBANKGCGFASTAStadenImage Sequence
Converter
Databases
Softwares
Formats
RetrivalSystem
Information
Types of biological databases Primary Database.Secondary database.
Primary databases Theses are the primary sources of data used to store nucleic acid, protein sequences and structural information of biological macromolecules.Some primary databases-• NCBI(The National Centre for Biotechnology Information)• GenBank• DDBJ (DNA data bank of Japan)• SWISS-PROT(Swiss-Prot )• PIR (Protein Information Resource)• PDB(Protein Data Bank)This sequence collection of this database is due to the efforts of basic research from academic industrial and sequencing lab)
IAM: International Advisory Meeting ICM: International Collaborative Meeting
GenBank/EMBL/DDBJInternational Nucleotide Sequence Database
EMBL: European Molecular Biology LaboratoryEBI: European Bioinformatics Institute
DDBJ: DNA Data Bank of JapanCIB: Center for Information Biology and DNA Data Bank of JapanNIG: National Institute of Genetics
NCBI: National Center for Biotechnology InformationNLM: National Library of Medicine
Secondary Database• A Secondary database contain additional information derived from the
analysis of data available in primary sources. • Secondary databases are analysed in a variety of ways and contain
different information in different formats.• Some secondary databases• TrEMBL• Pfam• PROSITE• Profiles• SCOP• CATH
PRIMARY VS. Secondary SEQUENCE DATABASES
GenBank
SequencingCenters
GA
GAGA
ATTAT
TCCGAGA
ATTAT
TCC
AT
GAGA
ATTCC GAGA
ATTCC
TTGACAATT
GACTA
ACGTGC
TTGACA
CGTGAATTGAC
TA
TATAGCCG
ACGTGC
ACGTGCACGTGCTTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTAATTGACTA AT
TGACTA
ATTGACTA
TATAGC
CG
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CAT
T
GAGA
ATTCC GAGA
ATTCC Labs
Algorithms
UniGene
Curators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continually by NCBI
Updated ONLY by submitters
Flat File Storage Data Formats
• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards.
• The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
The National Center for Biotechnology Information
Created in 1988 as a part of theNational Library of Medicine at NIH
– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
Bethesda,MD
NCBI Databases and Services• GenBank primary sequence database• Free public access to biomedical literature
• PubMed free Medline (3 million searches per day)• PubMed Central full text online access
• Entrez integrated molecular and literature databases• BLAST highest volume sequence search service (100 – 200 K searches per day)• VAST structure similarity searches• Software and Databases
GenBank (Genetic Sequence Databank)
• GenBank® is the genetic sequence database at the National Center for Biotechnology Information (NCBI).
• It was established in the year 1982 and now maintained by the National Center for Biotechnology (NCBI).
• DNA sequences can be submitted to GenBank using several different methods.
• It contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects.
• It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and computers.
• There are two main ways of making batch sequence submissions to GenBank: NCBI’s Barcode Submission Tool (BarSTool) and Sequin.
EMBL• The European Molecular Biology Laboratory (EMBL) is a molecular biology
research institution supported by 22 member states, four prospect and two associate member states.
• EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and outstations in Hinxton (the European Bioinformatics Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and molecular medicine as well as training for scientists, students and visitors.
• Israel is the only Asian state that has full membership.• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/),
maintained at the European Bioinformatics Institute (EBI),
• It is used to incorporate and distributes nucleotide sequences from public sources.
• The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).
• Data are exchanged between the collaborating databases on a daily basis.
• The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data.
• Automatic submission procedures are used for submission of data from large-scale genome sequencing
• The latest data collection can be accessed via FTP, email and WWW interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases.
• For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database and other databases.
• All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
ID LISOD standard; DNA; PRO; 756 BP.
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
DE L.ivanovii sod gene for superoxide dismutase
KW sod gene; superoxide dismutase.
OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria.
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and
EMBL format
ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product.";
RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 360 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 420 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 480 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 540
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single
entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).
PubMed• PubMed is a free search engine accessing primarily the
MEDLINE database of references and abstracts on life sciences and biomedical topics.
• The PubMed system was offered free to the public in June 1997.
• The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.
• PMID is the unique identifier number used in PubMed.
• They are assigned to each article record when it enters the PubMed system.
• The PMID# is always found at the end of a PubMed citation.
• PubMed Central (PMC) is a free digital system that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.
• A "PubMed Mobile" option, providing access to a mobile friendly, simplified PubMed version, is also available.
Entrez
• WWW-based data retrieval system.• Developed by NCBI (National Centre for Biotechnology
Information).• - Integrates information held in different DBs.
Data bases covered by Entrez are
• Nucleic acid - GenBank, RefSeq, PDB.
• Protein seqs - SWISS-PROT, PIR.
• 3D structures – MMDB• Genomes – Many
sources
• PopSet – From GenBank• OMIM – OMIM• Taxonomy – NCBI
taxonomy database• Books- Bookshelf• ProbeSet – GEO (Gene
Expression Omnibus)• Literature - PubMed
SRS• SRS is a Sequence Retrieval System• - Data retrieval tool developed by EBI• - Integrates 80 molecular biology DBs• - An Open source software (Can be installed locally)• SRS has an associated scripting language called
Icarus• Central resource for molecular biology data• - more than 250 databanks have been indexed. More
than 35 SRS servers over the WWW(world wide)
• Information retrieval• Easy way to retrieve information from sequence and sequence-related
databases• Possibility to search for multiple words/other criteria
• Linkage between different databases• E.g. Find all primary structures with known three-dimensional
structure.• Different types of database in SRS• Sequence & structure
• DNA, protein, three-dimensional structures• Sequence-related• Gene-related
• Genome, mapping, mutations, transcription factors• SNP
• Bibliographic
• SRS main toolbar tabs:• Top Page: displays databases in different database groups• Query: displays either the standard or extended query form• Results or “the query manager”: maintains a history of all the
results obtained during a session• Projects or “the project manager”: maintains a history of all
queries and views used during a session• Views: allows a user to define a user specific view for one or more
databases• Databanks: contains a list and some facts about the databases
available in the system
• Search terms in SRS• SRS indexed fields can be searched using any of the
following:• Single word search• Multiple word phrases• Numbers and dates• Regular expressions• Wildcards
•
LocusLink• LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National
Center for Biotechnology Information (NCBI) online resource. • It is principally intended for use by graduate students and
professional researchers in the biomedical sciences. • It is designed to bring together related information on genetic loci
and gene products from several sources. • LocusLink provides a central point of access for basic biomedical
information and molecular data for genes, transcripts, and proteins from model organisms, currently including human, rat, mouse, fruit fly, and zebrafish.
• Now it is not available in NCBI.
Top Related