Download - Major databases in bioinformatics

Major databases in Bioinformatics

ByMs.M.Vidya Kalaivani

Dept. of zoologyG.A.C.W

What is database????• Database are convenient system to properly store,

search and retrieve any type of data. • A database helps to easily handle and share large

amount of data and supports large scale analysis by easy access and data updating.

What is Biological Database• Biological databases are libraries of life sciences

information ,collected from scientific experiments, published literature, high-throughput experiment technology and computational analysis.

• They contain information from genomics, proteomics, microarray gene expression.

• Information contained in biological databases includes gene function, structure, localization(both cellular and chromosomal),biological sequences and structures.

Databases Architecture

Information system

) Query system

Storage SystemData

(The Google, EntrezSRS)

Your search key words

Oracle,MySQL,PC binary files,Unix text files,Bookshelves

GenBank flat file PDB fileInteraction RecordTitle of a bookBook

A Sequence Retrieving andManipulation Network

DNA ProteinNCBI-GenBANK PIRDDBJ SWISSPROTEBI-EMBL EXPASY, PDB

GCGSeqWEBVector NTIGenoMAX

EntrezSRS

Sequnece, Pdb, Image

GenBANKGCGFASTAStadenImage Sequence

Converter

Databases

Softwares

Formats

RetrivalSystem

Information

Types of biological databases Primary Database.Secondary database.

Primary databases Theses are the primary sources of data used to store nucleic acid, protein sequences and structural information of biological macromolecules.Some primary databases-• NCBI(The National Centre for Biotechnology Information)• GenBank• DDBJ (DNA data bank of Japan)• SWISS-PROT(Swiss-Prot )• PIR (Protein Information Resource)• PDB(Protein Data Bank)This sequence collection of this database is due to the efforts of basic research from academic industrial and sequencing lab)

IAM: International Advisory Meeting ICM: International Collaborative Meeting

GenBank/EMBL/DDBJInternational Nucleotide Sequence Database

EMBL: European Molecular Biology LaboratoryEBI: European Bioinformatics Institute

DDBJ: DNA Data Bank of JapanCIB: Center for Information Biology and DNA Data Bank of JapanNIG: National Institute of Genetics

NCBI: National Center for Biotechnology InformationNLM: National Library of Medicine

Secondary Database• A Secondary database contain additional information derived from the

analysis of data available in primary sources. • Secondary databases are analysed in a variety of ways and contain

different information in different formats.• Some secondary databases• TrEMBL• Pfam• PROSITE• Profiles• SCOP• CATH

PRIMARY VS. Secondary SEQUENCE DATABASES

GenBank

SequencingCenters

GA

GAGA

ATTAT

TCCGAGA

ATTAT

TCC

AT

GAGA

ATTCC GAGA

ATTCC

TTGACAATT

GACTA

ACGTGC

TTGACA

CGTGAATTGAC

TA

TATAGCCG

ACGTGC

ACGTGCACGTGCTTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTAATTGACTA AT

TGACTA

ATTGACTA

TATAGC

CG

TATAGCCG

TATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CAT

T

GAGA

ATTCC GAGA

ATTCC Labs

Algorithms

UniGene

Curators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

Flat File Storage Data Formats

• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards.

• The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.

The National Center for Biotechnology Information

Created in 1988 as a part of theNational Library of Medicine at NIH

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

NCBI Databases and Services• GenBank primary sequence database• Free public access to biomedical literature

• PubMed free Medline (3 million searches per day)• PubMed Central full text online access

• Entrez integrated molecular and literature databases• BLAST highest volume sequence search service (100 – 200 K searches per day)• VAST structure similarity searches• Software and Databases

GenBank (Genetic Sequence Databank)

• GenBank® is the genetic sequence database at the National Center for Biotechnology Information (NCBI).

• It was established in the year 1982 and now maintained by the National Center for Biotechnology (NCBI).

• DNA sequences can be submitted to GenBank using several different methods.

• It contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects.

• It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and computers.

• There are two main ways of making batch sequence submissions to GenBank: NCBI’s Barcode Submission Tool (BarSTool) and Sequin.

EMBL• The European Molecular Biology Laboratory (EMBL) is a molecular biology

research institution supported by 22 member states, four prospect and two associate member states.

• EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states.

• The Laboratory operates from five sites: the main laboratory in Heidelberg, and outstations in Hinxton (the European Bioinformatics Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and Monterotondo (near Rome).

• EMBL groups and laboratories perform basic research in molecular biology and molecular medicine as well as training for scientists, students and visitors.

• Israel is the only Asian state that has full membership.• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/),

maintained at the European Bioinformatics Institute (EBI),

http://www.ebi.ac.uk/embl/

• It is used to incorporate and distributes nucleotide sequences from public sources.

• The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA).

• Data are exchanged between the collaborating databases on a daily basis.

• The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data.

• Automatic submission procedures are used for submission of data from large-scale genome sequencing

• The latest data collection can be accessed via FTP, email and WWW interfaces.

• The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases.

• For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database and other databases.

• All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.

ID LISOD standard; DNA; PRO; 756 BP.

DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)

DE L.ivanovii sod gene for superoxide dismutase

KW sod gene; superoxide dismutase.

OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria.

RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and

EMBL format

ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product.";

RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 360 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 420 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 480 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 540

ID - Identification.

AC - Accession number(s).

DT - Date.

DE - Description.

GN - Gene name(s).

OS - Organism species.

OG - Organelle.

OC - Organism classification.

RN - Reference number.

RP - Reference position.

RC - Reference comments.

RX - Reference cross-references.

RA - Reference authors.

RL - Reference location.

CC - Comments or notes.

DR - Database cross-references.

KW - Keywords.

FT - Feature table data.

SQ - Sequence header.

- (blanks) sequence data.

// - Termination line.

Some entries do not contain all of the line types, and some line types occur many times in a single

entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).

PubMed• PubMed is a free search engine accessing primarily the

MEDLINE database of references and abstracts on life sciences and biomedical topics.

• The PubMed system was offered free to the public in June 1997.

• The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.

• PMID is the unique identifier number used in PubMed.

• They are assigned to each article record when it enters the PubMed system.

• The PMID# is always found at the end of a PubMed citation.

• PubMed Central (PMC) is a free digital system that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature.

• A "PubMed Mobile" option, providing access to a mobile friendly, simplified PubMed version, is also available.

Entrez

• WWW-based data retrieval system.• Developed by NCBI (National Centre for Biotechnology

Information).• - Integrates information held in different DBs.

Data bases covered by Entrez are

• Nucleic acid - GenBank, RefSeq, PDB.

• Protein seqs - SWISS-PROT, PIR.

• 3D structures – MMDB• Genomes – Many

sources

• PopSet – From GenBank• OMIM – OMIM• Taxonomy – NCBI

taxonomy database• Books- Bookshelf• ProbeSet – GEO (Gene

Expression Omnibus)• Literature - PubMed

SRS• SRS is a Sequence Retrieval System• - Data retrieval tool developed by EBI• - Integrates 80 molecular biology DBs• - An Open source software (Can be installed locally)• SRS has an associated scripting language called

Icarus• Central resource for molecular biology data• - more than 250 databanks have been indexed. More

than 35 SRS servers over the WWW(world wide)

• Information retrieval• Easy way to retrieve information from sequence and sequence-related

databases• Possibility to search for multiple words/other criteria

• Linkage between different databases• E.g. Find all primary structures with known three-dimensional

structure.• Different types of database in SRS• Sequence & structure

• DNA, protein, three-dimensional structures• Sequence-related• Gene-related

• Genome, mapping, mutations, transcription factors• SNP

• Bibliographic

• SRS main toolbar tabs:• Top Page: displays databases in different database groups• Query: displays either the standard or extended query form• Results or “the query manager”: maintains a history of all the

results obtained during a session• Projects or “the project manager”: maintains a history of all

queries and views used during a session• Views: allows a user to define a user specific view for one or more

databases• Databanks: contains a list and some facts about the databases

available in the system

• Search terms in SRS• SRS indexed fields can be searched using any of the

following:• Single word search• Multiple word phrases• Numbers and dates• Regular expressions• Wildcards

•

LocusLink• LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National

Center for Biotechnology Information (NCBI) online resource. • It is principally intended for use by graduate students and

professional researchers in the biomedical sciences. • It is designed to bring together related information on genetic loci

and gene products from several sources. • LocusLink provides a central point of access for basic biomedical

information and molecular data for genes, transcripts, and proteins from model organisms, currently including human, rat, mouse, fruit fly, and zebrafish.

• Now it is not available in NCBI.