Organizing information in the post-genomic era The rise of bioinformatics.

18
Organizing information in the post- genomic era The rise of bioinformatics

Transcript of Organizing information in the post-genomic era The rise of bioinformatics.

Organizing information in the post-genomic era

The rise of bioinformatics

An information explosion!

Bioinformatics

Computational tools are developed to collect, organize and analyze a wide variety of biological data

Advances in DNA sequencing technologies have accelerated the pace of discovery. Much of the process is now automated.

What is a database?

Which databases are important for molecular cell biology research?

How is information processed in databases?

Literature

Nucleotide

Protein

Organism

Structure

Function

Biological databases use different organizing principles

Hyperlinks connect records in different databases

Databases are organized collections of information

Information is stored in records

Accession #

Field 1 ........................

Field 2 .......................

Data........................................................................................

Databases assign each record a unique accession number using their own numbering system

Fields are used to cross-reference the data. Records can be searched by fields.

Data is entered in the record using a defined format

Bioinformaticians work with computer scientists to set up the database structure

Curators review and link records within and between databases

The information in databases ultimately derives from experimental data

Researchers do experiments

Researchers analyze data and write

papers

Data is published in

journals

to PubMe

d

nucleotide sequencesmuta

nts

, phen

otyp

e in

fo structural

coordinates

Curators will process the submissions and link entries in different databases

What is a database?

Which databases are important for molecular cell biology research?

How is information processed in databases?

Largest collection is housed at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine

NLM-NCBI complex in Bethesda MD

Large staff of curators process the information and compile information into derivative databases

Biologists use hundreds of different databases from around the world, some with similar foci

NCBI maintains both primary and derivative databasesWe’ll look at three of them

PubMed is the premier literature database in the world

SGD is a derivative database serving the yeast research community

Grew out of decades of research

Genome project provided a systematic organization for genes

What is a database?

Which databases are important for molecular cell biology research?

How is information processed in databases?

Questions for today:

Most records in the Protein database have been derived by automated translation of nucleotide sequences

GenBankNucleotide sequences

Annotated nucleic acid sequences are submitted to GenBank from many sources, including genome projects, individual investigators, and other databases – there is considerable REDUNDANCY in the information

ProteinAmino acid sequences

Automated translations of nucleotide sequences

Experimentally determined amino acid sequences and information from other protein databases

RefSeqNon-redundant nucleotide and

protein sequences

Sequences are compiled to generate non-redundant reference sequences

Curators are responsible for data flow between the NCBI databases

On a larger scale: Genome projects have produced the reference sequences in

nucleotide databases

(robots and computers do much of the work)

1. Pieces of chromosomal DNA are sequenced, each ~1000 bp long

S. cerevisiae genome is ~12 Mbp – how many reads would be necessary to cover each base pair in the genome once?

2. Overlapping sequence reads are aligned until sequences of entire chromosomes were complete

Computer algorithms identify areas of sequence overlap

Process is repeated to align long stretches of sequence

Complete chromosome sequences are submitted to GenBank

GenBank NC_####### (non-redundant chromosome) sequences

3. Chromosomal sequences are analyzed for the presence of potential transcripts (open reading frames; ORFs)

ORFs are characterized by an under-representation of stop codons

ORF-finding computer algorithms look for sequences that• begin with a methionine• methionine is separated from a stop codon in the same reading

frame by a large number of amino acids (often 100, equiv. to 300bp)

GenBank NM_####### records are predicted ORFs

4. Protein sequences are computationally predicted from ORF sequences

GenBank NP_###### records

Genes were given systematic (locus) names by their positions on chromosomes

Systematic name for MET1: YKR069W

Y (A-P) (L or R) (ORF number) (W or C)

yeast left or right arm of chromosome

sense strand is Watson or Crick strand (coding sequence is read 5’ to 3’)

ORF number, counting away from the centromere (position = 0)

chromosome1 = A2 = Betc.

Left arm Right arm

centromere

W W

C C

Literature

Nucleotide

Protein

Organism

Structure

Function