Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational...

27
Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary field at the intersection of biology, computer science, statistics, and mathematics. Its subject matter is the extraction of biologically useful information from large sets of molecular data, such as DNA or protein sequence data or gene expression data. The term “bioinformatics” is currently used mainly to refer to the extraction of information from sequence data, while the creation and analysis of gene expression data is called functional genomics.

Transcript of Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational...

Page 1: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Tentative definition of bioinformatics

Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary field at the intersection of biology, computer science, statistics, and mathematics. Its subject matter is the extraction of biologically useful information from large sets of molecular data, such as DNA or protein sequence data or gene expression data. The term “bioinformatics” is currently used mainly to refer to the extraction of information from sequence data, while the creation and analysis of gene expression data is called functional genomics.

Page 2: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Biology’s dilemma: There is too much to know about living things

Roughly 1.5 million species of organisms have been described and given scientific names to date. Some biologists estimate that the total number of all living species may be several times higher. It is impossible tolearn everything about all these organisms. Biologists solve the dilemma by focusing on some species, so-calledmodel organisms, and trying to find out as much as theycan about these model organisms.

Page 3: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Some important model organisms

Mammals: Human, chimpanzee, mouse, ratFish: Zebrafish, PufferfishInsects: Fruitfly (Drosophila melanogaster)Roundworms: Ceanorhabditis elegansProtista: Malaria parasite (Plasmodium falciparum)Fungi: Baker’s yeast (Saccharomyces cerevisiae) Plants: Thale cress (Arabidopsis thaliana), corn, riceBacteria: Escherichia coli, Mycoplasma genitalisArchea: Methanococcus janaschii

Page 4: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Let’s find out everything about some species

What would it mean to learn everything about a given species? All available evidence indicates that the

completeblueprint for making an organism is encoded in the organism’s genome. Chemically, the genome consists of one or several DNA molecules. These are long strings composed of pairs of nucleotides. There are only fourdifferent nucleotides, denoted by A, C, G, T. The information about how to make the organism is encoded by the order in which the nucleotides appear.

Page 5: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Some genome sizes

HIV2 virus 9671 bp Mycoplasma genitalis 5.8 · 105 bp Haemophilus influenzae 1.83 · 106 bp Saccharomyces cerevisiae 1.21 · 107 bp Caenorhabditis elegans 108 bp Drosophila melanogaster 1.65 · 108 bp Homo sapiens 3.14 · 109 bp Some amphibians 8 · 1010 bp Amoeba dubia 6.7 · 1011 bp

Page 6: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Sequencing Genomes

Contemporary technology makes it possible to completelysequence entire genomes, that is, determine the sequenceof A’s, C’s, G’s, and T’s in the organism’s genome. Thefirst virus was sequenced in the 1980’s, the firstbacterium (Haemophilus influenzae) in 1995, the first multicellular organism (Caenorhabditis elegans) in 1998. A draft of the human genome was announced in 2000.

Page 7: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Where to store all these data?

In databases of course. Some of the sequence data are stored in proprietary data bases, but most of them arestored in the public data base Genbank and an beaccessed via the World Wide Web. In fact, most relevant journals require proof of submission to Genbank before an article discussing sequence data will be published. The URL for Genbank is:

http://www.ncbi.nlm.nih.gov/Genbank/

Page 8: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

What’s in the databases?

In 1981, Genbank contained less than 500,000 bp of info.

In 1986, Genbank contained 9,615,371 bp of info.

In 1991, Genbank contained 71,947,426 bp of info.

In 1996, Genbank contained 651,972,984 bp of info.

In 2001, Genbank contained 15,849,921,438 bp of info.

In 2004, Genbank contained 37,893,844,733 bp of info.

In 2009, Genbank contained 106,533,156,756 bp of info.

Page 9: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

What’s in the databases?

On March 18, 2005 there were 1791 completely sequenced

viruses, 204 completely sequenced bacteria, 21 completely sequenced archaea, and 9 complete genomes of Eukaryotes, among them two yeasts, the roundworm C. elegans, the fruitfly Drosophila melanogaster, the mosquito A. gambiae, the malaria parasite P. falciparum, and the plant Arabidopsis

thaliana (thale cress). There are also drafts of 11 other

genomes of eukaryotes, most notably of the human genome.

Page 10: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

What’s in the databases?

On December 17, 2010 there were 3518 completely sequenced viruses, 952 completely sequenced bacteria, 68 completely sequenced archaea, and 73 complete genomes of Eukaryotes, among them cow, wolf, horse, human, a monkey, pig, chimpanzee.

Page 11: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

First challenge:Sequencing large genomes

Currently, much of the sequencing process is automated.

However, contemporary sequencing machines can only

sequence stretches of DNA that are a few hundred base

pairs long at a time. The process of assembling these stretches of sequence into a whole genome poses

some interesting mathematical problems.

Page 12: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

First challenge:Sequencing large genomes

For example, the publicly financed Human Genome Project

uses an approach called genome mapping to facilitate sequence assembly. Celera Genomics, a private enterprise, announced that they will be able to complete the sequencing of the entire human genome much faster by using an approach called shotgun sequencing. There was much debate over the feasibility of the latter approach, but it apparently worked. At its core, this was

a debate over the mathematics of sequence assembly.

Page 13: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

You have sequenced your genome - what do you do with it?

This is known as genome analysis or sequence analysis.

At present, most of bioinformatics is concerned with sequence analysis. Here are some of the questions studied in sequence analysis: gene finding protein 3D structure prediction gene function prediction prediction of important sites in proteins reconstruction of phylogenies

Page 14: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Genes and proteins

The genome controls the making and workings of an

organism by telling the cell which proteins to manufacture

under which conditions. Proteins are the workhorses of

biochemistry and play a variety of roles.

A gene is a stretch of DNA that codes a given protein.

Page 15: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Where are the genes?

The objective of gene finding is to identify the regions of DNA that are genes. Ideally, we want to make statementslike: “Positions 28,354 through 29,536 of this genome

code a protein.”

The mathematical challenge here is to identify patterns in DNA that reliably indicate where a gene starts and ends,especially in eukaryotes.

Page 16: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Protein structure prediction

When a protein is manufactured in the cell, it assumes acharacteristic 3D structure or fold. It is very costly to determine the 3D structure of a protein experimentally (by NMR or X-ray crystallography). It would be much cheaper if we could predict the 3D structure of a protein directly from its primary structure, i.e., from the sequence of its amino acids. This is known as the protein folding problem.Many approaches have been proposed to develop algorithms for solving this problem; so far results are mixed.

Page 17: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Prediction of protein function

Suppose you have identified a gene. What is its role in the

biochemistry of its organism? Sequence databases can help us in formulating reasonable hypotheses. Search the database for proteins with similar amino

acid sequences in other organisms. If the functions of the most similar proteins are

known and if they tend to be the same function (e.g., “enzyme involved in glucose metabolism”), then it is reasonable to conjecture that your gene also codes an enzyme involved in glucose metabolism.

Page 18: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Prediction of protein function: homology searches

Given a nucleotide or DNA sequence, searching the data base(s) for similar sequences is known as “homologysearches”. The most popular software tool for performingthese searches is called BLAST; therefore biologists oftenspeak of “BLAST searches”. There are two interesting problems here: How to measure “similarity” of two sequences. How much similarity constitutes evidence of biologically

meaningful homology as opposed to random chance?

Page 19: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Prediction of important sites in proteins

Not all parts of a protein are equally important; the function of most of its amino acids is often just to

maintain an appropriate 3D structure, and mutations of

those less crucial amino acids often don't have much effect. However, most proteins have crucial parts such asbinding sites. Mutations occurring at binding sites

tend to be lethal and will be weeded out by evolution.

Page 20: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

How to predict binding sites from sequence data:

Get a collection of proteins of similar amino acid sequences and analogous biochemical function from your database.

Align these sequences amino acid by amino acid. Check which regions of the protein are highly

conserved in the course of evolution. The binding site should be in one of the highly

conserved regions.

Page 21: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

The importance of being aligned

DNA and protein molecules evolve mostly by three processes: point mutations (exchange of a single letter foranother), insertions, and deletions. If a group of homologuous proteins from different organisms has beenidentified, it is assumed that these proteins have evolvedfrom a common ancestor. The process of multiple sequence alignment aims at identifying loci in the individual molecules that are derived from a commonancestral locus. These form the columns of the alignment.

Page 22: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Example of a multiple alignment

A T G - - T T C G G A C T | | |A C G A A T C C A G - C T | | |- C G A A T C C T A A C C | | |- T G A G C A C T A A C C

Page 23: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Reconstruction of phylogenetic trees

A phylogenetic tree depicts the evolutionary history of a group of species. By observing similarities and differences between species, we may be able to reconstruct their phylogeny. Classically, the degree of similarity between two species has been assessed from morphological characters. By comparing genomic sequence data, we actually can quantify the degree of similarity between any two species, and use these degrees of similarity as a basis for reconstructing phylogenetic trees.

Page 24: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Reconstruction of phylogenetic trees

The most common approach to using genomic data for reconstruction of phylogenetic trees is to look at genes with analogous function and thus supposedly common ancestry and see how far the genes taken from the

extant organisms have diverged. The observed differences in the amino acid composition are then used to reconstruct the phylogeny. The current partition of organisms into eubacteria, archaea and eukaria was discovered in this way by analyzing rRNA.

Page 25: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

The new frontier: Functional genomics

It is fashionable nowadays to talk about functional genomics. Many people use this term as if it were a newdiscipline separate from bioinformatics, but I think it is more appropriate to consider it a new subfield of bioinformatics.The ultimate aim of functional genomics is to understand what genes do, when they do it, and how they do it. Ideally, we would like to understand the cell, or organism, as a giant network of chemical pathways that regulate each other.

Page 26: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Microarrays (gene chips)

Microarrays or Gene Chips allow to monitor the level of

activity of all the gene represented on the chip

simultaneously under a variety of environmental

conditions, in various organs, and at various stages of

development.

There are two types of challenges here: To determine

when a change in activity level detected by the chip is

statistically significant, and to use the data so obtained to

make inferences about gene regulation.

Page 27: Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

What do we do with all these data?

The bread and butter method of microarray data analysis is clustering. This allows to identify, fora sequence of experiments on the same set of genesunder various conditions, groups of genes that areup- or down-regulated simultaneously. It is believedthat genes acting in the same chemical pathway would normally belong to the same cluster. Somealgorithms for clustering will be discussed in this

course.