Basics of bioinformatics

59
Need & Emergence of the Field Speaker Shashi Shekhar Head of computational Section Biowits Life Sciences

Transcript of Basics of bioinformatics

Page 1: Basics of bioinformatics

Need & Emergence of the Field

Speaker Shashi Shekhar Head of computational Section Biowits Life Sciences

Page 2: Basics of bioinformatics

The marriage between computer science and molecular biology ◦ The algorithm and techniques of computer science

are being used to solve the problems faced by molecular biologists

‘Information technology applied to the

management and analysis of biological data’ ◦ Storage and Analysis are two of the important

functions – bioinformaticians build tools for each.

Page 3: Basics of bioinformatics

Biology

Chemistry

Statistics

Computer Science

Bioinformatics

Page 4: Basics of bioinformatics

The need for bioinformatics has arisen from the recent explosion of publicly available genomic information, such as resulting from the Human Genome Project.

Gain a better understanding of gene analysis, taxonomy, & evolution.

To work efficiently on the rational drug designs and reduce the time taken for the development of drug manually.

Page 5: Basics of bioinformatics

To uncover the wealth of Biological information hidden in the mass of sequence, structure, literature and biological data.

It is being used now and in the foreseeable future in the areas of molecular medicine.

It has environmental benefits in identifying waste and clean up bacteria.

In agriculture, it can be used to produce high yield, low maintenance crops.

Page 6: Basics of bioinformatics

Molecular Medicine

Gene Therapy

Drug Development

Microbial genome applications

Crop Improvement

Forensic Analysis of Microbes

Biotechnology

Evolutionary Studies

Bio-Weapon Creation

Page 7: Basics of bioinformatics

In Experimental Molecular Biology

In Genetics and Genomics

In generating Biological Data

Analysis of gene and protein expression

Comparison of genomic data

Understanding of evolutionary aspect of Evolution

Understanding biological pathways and networks in System Biology

In Simulation & Modeling of DNA, RNA & Protein

Page 8: Basics of bioinformatics

Bioinformatics lecture March 5, 2002

organisation of knowledge

(sequences, structures,

functional data)

e.g. homology

searches

Page 9: Basics of bioinformatics

Prediction of structure from sequence

◦ secondary structure

◦ homology modelling, threading

◦ ab initio 3D prediction

Analysis of 3D structure

◦ structure comparison/ alignment

◦ prediction of function from structure

◦ molecular mechanics/ molecular dynamics

◦ prediction of molecular interactions, docking

Structure databases (RCSB)

Page 10: Basics of bioinformatics
Page 11: Basics of bioinformatics

Sequence Similarity

Tools used for sequence similarity searching

There uses in biology or to us

Databases

Different types of databases

Page 12: Basics of bioinformatics

One could align the sequence so that many corresponding residues match.

Strong similarity between two sequences is a strong argument for their homology.

Homology: Two(or more) sequences have a common ancestor.

Similarity: Two(or more) sequences are similar by some criterion, and it does not refer to any historical process.

Page 13: Basics of bioinformatics

To find the relatedness of the proteins or gene, if they have a common ancestor or not.

Mutation in the sequences, brings the changes or divergence in the sequences.

Can also reveal the part of the sequence which is crucial for the functioning of gene or protein.

Page 14: Basics of bioinformatics

Optimal Alignment: The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments.

Global Alignment: An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end.

Local Alignment: An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity.

(contd.)

Page 15: Basics of bioinformatics

Gaps & Insertions: In an alignment, one may achieve much better correspondence between two sequences if one allows a gap to be introduced in one sequence. Equivalently, one could allow an insertion in the other sequence. Biologically this corresponds to an mutation event.

Substitution matrix: A Substitution matrix describes the two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Gap Penalty: The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue to residue at some other neighboring point in the sequence.

Page 16: Basics of bioinformatics

Similarity indicates conserved function

Human and mouse genes are more than 80% similar at sequence level

But these genes are small fraction of genome

Most sequences in the genome are not recognizably similar

Comparing sequences helps us understand function

◦ Locate similar gene in another species to understand your new gene

Page 17: Basics of bioinformatics

Match score: +1

Mismatch score: +0

Gap penalty: –1

ACGTCTGATACGCCGTATAGTCTATCT

||||| ||| || ||||||||

----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1)

Mismatches: 2 × 0

Gaps: 7 × (– 1)

Score = +11

Page 18: Basics of bioinformatics

We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to

you?

We can achieve this by penalizing more for a new gap, than for extending an existing gap

ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT

AC-T-TGA--CG-CGT-TA-TCTATCT

Page 19: Basics of bioinformatics

Match/mismatch score: +1/+0

Origination/length penalty: –2/–1

ACGTCTGATACGCCGTATAGTCTATCT

||||| ||| || ||||||||

----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1)

Mismatches: 2 × 0

Origination: 2 × (–2)

Length: 7 × (–1)

Score = +7

Page 20: Basics of bioinformatics

Alignment scoring and substitution matrices

Aligning two sequences

◦ Dotplots

◦ The dynamic programming algorithm

◦ Significance of the results

Heuristic methods

◦ FASTA

◦ BLAST

◦ Interpreting the output

Page 21: Basics of bioinformatics

Examples:

Staden: simple text file, lines <= 80 characters

FASTA: simple text file, lines <= 80 characters, one line header marked by ">"

GCG: structured format with header and formatted sequence

Sequence format descriptions e.g. on http://www.infobiogen.fr/doc/tutoriel/formats.html

Page 22: Basics of bioinformatics

Local sequence comparison:

assumption of evolution by point mutations

◦ amino acid replacement (by base replacement)

◦ amino acid insertion

◦ amino acid deletion

scores:

◦ positive for identical or similar

◦ negative for different

◦ negative for insertion in one of the two sequences

Page 23: Basics of bioinformatics

Simple comparison without alignment

Similarities between sequences show up in 2D diagram

Page 24: Basics of bioinformatics

identity (i=j)

similarity of sequence with other parts of itself

Page 25: Basics of bioinformatics

The 1st alignment: highly significant

The 2nd: plausible

The 3rd: spurious

Distinguish by alignment score

Similarities increase score

Mismatches decrease score

Gaps decrease score

substitution matrix

gap penalties

Page 26: Basics of bioinformatics

Substitution matrix weights replacement of one residue by another:

◦ Similar -> high score (positive)

◦ Different -> low score (negative)

Simplest is identity matrix (e.g. for nucleic acids)

A C G T

A 1 0 0 0

C 0 1 0 0

G 0 0 1 0

T 0 0 0 1

Page 27: Basics of bioinformatics

PAM matrix series (PAM1 ... PAM250):

◦ Derived from alignment of very similar sequences

◦ PAM1 = mutation events that change 1% of AA

◦ PAM2, PAM3, ... extrapolated by matrix multiplication

e.g.: PAM2 = PAM1*PAM1; PAM3 = PAM2 * PAM1 etc

Problems with PAM matrices:

◦ Incorrect modelling of long time substitutions, since conservative mutations dominated by single nucleotide change

◦ e.g.: L <–> I, L <–> V, Y <–> F long time: any Amino Acid change

Page 28: Basics of bioinformatics

positive and negative values

identity score depends on residue

Page 29: Basics of bioinformatics

BLOSUM series (BLOSUM50, BLOSUM62, ...)

derived from alignments of distantly related sequence

BLOCKS database:

◦ ungapped multiple alignments of protein families

at a given identity

BLOSUM50 better for gapped alignments

BLOSUM62 better for ungapped alignments

Page 30: Basics of bioinformatics

Blosum62 substitution matrix

Page 31: Basics of bioinformatics

Significance of alignment:

Depends critically on gap penalty

Need to adjust to given sequence

Gap penalties influenced by knowledge of structure etc.

Simple rules when nothing is known (linear or affine)

Page 32: Basics of bioinformatics

Dynamic programming = build up optimal alignment using previous solutions for optimal alignments of subsequences.

The dynamic programming relies on a principle of optimality. This principle states that in an optimal sequence of decisions or choices, each subsequence must also be optimal.

The principle can be related as follows: the optimal solution to a problem is a combination of optimal solutions to some of its sub-problems.

Page 33: Basics of bioinformatics

Construct a two-dimensional matrix whose axes are the two sequences to be compared.

The scores are calculated one row at a time. This starts with the first row of one sequence, which is used to scan through the entire length of the other sequence, followed by scanning of the second row.

The scanning of the second row takes into account the scores already obtained in the first round. The best score is put into the bottom right corner of an intermediate matrix.

This process is iterated until values for all the cells are filled.

Page 34: Basics of bioinformatics

Contd.

Page 35: Basics of bioinformatics

Contd.

Page 36: Basics of bioinformatics

The results are traced back through the matrix in reverse order from the lower right-hand corner of the matrix toward the origin of the matrix in the upper left-hand corner.

The best matching path is the one that has the maximum total score.

If two or more paths reach the same highest score, one is chosen arbitrarily to represent the best alignment.

The path can also move horizontally or vertically at a certain point, which corresponds to introduction of a gap or an insertion or deletion for one of the two sequences.

Page 37: Basics of bioinformatics
Page 38: Basics of bioinformatics

Global alignment (ends aligned)

◦ Needleman & Wunsch, 1970

Local alignment (subsequences aligned)

◦ Smith & Waterman, 1981

Searching for repetitions

Searching for overlap

Page 39: Basics of bioinformatics
Page 40: Basics of bioinformatics

Multi-step approach to find high-scoring alignments

Exact short word matches

Maximal scoring ungapped extensions

Identify gapped alignments

Page 41: Basics of bioinformatics

Contd.

Page 42: Basics of bioinformatics
Page 43: Basics of bioinformatics

FASTA also uses E-values and bit scores. The FASTA output provides one more statistical parameter, the Z-score.

This describes the number of standard deviations from the mean score for the database search.

Most of the alignments with the query sequence are with unrelated sequences, the higher the Z-score for a reported match, the further away from the mean of the score distribution, hence, the more significant the match.

For a Z-score > 15, the match can be considered extremely significant, with certainty of a homologous relationship.

If Z is in the range of 5 to 15, the sequence pair can be described as highly probable homologs.

If Z < 5, their relationships is described as less certain.

Page 44: Basics of bioinformatics

Multi-step approach to find high-scoring alignments

List words of fixed length (3AA) expected to give score larger than threshold

For every word, search database and extend ungapped alignment in both directions

New versions of BLAST allow gaps

Page 45: Basics of bioinformatics

Contd.

Page 46: Basics of bioinformatics
Page 47: Basics of bioinformatics

The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is.

If E < 1e − 50 (or 1 × 10−50), there should be an extremely high confidence that the database match is a result of homologous relationships.

If E is between 0.01 and 1e − 50, the match can be considered a result of homology.

If E is between 0.01 and 10, the match is considered not significant, but may hint at a tentative remote homology relationship. Additional evidence is needed.

If E > 10, the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method.

Page 48: Basics of bioinformatics

Various versions:

Blastn: nucleotide sequences

Blastp: protein sequences

tBlastn: protein query - translated database

Blastx: nucleotide query - protein database

tBlastx: nucleotide query - translated database

Page 49: Basics of bioinformatics

Very fast growth of biological data

Diversity of biological data: ◦ Primary sequences

◦ 3D structures

◦ Functional data

Database entry usually required for publication ◦ Sequences

◦ Structures

Database entry may replace primary publication ◦ Genomic approaches

Page 50: Basics of bioinformatics

Nucleic Acid Protein

EMBL (Europe) PIR -

Protein Information

Resource

GenBank (USA) MIPS

DDBJ (Japan) SWISS-PROT

University of Geneva,

now with EBI

TrEMBL

A supplement to SWISS-

PROT

NRL-3D

Page 51: Basics of bioinformatics

Three databanks exchange data on a daily basis

Data can be submitted and accessed at either location

GenBank

◦ www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

EMBL

◦ www.ebi.ac.uk/embl/index.html

DNA Databank of Japan (DDBJ)

◦ www.nig.ac.jp/home.html

Page 52: Basics of bioinformatics

As there are many databases which one to search? Some are good in some aspects and weak in others?

Composite databases is the answer – which has several databases for its base data

Search on these databases is indexed and streamlined so that the same stored sequence is not searched twice in different databases.

Page 53: Basics of bioinformatics

OWL has these as their primary databases.

◦ SWISS PROT (top priority)

◦ PIR

◦ GenBank

◦ NRL-3D

Page 54: Basics of bioinformatics

Store secondary structure info or results of searches of the primary databases.

Composite

Databases

Primary Source

PROSITE SWISS-PROT

PRINTS OWL

Page 55: Basics of bioinformatics

We have sequenced and identified genes. So we know what they do.

The sequences are stored in databases.

So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases.

Since there are large number of databases we cannot do sequence alignment for each and every sequence

So heuristics must be used again.

Page 56: Basics of bioinformatics

Applications:- Bioinformatics joins mathematics, statistics, and computer science and information technology to solve complex biological problems.

Sequence Analysis:- The application of sequence analysis determines those genes which encode regulatory sequences or peptides by using the information of sequencing. These computers and tools also see the DNA mutations in an organism and also detect and identify those sequences which are related. Special software is used to see the overlapping of fragments and their assembly.

Contd.

Page 57: Basics of bioinformatics

Prediction of Protein Structure:- It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. Tools of bioinformatics can be used to determine the complex protein structures.

Genome Annotation:- In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences.

Page 58: Basics of bioinformatics

Comparative Genomics:- Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species.

Health and Drug discovery:- The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes.

Page 59: Basics of bioinformatics