PAM & Blosum

PAM & Blosum

Bioinformatics in Biosophy

Park, Jong Hwa

MRC-DUNN Hills Road Cambridge

CB2 2XYEngland

1

Next:02/06/2001

Family business and Matrix1. Proteins can always be grouped together.2. As they can be grouped (clustered), there

must be a way to represent a whole group of proteins.

3. How are we going to represent the groups of proteins to represent some important information effectively?

4. In Bioinformatics, multiple sequence alignment is one of the most common way of representing the whole groups. (Why?)

5. Once alignments are made, there are so many things to do, to know the protein families better. (Why?)

6. Clustering proteins in various other ways.

Some Blocks in Alignments

From sequence Blocks PAM matrix

We use PAM units to measure the amount of evolutionary distance between two amino acid sequences (at least 85% identical)

Two sequences S1 and S2 are said to be one PAM unit diverged if a series of accepted point mutations (and no insertions of deletions) has converted S1 to S2 with an average of one accepted point-mutation event per 100 amino acids in the two sequences.

The term ``accepted'' here means a mutation that was incorporated into the protein and passed to its progeny.

Therefore, either the mutation did not change the function of the protein or the change in the protein was beneficial to the organism.

PAM

Note that two sequences which are one PAM unit diverged do not necessarily differ in one percent, as often mistakenly thought, because a single position may undergo more than a single mutation. The difference between the two notions grows as the number of units does.

2 problems with PAMThere are two main problems with the notion of the PAM units: 1. First, practically all the sequences we can obtain today are

extracted from extant organisms. We almost do not know any protein sequences where one is actually derived from the other. The lack of ancestral protein sequences is handled by assuming that amino acid mutations are reversible and equally likely in either direction. This assumption, together with the additivity property of the PAM units derived from its definition, imply that given two amino acid sequences: Si and Sj whose mutual ancestor is Sij we have:

d(Si,Sj) = d(Si,Sij) + d(Sij,Sj)

when d(i,j) is the PAM distance between amino acid sequences i and j.

The InDel problem

2. The second problem, which is more difficult to overcome, is that we disregard here insertions and deletions which may occur

during evolution, hence we can not be sure of the correct correspondence between sequence positions. In order to know the exact correspondence one has to be able to identify the true historical gaps, or, at least to identify large intervals along the two

sequences where the correspondence is correct. This can not always be done with certainty, especially when the two

sequences are distantly diverged.

From seq. Blocks to BLOSUMThe first stage of building a matrix is

eliminating sequences, which are identical in more than x% of their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. The matrix built from blocks with no more the x% of similarity is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity is called BLOSUM-50.)

http://www.math.tau.ac.il/~rshamir/algmb/scribe/html/lec03/node10.html

Building Blocks Matrix (Blosum)The second stage is counting the pairs of amino acids in each

column of the multiple alignment. For example in a column with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability qi,j for a pair of amino acids in the same column to be Ai and Aj is calculated, as well as the probability pi of a certain amino acid to be Ai.

In the third stage the log odd ratio is calculated. As final result we consider the rounded 2si,j, this value is stored inthe (i,j) entry of the BLOSUM-x matrix.

We take the log value of the probability in order to allow computing the total score of all substitutions using summation rather than multiplication.

Making Blosum In contrast to the PAM matrices, more sequences are examined

in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ.

Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but found by the other.

According to this comparison BLOSUM-62 is found to be better than other BLOSUM-x matrices as well as than PAM-x matrices.

Making New FASTA exchange matrix

1. Get some structural or sequence alignments2. Find non-gap blocks in the alignments3. Count the occurrences of mutations for single

amino acid4. Calculate the log odd ratio of expected

frequency vs observed frequency5. Produce a FASTA format Matrix6. Use your own matrix to run a FASTA

program (Fasta -s option). 7. fasta -s my.mat query.spfa lib.spfa

Making your own Matrix 2

http://bioinfo.mbb.yale.edu/align/scop/tables/allStrAlignments.txt

Let’s make a Multiple Residue Column (MRC) Matrix.

1. Get the alignment file: bioinfo-238/teaching mat/practical_ resource/

2. Parse the alignments to get dipeptide entries and occurrences

3. Make 400x400 matrix with occurrences(frequency)

4. Calculate the log odd ratio of all the dipeptides

5. Display the rankings of similar non-identical dipeptides.

http://bioinfo.mbb.yale.edu/align/scop/tables/allStrAlignments.txt

How to improve exchange Matrix?

• Use sequence alignment from the best people or structure/sequence align algorithm

• Use very large set of alignments

• Apply some protein family specific matrix (for example, transmembrane proteins)

• Use multiple residue columns to make the Matrix

PAM to quantify homology automatically

• Problem: we have two sequences– SEQVENCE– STRVCTVT

How similar or Homologous are they?

1. A simple way: Compositional Identity

2. Another Simple way: Sequence Identity

3. Advanced way: Similarity Score

4. Most advanced: Statistical Significance

Simple Sequence Similarity

1. A simple way: Compositional Identity

Compositional ID: SEQVENCE <> STRVCTVT

Shows how similar the sequences without giving order information

1. Another Simple way: Sequence Identity

Simple to understand, but far from accurate

To have more meaningful similarity, we can use PAM, BLOSUM, your own MATRICES,,,

Similarity Scores for a pair of sequences

1. Advanced way: Similarity Score Align the two sequences

Calculate the score based on Matrices used.

2. Most advanced: Statistical Significance1. Produces some random sequences for one of the

sequences and compare them to the other sequences many times (~100) and check what is the control level alignment score.

2. Onc can use reversed sequences instead of random sequences.

How to count similarity?

• If two sequences are very similar, we can just count the identical and similar sequences and calculate score

– ABCDEKKKMLOVE– ** *.**:.: *– ABEDCKKRNLIKE

• However, biological sequences have Insertions and Deletions.

– ABCDEKKKABCDEMLOVE– ABEDCKKR-----NLOVE

How do we know the correct Insertions and Deletions positions?

Normally it is not easy when sequences are very distant.Below twilight zone => Nearly impossible to have correct alignment.

What is the ‘correct’ alignment? This is a non-sense for diverged sequences

Around and below 40% sequence identity, most sequence alignments are WRONG.The wrong regions are mostly Insertion and Deletion locations (gaps)

The best we can do is to find where the gaps are and give up perfect Alignment.Putting gaps correctly in any sequence alignment is NOT solved completely.

Finding gaps as best as we can

• Usually, the best method is using NNN.

• NNN is time consuming.

• So, Needleman & Wunsch made a computer algorithm. Simple Dynamic Programming

(J. Mol. Biol. 48, 443-453 [1970])

Let's do a simple example, adapted from Needleman & Wunsch's original paper. First, place the sequences on a matrix of cells. At each

cell where the amino acids are identical, enter a value of 1. All the other cells are implicitly given a score of 0 (zero).

A D C N S R Q C L C R P MA 1 S 1 C 1 1 1 S 1 N 1 R 1 1 C 1 1 1 K C 1 1 1 R 1 1 D 1 P 1

Ref: http://www.finchcms.edu/biochem/Walters/nw.html

Now, starting at the C-terminal ends of the sequences and working toward the origins, add to each cell the maximum value from among all the cells downstream from it (not including cells directly below or directly to the right. Let's do a few cycles of this and see how the matrix develops. Start with the last column and last row, adding in the zeros.

A D C N S R Q C L C R P MA 1 0S 1 0C 1 1 1 0S 1 0N 1 0R 1 1 0C 1 1 1 0K 0 C 1 1 1 0R 1 1 0D 1 0P 0 0 0 0 0 0 0 0 0 0 0 1 0

A D C N S R Q C L C R P MA 8 7 6 6 5 4 4 3 3 2 1 0 0S 7 7 6 6 6 4 4 3 3 2 1 0 0C 6 6 7 6 5 4 4 4 3 3 1 0 0S 6 6 6 5 6 4 4 3 3 2 1 0 0N 5 5 5 6 5 4 4 3 3 2 1 0 0R 4 4 4 4 4 5 4 3 3 2 2 0 0C 3 3 4 3 3 3 3 4 3 3 1 0 0K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0R 2 1 1 1 1 2 1 1 1 1 2 0 0D 1 2 1 1 1 1 1 1 1 1 1 0 0P 0 0 0 0 0 0 0 0 0 0 0 1 0

•ADCNS-RQCLCR-PM•| | | | | || |•ASC-SNR-CKCRDP-

•ADC-NSRQCLCR-PM•| | | | | || |•ASCSN-R-CKCRDP-

http://www.perlmonth.com/columns/modules/Perl/Align/NW/NW.htm

Additional ref: Smith, T.F. and Waterman, M.S. 1981. ``Identification of common molecular subsequences'' Journal of Molecular Biology. 147: 195-197

http://www.perlmonth.com/columns/modules/Perl/Align/NW/NW.htm

FASTA

• FASTA is a program for rapid alignment of pairs of protein and DNA sequences. Rather than comparing individual residues in the two sequences, FASTA instead looks for matching sequence patterns or words, called k-tuples, and then attempts to build a local alignment based upon these word matches.

FASTAIn the initial search for regions of similarity, FASTA uses a

computer method known as hash coding. In this method, a lookup table showing the positions of each sequence word of length k, called a k-tuple, is constructed for each sequence. The relative positions of each word in the two sequences is then calculated by subtracting the position in the first sequence from that in the second. Words that have the

same offset position reveal a region of alignment between the two sequences. The number of comparisons increases linearly in proportion to average sequence length. In contrast, the time taken in dot matrix and dynamic programming methods increases as the square of the average sequence length. The k-tuple length is user-defined and is usually 1 or 2 for protein sequences (i.e. either the positions of each of the individual

20 amino acids or the positions of each of the 400 possible dipeptides are located).

FASTAposition 1 2 3 4 5 6 7 8 9 10 11protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offsetamino acid protein A protein B pos A - posB-----------------------------------------------------a 6 6 0c 2 7 -5k - 11n 1 -p 4 9 -5r - 10s 3 8 -5t 5 ------------------------------------------------------Note the common offset for the 3 amino acids c,s and pA possible alignment is thus quickly found -

protein 1 n c s p t a | | | protein 2 a c s p r k

PAM & Blosum

Documents

Transcript of PAM & Blosum