Post on 15-Nov-2014
BIOTOOLS
AND
DATABASES
BY
C.GAYATHRI
(I M.Sc.BIOINFORMATICS)
Needleman Wunsch Algorithm
Smith Waterman Algorithm
Published in 1970 by SAUL NEEDLEMAN and CHRISTIAN WUNSCH
General algorithm for sequence comparison
Commonly used in bioinformatics to align protein or nucleotide sequences
Example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.
Scores for aligned characters are specified by a SIMILARITY MATRIX. Here, S(i, j) is the similarity of characters i and j. It uses a LINEAR GAP PENALTY, here called d.
Maximizes a similarity score, to give ‘MAXIMUM MATCH’
Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions
Finds the best GLOBAL alignment of any two sequences
N-W involves an iterative matrix method of calculationAll possible pairs of residues (bases or
amino acids) - one from each sequence - are represented in a 2-dimensional array
All possible alignments (comparisons) are represented by pathways through this array
Three main steps
1. Assign similarity values
2. For each cell, allowing insertions and deletions give the maximum possible scoring value
3. Construct an alignment (pathway) back from the highest scoring cell
Similarity values Numerical value is assigned to
every cell (depending on the similarity/dissimilarity of the two residues)
simple scores or more complicated, (chemical similarities or frequency of observed substitutions)
The example shown here has match = +1 mismatch = 0
M P R C L C Q R J N C B AP 1B 1R 1 1C 1 1 1KC 1 1 1R 1N 1J 1C 1 1 1J 1A 1
Score pathways through array
to know the maximum possible score for an alignment
Searches sub rows and sub columns, for the highest score
Adds this to the score for the current cell
Proceeds row by row through the array
Gap penalty for the introduction of gaps in the alignment = 0
M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 ?NJ 1C 1 1 1J 1A 1
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}
Construct alignment The alignment score is
cumulative by adding along a path through the array
The best alignment has the highest score i.e. the maximum match
Maximum match = largest number resulting from summing the cell values of every pathway
The maximum match will ALWAYS be somewhere in the outer row or column shown
The alignment is constructed by working backwards from the maximum match
M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 5 4 4 4 4 4N 0 0 1 2 3 3 4 4 5 6 5 5 5J 0 0 1 2 3 3 4 4 6 5 6 6 6C 0 0 1 3 3 4 4 4 5 6 7 6 6J 0 0 1 2 3 3 4 4 6 6 6 7 7A 0 0 1 2 3 3 4 4 5 6 6 7 8
MP –RCLCQR - JNCBA | | | | | | | | -PBRCKC –RNJ - CJA
Statistical Significance
Maximum match is a function of sequence relationship and composition
Useful to know probability of obtaining result (maximum match) from a pair of random sequences
Estimate this experimentally Sequences from random proteins are taken(I.e.
having same composition as the real proteins) if the value for the random proteins is
significantly different from that for the real proteins then the difference is a function of the sequences alone and not of their composition
Proposed by Temple Smith and Michael Waterman in 1981
Smith-Waterman algorithm is useful for performing local sequence alignment
Determining similar regions between two nucleotide or protein sequences
Instead of looking at entire sequence, it compares segments of all possible lengths and optimizes the similarity measure.
For every cell the algorithm calculates ALL possible paths that can be of any length and contain insertions, deletions and gaps
Works effectively, only when gap penalties are used
In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent
of gap) Start with all cell values =
0 Looks in sub column and
sub row shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account
C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 ?AUUGACGG
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}
Four possible ways of forming a path
For every residue in the query sequence
1. To align with next residue, score =previous score +similarity score2. Deletion (i.e. match residue of query with a gap), score =previous score - gap penalty dependent on size of
the gap Insertion (i.e. match residue of db sequence with a gap, score =previous score - gap penalty dependent on size of
the gap4. Stop when the score is zero
Choose whichever of these which has the highest score
Construct Alignment The score in each cell is
the maximum possible score for an alignment of ANY LENGTH ending at those coordinates
Trace pathway back from highest scoring cell
This cell can be anywhere in the array
Align highest scoring segment
C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0
GCC-UCGGCCAUUG
Needleman-Wunsch
1. Global alignments
2. Requires alignment score for a pair of residues to be >=0
3. No gap penalty required
4. Score cannot decrease between two cells of a pathway
5. Trace back is mostly from the last cell that has the highest score
Smith-Waterman
1. Local alignments
2. Residue alignment score may be positive or negative
3. Requires a gap penalty to work effectively
4. Score can increase, decrease or stay level between two cells of a pathway
5. Trace back is from the cell that has the highest score
CONCLUSION
Hence from calculating and working many times on these algorithms considering different organisms, it is found that NW and SW algorithms are excellent methods for finding the similarity and dissimilarity between the different organisms