CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
CS 5263 Bioinformatics
description
Transcript of CS 5263 Bioinformatics
CS 5263 Bioinformatics
Lecture 5: Local Sequence Alignment Algorithms
Poll
• Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?
Roadmap
• Review of last lecture
• Local Sequence Alignment
• Statistics of sequence alignment– Substitution matrix– Significance of alignment
Bounded Dynamic Programming
• O(kM) time
• O(kM) memory– Possibly O(M+k)
x1 ………………………… xM
y N …
……
……
……
……
… y
1
k
Linear-space alignment
N-k*
M/2
M/2
k*
• O(M+N) memory
• 2MN time
Graph representation of seq alignment
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2(3,4)
1
-1
1
1
1
-1 -1 -1
-1 -1
(0,0)
An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph
Question
• If I change the scoring scheme, will it change the alignment? – Match = 1, mismatch = gap = -2
||
v– Match = 2, mismatch = gap = -1?
• Answer: Yes
Proof
• Let F1 be the score of an optimal alignment under the scoring scheme – Match = m > 0– Mismatch = s < 0– Gap = d < 0
• Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment
• F1 = a1m + b1s + c1d
Proof (cont’)
• Let F2 be the score of a sub-optimal alignment under the same scoring scheme
• Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment
• F2 = a2m + b2s + c2d
• Let F1 = F2 + k, where k > 0
Proof (cont’)
• Now we change the scoring scheme, so that– Match = m + 1– Mismatch = s + 1– Gap = d + 1
Proof (cont’)
• The new scores for the two alignments become:
F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1)
= a1m + b1s + c1d + (a1+b1+c1)
= F1 + (a1+b1+c1)
= F1 + L1
F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1)
= F2 + (a2+b2+c2)
= F2 + L2
length of alignment 1
length of alignment 2
Proof (cont’)
• F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2)
= k + (a1+b1+c1) – (a2+b2+c2)
= k + L1 – L2
In order for F1’ < F2’, we need to have:
k + L1 – L2 < 0, i.e. L2 – L1 > k
Length of alignment 1 Length of alignment 2
Proof (cont’)
• This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme.
• We only need to show one example that it is possible to find such two alignments
m m
m
m
F1 = 2m + 3sF2 = 3m + 4d
s
s
s
d
d
d
d
F1 = 2m + 3sF2 = 3m + 4d
m = 1, s = d = –2
F1 = 2 – 6 = –4
F2 = 3 – 8 = –5
F1 > F2
m m
m
m
s
s
s
d
d
d
d
F1 = 2m + 3sF2 = 3m + 4d
m = 2, s = d = – 1
F1’ = 4 – 3 = 1
F2’ = 6 – 4 = 2
F2’ > F1’
m m
m
m
s
s
s
d
d
d
d
m m
m
m
A A A
A
T
T
C
C G
G
AACAG
| |
ATCGT
AA-CAG-
| | |
-ATC-GT
F1 = 2x1-3x2 = -4F1’ = 2x2 – 3x1 = 1
F2 = 3x1 – 4x2 = -5F2’ = 3x2 – 4x1 = 2
• On the other hand, if we had doubled our scores, such thatm’ = 2m,
s’ = 2s
d’ = 2d
• F1’ = 2F1
• F2’ = 2F2
• Our alignment won’t be changed
Today
• How to model gaps more accurately?
• Local sequence alignment
• Statistics of alignment
What’s a better alignment?
GACGCCGAACG||||| |||GACGC---ACG
GACGCCGAACG|||| | | ||GACG-C-A-CG
Score = 8 x m – 3 x d Score = 8 x m – 3 x d
However, gaps usually occur in bunches.
- During evolution, chunks of DNA may be lost entirely- Aligning genomic sequence vs. cDNA (reverse
complimentary to mRNA)
Model gaps more accurately
• Current model:– Gap of length n incurs penalty nd
• General: – Convex function– E.g. (n) = c * sqrt (n)
n
n
General gap dynamic programming
Initialization: same
Iteration:
F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)
Termination: same
Running Time: O(N2M) (cubic)Space: O(NM) (linear-space algorithm not applicable)
Compromise: affine gaps
(n) = d + (n – 1)e | |gap gapopen extension
de
(n)
Match: 2
Gap open: 5
Gap extension: 1
GACGCCGAACG||||| |||GACGC---ACG
GACGCCGAACG|||| | | ||GACG-C-A-CG
8x2-5-2 = 9 8x2-3x5 = 1
Additional states
• The amount of state needed increases– In scoring a single entry in our matrix, we need
remember an extra piece of information• Are we continuing a gap in x? (if no, start is more expensive)• Are we continuing a gap in y? (if no, start is more expensive)
• Are we continuing from a match between xi and yj?
Finite State Automaton
Xi and Yj aligned
Xi aligned to a gap
Yj aligned to a gap
d
d
e
e
Dynamic programming
• We encode this information in three different matrices
• For each element (i,j) we use three variables– F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj
– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap
F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)
Iy(i – 1, j – 1)
F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d
Ix(i, j – 1) – e
F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d
Iy(i – 1, j) – e
Continuing alignment
Closing gaps in x
Closing gaps in y
Opening a gap in x
Gap extension in x
Opening a gap in y
Gap extension in y
d
F
Ix Iy
IyIx
F
• If we stack all three matrices– No cyclic dependency– We can fill in all three matrices in order
Algorithm
• for i = 1:m– for j = 1:n
• Fill in F(i, j), Ix(i, j), Iy(i, j)
– end
end• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
• Time: O(MN)• Space: O(MN) or O(N) when combine with the
linear-space algorithm
To simplify
F(i – 1, j – 1) + (xi, yj)F(i, j) = max
I(i – 1, j – 1) + (xi, yj)
F(i, j – 1) – d
I (i, j) = max I(i, j – 1) – e F(i – 1, j) – d
I(i – 1, j) – e
I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap
This is possible because no alternating gaps allowed
To summarize
• Global alignment– Basic algorithm: Needleman-Wunsch– Variants:
• Overlapping detection• Longest common subsequences• Achieved by varying initial conditions or scoring
– Bounded DP (pruning search space)– Linear space (divide-and-conquer)– Affine gap penalty
Local alignment
The local alignment problem
Given two strings X = x1……xM,
Y = y1……yN
Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum
e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de
x
y
Why local alignment
• Conserved regions may be a small part of the whole– “Active site” of a protein– Scattered genes or exons among “junks”– Don’t have whole sequence
• Global alignment might miss them if flanking “junk” outweighs similar regions
• Genes are shuffled between genomes
A
A
B C D
B CD
A B C D
A
B
C
D
Naïve algorithm
for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic
programmingRetain pair with max valueend ;Output the retained pair
• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.
Reminder
• The overlap detection algorithm– We do not give penalty to gaps in the ends
Free gap
Free gap
Similar here
• We are free of penalty for the unaligned regions
The big idea
• Whenever we get to some bad region (negative score), we ignore the previous alignment– Reset score to zero
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0
0
F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + (xi, yj)
Iteration: F(i, j) = max
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace
back
• The correctness of the algorithm can be proved by induction using the alignment graph
-10
100
0
0
x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
cxde| ||c-de
x-de| ||xcde
• No negative values in local alignment DP array
• Optimal local alignment will never have a gap on either end
• Local alignment: “Smith-Waterman”
• Global alignment: “Needleman-Wunsch”
Analysis
• Time: – O(MN) for finding the best alignment– Depending on the number of sub-opt
alignments
• Memory:– O(MN)– O(M+N) possible
The statistics of alignment
Where does (xi, yj) come from?
Are two aligned sequences actually related?
Probabilistic model of alignments
• We’ll focus on protein alignments without gaps• Given an alignment, we can consider two
possibilities– R: the sequences are related by evolution– U: the sequences are unrelated
• How can we distinguish these possibilities?• How is this view related to amino-acid
substitution matrix?
Model for unrelated sequences
• Assume each position of the alignment is independently sampled from some distribution of amino acids
• ps: probability of amino acid s in the sequences
• Probability of seeing an amino acid s aligned to an amino acid t by chance is– Pr(s, t | U) = ps * pt
• Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is
Model for related sequences
• Assume each pair of aligned amino acids evolved from a common ancestor
• Let qst be the probability that amino acid s in one sequence is related to t in another sequence
• The probability of an alignment of x and y is give by
Probabilistic model of Alignments
• How can we decide which possibility (U or R) is more likely?
• One principled way is to consider the relative likelihood of the two possibilities (the odd ratios)– A higher ratio means that R is more likely than U
Log odds ratio
• Taking the log, we get
• Recall that the score of an alignment is given by
• Therefore, if we define
• We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U
This is indeed how biologists have defined the substitution matrices for proteins
• ps can be counted from the available protein sequences
• But how do we get qst? (the probability that s and t have a common ancestor)
• Counted from trusted alignments of related sequences
Protein Substitution Matrices
• Two popular sets of matrices for protein sequences– PAM matrices [Dayhoff et al, 1978]
• Better for aligning closely related sequences
– BLOSUM matrices [Henikoff & Henikoff, 1992]• For both closely or remotely related sequences
Positive for chemically similar substitution
Common amino acids get low weights
Rare amino acids get high weights
BLOSUM-N matrices
• Constructed from a database called BLOCKS• Contain many closely related sequences
– Conserved amino acids may be over-counted
• N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity– identity: % of matched columns
• Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)
• If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75
• On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50
• BLOSUM62 is the standard
For DNAs
• No database of trusted alignments to start with
• Specify the percentage identity you would like to detect
• You can then get the substitution matrix by some calculation
For example
• Suppose pA = pC = pT = pG = 0.25
• We want 88% identity
• qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01
(A, A) = (C, C) = (G, G) = (T, T)
= log (0.22 / (0.25*0.25)) = 1.26(s, t) = log (0.01 / (0.25*0.25)) = -1.83 for
s ≠ t.
Substitution matrix
A C G T
A 1.26 -1.83 -1.83 -1.83
C -1.83 1.26 -1.83 -1.83
G -1.83 -1.83 1.26 -1.83
T -1.83 -1.83 -1.83 1.26
• Scale won’t change the alignment• Multiply by 4 and then round off to get integers
A C G T
A 5 -7 -7 -7
C -7 5 -7 -7
G -7 -7 5 -7
T -7 -7 -7 5
Arbitrary substitution matrix
• Say you have a substitution matrix provided by someone
• It’s important to know what you are actually looking for when you use the matrix
• What’s the difference?
• Which one should I use?
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5
• We had
• Scale it, so that
• Reorganize:
• Since all probabilities must sum to 1,
• We have
• Suppose again ps = 0.25 for any s
• We know (s, t) from the substitution matrix
• We can solve the equation for λ
• Plug λ into to get qst
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5
= 1.33
qst = 0.24 for s = t, and 0.004 for s ≠ t
Translate: 95% identity
= 1.21
qst = 0.16 for s = t, and 0.03 for s ≠ t
Translate: 65% identity