CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 5: Local Sequence Alignment Algorithms

Poll

• Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?

Roadmap

• Review of last lecture

• Local Sequence Alignment

• Statistics of sequence alignment– Substitution matrix– Significance of alignment

Bounded Dynamic Programming

• O(kM) time

• O(kM) memory– Possibly O(M+k)

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Linear-space alignment

N-k*

M/2

M/2

k*

• O(M+N) memory

• 2MN time

Graph representation of seq alignment

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2(3,4)

1

-1

1

1

1

-1 -1 -1

-1 -1

(0,0)

An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph

Question

• If I change the scoring scheme, will it change the alignment? – Match = 1, mismatch = gap = -2

||

v– Match = 2, mismatch = gap = -1?

• Answer: Yes

Proof

• Let F1 be the score of an optimal alignment under the scoring scheme – Match = m > 0– Mismatch = s < 0– Gap = d < 0

• Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment

• F1 = a1m + b1s + c1d

Proof (cont’)

• Let F2 be the score of a sub-optimal alignment under the same scoring scheme

• Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment

• F2 = a2m + b2s + c2d

• Let F1 = F2 + k, where k > 0

Proof (cont’)

• Now we change the scoring scheme, so that– Match = m + 1– Mismatch = s + 1– Gap = d + 1

Proof (cont’)

• The new scores for the two alignments become:

F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1)

= a1m + b1s + c1d + (a1+b1+c1)

= F1 + (a1+b1+c1)

= F1 + L1

F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1)

= F2 + (a2+b2+c2)

= F2 + L2

length of alignment 1

length of alignment 2

Proof (cont’)

• F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2)

= k + (a1+b1+c1) – (a2+b2+c2)

= k + L1 – L2

In order for F1’ < F2’, we need to have:

k + L1 – L2 < 0, i.e. L2 – L1 > k

Length of alignment 1 Length of alignment 2

Proof (cont’)

• This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme.

• We only need to show one example that it is possible to find such two alignments

m m

m

m

F1 = 2m + 3sF2 = 3m + 4d

s

s

s

d

d

d

d

F1 = 2m + 3sF2 = 3m + 4d

m = 1, s = d = –2

F1 = 2 – 6 = –4

F2 = 3 – 8 = –5

F1 > F2

m m

m

m

s

s

s

d

d

d

d

F1 = 2m + 3sF2 = 3m + 4d

m = 2, s = d = – 1

F1’ = 4 – 3 = 1

F2’ = 6 – 4 = 2

F2’ > F1’

m m

m

m

s

s

s

d

d

d

d

m m

m

m

A A A

A

T

T

C

C G

G

AACAG

| |

ATCGT

AA-CAG-

| | |

-ATC-GT

F1 = 2x1-3x2 = -4F1’ = 2x2 – 3x1 = 1

F2 = 3x1 – 4x2 = -5F2’ = 3x2 – 4x1 = 2

• On the other hand, if we had doubled our scores, such thatm’ = 2m,

s’ = 2s

d’ = 2d

• F1’ = 2F1

• F2’ = 2F2

• Our alignment won’t be changed

Today

• How to model gaps more accurately?

• Local sequence alignment

• Statistics of alignment

What’s a better alignment?

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

Score = 8 x m – 3 x d Score = 8 x m – 3 x d

However, gaps usually occur in bunches.

- During evolution, chunks of DNA may be lost entirely- Aligning genomic sequence vs. cDNA (reverse

complimentary to mRNA)

Model gaps more accurately

• Current model:– Gap of length n incurs penalty nd

• General: – Convex function– E.g. (n) = c * sqrt (n)

n

n

General gap dynamic programming

Initialization: same

Iteration:

F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – (i-k)

maxk=0…j-1F(i,k) – (j-k)

Termination: same

Running Time: O(N2M) (cubic)Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps

(n) = d + (n – 1)e | |gap gapopen extension

de

(n)

Match: 2

Gap open: 5

Gap extension: 1

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

8x2-5-2 = 9 8x2-3x5 = 1

Additional states

• The amount of state needed increases– In scoring a single entry in our matrix, we need

remember an extra piece of information• Are we continuing a gap in x? (if no, start is more expensive)• Are we continuing a gap in y? (if no, start is more expensive)

• Are we continuing from a match between xi and yj?

Finite State Automaton

Xi and Yj aligned

Xi aligned to a gap

Yj aligned to a gap

d

d

e

e

Dynamic programming

• We encode this information in three different matrices

• For each element (i,j) we use three variables– F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj

– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)

Iy(i – 1, j – 1)

F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d

Ix(i, j – 1) – e

F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d

Iy(i – 1, j) – e

Continuing alignment

Closing gaps in x

Closing gaps in y

Opening a gap in x

Gap extension in x

Opening a gap in y

Gap extension in y

d

F

Ix Iy

IyIx

F

• If we stack all three matrices– No cyclic dependency– We can fill in all three matrices in order

Algorithm

• for i = 1:m– for j = 1:n

• Fill in F(i, j), Ix(i, j), Iy(i, j)

– end

end• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))

• Time: O(MN)• Space: O(MN) or O(N) when combine with the

linear-space algorithm

To simplify

F(i – 1, j – 1) + (xi, yj)F(i, j) = max

I(i – 1, j – 1) + (xi, yj)

F(i, j – 1) – d

I (i, j) = max I(i, j – 1) – e F(i – 1, j) – d

I(i – 1, j) – e

I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap

This is possible because no alternating gaps allowed

To summarize

• Global alignment– Basic algorithm: Needleman-Wunsch– Variants:

• Overlapping detection• Longest common subsequences• Achieved by varying initial conditions or scoring

– Bounded DP (pruning search space)– Linear space (divide-and-conquer)– Affine gap penalty

Local alignment

The local alignment problem

Given two strings X = x1……xM,

Y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum

e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de

x

y

Why local alignment

• Conserved regions may be a small part of the whole– “Active site” of a protein– Scattered genes or exons among “junks”– Don’t have whole sequence

• Global alignment might miss them if flanking “junk” outweighs similar regions

• Genes are shuffled between genomes

A

A

B C D

B CD

A B C D

A

B

C

D

Naïve algorithm

for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic

programmingRetain pair with max valueend ;Output the retained pair

• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder

• The overlap detection algorithm– We do not give penalty to gaps in the ends

Free gap

Free gap

Similar here

• We are free of penalty for the unaligned regions

The big idea

• Whenever we get to some bad region (negative score), we ignore the previous alignment– Reset score to zero

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0

0

F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + (xi, yj)

Iteration: F(i, j) = max

The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace

back

• The correctness of the algorithm can be proved by induction using the alignment graph

-10

100

0

0

x x x c d e

0 0 0 0 0 0 0

a 0

b 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

cxde| ||c-de

x-de| ||xcde

• No negative values in local alignment DP array

• Optimal local alignment will never have a gap on either end

• Local alignment: “Smith-Waterman”

• Global alignment: “Needleman-Wunsch”

Analysis

• Time: – O(MN) for finding the best alignment– Depending on the number of sub-opt

alignments

• Memory:– O(MN)– O(M+N) possible

The statistics of alignment

Where does (xi, yj) come from?

Are two aligned sequences actually related?

Probabilistic model of alignments

• We’ll focus on protein alignments without gaps• Given an alignment, we can consider two

possibilities– R: the sequences are related by evolution– U: the sequences are unrelated

• How can we distinguish these possibilities?• How is this view related to amino-acid

substitution matrix?

Model for unrelated sequences

• Assume each position of the alignment is independently sampled from some distribution of amino acids

• ps: probability of amino acid s in the sequences

• Probability of seeing an amino acid s aligned to an amino acid t by chance is– Pr(s, t | U) = ps * pt

• Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is

Model for related sequences

• Assume each pair of aligned amino acids evolved from a common ancestor

• Let qst be the probability that amino acid s in one sequence is related to t in another sequence

• The probability of an alignment of x and y is give by

Probabilistic model of Alignments

• How can we decide which possibility (U or R) is more likely?

• One principled way is to consider the relative likelihood of the two possibilities (the odd ratios)– A higher ratio means that R is more likely than U

Log odds ratio

• Taking the log, we get

• Recall that the score of an alignment is given by

• Therefore, if we define

• We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U

This is indeed how biologists have defined the substitution matrices for proteins

• ps can be counted from the available protein sequences

• But how do we get qst? (the probability that s and t have a common ancestor)

• Counted from trusted alignments of related sequences

Protein Substitution Matrices

• Two popular sets of matrices for protein sequences– PAM matrices [Dayhoff et al, 1978]

• Better for aligning closely related sequences

– BLOSUM matrices [Henikoff & Henikoff, 1992]• For both closely or remotely related sequences

Positive for chemically similar substitution

Common amino acids get low weights

Rare amino acids get high weights

BLOSUM-N matrices

• Constructed from a database called BLOCKS• Contain many closely related sequences

– Conserved amino acids may be over-counted

• N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity– identity: % of matched columns

• Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

• If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75

• On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50

• BLOSUM62 is the standard

For DNAs

• No database of trusted alignments to start with

• Specify the percentage identity you would like to detect

• You can then get the substitution matrix by some calculation

For example

• Suppose pA = pC = pT = pG = 0.25

• We want 88% identity

• qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01

(A, A) = (C, C) = (G, G) = (T, T)

= log (0.22 / (0.25*0.25)) = 1.26(s, t) = log (0.01 / (0.25*0.25)) = -1.83 for

s ≠ t.

Substitution matrix

A C G T

A 1.26 -1.83 -1.83 -1.83

C -1.83 1.26 -1.83 -1.83

G -1.83 -1.83 1.26 -1.83

T -1.83 -1.83 -1.83 1.26

• Scale won’t change the alignment• Multiply by 4 and then round off to get integers

A C G T

A 5 -7 -7 -7

C -7 5 -7 -7

G -7 -7 5 -7

T -7 -7 -7 5

Arbitrary substitution matrix

• Say you have a substitution matrix provided by someone

• It’s important to know what you are actually looking for when you use the matrix

• What’s the difference?

• Which one should I use?

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

• We had

• Scale it, so that

• Reorganize:

• Since all probabilities must sum to 1,

• We have

• Suppose again ps = 0.25 for any s

• We know (s, t) from the substitution matrix

• We can solve the equation for λ

• Plug λ into to get qst

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

= 1.33

qst = 0.24 for s = t, and 0.004 for s ≠ t

Translate: 95% identity

= 1.21

qst = 0.16 for s = t, and 0.03 for s ≠ t

Translate: 65% identity

CS 5263 Bioinformatics

Documents

Transcript of CS 5263 Bioinformatics