CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

195
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment

Transcript of CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Page 1: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

CS 5263 BioinformaticsCS 4593 AT:Bioinformatics

Lectures 3-6: Pair-wise Sequence Alignment

Page 2: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Outline

• Part I: Algorithms– Biological problem– Intro to dynamic programming– Global sequence alignment– Local sequence alignment– More efficient algorithms

• Part II: Biological issues– Model gaps more accurately– Alignment statistics

• Part III: BLAST

Page 3: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Evolution at the DNA level

…ACGGTGCAGTCACCA…

…ACGTTGC-GTCCACCA…

C

DNA evolutionary events (sequence edits):Mutation, deletion, insertion

Page 4: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Sequence conservation implies function

OK

OK

OK

X

X

Still OK?

next generation

Page 5: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Why sequence alignment?

• Conserved regions are more likely to be functional – Can be used for finding genes, regulatory elements,

etc.

• Similar sequences often have similar origin and function– Can be used to predict functions for new genes /

proteins

• Sequence alignment is one of the most widely used computational tools in biology

Page 6: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Global Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionAn alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t.

(1) |S’| = |T’|, and (|S| = “length of S”)(2) removing all spaces in S’, T’ leaves S, T

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

S

T

S’

T’

Page 7: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

What is a good alignment?

Alignment: The “best” way to match the letters of one sequence with those of the other

How do we define “best”?

Page 8: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• The score of aligning (characters or spaces) x & y is σ (x,y).

• Score of an alignment:

• An optimal alignment: one with max score

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Page 9: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Scoring Function

• Sequence edits:AGGCCTC

– Mutations AGGACTC– Insertions AGGGCCTC– Deletions AGG-CTC

Scoring Function:Match: +m~~~AAC~~~Mismatch: -s ~~~A-A~~~Gap (indel): -d

Page 10: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Match = 2, mismatch = -1, gap = -1

• Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

Page 11: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

More complex scoring function

• Substitution matrix– Similarity score of matching two letters a, b

should reflect the probability of a, b derived from the same ancestor

– It is usually defined by log likelihood ratio– Active research area. Especially for proteins.– Commonly used: PAM, BLOSUM

Page 12: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

An example substitution matrix

A C G T

A 3 -2 -1 -2

C 3 -2 -1

G 3 -2

T 3

Page 13: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

How to find an optimal alignment?

• A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do

align A[i] with B[i], 1 ≤i ≤|A|align all other chars to spacescompute its valueretain the max

endoutput the retained alignment

S = abcd A = cdT = wxyz B = xz-abc-d a-bc-dw--xyz -w-xyz

Page 14: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Analysis

• Assume |S| = |T| = n• Cost of evaluating one alignment: ≥n• How many alignments are there:

– pick n chars of S,T together– say k of them are in S– match these k to the k unpicked chars of T

• Total time:

• E.g., for n = 20, time is > 240 >1012 operations

Page 15: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Intro to Dynamic Programming

Page 16: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Dynamic programming

• What is dynamic programming?– A method for solving problems exhibiting the

properties of overlapping subproblems and optimal substructure

– Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly

• Two simple examples: – Computing Fibonacci numbers– Find the special shortest path in a grid

Page 17: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example 1: Fibonacci numbers

• 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

F(0) = 1;

F(1) = 1;

F(n) = F(n-1) + F(n-2)

• How to compute F(n)?

Page 18: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

A recursive algorithm

function fib(n)

if (n == 0 or n == 1) return 1;

else return fib(n-1) + fib(n-2);

F(9)F(8) F(7)

F(7) F(6) F(6) F(5)

F(6) F(5) F(5) F(4)F(5) F(4) F(4) F(3)

Page 19: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Time complexity:– Between 2n/2 and 2n

– O(1.62n), i.e. exponential

• Why recursive Fib algorithm is inefficient?– Overlapping subproblems

F(9)F(8) F(7)

F(9)F(8) F(7)

F(7) F(6) F(6) F(5)F(7) F(6) F(6) F(5)

F(6) F(5) F(5) F(4)F(5) F(4) F(4) F(3)F(6) F(5) F(5) F(4)F(5) F(4) F(4) F(3)

n/2n

Page 20: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

An iterative algorithm

function fib(n)

F[0] = 1; F[1] = 1;

for i = 2 to n

F[i] = F[i-1] + F[i-2];

Return F[n];

55342113853211 55342113853211

Time complexity:Time: O(n), space: O(n)

Page 21: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example 2: shortest path in a gridS

G

m

nEach edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

Page 22: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Optimal substructures

• Naïve algorithm: enumerate all possible paths and compare costs– Exponential number of paths

• Key observation: – If a path P(S, G) is the shortest from S to G,

any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

Page 23: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Proof

• Proof by contradiction– If the path between P(S,x) is

not the shortest, i.e., P’(S,x) < P(S,x)

– Construct a new path P’(S,G) = P’(S,x) + P(x, G)

– P’(S,G) < P(S,G) => P(S,G) is not the shortest

– Contradiction– Therefore, P(S, x) is the

shortest

S

G

x

Page 24: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Recursive solution• Index each intersection by

two indices, (i, j)• Let F(i, j) be the total

length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted.

• To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1)

m

n

(0,0)

(m, n)

F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min

F(m, n-1) + length((m, n-1), (m, n))

Page 25: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Recursive solution

• But: if we use recursive call, many subpaths will be recomputed for many times

• Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?)

m

n

F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min

F(i, j-1) + length((i, j-1), (i, j))

(0,0)

(m, n)

(i, j)

(i-1, j)

(i, j-1)

Page 26: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Dynamic programming illustration3 9 1 2

3 2 5 2

2 4 2 3

3 6 3 3

1 2 3 2

5 3 3 3 3

2 3 3 9 3

6 2 3 7 4

4 6 3 1 3

3 12 13 15

6 8 13 15

9 11 13 16

11 14 17 20

17 17 18 20

0

5

7

13

17

S

G

F(i-1, j) + length(i-1, j, i, j) F(i, j) = min

F(i, j-1) + length(i, j-1, i, j)

Page 27: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trackback

3 9 1 2

3 2 5 2

2 4 2 3

3 6 3 3

1 2 3 2

5 3 3 3 3

2 3 3 9 3

6 2 3 7 4

4 6 3 1 3

3 12 13 15

6 8 13 15

9 11 13 16

11 14 17 20

17 17 18 20

0

5

7

13

17

Page 28: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Elements of dynamic programming

• Optimal sub-structures– Optimal solutions to the original problem

contains optimal solutions to sub-problems

• Overlapping sub-problems– Some sub-problems appear in many solutions

• Memorization and reuse– Carefully choose the order that sub-problems

are solved

Page 29: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Dynamic Programming for sequence alignment

Suppose we wish to alignx1……xM

y1……yN

Let F(i,j) = optimal score of aligningx1……xi

y1……yj

Scoring Function:Match: +mMismatch: -sGap (indel): -d

Page 30: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Elements of dynamic programming

• Optimal sub-structures– Optimal solutions to the original problem

contains optimal solutions to sub-problems

• Overlapping sub-problems– Some sub-problems appear in many solutions

• Memorization and reuse– Carefully choose the order that sub-problems

are solved

Page 31: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Optimal substructure

• If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then

• The alignment between x[1..i] and y[1..j] is also optimal

• Easy to prove by contradiction

...

1 2 i M

...

1 2 j N

x:

y:

Page 32: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Recursive solutionNotice three possible cases:

1. xM aligns to yN

~~~~~~~ xM

~~~~~~~ yN

2. xM aligns to a gap

~~~~~~~ xM

~~~~~~~

3. yN aligns to a gap

~~~~~~~ ~~~~~~~ yN

m, if xM = yN

F(M,N) = F(M-1, N-1) + -s, if not

F(M,N) = F(M-1, N) - d

F(M,N) = F(M, N-1) - d

max

Page 33: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Recursive solution

• Generalize:

F(i-1, j-1) + (Xi,Yj)

F(i,j) = max F(i-1, j) – d

F(i, j-1) – d

(Xi,Yj) = m if Xi = Yj, and –s otherwise

• Boundary conditions:– F(0, 0) = 0. – F(0, j) = ? – F(i, 0) = ?

-jd: y[1..j] aligned to gaps.

-id: x[1..i] aligned to gaps.

Page 34: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

What order to fill?F(0,0)

F(M,N)

F(i, j)F(i, j-1)

F(i-1, j)F(i-1, j-1)11 2

3i

j

Page 35: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

What order to fill?F(0,0)

F(M,N)

Page 36: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

A

T

A

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 37: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1

T -2

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 38: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 39: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 40: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

Optimal Alignment:F(4,3) = 2

F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 41: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

Optimal Alignment:F(4,3) = 2

This only tells us the best score

F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Page 42: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace-back

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

A

A

Page 43: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace-back

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

x = AGTA m = 1

y = ATA s = 1

d = 1

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

T A

T A

Page 44: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace-back

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

G T A

- T A

Page 45: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace-back

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

A G T A

A - T A

Page 46: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace-back

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

Optimal Alignment:F(4,3) = 2

AGTAATA

F(i-1, j-1) + (Xi,Yj)F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

Page 47: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1

T -2

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 48: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 49: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 50: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 51: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 52: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 53: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

F(i,j) i = 0 1 2 3 4

Page 54: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Using trace-back pointers

x = AGTA m = 1

y = ATA s = 1

d = 1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

j = 0

1

2

3

Optimal Alignment:F(4,3) = 2

AGTAATA

F(i,j) i = 0 1 2 3 4

Page 55: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The Needleman-Wunsch Algorithm

1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d

2. Main Iteration. Filling in scoresa. For each i = 1……M

For each j = 1……N F(i-1,j) – d [case 1]

F(i, j) = max F(i, j-1) – d [case 2]

F(i-1, j-1) + σ(xi, yj) [case 3]

UP, if [case 1]Ptr(i,j) = LEFT if [case 2]

DIAG if [case 3]

3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Page 56: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Complexity

• Time:O(NM)

• Space:O(NM)

• Linear-space algorithms do exist (with the same time complexity)

Page 57: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Equivalent graph problem

(0,0)

(3,4)

A G T A

A

A

T

1 1

1

1

S1 =

S2 =

• Number of steps: length of the alignment

• Path length: alignment score

• Optimal alignment: find the longest path from (0, 0) to (3, 4)

• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

: a gap in the 2nd sequence

: a gap in the 1st sequence

: match / mismatch

Value on vertical/horizontal line: -dValue on diagonal: m or -s

1

Page 58: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Question

• If we change the scoring scheme, will the optimal alignment be changed? – Old: Match = 1, mismatch = gap = -1– New: match = 2, mismatch = gap = 0– New: Match = 2, mismatch = gap = -2?

Page 59: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Question

• What kind of alignment is represented by these paths?

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A-

BC

A--

-BC

--A

BC-

-A-

B-C

-A

BC

Alternating gaps are impossible if –s > -2d

Page 60: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

A variant of the basic algorithm

Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG--------

Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG

The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1)

Score = -7

Score = -2

Page 61: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

A variant of the basic algorithm

• Maybe it is OK to have an unlimited # of gaps in the beginning and end:

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGCGCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

• Then, we don’t want to penalize gaps in the ends

Page 62: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The Overlap Detection variant

Changes:

1. InitializationFor all i, j,

F(i, 0) = 0

F(0, j) = 0

2. Termination maxi F(i,

N)

FOPT = max

maxj F(M, j)

x1 ……………………………… xM

yN …

……

……

……

……

……

y1

Page 63: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Different types of overlapsx

yx

y

Page 64: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The local alignment problem

Given two strings X = x1……xM,

Y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum

e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de

x

y

Page 65: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Why local alignment

• Conserved regions may be a small part of the whole– Global alignment might miss them if flanking “junk”

outweighs similar regions

• Genes are shuffled between genomes

A

A

B C D

B CD

Page 66: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Naïve algorithm

for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic

programmingRetain pair with max valueend ;Output the retained pair

• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Page 67: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Reminder

• The overlap detection algorithm– We do not give penalty to gaps at either end

Free gap

Free gap

Page 68: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The local alignment idea• Do not penalize the unaligned regions (gaps or mismatches)• The alignment can start anywhere and ends anywhere• Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment

– By resetting alignment score to zero

Page 69: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0

0

F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + (xi, yj)

Iteration: F(i, j) = max

Page 70: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace

back

Page 71: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0

b 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 72: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 73: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 74: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 75: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 76: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 77: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Page 78: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Page 79: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

cxde| ||c-de

x-de| ||xcde

Page 80: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• No negative values in local alignment DP array

• Optimal local alignment will never have a gap on either end

• Local alignment: “Smith-Waterman”

• Global alignment: “Needleman-Wunsch”

Page 81: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Analysis

• Time: – O(MN) for finding the best alignment– Time to report all alignments depends on the

number of sub-opt alignments

• Memory:– O(MN)– O(M+N) possible

Page 82: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

More efficient alignment algorithms

Page 83: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Given two sequences of length M, N

• Time: O(MN)– Ok, but still slow for long sequences

• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1TB memory

• Can we do better?

Page 84: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Bounded alignment

Good alignment should appear near the diagonal

Page 85: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Bounded Dynamic Programming

If we know that x and y are very similar

Assumption: # gaps(x, y) < k

xi Then,| implies | i – j | < k

yj

Page 86: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ (xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k

F(i – 1, j) – d, if j < i + k

Termination: same

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Page 87: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Analysis

• Time: O(kM) << O(MN)

• Space: O(kM) with some tricks

2k

M

2k

=>M

Page 88: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
Page 89: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1mb seq x 1mb seq = 1TB memory

• Can we do better?

Page 90: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Linear space algorithm

• If all we need is the alignment score but not the alignment, easy!

We only need to keep two rows

(You only need one row, with a little trick)

But how do we get the alignment?

Page 91: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Linear space algorithm

• When we finish, we know how we have aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)

Time complexity: O((M+N)(MN))

XM

YN

Page 92: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

(0, 0)

(M, N)

M/2

Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

Page 93: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

(0,0)

(6,6)

(3,2) (3,4) (3,6)(3,0)

Page 94: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Hirschberg’s idea

• Divide and conquer!

M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithmAlign x1x2…xM/2 with Y

X

Y

Page 95: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithmAlign reverse(xM/2+1…xM) with reverse(Y)

Y

X

Page 96: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Linear-space alignment

Using 2 (4) rows of space, we can compute

for k = 1…N, F(M/2, k), B(M/2, k)

M/2

Page 97: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k)

Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Page 98: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Linear-space alignment• Iterate this procedure to the two sub-problems!

N-k*

M/2

M/2

k*

Page 99: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Analysis

• Memory: O(N) for computation, O(N+M) to store the optimal alignment

• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …

k

N-k

M/2

M/2

Page 100: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

MN MN/2 MN/4

MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)

Page 101: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Outline

• Part I: Algorithms– Biological problem– Intro to dynamic programming– Global sequence alignment– Local sequence alignment– More efficient algorithms

• Part II: Biological issues– Model gaps more accurately– Alignment statistics

• Part III: BLAST

Page 102: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

How to model gaps more accurately

Page 103: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

What’s a better alignment?

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

Score = 8 x m – 3 x d Score = 8 x m – 3 x d

However, gaps usually occur in bunches.

- During evolution, chunks of DNA may be lost entirely- Aligning genomic sequences vs. cDNAs (reverse

complimentary to mRNAs)

Page 104: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Model gaps more accurately

• Current model:– Gap of length n incurs penalty nd

• General: – Convex function– E.g. (n) = c * sqrt (n)

n

n

Page 105: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

General gap dynamic programming

Initialization: same

Iteration:

F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – (i-k)

maxk=0…j-1F(i,k) – (j-k)

Termination: same

Running Time: O((M+N)MN) (cubic)Space: O(NM) (linear-space algorithm not applicable)

Page 106: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Compromise: affine gaps

(n) = d + (n – 1)e | |gap gapopen extension

de

(n)

Match: 2

Gap open: -5

Gap extension: -1

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

8x2-5-2 = 9 8x2-3x5 = 1

We want to find the optimal alignment with affine gap penalty in

• O(MN) time

• O(MN) or better O(M+N) memory

Page 107: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Allowing affine gap penalties

• Still three cases– xi aligned with yj

– xi aligned to a gap• Are we continuing a gap in x? (if no, start is more expensive)

– yj aligned to a gap• Are we continuing a gap in y? (if no, start is more expensive)

• We can use a finite state machine to represent the three cases as three states– The machine has two heads, reading the chars on the two

strings separately– At every step, each head reads 0 or 1 char from each sequence– Depending on what it reads, goes to a different state, and

produces different scores

Page 108: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Finite State Machine

F: have just read 1 char from each seq (xi aligned to yj )

Ix: have read 0 char from x. (yj aligned to a gap)

Iy: have read 0 char from y (xi aligned to a gap)

F

Ix

Iy

? / ?

? / ?

? / ?

? / ?

? / ?

? / ?

? / ?Input Output

State

Page 109: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / eInput Output

Start state

Current state Input Output Next state

F (xi,yj) F

F (-,yj) d Ix

F (xi,-) d Iy

Ix (-,yj) e Ix

… … … …

Page 110: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

AAC

ACT

F-F-F-F

AAC

|||

ACT

F-Iy-F-F-Ix

AAC-

||

-ACT

F-F-Iy-F-Ix

AAC-

| |

A-CT

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / e

startstate

Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM.

Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

Page 111: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Dynamic programming

• We encode this information in three different matrices

• For each element (i,j) we use three variables– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns

to yj

– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

xi

yj

xi

yj

xi

yj

F(i, j) Ix(i, j) Iy(i, j)

Page 112: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j-1) + (xi, yj)

F(i, j) = max Ix(i-1, j-1) + (xi, yj)

Iy(i-1, j-1) + (xi, yj)

xi

yj

Page 113: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i, j-1) + d

Ix(i, j) = max

Ix(i, j-1) + e

xi

yj

Ix(i, j)

Page 114: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j) + d

Iy(i, j) = max

Iy(i-1, j) + e

xi

yj

Iy(i, j)

Page 115: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)

Iy(i – 1, j – 1)

F(i, j – 1) + d Ix(i, j) = max

Ix(i, j – 1) + e

F(i – 1, j) + d Iy(i, j) = max

Iy(i – 1, j) + e

Continuing alignment

Closing gaps in x

Closing gaps in y

Opening a gap in x

Gap extension in x

Opening a gap in y

Gap extension in y

Page 116: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Data dependency

F

Ix Iy

i

j

i-1

j-1

i-1

j-1

Page 117: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Data dependency

IyIx

F

i

j

i

j

i

j

Page 118: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Data dependency

• If we stack all three matrices– No cyclic dependency– Therefore, we can fill in all three matrices in order

Page 119: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Algorithm

• for i = 1:m– for j = 1:n

• Fill in F(i, j), Ix(i, j), Iy(i, j)

– end

end• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))

• Time: O(MN)• Space: O(MN) or O(N) when combined with the

linear-space algorithm

Page 120: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Exercise

• x = GCAC

• y = GCC

• m = 2

• s = -2

• d = -5

• e = -1

Page 121: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

-

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F: aligned on both Iy: Insertion on y

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix: Insertion on x

(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 122: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 123: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 124: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 125: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

-

-

-

- - -

-5

-6

-7

-8

-5 -6 -7

- - -3

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 126: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

-

-

-

- - -

-5

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 127: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

-

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 128: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 129: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 130: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 131: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 132: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6 -3

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 133: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6 -3 -12 -13

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 134: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -5

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 135: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 136: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 137: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 138: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 139: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

GCAC

|| |

GC-C

x =

y =

x =

y =

x =

y =

x

y

G

C

A

C

G C C

x =

y =

m = 2s = -2d = -5e = -1

Page 140: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics of alignment

Where does (xi, yj) come from?

Are two aligned sequences actually related?

Page 141: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Probabilistic model of alignments

• We’ll first focus on protein alignments without gaps

• Given an alignment, we can consider two possible models– R: the sequences are related by evolution– U: the sequences are unrelated

• How can we distinguish these two models?• How is this view related to amino-acid

substitution matrix?

Page 142: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Model for unrelated sequences

• Assume each position of the alignment is independently sampled from some distribution of amino acids

• ps: probability of amino acid s in the sequences

• Probability of seeing an amino acid s aligned to an amino acid t by chance is– Pr(s, t | U) = ps * pt

• Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is

i

Page 143: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Model for related sequences

• Assume each pair of aligned amino acids evolved from a common ancestor

• Let qst be the probability that amino acid s in one sequence is related to t in another sequence

• The probability of an alignment of x and y is give by

Page 144: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Probabilistic model of Alignments

• How can we decide which model (U or R) is more likely?

• One principled way is to consider the relative likelihood of the two models (the odds ratio)– A higher ratio means that R is more likely than U

Page 145: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Log odds ratio

• Taking logarithm, we get

• Recall that the score of an alignment is given by

Page 146: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Therefore, if we define

• We are actually defining the alignment score as the log odds ratio between the two models R and U

Page 147: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

How to get the probabilities?

• ps can be counted from the available protein sequences

• But how do we get qst? (the probability that s and t have a common ancestor)

• Counted from trusted alignments of related sequences

Page 148: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Protein Substitution Matrices

• Two popular sets of matrices for protein sequences– PAM matrices [Dayhoff et al, 1978]

• Better for aligning closely related sequences

– BLOSUM matrices [Henikoff & Henikoff, 1992]• For both closely or remotely related sequences

Page 149: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLOSUM-N matrices

• Constructed from a database called BLOCKS• Contain many closely related sequences

– Conserved amino acids may be over-counted

• N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity– identity: % of matched columns

• Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

Page 150: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Positive for chemically similar substitution

Common amino acids get low weights

Rare amino acids get high weights

: Scaling factor to convert score to integer.Important: when you are told that ascoring matrix is in half-bits => = ½ ln2

Page 151: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLOSUM-N matrices

• If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75

• On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50

• BLOSUM-62: good for most purposes

45 62 90

Weak homology Strong homology

Page 152: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

For DNAs

• No database of trusted alignments to start with

• Specify the percentage identity you would like to detect

• You can then get the substitution matrix by some calculation

Page 153: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

For example

• Suppose pA = pC = pT = pG = 0.25

• We want 88% identity

• qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01

(A, A) = (C, C) = (G, G) = (T, T)

= log (0.22 / (0.25*0.25)) = 1.26(s, t) = log (0.01 / (0.25*0.25)) = -1.83 for

s ≠ t.

Page 154: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Substitution matrix

A C G T

A 1.26 -1.83 -1.83 -1.83

C -1.83 1.26 -1.83 -1.83

G -1.83 -1.83 1.26 -1.83

T -1.83 -1.83 -1.83 1.26

Page 155: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Scale won’t change the alignment• Multiply by 4 and then round off to get integers

A C G T

A 5 -7 -7 -7

C -7 5 -7 -7

G -7 -7 5 -7

T -7 -7 -7 5

Page 156: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Arbitrary substitution matrix

• Say you have a substitution matrix provided by someone

• It’s important to know what you are actually looking for when you use the matrix

Page 157: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• What’s the difference? • Which one should I use for my sequences?

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

NCBI-BLAST WU-BLAST

Page 158: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• We had

• Scale it, so that

• Reorganize:

Page 159: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

• Since all probabilities must sum to 1,

• We have

• Suppose again ps = 0.25 for any s

• We know (s, t) from the substitution matrix

• We can solve the equation for λ

• Plug λ into to get qst

Page 160: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

A C G T

A 5 -4 -4 -4

C -4 5 -4 -4

G -4 -4 5 -4

T -4 -4 -4 5

= 1.33

qst = 0.24 for s = t, and 0.004 for s ≠ t

Translate: 95% identity

= 0.19

qst = 0.16 for s = t, and 0.03 for s ≠ t

Translate: 65% identity

NCBI-BLAST WU-BLAST

Page 161: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Details for solving

Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t.Since

and s,t qst = 1, we have

12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have¾ x-2 + ¼ x = 1. Hence,x3 – 4x2 + 3 = 0;• X has three solutions: 3.8, 1, -0.8• Only the first solution leads to a positive = ln (3.8) = 1.33

A C G T

A 1 -2 -2 -2

C -2 1 -2 -2

G -2 -2 1 -2

T -2 -2 -2 1

Page 162: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics of alignment

Where does (xi, yj) come from?

Are two aligned sequences actually related?

Page 163: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics of Alignment Scores

• Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)?– Is a score 82 good? What about 180?

• A: determine how likely it is that such an alignment score would result from chance

Page 164: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

P-value of alignment

• p-value– The probability that the alignment score can

be obtained from aligning random sequences– Small p-value means the score is unlikely to

happen by chance

• The most common thresholds are 0.01 and 0.05– Also depend on purpose of comparison and

cost of misclaim

Page 165: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics of global seq alignment

• Theory only applies to local alignment• For global alignment, your best bet is to do Monte-Carlo

simulation– What’s the chance you can get a score as high as the real

alignment by aligning two random sequences?

• Procedure– Given sequence X, Y– Compute a global alignment (score = S)– Randomly shuffle sequence X (or Y) N times, obtain

X1, X2, …, XN

– Align each Xi with Y, (score = Ri)– P-value: the fraction of Ri >= S

Page 166: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Human HEXA

Fly HEXO1

Score = -74

Page 167: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

-95 -90 -85 -80 -75 -70 -65 -60 -55 -500

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

-74

Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences

There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

Page 168: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

……………………………………………………

Mouse HEXA

Human HEXA

Score = 732

Page 169: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

-200 -100 0 100 200 300 400 500 600 700 8000

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

732

Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences

-230 -220 -210 -200 -190 -180 -170 -160 -1500

5

10

15

20

25

30

35

40

45

Alignment Score

Num

ber

of S

eque

nces

• No random sequences with alignment score >= 732– So: the P-value is less than 1 / 200 = 0.05

• To get smaller p-value, have to align more random sequences– Very slow

• Unless we can fit a distribution (e.g. normal distribution)– Such distribution may not be generalizable– No theory exists for global alignment score distribution

Page 170: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics for local alignment

• Elegant theory exists• Score for ungapped local alignment follows extreme value

distribution (Gumbel distribution)

Normal distribution

Extreme value distribution

An example extreme value distribution:

• Randomly sample 100 numbers from a normal distribution, and compute max

• Repeat 100 times.

• The max values will follow extreme value distribution

Page 171: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics for local alignment

• Given two unrelated sequences of lengths M, N• Expected number of ungapped local alignments

with score at least S can be calculated by– E(S) = KMN exp[-S]– Known as E-value : scaling factor as computed in last lecture– K: empirical parameter ~ 0.1

• Depend on sequence composition and substitution matrix

Page 172: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

P-value for local alignment score

• P-value for a local alignment with score S

)(

exp1)(exp1

SE

SKMNeSESxP

when P is small.

Page 173: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Example

• You are aligning two sequences, each has 1000 bases

• m = 1, s = -1, d = -inf (ungapped alignment)

• You obtain a score 20

• Is this score significant?

Page 174: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

= ln3 = 1.1 (computed as discussed on slide #41)• E(S) = K MN exp{- S}• E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5

• P-value = 3 x 10-5 << 0.05• The alignment is significant

9 10 11 12 13 14 15 16 17 180

50

100

150

200

250

300

350

400

Alignment Score

Num

ber

of S

eque

nces

20

Distribution of 1000 random sequence pairs

Page 175: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Multiple-testing problem

• Searching a 1000-base sequence against a database of 106 sequences (each of length 1000)

• How significant is a score 20 now?• You are essentially comparing 1000 bases with 1000x106

= 109 bases (ignore edge effect)• E(20) = 0.1 * 1000 * 109 * 3-20 = 30• By chance we would expect to see 30 matches

– The P-value (probability of seeing at least one match with score >= 30) is 1 – e-30 = 0.9999999999

– The alignment is not significant– Caution: it does NOT mean that the two sequences are unrelated.

Rather, it simply means that you have NO confidence to say whether the two sequences are related.

Page 176: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Score threshold to determine significance

• You want a p-value that is very small (even after taking into consideration multiple-testing)

• What S will guarantee you a significant p-value?

E(S) P(S) << 1

=> KMN exp[-S] << 1

=> log(KMN) -S < 0=> S > T + log(MN) / (T = log(K) / , usually small)

Page 177: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Score threshold to determine significance

• In the previous example– m = 1, s = -1, d = -inf => = 1.1

• Aligning 1000bp vs 1000bpS > log(106) / 1.1 = 13.

So 20 is significant.

• Searching 1000bp against 106 x 1000bpS > log(1012) / 1.1 = 25.

so 20 is not significant.

Page 178: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Statistics for gapped local alignment

• Theory not well developed• Extreme value distribution works well

empirically• Need to estimate K and empirically

– Given the database and substitution matrix, generate some random sequence pairs

– Do local alignment– Fit an extreme value distribution to obtain K

and

Page 179: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Alignment statistics summary

• How to obtain a substitution matrix?– Obtain qst and ps from established alignments (for DNA: from

your knowledge)– Computing score:

• How to understand arbitrary substitution matrix?– Solve function to obtain and target qst

– Which tells you what percent identity you are expecting

• How to understand alignment score?– probability that a score can be expected from chance.– Global alignment: Monte-Carlo simulation– Local alignment: Extreme Value Distribution

• Estimate p-value from a score• Determine a score threshold without computing a p-value

Page 180: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Part III:Heuristic Local Sequence

Alignment: BLAST

Page 181: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

State of biological databasesSequenced Genomes:

Human 3109 Yeast 1.2107

Mouse 2.7109 Rat 2.6109

Neurospora 4107 Fugu fish 3.3108

Tetraodon 3108 Mosquito 2.8108

Drosophila 1.2108 Worm 1.0108

Rice 1.0109 Arabidopsis 1.2108

sea squirts 1.6108

Current rate of sequencing (before new-generation sequencing):4 big labs 3 109 bp /year/lab10s small labsPrivate sectors

With new-generation sequencing: Easily generating billions of reads daily

Page 182: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Some useful applications of alignments

Given a newly discovered gene,

- Does it occur in other species?

Assume we try Smith-Waterman:

The entire genomic database

Our new gene104

1010 - 1011

May take several weeks!

Page 183: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Some useful applications of alignments

Given a newly sequenced organism,

- Which subregions align with other organisms?- Potential genes

- Other functional units

Assume we try Smith-Waterman:

The entire genomic database

Our newly sequenced mammal

3109

1010 - 1011

> 1000 years ???

Page 184: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST

• Basic Local Alignment Search Tool– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990– The most widely used bioinformatics tool

• Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? – Score-wise, exactly equivalent– Biologically, later may be more interesting, & is common– At least, if must miss some, rather miss the former

• BLAST is a heuristic algorithm emphasizing the later– speed/sensitivity tradeoff: BLAST may miss former, but gains

greatly in speed

Page 185: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST• Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/• Along with many sequence databases

Main idea:1.Construct a dictionary of all the words in the query2.Initiate a local alignment for each word match between query and DB

Running Time: O(MN)However, orders of magnitude faster

than Smith-Waterman

query

DB

Page 186: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST Original Version

Dictionary:

All words of length k (~11 for DNA, 3 for proteins)

Alignment initiated between words of alignment score T (typically T = k)

Alignment:

Ungapped extensions until score

below statistical threshold

Output:

All local alignments with score

> statistical threshold

……

……

query

DB

query

scan

Page 187: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST Original VersionA C G A A G T A A G G T C C A G T

C

C

C

T

T

C C

T

G

G

A T

T

G

C

G

A

Example:

k = 4, T = 4

The matching word GGTC initiates an alignment

Extension to the left and right with no gaps until alignment falls < 50%

Output:

GTAAGGTCC

GTTAGGTCC

Page 188: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Gapped BLASTA C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

AAdded features:

• Pairs of words can initiate alignment

• Extensions with gaps in a band around anchor

Output:

GTAAGGTCCAGTGTTAGGTC-AGT

Page 189: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

ExampleQuery: gattacaccccgattacaccccgattaca (29 letters) [2 mins]

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters

>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 125138 tacacccagattacaccccga 125158

Score = 34.2 bits (17),

Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 125104 tacacccagattacaccccga 125124

>gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 3891 tacacccagattacaccccga 3911

Page 190: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

ExampleQuery: Human atoh enhancer, 179 letters [1.5 min]

Result: 57 blast hits1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 2. gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 3. gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 4. gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 5. gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 6. gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 7. gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 8. gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27

gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),

Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203

Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262

Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||

Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

Page 191: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST Score: bit score vs raw score

• Bit score is converted from raw score by taking into account K and :S’ = ( S – log K) / log 2

• To compute E-value from bit score:E = KM’N’ e-S = M’N’ 2-S’

• Critical score is now:S* = log2(M’N’)If S’ >> S*: significantIf S’ << S*: not significant

(M’ ~ M, N’ ~ N)

Page 192: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Different types of BLAST

• blastn: search nucleic acid databases• blastp: search protein databases• blastx: you give a nucleic acid sequence,

search protein databases• tblastn: you give a protein sequence,

search nucleic acid databases• tblastx: you give a nucleic sequence,

search nucleic acid database, implicitly translate both into protein sequences

Page 193: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

BLAST cons and pros

• Advantages– Fast!!!!– A few minutes to search a database of 1011 bases

• Disadvantages– Sensitivity may be low– Often misses weak homologies

• New improvement– Make it even faster

• Mainly for aligning very similar sequences or really long sequences – E.g. whole genome vs whole genome

– Make it more sensitive• PSI-BLAST: iteratively add more homologous sequences• PatternHunter: discontinuous seeds

Page 194: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Variants of BLASTNCBI-BLAST: most widely used versionWU-BLAST: (Washington University BLAST): another popular version

Optimized, added featuresMEGABLAST:

Optimized to align very similar sequences. Linear gap penaltyBLAT: Blast-Like Alignment ToolBlastZ:

Optimized for aligning two genomes PSI-BLAST:

BLAST produces many hitsThose are aligned, and a pattern is extractedPattern is used for next search; above steps iteratedSensitive for weak homologiesSlower

Page 195: CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Summary

• Part I: Algorithms– Global sequence alignment: Needleman-Wunsch – Local sequence alignment: Smith-Waterman– Improvement on space and time

• Part II: Biological issues– Model gaps more accurately: affine gap penalty– Alignment statistics

• Part III: Heuristic algorithms – BLAST family