CS 5263 Bioinformatics

40
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms

description

CS 5263 Bioinformatics. Lecture 4: Global Sequence Alignment Algorithms. Roadmap. Review of last lecture More global sequence alignment algorithms. Given a scoring scheme, Match: m Mismatch: -s Gap: -d We can easily compute an optimal alignment by dynamic programming. - PowerPoint PPT Presentation

Transcript of CS 5263 Bioinformatics

Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 4: Global Sequence Alignment Algorithms

Page 2: CS 5263 Bioinformatics

Roadmap

• Review of last lecture

• More global sequence alignment algorithms

Page 3: CS 5263 Bioinformatics

• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d

• We can easily compute an optimal alignment by dynamic programming

Page 4: CS 5263 Bioinformatics

• In a completed alignment between a pair of sequences X = x1x2…xM, Y = y1y1…yN

• If we look at any column of the alignment, there are only three possibilities– xi is aligned to yj

– xi is aligned to a gap

– yj is aligned to a gap

Page 5: CS 5263 Bioinformatics

• Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to:

F(M-1, N-1) + (xM, yN)F(M, N) = max F(M-1, N) - d

F(M, N-1) - d

Page 6: CS 5263 Bioinformatics

• And recursively:

F(i-1, j-1) + (xi, yj)F(i, j) = max F(i-1, j) - d

F(i, j-1) - d

Page 7: CS 5263 Bioinformatics

F(0,0)

F(M,N)

Page 8: CS 5263 Bioinformatics

F(0,0)

F(M,N)

Page 9: CS 5263 Bioinformatics

A

A

G

-

T

T

A

A

Trace-back

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) j = 0 1 2 3 4

i = 0

1

2

3

A

A

G

-

T

T

A

A

Page 10: CS 5263 Bioinformatics

Graph representation

(0,0)

(3,4)

A G T A

A

A

T

1

-1

1

1

1

S1 =

S2 =

• Number of steps: length of the alignment

• Path length: alignment score

• Alignment: find the longest path from (0, 0) to (3, 4)

• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

: a gap in the 2nd sequence

: a gap in the 1st sequence

: match / mismatch

-1 -1 -1

-1 -1

Values on vertical/horizontal line: -dValues on diagonal: m or -s-1

-1

-1

-1

Page 11: CS 5263 Bioinformatics

Question

• If we change the scoring scheme, will the optimal alignment be changed? – Original: Match = 1, mismatch = gap = -1– New: match = 2, mismatch = gap = 0– New: Match = 2, mismatch = gap = -2?

Page 12: CS 5263 Bioinformatics

Number of alignments

• Is equal to the number of distinct paths from (0, 0) to (m, n)

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A-

BC

A--

-BC

--A

BC-

-A-

B-C

-A

BC

Page 13: CS 5263 Bioinformatics

• How to count?– Homework assignment

– Hint: dynamic programming– Or analytically

Page 14: CS 5263 Bioinformatics

However

• Biologically meaningful “distinct” alignments may be much less– All three may be considered equivalent– A, B, and C all aligned to gaps

A

B

C

A

B

C

A

B

C

A--

-BC

--A

BC-

-A-

B-C

Page 15: CS 5263 Bioinformatics

Number of alignments

• We only care about who is aligned to whom, not the gaps

• For two sequences of length m, n, there may be k matches, k = 0 to min(m, n)

• Number of alignments:

Page 16: CS 5263 Bioinformatics

FurthermoreA

B

C

A

B

C

A-

BC

A--

-BC

• Alternating gaps are discouraged / prohibited.

• With most scoring scheme, alternating gaps will never happen. (as long as 2d > s)

=>

-d-d m or -s

Page 17: CS 5263 Bioinformatics

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A-

BC

A--

-BC

--A

BC-

-A-

B-C

-A

BC

• Special trick? • No. In most scoring scheme this is achieved

automatically– 2d > s

Page 18: CS 5263 Bioinformatics

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

Page 19: CS 5263 Bioinformatics

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

• Came from left, cannot go down

Page 20: CS 5263 Bioinformatics

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

• Came from left, cannot go down• Came from above, cannot turn

right

Page 21: CS 5263 Bioinformatics

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory

• Can we do better?

Page 22: CS 5263 Bioinformatics

In biology, this kind of alignment is unlikely to be meaningful

abcde--------vwxyz

Page 23: CS 5263 Bioinformatics

Good alignment should appear near the diagonal

Page 24: CS 5263 Bioinformatics

Bounded Dynamic Programming

If we know that x and y are very similar

Assumption: # gaps(x, y) < k

xi Then,| implies | i – j | < k

yj

Page 25: CS 5263 Bioinformatics

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ (xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k

F(i – 1, j) – d, if j < i + k

Termination: same

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Page 26: CS 5263 Bioinformatics

Analysis

• Time: O(kM) << O(MN)

• Space: O(kM) with some tricks

2k

M

2k

=>M

Page 27: CS 5263 Bioinformatics
Page 28: CS 5263 Bioinformatics

• What if we don’t know k?

• Iterate:– For k = 2, 4, 8, 16, …– For each k, we can have an optimal bounded

alignment with score Sk

– Stop when ((min(N, M)-k) * m – 2kd) < Sk, since we will not be able to get a higher score with larger k

Page 29: CS 5263 Bioinformatics

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory

• Can we do better?

Page 30: CS 5263 Bioinformatics

Linear space algorithm

• If all we need is the alignment score but not the alignment, easy!

We only need to keep two rows

(if you are crafty enough, you only need one row)

But how do we get the alignment?

Page 31: CS 5263 Bioinformatics

Linear space algorithm

• When we finish, we know how we have aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)

Time complexity: O((M+N)(MN))

XM

YN

Page 32: CS 5263 Bioinformatics

Hirschberg’s idea

• Divide and conquer!

M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithmAlign x1x2…xM/2 with Y

X

Y

Page 33: CS 5263 Bioinformatics

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithmAlign reverse(xM/2xM/2+1…xM) with reverse(Y)

Y

X

Page 34: CS 5263 Bioinformatics

Lemma

•F(M/2, k) + B(M/2, k) is the best alignment under the constraint that xM/2 must be aligned to yk

•F(M, N) = maxk=0…N( F(M/2, k) + B(M/2, k) )

x

y

M/2

k*

F(M/2, k) B(M/2, k)

Page 35: CS 5263 Bioinformatics

• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6)

(0,0)

(6,6)

(3,2) (3,4) (3,6)(3,0)

Page 36: CS 5263 Bioinformatics

Linear-space alignment

Now, using 2 rows of space, we can compute

for k = 1…N, F(M/2, k), B(M/2, k)

M/2

Page 37: CS 5263 Bioinformatics

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k)

Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Page 38: CS 5263 Bioinformatics

Linear-space alignment• Iterate this procedure to the two sub-problems!

N-k*

M/2

M/2

k*

Page 39: CS 5263 Bioinformatics

Analysis

• Memory: O(N) for computation, O(N+M) to store the optimal alignment

• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …

k

N-k

M/2

M/2

Page 40: CS 5263 Bioinformatics

MN MN/2 MN/4

MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)