CS 5263 Bioinformatics

46
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties

description

CS 5263 Bioinformatics. Lecture 5: Affine Gap Penalties. Last lecture. Local Sequence Alignment Bounded Dynamic Programming Linear Space Sequence Alignment. The Smith-Waterman algorithm. Initialization : F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d - PowerPoint PPT Presentation

Transcript of CS 5263 Bioinformatics

Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 5: Affine Gap Penalties

Page 2: CS 5263 Bioinformatics

Last lecture

• Local Sequence Alignment

• Bounded Dynamic Programming

• Linear Space Sequence Alignment

Page 3: CS 5263 Bioinformatics

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0

0

F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + (xi, yj)

Iteration: F(i, j) = max

Page 4: CS 5263 Bioinformatics

The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace

back

Page 5: CS 5263 Bioinformatics

Bounded Dynamic Programming

• O(kM) time

• O(kN) memory

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Page 6: CS 5263 Bioinformatics

Linear-space alignment

N-k*

M/2

M/2

k*

• O(M+N) memory

• 2MN time

Page 7: CS 5263 Bioinformatics

Homework Problem 5 hintsDot matrix for visualizing seq similarities• Seq1: x[1..m]• Seq2: y[1..n]

Sequence 2

Se

qu

en

ce 1

50 100 150 200 250 300

50

100

150

200

250

300

Sequence 2

Se

qu

en

ce 1

50 100 150 200 250 300

50

100

150

200

250

300

Sequence 2

Se

qu

en

ce 1

50 100 150 200 250 300

50

100

150

200

250

300

A(i, j) = 1 if k=1:10((xi+k, yj+k)) > 7

A(i, j) = 1 if k=1:20((xi+k, yj+k)) > 15

A dot matrix does not do any alignment (global or local).It helps to detect strongly conserved regions.

A(i, j) = 1 if (xi, yj) = 1

Page 8: CS 5263 Bioinformatics

Sequence 2

Se

qu

en

ce 1

50 100 150 200 250 300

50

100

150

200

250

300

Seq1Seq2

Page 9: CS 5263 Bioinformatics

Today

• How to model gaps more accurately?

• Statistics of alignments– Where does (xi, yj) come from?

– Are two aligned sequences actually related? – not today

Page 10: CS 5263 Bioinformatics

What’s a better alignment?

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

Score = 8 x m – 3 x d Score = 8 x m – 3 x d

However, gaps usually occur in bunches.

- During evolution, chunks of DNA may be lost entirely- Aligning genomic sequences vs. cDNAs (reverse

complimentary to mRNAs)

Page 11: CS 5263 Bioinformatics

Model gaps more accurately

• Current model:– Gap of length n incurs penalty nd

• General: – Convex function– E.g. (n) = c * sqrt (n)

n

n

Page 12: CS 5263 Bioinformatics

General gap dynamic programming

Initialization: same

Iteration:

F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – (i-k)

maxk=0…j-1F(i,k) – (j-k)

Termination: same

Running Time: O((M+N)MN) (cubic)Space: O(NM) (linear-space algorithm not applicable)

Page 13: CS 5263 Bioinformatics

Compromise: affine gaps

(n) = d + (n – 1)e | |gap gapopen extension

de

(n)

Match: 2

Gap open: -5

Gap extension: -1

GACGCCGAACG||||| |||GACGC---ACG

GACGCCGAACG|||| | | ||GACG-C-A-CG

8x2-5-2 = 9 8x2-3x5 = 1

We want to find the optimal alignment with affine gap penalty in

• O(MN) time

• O(MN) or better O(M+N) memory

Page 14: CS 5263 Bioinformatics

Allowing affine gap penalties

• Still three cases– xi aligned with yj

– Xi aligns to a gap• Are we continuing a gap in x? (if no, start is more expensive)

– Yj aligns to a gap• Are we continuing a gap in y? (if no, start is more expensive)

• We can use a finite state machine to represent the three cases as three states– The machine has two heads, reading the chars on the two

strings separately– At every step, each head reads 0 or 1 char from each sequence– Depending on what it reads, goes to a different state, and

produces different scores

Page 15: CS 5263 Bioinformatics

Finite State Machine

F: have just read 1 char from each seq (xi aligned to yj )

Ix: have read 0 char from x. (yj aligned to a gap)

Iy: have read 0 char from y (xi aligned to a gap)

F

Ix

Iy

? / ?

? / ?

? / ?

? / ?

? / ?

? / ?

? / ?Input Output

State

Page 16: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / eInput Output

Start state

Current state Input Output Next state

F (xi,yj) F

F (-,yj) d Ix

F (xi,-) d Iy

Ix (-,yj) e Ix

… … … …

Page 17: CS 5263 Bioinformatics

AAC

ACT

F-F-F-F

AAC

|||

ACT

F-Iy-F-F-Ix

AAC-

||

-ACT

F-F-Iy-F-Ix

AAC-

| |

A-CT

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) / d

(xi,-) / e

(-, yj) / d

(-, yj) / e

startstate

Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM.

Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

Page 18: CS 5263 Bioinformatics

Dynamic programming

• We encode this information in three different matrices

• For each element (i,j) we use three variables– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns

to yj

– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

xi

yj

xi

yj

xi

yj

F(i, j) Ix(i, j) Iy(i, j)

Page 19: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j-1) + (xi, yj)

F(i, j) = max Ix(i-1, j-1) + (xi, yj)

Iy(i-1, j-1) + (xi, yj)

xi

yj

Page 20: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i, j-1) + d

Ix(i, j) = max

Ix(i, j-1) + e

xi

yj

Ix(i, j)

Page 21: CS 5263 Bioinformatics

F

Ix

Iy

(xi,yj) /

(xi,yj) /

(xi,yj) /

(xi,-) /d

(xi,-)/e

(-, yj) /d

(-, yj)/e

F(i-1, j) + d

Iy(i, j) = max

Iy(i-1, j) + e

xi

yj

Iy(i, j)

Page 22: CS 5263 Bioinformatics

F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)

Iy(i – 1, j – 1)

F(i, j – 1) + d Ix(i, j) = max

Ix(i, j – 1) + e

F(i – 1, j) + d Iy(i, j) = max

Iy(i – 1, j) + e

Continuing alignment

Closing gaps in x

Closing gaps in y

Opening a gap in x

Gap extension in x

Opening a gap in y

Gap extension in y

Page 23: CS 5263 Bioinformatics

Data dependency

F

Ix Iy

i

j

i-1

j-1

i-1

j-1

Page 24: CS 5263 Bioinformatics

Data dependency

IyIx

F

i

j

i

j

i

j

Page 25: CS 5263 Bioinformatics

Data dependency

• If we stack all three matrices– No cyclic dependency– Therefore, we can fill in all three matrices in order

Page 26: CS 5263 Bioinformatics

Algorithm

• for i = 1:m– for j = 1:n

• Fill in F(i, j), Ix(i, j), Iy(i, j)

– end

end• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))

• Time: O(MN)• Space: O(MN) or O(N) when combine with the

linear-space algorithm

Page 27: CS 5263 Bioinformatics

Exercise

• x = GCAC

• y = GCC

• m = 2

• s = -2

• d = -5

• e = -1

Page 28: CS 5263 Bioinformatics

0 - - -

-

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F: aligned on both Iy: Insertion on y

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix: Insertion on x

(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 29: CS 5263 Bioinformatics

0 - - -

- 2

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 30: CS 5263 Bioinformatics

0 - - -

- 2 -7

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 31: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

-

-

-

- - - -

-5

-6

-7

-8

- -5 -6 -7

-

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 32: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

-

-

-

- - -

-5

-6

-7

-8

-5 -6 -7

- - -3

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 33: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

-

-

-

- - -

-5

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 34: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

-

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 35: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = -2

m = 2s = -2d = -5e = -1

Page 36: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 37: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

-

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

(xi, yj) = 2

m = 2s = -2d = -5e = -1

Page 38: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Ix(i,j)Ix(i,j-1)

F(i,j-1)d = -5

e = -1

m = 2s = -2d = -5e = -1

Page 39: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6 -3

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 40: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

-

-

- - -

-5 - - -

-6 -3 -12 -13

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

-

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 41: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -5

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 42: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 43: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

-

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

-

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

Iy(i,j)

Iy(i-1,j)F(i-1,j)

d=-5e=-1

m = 2s = -2d = -5e = -1

Page 44: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 45: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

x =

y =

x =

y =

x =

y =

F(i, j)

F(i-1, j-1)

Ix(i-1, j-1)

Iy(i-1, j-1)

Ix(i,j)Ix(i,j-1)

F(i,j-1) Iy(i,j)

Iy(i-1,j)

F(i-1,j)(xi, yj)

d

e

de

m = 2s = -2d = -5e = -1

Page 46: CS 5263 Bioinformatics

0 - - -

- 2 -7 -8

- -7 4 -1

- -8 -5 2

- -9 -6 1

- - -

-5 - - -

-6 -3 -12 -13

-7 -8 -1 -6

-8 -13 -2 -3

-5 -6 -7

- - -3 -4

- - -12 -1

- - -13 -10

- - -14 -11

F Iy

Ix

G C C

G

C

A

C

G

C

A

C

G

C

A

C

G C C

G C C

GCAC

|| |

GC-C

x =

y =

x =

y =

x =

y =

x

y

G

C

A

C

G C C

x =

y =

m = 2s = -2d = -5e = -1