CSE 182-L2:Blast & variants I Dynamic Programming

42
FA05 CSE182 CSE 182-L2:Blast & variants I Dynamic Programming www.cse.ucsd.edu/classes/fa05/cse182 www.cse.ucsd.edu/classes/fa05/cse182 www.cse.ucsd.edu/~vbafna www.cse.ucsd.edu/~vbafna

description

CSE 182-L2:Blast & variants I Dynamic Programming. www.cse.ucsd.edu/classes/fa05/cse182. www.cse.ucsd.edu/~vbafna. Searching Sequence databases. http://www.ncbi.nlm.nih.gov/BLAST/. Query:. >gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus] - PowerPoint PPT Presentation

Transcript of CSE 182-L2:Blast & variants I Dynamic Programming

Page 1: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

CSE 182-L2:Blast & variants IDynamic Programming

www.cse.ucsd.edu/classes/fa05/cse182www.cse.ucsd.edu/classes/fa05/cse182

www.cse.ucsd.edu/~vbafnawww.cse.ucsd.edu/~vbafna

Page 2: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Searching Sequence databases

http://www.ncbi.nlm.nih.gov/BLAST/

Page 3: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Query:

>gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus]MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVALIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTVSVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKTTLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPVSQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRETVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTTQISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV

• What is the function of this sequence?• Is there a human homolog?• Which organelle does it work in? (Secreted/membrane bound)• Idea: Search a database of known proteins to see if you can

find similar sequences which have a known function

Page 4: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Querying with Blast

Page 5: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast Output

• The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query

• Each database hit is aligned to a subsequence of the query

Page 6: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast Output

query26

19 405

422

Schematic

db

Q beg

S beg

Q end

S end

S Id

Page 7: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast Output

Q beg

S beg

Q end

S end

S Id

Page 8: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

The technological question

• How do we measure similarity between sequences?• Percent identity?

– Hard to compute without indels.

Number of sequence edit operations?Number of sequence edit operations?Implies a notion of alignment.Implies a notion of alignment.

A T C A A C GT C A A T G G T

A T C A A - C G -- T C A A T G G T

Page 9: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

The biology question

• How do we interpret these results?– Similar sequence in the 3 species implies that the common

ancestor of the 3 had that sequence.– The sequence accumulates mutations over time. These

mutations may be indels, or substitutions.– Hum and mus diverged more recently and so the sequences

are more likely to be similar.

hum

mus

dros

hummus?

Page 10: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Computing alignments

• What is an alignment?• 2Xm table. • Each sequence is a row, with interspersed gaps• Columns describe the edit operations

•What is the score of an alignment?What is the score of an alignment?•Score every column, and sum up the scores. Let C be the score Score every column, and sum up the scores. Let C be the score function for the columnfunction for the column

•How do we compute the alignment with the best score?How do we compute the alignment with the best score?

A A - T C G G A

A C T C G - A

Page 11: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Optimum scoring alignments, and score of optimum alignment

• Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment.

• Later, we will show that the two are equivalent

Page 12: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Computing Optimal Alignment score

• Observations: The optimum alignment has nice recursive properties:– The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be

optimal sub-alignments.– The left part contains prefixes s[1..i], and t[1..j] for some i

and some j (we don’t know the values of i and j).

1 21

k

ts

Page 13: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Optimum prefix alignments

• Consider an optimum alignment of the prefix s[1..i], and t[1..j]

• Look at the last cell, indexed by k. It can only have 3 possibilities.

1 kst

Page 14: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

3 possibilities for rightmost cell

1. s[i] is aligned to t[j]

2. s[i] is aligned to ‘-’

3. t[j] is aligned to ‘-’

s[i]

t[j]

s[i]t[j]

Optimum alignment of s[1..i-1], and t[1..j-1]

Optimum alignment of s[1..i-1], and t[1..j]

Optimum alignment of s[1..i], and t[1..j-1]

Page 15: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Optimal score of an alignment

• Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities.

s[i]t[j]

Optimum alignment of s[1..i-1], and t[1..j-1]

s[i]

-

Optimum alignment of s[1..i-1], and t[1..j]

-

Optimum alignment of s[1..i], and t[1..j-1]

t[j]

S[i,j] = C(si,tj)+S(i-1,j-1)

S[i,j] = C(si,-)+S(i-1,j)

S[i,j] = C(-,tj)+S(i,j-1)

Page 16: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Optimal alignment score

• Which prefix pairs (i,j) should we use? For now, simply use all.

• If the strings are of length m, and n, respectively, what is the score of the optimal alignment?

S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )

⎧ ⎨ ⎪

⎩ ⎪

Page 17: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

An O(nm) algorithm for score computation

• The iteration ensures that all values on the right are computed in earlier steps.€

S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )

⎧ ⎨ ⎪

⎩ ⎪

For i = 1 to n For j = 1 to m

Page 18: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Initialization

S[0,0] = 0

S[i,0] =C(si,−) + S[i −1,0] ∀ i

S[0, j] =C(−,s j ) + S[0, j −1] ∀ j

Page 19: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

A tableaux approach

s

n

1

i

1 j n

MS[i,j-1] S[i,j]

S[i-1,j]S[i-1,j-1]

t

Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells

S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )

⎧ ⎨ ⎪

⎩ ⎪

Page 20: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

An Example

• Align s=TCAT with t=TGCAA• Match Score = 1• Mismatch score = -1, Indel Score = -1• Score A1?, Score A2?

T C A T -T G C A A

T C A TT G C A A

A1 A2

Page 21: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

0 -1 -2 -3 -4 -5

-1 1 0 -1 -2 -3

-2 0 0 1 0 -1

-3 -1 -1 0 2 1

-4 -2 -2 -1 1 1

T G C A A

T

C

A

T

Alignment Table

Page 22: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

1 1 -1 -2 -2 -4

1 2 0 -1 -1 -3

-1 0 1 0 0 -2

-3 -2 -1 0 1 -1

-5 -4 -3 -2 -1 0

T G C A A

T

C

A

T

Alignment Table

• S[4,5] = 1 is the score of an optimum alignment

• Therefore, A2 is an optimum alignment

• We know how to obtain the optimum Score. How do we get the best alignment?

Page 23: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Computing Optimum Alignment

• At each cell, we have 3 choices• We maintain additional information to record the choice

at each step.

For i = 1 to n For j = 1 to m

S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )

⎧ ⎨ ⎪

⎩ ⎪

If (S[i,j]= S[i-1,j-1] + C(si,tj)) M[i,j] =

If (S[i,j]= S[i-1,j] + C(si,-)) M[i,j] =

If (S[i,j]= S[i,j-1] + C(-,tj) ) M[i,j] =

j-1

i-1

j

i

Page 24: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

T G C A A

T

C

A

T 1 1 -1 -2 -2 -4

1 2 0 -1 -1 -3

-1 0 1 0 0 -2

-3 -2 -1 0 1 -1

-5 -4 -3 -2 -1 0

Computing Optimal Alignments

Page 25: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Retrieving Opt.Alignment

1 1 -1 -2 -2 -4

1 2 0 -1 -1 -3

-1 0 1 0 0 -2

-3 -2 -1 0 1 -1

-5 -4 -3 -2 -1 0

T G C A A

T

C

A

T

• M[4,5]= Implies that

S[4,5]=S[3,4]+C(A,T) or

A

T

M[3,4]= Implies that

S[3,4]=S[2,3] +C(A,A) or

A

T

A

A

1 2 3 4 5

1

3

2

4

Page 26: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Retrieving Opt.Alignment

1 1 -1 -2 -2 -4

1 2 0 -1 -1 -3

-1 0 1 0 0 -2

-3 -2 -1 0 1 -1

-5 -4 -3 -2 -1 0

T G C A A

T

C

A

T

• M[2,3]= Implies that

S[2,3]=S[1,2]+C(C,C) or

A

T

M[1,2]= Implies that

S[1,2]=S[1,1] +C(-,G) or

A

T

A

A

A

A

C

C

C

C

-

GT

T

1 2 3 4 5

1

3

2

4

Page 27: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Algorithm to retrieve optimal alignment

RetrieveAl(i,j)if (M[i,j] == `\’)

return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’)

return (RetrieveAl (i-1,j) . )

si

tj

si

-

-

tj

else if (M[i,j] == `--’)else if (M[i,j] == `--’)

return (RetrieveAl (i,j-1) . return (RetrieveAl (i,j-1) . ))

Page 28: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Summary

• An optimal alignment of strings of length n and m can be computed in O(nm) time

• There is a tight connection between computation of optimal score, and computation of opt. Alignment– True for all DP based solutions

Page 29: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Global versus Local Alignment

Consider s = ACCACCCCTT

t = ATCCCCACAT

ACCACCCCTTACCACCCCTT

A TCCCCACATA TCCCCACATATCCCCACATATCCCCACAT

ACCACCCCT TACCACCCCT T

Page 30: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast Outputs Local Alignment

query26

19 405

422

Schematic

db

Page 31: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Local Alignment

• Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’)

• How can we compute this efficiently?

a

b

a’ b’

Page 32: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Local Alignment

• Recall that in global alignment, we compute the best score for all prefix pairs s(1..i) and t(1..j).

• Instead, compute the best score for all sub-alignments that end in s(i) and t(j).

• What changes in the recurrence?

a

i

a’ j

Page 33: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Local Alignment

• The original recurrence still works, except when the optimum score S[i,j] is negative

• When S[i,j] <0, it means that the optimum local alignment cannot include the point (i,j).

• So, we must reset the score to 0.

ii-1

jj-1

si

tj

Page 34: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Local Alignment Trick (Smith-Waterman algorithm)

S[i, j] = max

0S[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )

⎨ ⎪

⎩ ⎪

How can we compute the local alignment itself?

Page 35: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Generalizing Gap Cost

• It is more likely for gaps to be contiguous

• The penalty for a gap of length l should be

go+ ge∗l

Page 36: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Using affine gap penalties

S[i, j] = maxl

0S[i −1, j −1] +C(si, t j )S[i − l, j] + go+ ge* lS[i, j − l] + go+ ge* l

⎨ ⎪

⎩ ⎪

• What is the time taken for this?• What are the values that l can take?• Can we get rid of the extra Dimension?

Page 37: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Affine gap penalties

• Define D[i,j] : Score of the best alignment, given that the final column is a ‘deletion’ (si is aligned to a gap)

• Define I[i,j]: Score of the best alignment, given that the final column is an insertion (tj is aligned to a gap)

S[i, j] = max

0S[i −1, j −1] +C(si, t j )

D[i, j]I[i, j]

⎨ ⎪

⎩ ⎪

s[i]

Optimum alignment of s[1..i-1], and t[1..j]

-

Optimum alignment of s[1..i], and t[1..j-1]

t[j]

Page 38: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

O(nm) solution for affine gap costs

D[i, j] = maxD[i −1, j] + ge

S[i −1, j] + go+ ge ⎧ ⎨ ⎩

I[i, j] = maxI[i, j −1] + ge

S[i, j −1] + go+ ge ⎧ ⎨ ⎩

Page 39: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast variants

• Blastn: DNA query, DNA database• Blastp: Protein Query, DNA database• Blastx,Tblastn,Tblastx

– all require a 6 frame translation

Page 40: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

The genetic code

• The DNA can be translated conceptually using triplets, and looking up the table.

• How many translations are possible?

ATCGGATCGTGAT

Page 41: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Blast variants

• Blastx: DNA query, protein database• Tblastn: Protein query, DNA database• Tblastx: DNA query, DNA database

– How is it different from Blastn?

Page 42: CSE 182-L2:Blast & variants I Dynamic Programming

FA05 CSE182

Summary

• A critical part of BLAST is local sequence alignment.• Optimum Local alignments between two sequences of

length n and m can be computed in O(nm) time.• The algorithm can be extended to deal with affine gap

penalties.• What is the memory requirement for this computation?• Q: If one of the sequences was a large database, would

it be feasible to compute these alignments?• Q: What if the query and database were large?