Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational...
-
Upload
leticia-woolcott -
Category
Documents
-
view
214 -
download
0
Transcript of Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational...
![Page 1: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/1.jpg)
Sequence comparison: Dynamic programming
Genome 559: Introduction to Statistical and Computational
GenomicsProf. James H. Thomas
![Page 2: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/2.jpg)
Sequence comparison overview
• Problem: Find the “best” alignment between a query sequence and a target sequence.
• To solve this problem, we need– a method for scoring alignments, and– an algorithm for finding the alignment with the
best score.• The alignment score is calculated using
– a substitution matrix– gap penalties.
• The algorithm for finding the best alignment is dynamic programming (DP).
![Page 3: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/3.jpg)
A simple alignment problem.
• Problem: find the best pairwise alignment of GAATC and CATAC.
• Use a linear gap penalty of -4.• Use the following substitution matrix:
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 4: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/4.jpg)
How many possibilities?
• How many different possible alignments of two sequences of length n exist?
GAATCCATAC
GAATC-CA-TAC
GAAT-CC-ATAC
GAAT-CCA-TAC
-GAAT-CC-A-TAC
GA-ATCCATA-C
![Page 5: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/5.jpg)
How many possibilities?
• How many different alignments of two
sequences of length n exist?
GAATCCATAC
GAATC-CA-TAC
GAAT-CC-ATAC
GAAT-CCA-TAC
-GAAT-CC-A-TAC
GA-ATCCATA-C
2!
!22
n
n
n
n
5 2.5x102
10 1.8x105
20 1.4x1011
30 1.2x1017
40 1.1x1023
FYI for two sequences of length m and n, possible alignments number:
2
( )!
min( , ) (min( , )!)
mn mn
m n m n
![Page 6: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/6.jpg)
DP matrix
G A A T C
C
A
T
A
C
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
i
0
1
2
3
4
5
j 0 1 2 3 etc.
5
The value in position (i,j) is the score of the best alignment of the first i
positions of one sequence versus the first j positions of the other sequence.
GACA
init. row and column
![Page 7: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/7.jpg)
DP matrix
G A A T C
C
A 5 1T
A
C
Moving horizontally in the matrix introduces a
gap in the sequence along the left edge.
GAACA-
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 8: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/8.jpg)
DP matrix
G A A T C
C
A 5T 1A
C
Moving vertically in the matrix introduces a gap in the sequence along
the top edge.
GA-CAT
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 9: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/9.jpg)
DP matrix
G A A T C
C
A 5T 0A
C
Moving diagonally in the matrix aligns two
residues
GAACAT
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 10: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/10.jpg)
Initialization
G A A T C
0
C
A
T
A
C
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Start at top left and move progressively
![Page 11: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/11.jpg)
Introducing a gap
G A A T C
0 -4
C
A
T
A
C
G-
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 12: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/12.jpg)
G A A T C
0 -4
C -4
A
T
A
C
-C
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Introducing a gap
![Page 13: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/13.jpg)
Complete first rowand column
G A A T C
0 -4 -8 -12 -16 -20
C -4
A -8
T -12
A -16
C -20
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
-----CATAC
![Page 14: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/14.jpg)
Three ways to get to i=1, j=1
G A A T C
0 -4
C -8
A
T
A
C
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
G--C
j 0 1 2 3 etc.
i
0
1
2
3
4
5
![Page 15: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/15.jpg)
G A A T C
0
C -4 -8
A
T
A
C
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
-GC-
Three ways to get to i=1, j=1
j 0 1 2 3 etc.
i
0
1
2
3
4
5
![Page 16: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/16.jpg)
G A A T C
0
C -5
A
T
A
C
GC
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Three ways to get to i=1, j=1
i
0
1
2
3
4
5
j 0 1 2 3 etc.
![Page 17: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/17.jpg)
Accept the highest scoring of the three
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8
T -12
A -16
C -20
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Then simply repeat the same rule progressively across the matrix
![Page 18: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/18.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 ?
T -12
A -16
C -20
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 19: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/19.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 ?
T -12
A -16
C -20
-4
0 -4
-GCA
G-CA
--GCA-
-4 -9 -12
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 20: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/20.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12
A -16
C -20
-4
0 -4
-GCA
G-CA
--GCA-
-4 -9 -12
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 21: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/21.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12 ?
A -16 ?
C -20 ?
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 22: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/22.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12 -8
A -16 -12
C -20 -16
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 23: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/23.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 ?
A -8 -4 ?
T -12 -8 ?
A -16 -12 ?
C -20 -16 ?
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
![Page 24: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/24.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9
A -8 -4 5
T -12 -8 1
A -16 -12 2
C -20 -16 -2
What is the alignment
associated with this entry?
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
(just follow the arrows back - this is called the traceback)
![Page 25: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/25.jpg)
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9
A -8 -4 5
T -12 -8 1
A -16 -12 2
C -20 -16 -2
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
-G-ACATA
![Page 26: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/26.jpg)
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9
A -8 -4 5
T -12 -8 1
A -16 -12 2
C -20 -16 -2 ?
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Continue and find the optimal
global alignment, and
its score.
DP matrix
![Page 27: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/27.jpg)
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9 -13 -12 -6
A -8 -4 5 1 -3 -7
T -12 -8 1 0 11 7
A -16 -12 2 11 7 6
C -20 -16 -2 7 11 17
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
Best alignment starts at lower right and follows traceback arrows to top left
![Page 28: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/28.jpg)
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9 -13 -12 -6
A -8 -4 5 1 -3 -7
T -12 -8 1 0 11 7
A -16 -12 2 11 7 6
C -20 -16 -2 7 11 17
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
GA-ATCCATA-C
One best traceback
![Page 29: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/29.jpg)
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9 -13 -12 -6
A -8 -4 5 1 -3 -7
T -12 -8 1 0 11 7
A -16 -12 2 11 7 6
C -20 -16 -2 7 11 17
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
GAAT-C-CATAC
Another best traceback
![Page 30: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/30.jpg)
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9 -13 -12 -6
A -8 -4 5 1 -3 -7
T -12 -8 1 0 11 7
A -16 -12 2 11 7 6
C -20 -16 -2 7 11 17
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
GA-ATCCATA-C
GAAT-C-CATAC
![Page 31: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/31.jpg)
Multiple solutions
• When a program returns a single sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them.
• In our example, all of the alignments at the left have equal scores.
GA-ATCCATA-C
GAAT-CCA-TAC
GAAT-CC-ATAC
GAAT-C-CATAC
![Page 32: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/32.jpg)
DP in equation form• Align sequence x and y.• F is the DP matrix; s is the
substitution matrix; d is the linear gap penalty.
djiF
djiF
yxsjiF
jiF
F
ji
1,
,1
,1,1
max,
00,0
![Page 33: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/33.jpg)
DP equation graphically
1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
take the max of these three
![Page 34: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/34.jpg)
Dynamic programming
• Yes, it’s a weird name.• DP is closely related to recursion and
to mathematical induction.• We can prove that the resulting score
is optimal.
![Page 35: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/35.jpg)
Summary
• Scoring a pairwise alignment requires a substitution matrix and gap penalties.
• Dynamic programming is an efficient algorithm for finding an optimal alignment.
• Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to that position.
• DP iteratively fills in the matrix using a simple mathematical rule.
![Page 36: Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader037.fdocuments.net/reader037/viewer/2022110304/551c1d02550346b24f8b59d4/html5/thumbnails/36.jpg)
A C G T
A 10 -5 0 -5
C -5 10 -5 0
G 0 -5 10 -5
T -5 0 -5 10
G A A T C
0
A
A
T
T
C
Practice problem: find a best pairwise alignment of GAATC and AATTC
d = -4