Markov models and applications Sushmita Roy BMI/CS 576 [email protected] Oct 7 th, 2014.
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 10 th,...
-
Upload
hailee-harrill -
Category
Documents
-
view
222 -
download
3
Transcript of Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 10 th,...
![Page 1: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/1.jpg)
Pairwise Sequence Alignment
Sushmita RoyBMI/CS 576
www.biostat.wisc.edu/bmi576/Sushmita Roy
[email protected] 10th, 2013
BMI/CS 576
![Page 2: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/2.jpg)
Key concepts in today’s class
• What is pairwise sequence alignment?• How to handle gaps?
– Gap penalties
• What are the issues in sequence alignment?• What are the different types of alignment problems
– Global alignment– Local alignment
• Dynamic programming solution to finding the best alignment• Given a scoring function, find the best alignment of two
sequences– Global– Local
![Page 3: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/3.jpg)
What is sequence alignment
The task of locating equivalent regions of two or more sequences to maximize their overall similarity
![Page 4: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/4.jpg)
Why sequence alignment?
• Identifying similarity between two sequences
• Sequence specifies the function of a protein
• Similarity in sequence can imply similarity in function.– Assign function to uncharacterized sequences based on characterized
sequences
• Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.
![Page 5: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/5.jpg)
Alignment example
T H I S S E Q U E N C E
T H A T S E Q U E N C E
![Page 6: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/6.jpg)
How to compare these two sequences?
T H I S S E Q U E N C E
T H A T I S A S E Q U E N C E
![Page 7: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/7.jpg)
Need to incorporate gaps
_ _ _T H I S S E Q U E N C E
T H A T I S A S E Q U E N C E
T H I S _ _ _ S E Q U E N C E
T H A T I S A S E Q U E N C E
Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches
![Page 8: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/8.jpg)
Issues in sequence alignment
• What type of alignment?– Align the entire sequence or part of it?
• How to score an alignment?– the sequences we’re comparing typically differ in length– some amino acid pairs are more substitutable than others
• How to find the alignment?– Algorithms for alignment
• How to tell if the alignment is biologically meaningful?– Computing significance of an alignment
![Page 9: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/9.jpg)
Pairwise alignment: task definition
Given– a pair of sequences (DNA or protein)– a scoring function
Do– determine the common substrings in the sequences
such that the similarity score is maximized
![Page 10: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/10.jpg)
Sequence can change through different mutations
• substitutions: ACGA AGGA
• insertions: ACGA ACGGA
• deletions: ACGA AGAGaps
![Page 11: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/11.jpg)
Scoring function
• Scoring function comprises– Substitution matrix s(a,b): indicates the score of aligning a with b– Gap penalty function γ(g) where g is the gap length
• DNA has a simple scoring function– Consider match and mismatch
• Proteins have a more complex scoring function– Some amino acid pairs might be more substitutable versus others– Scores might capture
• similar physical and chemical properties• evolutionary relationships• More next class
![Page 12: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/12.jpg)
Gap penalty function
• Different gap penalty functions require somewhat different scoring algorithms
• Linear score
– γ(g)=-dg, where d is a fixed cost, g is the gap length
– Longer the gap higher the penalty
• Affine score
– γ(g)=-(d +(g-1)e), d and e are fixed penalties and e<d.
– d : gap open penalty
– e : gap extension penalty
![Page 13: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/13.jpg)
Types of pairwise sequence alignment problems
• Global alignment (Needleman-Wunsch algorithm)– Align the entire sequence
• Local alignment (Smith-Waterman algorithm)– Align part of the sequence
![Page 14: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/14.jpg)
Global alignment
• Given two sequences u and v and a scoring function, find the alignment with the maximal score.
• The number of possible alignments is very big.
![Page 15: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/15.jpg)
Number of possible alignments
CAATGA ATTGATSequence 1 Sequence 2
CAATGA__ATTGAT
CAATGAATTGAT
CAATGA_AT_TGAT
Alignment 1 Alignment 2 Alignment 3
![Page 16: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/16.jpg)
Number of possible alignments
• There are
nn
n
n
n n
π
2
2
2)!()!2(2
≈=⎟⎟⎠
⎞⎜⎜⎝
⎛
possible global alignments for 2 sequences of length n
• E.g. two sequences of length 100 have possible alignments
• But we can use dynamic programming to find an optimal alignment efficiently
7710≈
![Page 17: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/17.jpg)
Dynamic programming (DP)
• Divide and conquer
• Solve a problem using pre-computed solutions for subparts of the problem
![Page 18: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/18.jpg)
Dynamic programming for sequence alignment
• Two sequences: AAAC, AAG• x: AAAC
– xi: Denotes the ith letter
• y: AAG– yi denotes the jth letter in y.
• Assume cost of gap is d.• Assume we have aligned x1..xi to y1..yj
• There are three possibilities
AAxi
AAyj
AAAAAG
AA_AAG
AAAAA_
![Page 19: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/19.jpg)
Dynamic programming idea
• Let x’s length be n• Let y’s length be m• construct an (n+1) (m+1) matrix F• F ( j, i ) = score of the best alignment of x1…xi with y1…yj
y
x
A
A
CA G
A
C
score of best alignment ofAAA to AG
![Page 20: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/20.jpg)
DP for global alignment with linear gap penalty
€
F( j,i) = max
F( j −1,i −1) +s(x i,y j )
F( j −1,i) − d
F( j,i −1) − d
⎧
⎨ ⎪
⎩ ⎪
F(j,i)
F(j-1,i-1) F(j-1,i)
F(j,i-1)-d
-ds(xi,yj)
DP recurrence relation
Score of the best partial alignment between x1..xi and y1.. yj
![Page 21: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/21.jpg)
DP algorithm sketch: global alignment
• Initialize first row and column of matrix
• Fill in rest of matrix from top to bottom, left to right
• For each F(i, j), save pointer(s) to cell(s) that resulted in best score
• F(m, n) holds the optimal alignment score;
• Trace back from F(m, n) to F(0, 0) to recover alignment
![Page 22: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/22.jpg)
Global alignment example
• Suppose we choose the following scoring scheme
d = 2, where d is the gap penalty
![Page 23: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/23.jpg)
Initializing matrix: global alignment with linear gap penalty
A -d
A -2d
CA G
A -3d
C -4d
0 -3d-d -2d
![Page 24: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/24.jpg)
Global alignment example
A -2
A -4
CA G
A -6
C -8
0 -6-2 -4
1 -3-1
0 -2
-2 -1
-5 -4 -1
-1
-3
![Page 25: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/25.jpg)
Trace back to retrieve the alignment
A -2
A -4
CA G
A -6
C -8
0 -6-2 -4
1 -3-1
0 -2
-2 -1
-5 -4 -1
-1
-3
x:
y:
one optimal alignment
C
C
A
_
A
G
A
A
![Page 26: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/26.jpg)
Computational complexity
• initialization: O(m), O(n) where sequence lengths are m, n• filling in rest of matrix: O(mn)• traceback: O(m + n)• Since sequences have nearly same length, the computational
complexity is
)( 2nO
![Page 27: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/27.jpg)
Summary of DP
• Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1)• works for either DNA or protein sequences, although the
substitution matrices used differ• finds an optimal alignment• the exact algorithm (and computational complexity) depends
on gap penalty function
![Page 28: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/28.jpg)
Local alignment
• Look for local regions sequence similarity.– Aligning substrings of x and y.
• Try to find the best alignment over all possible substrings.
• This seems difficult computationally.– But can be solved by DP.
![Page 29: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/29.jpg)
Local alignment motivation
• Comparing protein sequences that share a domain (independently folded unit) but differ elsewhere
• Comparing DNA sequences that share a similar sequence pattern but differ elsewhere
• More sensitive when comparing highly diverged sequences– Selection on the genome differs between regions– Unconstrained regions are not very alignable
![Page 30: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/30.jpg)
Local alignment DP
Similar to global alignment with a few exceptions• DP recurrence is different
• Top and left row are initialized to 0s.
€
F( j,i) = max
F( j −1,i −1) +s(x i, y j )
F( j −1,i) − d
F( j,i −1) − d
0
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
New in local: starts a new alignment
![Page 31: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/31.jpg)
Local alignment DP algorithm
• Initialization: first row and first column initialized with 0’s
• Traceback:– Find maximum value of F(j, i)
• can be anywhere in matrix
– Stop when we get to a cell with value 0
![Page 32: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576.](https://reader036.fdocuments.net/reader036/viewer/2022062621/551c20f4550346a34f8b5b7a/html5/thumbnails/32.jpg)
Local alignment example
0
0
00 00
00 00
0
T
T
A
A
G
0
0
0
0
0
0 0
G
0
A
0
A
0
A
1
0
1
1 2
3
1
1
x:
y:
G
G
A
A
A
A
1