Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G...
-
Upload
derick-anthony -
Category
Documents
-
view
217 -
download
0
Transcript of Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G...
![Page 1: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/1.jpg)
Pair-wise Sequence Alignment
Introduction to bioinformatics 2007
Lecture 5
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
![Page 2: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/2.jpg)
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense except in the light of Biology”
Bioinformatics
![Page 3: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/3.jpg)
The Genetic Code
![Page 4: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/4.jpg)
A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***
A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******
![Page 5: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/5.jpg)
Searching for similarities
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
![Page 6: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/6.jpg)
Is similarity really interesting?
![Page 7: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/7.jpg)
Evolutionary and functional relationships
Reconstruct evolutionary relation:
• Based on sequence-Identity (simplest method)-Similarity• Homology (common ancestry: the ultimate goal)• Other (e.g., 3D structure)
Functional relation:Sequence Structure Function
![Page 8: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/8.jpg)
Common ancestry is more interesting:Makes it more likely that genes sharethe same function
Homology: sharing a common ancestor– a binary property (yes/no)– it’s a nice tool:When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.
Searching for similarities
![Page 9: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/9.jpg)
How to go from DNA to protein sequence
A piece of double stranded DNA:
5’ attcgttggcaaatcgcccctatccggc 3’3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
![Page 10: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/10.jpg)
How to go from DNA to protein sequence
6-frame translation using the codon table (last lecture):
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
![Page 11: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/11.jpg)
Bioinformatics tool
Data
Algorithm
BiologicalInterpretation
(model)
tool
![Page 12: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/12.jpg)
Example today: Pairwise sequence alignment needs sense of evolution
Global dynamic programmingMDAGSTVILCFVG
MDAASTILCGS Amino Acid
Exchange Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS
Evolution
![Page 13: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/13.jpg)
How to determine similarity?
• Frequent evolutionary events:1. Substitution2. Insertion, deletion3. Duplication4. Inversion
• Evolution at work
We’ll only use these
Z
X Y
Common ancestor, usually extinct
available
![Page 14: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/14.jpg)
Alignment - the problem
Operations: – substitution, – Insertion– deletion
GTACT--CGGA-TGAC
algorithmsforgenomes||||||||||algorithms4genomes
Algorithmsforgenomes ||||||| algorithms4genomes
algorithmsforgenomes||||||||||? |||||||algorithms4--genomes
![Page 15: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/15.jpg)
A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***
A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******
![Page 16: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/16.jpg)
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum of the scores of all alignment columns
Dynamic programmingScoring alignments
![Page 17: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/17.jpg)
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)The score for an alignment is the sum
of the scores over all alignment columns
Dynamic programmingScoring alignments
• /Gap length
pen
alt
y
affine
concave
linear
,
General alignment score:
Sa,b = )(kgpNk
k li jbas ),(
![Page 18: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/18.jpg)
Substitution Matrices: DNA
define a score for match/mismatch of letters
Simple:
Used in genome alignments:
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
![Page 19: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/19.jpg)
Substitution matrices for a.a.
Amino acids are not equal:1. Some are similar and easily
substituted:• biochemical properties• structure
2. Some mutations occur more often due to similar codons
The two above give us substitution matrices
orange: nonpolar and hydrophobic.green: polar and hydrophilic magenta box are acidiclight blue box are basic
http://www.cimr.cam.ac.uk/links/codon.htm
http
://w
ww
.peo
ple.
virg
inia
.edu
/~rj
h9u/
amin
acid
.htm
l
![Page 20: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/20.jpg)
BLOSUM 62 substitution matrix
http://www.carverlab.org/testing/epp.html
![Page 21: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/21.jpg)
Constant vs. affine gap scoring
Seq1 G T A - - G - T - ASeq2 - - A T G - A T G -
Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7)
Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)
-1
-2
extensionintroGap
Scoring
-3Affine
-2Constant
…and +1 for match
![Page 22: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/22.jpg)
Dynamic programmingScoring alignments
10 1
Amino Acid Exchange Matrix Affine gap penalties (open, extension)
2020
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px +
+s(L,I)+s(K,K)
T D W V T A L KT D W L - - I K
![Page 23: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/23.jpg)
Amino acid exchange matrices
How do we get one?
And how do we get associated gap penalties?
First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure
2020
![Page 24: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/24.jpg)
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C -2 -4 -4 -5 12
Q 0 1 1 2 -5 4
E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
PAM250 matrix
amino acid exchange matrix (log odds)
Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation
![Page 25: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/25.jpg)
Amino acids are not equal:
1. Some are easily substituted because they have similar:
• physico-chemical properties
• structure
2. Some mutations between amino acids occur more often due to similar codons
The two above observations give us ways to define substitution matrices
Amino acid exchange matrices
![Page 26: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/26.jpg)
Pair-wise alignment
Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n (2n)! 22n
= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!
T D W V T A L KT D W L - - I K
![Page 27: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/27.jpg)
Technique to overcome the combinatorial explosion:Dynamic Programming
• Alignment is simulated as Markov process, all sequence positions are seen as independent
• Chances of sequence events are independent– Therefore, probabilities per aligned position
need to be multiplied– Amino acid matrices contain so-called log-odds
values (log10 of the probabilities), so probabilities can be summed
![Page 28: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/28.jpg)
• To perform statistical analyses on messages or sequences, we need a reference model.
• The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner.
• This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment.
• Given a probability distribution, Pi, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn.
• As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi +
log Pj + log Pk+ ··· + log Pn.
• In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.
To say the same more statistically…
![Page 29: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/29.jpg)
Sequence alignmentHistory of Dynamic
Programming algorithm
1970 Needleman-Wunsch global pair-wise alignment
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.
1981 Smith-Waterman local pair-wise alignment
Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.
![Page 30: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/30.jpg)
Pairwise sequence alignmentGlobal dynamic programming
MDAGSTVILCFVGMDAASTILCGS
Amino Acid Exchange
Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS
Evolution
![Page 31: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/31.jpg)
Global dynamic programming
i-1i
j-1 j
H(i,j) = Max H(i-1,j-1) + S(i,j)H(i-1,j) - gH(i,j-1) - g
diagonalverticalhorizontal
Value from residue exchange matrix
This is a recursive formula
![Page 32: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/32.jpg)
Global alignmentThe Needleman-Wunsch algorithm
• Goal: find the maximal scoring alignment• Scores: m match, -s mismatch, -g for insertion/deletion• Dynamic programming
– Solve smaller subproblem(s)– Iteratively extend the solution
• The best alignment for X[1…i] and Y[1…j] is called M[i, j]
X1 … Xi-1 - - Xi
Y1 … - Yj-1 Yj -
![Page 33: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/33.jpg)
The algorithm
Y[1…j-1] Y[j]X[1…i-1] X[i]
Y[1…j-1]Y[j]
X[1…i] -Y[1…j] -
X[1…i-1] X[i]
M[i, j-1] - g
M[i-1, j] - g
M[i-1, j-1]
+ m- sM[i,j]
=
Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for
insertion/deletion The best alignment for X[1…i] and Y[1…j]
is called M[i, j] 3 ways to extend the alignment
![Page 34: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/34.jpg)
The algorithm – final equation
M[i, j] =
M[i, j-1] – gM[i-1, j] – g
M[i-1, j-1] + score(X[i],Y[j])
jj-1
i-1
i
max
*
* -g
-g
X1…Xi-1 Xi
Y1…Yj-1 Yj
X1…Xi -Y1…Yj-1 Yj
X1…Xi-1 Xi
Y1…Yj -
Corresponds to:
![Page 35: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/35.jpg)
Example: global alignment of two
sequences• Align two DNA sequences:
– GAGTGA– GAGGCGA (note the length difference)
• Parameters of the algorithm:– Match: score(A,A) = 1
– Mismatch: score(A,T) = -1
– Gap: g = -2
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
![Page 36: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/36.jpg)
The algorithm. Step 1: init
• Create the matrix• Initiation
– 0 at [0,0]– Apply the
equation…
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
6543210j
7
6
5
4
3
2
1
0
i
A
G
C
G
G
A
G
0-
AGTGAG-
![Page 37: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/37.jpg)
The algorithm. Step 1: init
-14A
-12G
-10C
-8G
-6G
-4A
-2G
-12-10-8-6-4-20-
AGTGAG-• Initiation of the
matrix:– 0 at pos [0,0]– Fill in the first row
using the “” rule– Fill in the first
column using “”
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
j
i
![Page 38: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/38.jpg)
The algorithm. Step 2: fill in
-14A
-12G
-10C
-8G
-6G
2-1-4A
-3-11-2G
-12-10-8-6-4-20-
AGTGAG-• Continue filling in of
the matrix, remembering from which cell the result comes (arrows)
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
j
i
![Page 39: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/39.jpg)
The algorithm. Step 2: fill in
2-1-4-5-8-11-14A
01-2-3-6-9-12G
110-1-4-7-10C
0221-2-5-8G
-3-1130-3-6G
-6-4-202-1-4A
-9-7-5-3-11-2G
-12-10-8-6-4-20-
AGTGAG-• We are done…• Where’s the
result?
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
j
iThe lowest-rightmost cell
![Page 40: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/40.jpg)
The algorithm. Step 3: traceback
2-1-4-5-8-11-14A
01-2-3-6-9-12G
110-1-4-7-10C
0221-2-5-8G
-3-1130-3-6G
-6-4-202-1-4A
-9-7-5-3-11-2G
-12-10-8-6-4-20-
AGTGAG-
• Start at the last cell of the matrix
• Go against the direction of arrows
• Sometimes the value may be obtained from more than one cell (which one?)
j
i
M[i, j] =
M[i, j-1] – 2M[i-1, j] – 2
M[i-1, j-1] ± 1max
![Page 41: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/41.jpg)
The algorithm. Step 3: traceback
• Extract the alignments
2-1-4-5-8-11-14A
01-2-3-6-9-12G
110-1-4-7-10C
0221-2-5-8G
-3-1130-3-6G
-6-4-202-1-4A
-9-7-5-3-11-2G
-12-10-8-6-4-20-
AGTGAG-
j
i
GAGT-GAGAGGCGA
GA-GTGAGAGGCGA
a
b
a)
b)
The ‘low’ and the ‘high’ road
You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?
![Page 42: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/42.jpg)
Affine gaps
M[i, j] =
Ix[i-1, j-1]
Iy[i-1, j-1]
M[i-1, j-1]
maxscore(X[i], Y[j]) +
Ix[i, j] =
M[i, j-1] + go
Ix[i, j-1] + g
e
max
Iy[i, j] =
M[i-1, j] + go
Iy[i-1, j] + g
x
max
• Penalties: g
o - gap opening (e.g. -8)
ge - gap extension (e.g. -1)
X1… -Y1… Yj
X1…Xi -Y1…Yj-1 Yj
X1… - -Y1…Yj-1 Yj
X1… Xi
Y1… Yj
• @ home: think of boundary values M[*,0], I[*,0] etc.
![Page 43: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/43.jpg)
Semi-global pairwise alignment
• Global alignment: all gaps are penalised• Semi-global alignment: N- and C-terminal (5’ and
3’) gaps (end-gaps) are not penalised
MSTGAVLIY--TS--------GGILLFHRTSGTSNS
End-gaps
End-gaps
![Page 44: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/44.jpg)
Variation on global alignment
• Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences.
CAGCACTTGGATTCTCGGCAGC-----G-T----GG
• Semi-global alignment: don’t penalize for end gaps CAGCA-CTTGGATTCTCGG
---CAGCGTGG--------
![Page 45: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/45.jpg)
Semi-global pairwise alignmentApplications of semi-global:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
Danger: really bad alignments for divergent sequences (particularly if gap penalties are high)
Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic, and so these ends match
seq X:
seq Y:
![Page 46: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/46.jpg)
Semi-global alignment
• Ignore 5’ or N-terminal end gaps– First row/column set
to 0
• Ignore C-terminal or 3’ end gaps– Read the result from
last row/column
T
G
A
-
GTGAG-
1300-10
-202-210
-1-1-11-10
000000
![Page 47: Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649de55503460f94add565/html5/thumbnails/47.jpg)
Take-home messages
• Homology• Why are we interested in similarity?• Pairwise alignment: global alignment
and semi-global alignment• Three types of gap penalty regimes:
linear, affine and concave• Make sure you can calculate
alignment using the DP algorithm