Pairwise alignments
Introduction to Bioinformatics
B!GRe Bioinformatique des
Génomes et Réseaux
Jacques van [email protected]
Aix-Marseille Université (AMU)Institut Français de Bioinformatique (IFB)
https://orcid.org/0000-0002-8799-8584
Dot plots : intuitive visualization of “alignable” sequence segments
Bioinformatics
Matrice de points (Matrice de pixels; dot-plot)
n Le dot plot est une représentation graphique simple des résidus identiques entre les deux séquences.q Les deux séquences sont
représentées sur les deux axes
q Un point (dot) est tracé pour chaque correspondance entre deux résidus de séquences.
q Les lignes diagonales révèlent les régions alignables entre les deux séquences.
n Diapo: Emese Meglecz
ADSTARYEMQSDQIYTQN| | | ||||| AETSAQYDMQSDQEFTRD
Matrice de points (Matrice de pixels; dot-plot)
n Outre les identités, on peut marquer les similarités entre acides aminés (substitutions conservatives, d’après une matrice de substitution donnée)q Exemple: marquage des
paires de résidus ayant un score BLOSUM62 > 1.
n Diapo: Emese Meglecz
ADSTARYEMQSDQIYTQN|:::|:|:||||| :|::AETSAQYDMQSDQEFTRD
Matrice de points (Matrice de pixels; dot-plot)
n On peut appliquer à la matrice de points un filtrage par fenêtre glissante.q On n’affiche que les diagonale
comportant au moins 2 points marqués successifs.
n À chaque position de la matrice on extrait la paire de mots de taille w qui commence aux positions correspondantes des deux séquences.
n Le score de la paire de mots est calculé en additionnant les scores des paires des résidus (d’après la matrice de substitution).
n Si le score dépasse un seuil défini par l’utilisateur, une diagonale noire s'affiche à la position correspondante.
n Adapté d’après Emese Meglecz
Matrice de points - Détections des INDELs
n Un décalage (gap) entre deux diagonales suggère soit une délétion (sur la première séquence dans l’exemple ci-contre), soit une insertion (dans la seconde séquence).
n Le simple alignement entre deux séquences ne permet pas de décider si l’événement évolutif était une délétion ou une insertion.
n On désigne par « indel » l’événement évolutif supposé.
n Adapté d’après Emese Meglecz
Dot plot
n A dot plot is a simple graphical representation of identical residues between two sequences. q The X axis represents the first
sequence (PHO5), q The Y axis represents the second
sequence (PHO3)q A dot is plotted for each match
between two residues of the sequences.
q Diagonal lines reveal regions of identity between the two sequences.
7
Example: protein sequences of the Pho5p and Pho3p phosphatases in the yeast Saccharomyces cerevisiae
http://emboss.bioinformatics.nl/cgi-bin/emboss/dottup
8
9
Dot plot with word matches
n With nucleic sequences, each residue is expected every 4 positions on average. A letter-based dot plot is thus very confusing.
n The dot plot can be adapted to display only word matches, which correspond to a diagonal of dots in the letter-based dot plot.
n Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes.
Word size = 10Word size = 3Word size = 2
10
Matrice de points - Répétitions intra-séquence
n La matrice de points permet de repérer des segments répétés au sein d’une séquence.
n Pour cela, on aligne la séquence avec elle-même. n Les segments répétées apparaissent comme des lignes obliques parallèles à la
diagonale principale.
n Adaptée d’après Emese Meglecz
RégionA RégionBSeq1
Seq1
RégionB
RégionA
Detecting repeats with a dot plot
n Sequence repeats are easily detected in a dot plot when a sequence is compared to itself.q The main diagonal is completely marked
(by definition, since the sequence is identical do itself)q Repeats appear as segments of lines parallel to the diagonal.
12
Gray scale dottups
n Word matches require a perfect match over the whole word length.
n A more refined way to show partial similarities is to use windows.
n For each point, the window score is the sum of matches of its neighbours on the diagonal.
n The gray level reflects the window similarity score.
13Source ?
Substitution matrices in dot plots
n Substitution matrices can be used in dot plotsn The user has to specify the following parameters:
q Window size (=word size)q Score thresholdq Substitution matrix
n At each position of the plotq A pair of words of size w are extracted from the first and second sequences.q The score of the word pair is calculated by summing the scores of the corresponding
pairs of residues. q If the pair of words passes the threshold, a black diagonal is displayed at the
corresponding position of the dot plot. n The regions of similarity between the two sequences are appear as black
diagonals on the dot plot.
14
Aspartokinases: dot plot with simple word matches
n Let us compare the peptidic sequences of two enzymes from the bacteria Escherichia coli K12.
q LysC aspartokinase involved in lysine biosynthesisq MetL bifunctional enzyme which combines two domains
• aspartokinase catalyses the first step of methionine biosynthesis• homoserine dehydrogenase catalyze the third step of methionine biosynthesis
n On the dot plot, the region of similarity between the two aspartokinase domains is barely visible..
15
dottup result, window size=3
Aspartokinases: dot plot with substitution matrix (BLOSUM62)
n With dotmatcher, a substitution matrix is used to score the similarity between each pair of residues.
n This reveals the similarity between the aspartokinase domains of MetL and LysC. n Note that this similarity only covers the N terminal half of MetL, because it is a bi-functional
enzyme : the C-terminal domain is a homoserine dehydrogenase. n It is quite tricky to find the appropriate parameters, because these can vary from case ot
case.
16
Matrice de points – Extension à l’échelle génomiquen On peut transposer le concept de matrice de points à l’échelle génomique, en indiquant par
un point la présence de gènes homologues entre les deux génomes. n Cette approche permet de repérer
q des régions génomiques où les gènes se succèdent de façon similaires (syntons);q des inversions chromosomiques.
Alignment matrix
n An alignment matrix is conceptually related to a dot plot
n One sequence is pasted horizontally, the other vertically
n A score is assigned to each matchn In the case of dot plots, the
scoring scheme was very simple:q match = 1q mismatch = 0
n In an alignment matrix, “good” alignments appear as high-scoring diagonals.
n Separations between high-scoring diagonals indicate gaps.
q Vertical separation: a segment of the vertical sequence is aligned with a gap of the horizontal sequence.
q Horizontal separation: gap in the vertical sequence.
18
A A T C T T C A G C G T A T T G C TA 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1C 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1A 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0G 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0C 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0C 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0G 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0G 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0A 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0G 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0G 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1A 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1T 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1
AATCTTCAGC-----GTATTGCT-ATCTT-AGCCGGAGGTATT---
Software tools for displaying dot plot
n Dotletq Junier T, Pagni M (2000) Dotlet: diagonal plots in a web browser. Bioinformatics.
16:178-9. q http://myhits.isb-sib.ch/cgi-bin/dotlet
n Dnadotq http://www.vivo.colostate.edu/molkit/dnadot/q Draw nucleic acid dot plots, convenient for DNA/RNA alignment.
n Dotterq Sonnhammer EL, Durbin R (1995) A dot-matrix program with dynamic threshold
control suited for genomic DNA and protein sequence analysis. Gene. 167:GC1-10.q http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.html
19
Finding the best alignment between two sequences
Example of gapless alignment
n The simplest way to align two sequences without gap is to slide one sequence over the other one, and score the alignment for each possible offset.
TTGCGGT-| 1TTAGCCGT
TTGCGGT--- 0TTAGCCGT
TTGCGGT---- 0TTAGCCGT
TTGCGGT---|- 1TTAGCCGT
TTGCGGT|--|-- 2TTAGCCGT
TTGCGGT||----- 2TTAGCCGT
TTGCGGT|-||-|| 5
TTAGCCGT
TTGCGGT| 1TTAGCCGT
TTGCGGT---||- 2
TTAGCCGT
TTGCGGT----- 0
TTAGCCGT
TTGCGGT--|- 1
TTAGCCGT
TTGCGGT--- 0
TTAGCCGT
TTGCGGT-|
TTAGCCGT
TTGCGGT| 2
TTAGCCGT
- substitution
| match
21
Number of possible gapless alignments
n If we only consider substitutions (no deletion, no insertion), sequence alignment is very simple
n A simple (but not very efficient) algorithm :q Align last position of seq1 with first position of seq2q max score ¬ 0q while there are some aligned residues
• Current score ¬ number of matching residues• If the current score > max score, then
n max score ¬ current scoren best alignment ¬ current alignment
• Slide seq1 one residue leftn Required time :
q Let us assume that• L1 is the length of seq1• L2 is the length of seq2• L2 <= L1 (seq2 is the shortest sequence if their sizes differ)
q There are L1+L2-1 possible offsets between sequence 1 and sequence 2q For each offset, one has to sum the score over the length of the alignment
(max=L2, when seq2 completely overlaps seq1)q T ~ (L1 + L2 -1)*L2 - L2*(L2-1)
22
Exercise
n We dispose of the two following sequencesq Seq1 TTTGCGTTAAATCGTGTAGCAATTTAA q Seq2 AAGAATGGCGTTTTTAATAGCAATAT
n Questions1. By progressively shifting the sequences along each other, find the offset(s) that reveal
similar regions ?2. At each offset position, identify conserved segments (i.e. uninterrupted successions of
identical residues) ?3. Can you improve the number of matching positions by inserting some gaps ?
23
Number of possible alignments with gaps
n Gapless alignments are rarely informative, because they fail to detect insertions and deletions
n There might be several local matching regions, separated by a variable-length non-conserved region: ----TTTGCGTT--AAATCGTGTAGCAATTTAA s=substitution; |=match
1111s|s|||||11s||22222|||||||s|22 1=gap in the first sequenceAAGAATGGCGTTTTTAA-----TAGCAATAT-- 2=gap in the second sequence
n Allowing gaps increases the complexity of the problem: at each position, there can be eitherq a gap in the first sequenceq a gap in the second sequenceq a superposition of residue 1 with residue 2 (match or substitution)
n In total, the computational size of the problem is N~3L, where L is the size of the shortest sequence. The number of possibilities increases thus exponentiallywith sequence length. This rapidly becomes intractable. q For two sequences of size 1000, there are ~31000 (~10477) possible alignments.
n We want to find the optimal alignment, i.e. that associated with the highest score, among all possible alignments. It is however impossible to test each alignment and measure its score, because it would take an “infinite” time.
24
Example of pairwise alignment
n Example of alignmentTTTGCGTT--AAATCGTGTAGCAATTT s=substitutions|ss||||ggs||ggggg|||||||s| g=gap
ATGCCGTTTTTAA-----TAGCAATAT |=identical residues
n Gaps, insertions and deletionsq Gaps can reflect either an insertion in one of the sequences, or a deletion in the other one.q The simple observation of two aligned sequences is insufficient to decide whether a gap results
from an insertion or a deletion. q The term indel is sometimes used in this case to designate the evolutionary event.
25
Global (Needleman-Wunsch) versus local (Smith-Waterman) alignment
n Global alignment q Algorithm: Needleman-Wunsch (1970).q EMBOSS Web tool: needle (nucleic acids or proteins).q Appropriate, for example, for proteins having a common ancestor and being conserved over their
whole sequences.q The final alignment mandatorily includes the aligned sequences over their full lengths.
LQGPSKGTGKGS-SRSWDN |----|--|||---|--|-LN-ITKSAGKGAIMRLGDA
n Local alignmentq Algorithm: Smith-Waterman (1981).q EMBOSS Web tool: water (nucleic acids or proteins).q Appropriate, for example, for proteins sharing a common domain restricted to a segment of the
sequence.LQGPSSKTGKGS-SSRIWDN
|-|||LN-ITKKAGKGAIMRLGDA
q The final alignment is restricted to the conserved segment(s)KTGKG|-|||KAGKG
n Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53.
n Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-7. 26
Algorithmic aspects
Finding the optimal pairwise alignment by dynamical programming
Dynamical programming - global alignment
n Needleman-Wunsch proposed an algorithm called dynamical programmingq Performs a global alignment (the sequences are aligned on the whole length)q The time of processing is proportional to the product of sequence lengths. It si thus
increasing quadratically with the sequence length, instead of exponentially. q Guarantees to return the highest scoring alignment between two sequences.
28
Reference: Needleman, S. and Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol., 48:444–453.
Dynamical programming - global alignment
n For each position in the alignment, one can have
q a gap in sequence 1q a gap in sequence 2q aligned residues (match or
substitution)
T A C G T
C
T
A
G
G
A
T
29
Dynamical programming - paths
n Any possible alignment between the two sequences can be represented as a path from the left top corner to the right bottom corner.
n For example, the path highlighted in green corresponds to the alignment below.
- - T A - C G Tg g s s g s s gC T A G G A T –
n Obviously, this alignment is not optimal : it does not contain a single match !
T A C G T
C
T
A
G
G
A
T
30
Dynamical programming - global alignment
n Our goal is to find the best possible alignment, which corresponds to the path giving the optimal score (we still need to define how this score will be computed).
n As discussed before, the number of possible paths increases exponentially with the sequence sizes.
n Dynamical programming however allows us to find this optimal score in a quadratic time (L1*L2), by building the solution progressively.
n This progressive path finding allows us to avoid evaluating a large number of paths which would anyway return a sub-optimal score.
T A C G T
C
T
A
G
G
A
T
31
Dynamical programming - initialization
n The top row and left column are initialized .
n We will now progressively fill the other cells of the matrix by calculating, for each cell, its optimal score as a function of the path used to reach this cell.
n Two possible ways to initializeq If you consider that there is no cost for an
terminal gaps (start, end), initialize first row and column with 0.
q If you want to penalize terminal gaps, initialize first row and column with gap penalties.
T A C G T
C
T
A
G
G
A
T
0
0
0
0
0
0
0
0
0 0 0 0 0
32
Dynamical programming – the 3 way to leave a cell
n From each starting cell, we can take 3 possible moves:q Diagonal
• Align the two residues (match or substitution)
q Rightward• Align one residue of the
horizontal sequence with a gap in the vertical sequence.
q Downward • Align one residue of the
vertical sequence with a gap in the horizontal sequence.
Diagonal move : align the two residues Alignment
A
CSubstitution AC
C
CCCMatch
Rightward move: insert gap in vertical sequence
A-
A
C
Downward move: insert gap in horizontal sequence
-C
A
C
33
Dynamical programming – the 3 way to leave a cell
n From each starting cell, we can take 3 possible moves:q Diagonal
• Align the two residues (match or substitution)
q Rightward• Align one residue of the
horizontal sequence with a gap in the vertical sequence.
q Downward• Align one residue of the
vertical sequence with a gap in the horizontal sequence.
n We can define a scoring scheme, and define the score of each possible direction.q match +1q substitution -1q gap -2
Diagonal move : align the two residues Alignment
A
C2
1
-1Substitution AC
C
C2
3
+1 CCMatch
Rightward move: insert gap in vertical sequence
A-
A
C2 0
-2
Downward move: insert gap in horizontal sequence
-C
A
C2
0-234
Exercise – what is the best way to reach a cell
n Let us assume the following scoring scheme
q match +1q substitution -1q gap -2
n At a given destination cell, 3 scores are calculated depending on the 3 possible starting positions:
q upper neighbour + gap costq left neighbour + gap costq upper-left neighbour +
• match score if the residue match
• substitution cost if residues do not match
n For each of the cases besidesq Compute the 3 possible scores.q Indicate the best score in the
destination cell.q Mark the arrow giving this best
score.
A
C
2
0
1 A
C
2
0
1A
C
2
0
1
Example 1
C
C
2
0
1 C
C
2
0
1
Example 2C
C
2
0
1
C
C
2
0
5 C
C
2
0
5
Example 4C
C
2
0
5
Example 3C
C
1
0
5 C
C
1
0
5C
C
1
0
5
Alignment?
35
Dynamical programming – the best way to reach a cell
n Let us assume the following scoring scheme
q match +1q substitution -1q gap -2
n At a given destination cell, 3 scores are calculated depending on the 3 possible starting positions:
q upper neighbour + gap costq left neighbour + gap costq upper-left neighbour +
• match score if the residue match
• substitution cost if residues do not match
n The highest score is retained and the arrow is labelled
n In some cases (example 4), there are several equivalent highest scores
A
C
2
0
1
-1-2
A
C
2
0
1
1
-1
-2
A
C
2
0
1
-2
Example 1: best move is a substitution
C
C
2
0
1
-1-2
C
C
2
0
1
3+1
Example 2: best move is a match
-2
C
C
2
0
1
-2
C
C
2
0
5
3-2
C
C
2
0
5
3+1
Example 4: bets move is either gap or match
-2
C
C
2
0
5
-2
Example 3: best move is a gapC
C
1
0
5
3-2
C
C
1
0
5
2+1
-2
C
C
1
0
5
-2
Alignment
AC
CC
-C
CC or -
C
36
Dynamical programming - recursive computation
n The first row is processed from left to right T A C G T
C
T
A
G
G
A
T
0
0
0
0
0
0
0
0
-1
0
-1
0
1
0
-1
0
-1
37
Dynamical programming - recursive computation
n The second row is then processedT A C G T
C
T
A
G
G
A
T
0
0
0
0
0
0
0
0
0
-1
1
0
-1
-1
0
1
-1
0
-1
0
0
-1
0
38
Dynamical programming - recursive computation
n The process is iterated until the right corner is reached.
n Exercise: calculate the remaining scores.
T A C G T
C
T
A
G
G
A
T
0
0
0
0
0
0
0
0
0
-1
1
-1
-1
0
-1
-1
2
0
0
1
-1
0
1
0
-1
0
-2
1
0
-1
0
-1
-1
39
Needleman-Wunsch : exercise
n Using the Needleman-Wunsch algorithm, fill the scores of the alignment matrix with the following parametersq Substitution matrix: BLOSUM62q Gap opening penalty: -5q Gap extension penalty: -1q Initial and terminal gaps: 0
S V E T D
T
S
I
N
Q
E
T
Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A R N D C Q E G H I L K M F P S T W Y V40
Needleman-Wunsch : solution of the exercise
n A substitution matrix can be used to assign specific scores to each pair of residues.
n Exercise: fill the scores of the alignment matrix using the BLOSUM62 substitution matrix.q Gap opening penalty: -5q Gap extension penalty: -1q Initial and terminal gaps: 0
S V E T D
T
S
I
N
Q
E
T
0
0
0
0
0
0
0
0
0
1
4
-1
1
0
0
1
0
0
-1
7
2
1
0
1
0
-1
0
2
7
4
6
1
0
5
0
1
2
6
3
11
0
0
5
5
5
5
8
11
- S V - - E T D- | : - - | | -T S I N Q E T -
Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A R N D C Q E G H I L K M F P S T W Y V41
Needleman-Wunsch example
n Alignment of E.coli metL and thrA proteins with Needleman-Wunsch algorithm.
n The vertical bars indicate identity between two amino-acids.
n The columns and dots indicate similarities, i.e. pairs of residues having a positive scores in the chosen substitution matrix (BLOSUM62).
# Matrix: EBLOSUM62# Gap_penalty: 10.0# Extend_penalty: 0.5# Length: 867# Identity: 254/867 (29.3%)# Similarity: 423/867 (48.8%)# Gaps: 104/867 (12.0%)# Score: 929.0
metL 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDM-MVVSA 49.::.||||:|:|:.:.:||||.|:...::...: .|:||
thrA 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSA 39
metL 50 AGSTTNQLINWLK-----------LSQTDRLSAHQVQQTLRRYQCDLISG 88....||.|:..:: :|..:|:.| :|::|
thrA 40 PAKITNHLVAMIEKTISGQDALPNISDAERIFA------------ELLTG 77
metL 89 LLPAEEADSL--ISAFV-SDLERLAALLDSGIN------DAVYAEVVGHG 129|..|:....| :..|| .:..::..:| .||: |::.|.::..|
thrA 78 LAAAQPGFPLAQLKTFVDQEFAQIKHVL-HGISLLGQCPDSINAALICRG 126
metL 130 EVWSARLMSAVLNQQGLPAAWLD-AREFLRAERAAQPQVD--EGLSYPLL 176|..|..:|:.||..:|.....:| ..:.|......:..|| |.......
thrA 127 EKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAA 176
metL 177 QQLLVQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRV 226.::...| .:::.||.:.|..||.|:||||||||||..:.|.......
thrA 177 SRIPADH---MVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCC 223
metL 227 TIWSDVAGVYSADPRKVKDACLLPLLRLDEASELARLAAPVLHARTLQPV 276.||:||.|||:.|||:|.||.||..:...||.||:...|.|||.||:.|:
thrA 224 EIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPI 273
metL 277 SGSEIDLQLRCSYTPDQ-----GSTRIERVLASGTGARIVTSHDDVCLIE 321:..:|...::.:..|.. |::|.|..|. .:.:::.:::.:..
thrA 274 AQFQIPCLIKNTGNPQAPGTLIGASRDEDELP----VKGISNLNNMAMFS 319
metL 322 FQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFCYTSEVADS 371...|..:........:...:.||::..:.:...:....:.||........
42
Dynamical programming - local alignment
n The Needleman-Wunsch algorithm performs a global alignemnt (it finds the best path between the two corners of the alignment matrix).
n This is appropriate when the sequences are similar over their whole length, but local similarities could be missed.
n In 1981, Smith and Waterman published an adaptation of the Needleman-Wunsch algorithm, which allows to detect local similarities.
n Reference: Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol., 147:195–197. 43
Dynamical programming - local alignment
n Smith-Waterman algorithmn The algorithm is similar to Needleman-
Wunsch, but negative scores are replaced by 0
n Alignments stop after local maximan In this case we have two overlapping
alignments, each having a score of 2 :TA TACGTA TAGG
n Basically, the cost of the C/G substitution is compensate by the benefit of the G/G match.
n Reference: Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol., 147:195–197.
T A C G T
C
T
A
G
G
A
T
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
2
1
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
1
2
0
0
0
0
1
0
0
0
1
1
44
Smith-Waterman: exercise
n Using the Smith-Waterman algorithm, fill the scores of the alignment matrix with the following parameters
q Substitution matrix: BLOSUM62q Gap opening penalty: -5q Gap extension penalty: -1q Initial and terminal gaps: 0
S V E T D
T
S
I
N
Q
E
T
Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A R N D C Q E G H I L K M F P S T W Y V45
Smith-Waterman : solution of the exercise
n A substitution matrix can be used to assign specific scores to each pair of residues.
n Exercise: fill the scores of the alignment matrix using the BLOSUM62 substitution matrix.
q Gap opening penalty: -5q Gap extension penalty: -1q Initial and terminal gaps: 0
S V E T D
T
S
I
N
Q
E
T
0
0
0
0
0
0
0
0
0
1
4
0
1
0
0
1
0
0
0
7
2
1
0
1
0
0
0
2
7
4
6
1
0
5
1
1
2
6
3
11
0
0
5
5
5
5
8
11
S V - - E T| : - - | |S I N Q E T
Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A R N D C Q E G H I L K M F P S T W Y V46
Case studyBacterial aspartokinaes
Needleman-Wunsch with partial similarities
n Alignment of E.coli lysC and metL proteins with Needleman-Wunschalgorithm.
n metL contains two domains: aspartokinase and homoserine dehydrogenase.
n LysC only contains the aspartokinase domains.
n With Smith-Waterman, the %similarity is calculated over the whole length of the alignment (854aa), which gives 24.5%.
n Actually, most of the alignment length is in the terminal gap (the homoserinedehydrogenase domain of metL).
n This percentage is lower than the usual threshold for considering two proteins as homolog.
# Matrix: EBLOSUM62# Gap_penalty: 10.0# Extend_penalty: 0.5# Length: 854# Identity: 136/854 (15.9%)# Similarity: 209/854 (24.5%)# Gaps: 449/854 (52.6%)# Score: 351.0metL 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA 50
||.|. :.||||:|:||.....|.|.|:...:.. .::|:||:lysC 1 MSEIV--------VSKFGGTSVADFDAMNRSADIVLSDANV-RLVVLSAS 41
metL 51 GSTTNQLINWLK-LSQTDRLSAHQVQQTLRRYQCDLISGL----LPAEEA 95...||.|:...: |...:|. :....:|..|..::..| :..||.
lysC 42 AGITNLLVALAEGLEPGERF---EKLDAIRNIQFAILERLRYPNVIREEI 88
metL 96 DSLISAFVSDLERLAALLDSGINDAVYAEVVGHGEVWSARLMSAVLNQQG 145:.|:.. ::.|...|||..| .|:..|:|.|||:.|..|...:|.::.
lysC 89 ERLLEN-ITVLAEAAALATS---PALTDELVSHGELMSTLLFVEILRERD 134
metL 146 LPAAWLDAREFLRA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-GF 193:.|.|.|.|:.:|. :|..:.:.|......|....|:....:.||:| ||
lysC 135 VQAQWFDVRKVMRTNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQGF 184
metL 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243|...|.|.|..|||.||||:|..:......|||.||:||.|:|:.|||.|
lysC 185 IGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVV 234
metL 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQ 293..|..:..:...||:|:|...|.|||..||.|...|:|.:.:..|..|..
lysC 235 SAAKRIDEIAFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRA 284
metL 294 GSTRI---------ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAH 334|.|.: .|.||......::|.| ...:..|:.| ||
lysC 285 GGTLVCNKTENPPLFRALALRRNQTLLTLH------SLNMLHSRGF-LA- 326
metL 335 KEIDQILKRAQVRPLAVGVHNDRQLLQFCYTSEVA--------------D 370|:..||.| ||.. :....||||: |
lysC 327 -EVFGILAR----------HNIS--VDLITTSEVSVALTLDTTGSTSTGD 363 48
Smith-Waterman with partial similarities
n Alignment of E.coli lysC and metL proteins with Smith-Waterman algorithm.
n The alignment is almost identical to the one reported by Needleman-Wunsch, but the score is now considered on the aligned segments only (482 aa).
n On this region, there is 42.5% of similarity.
# Matrix: EBLOSUM62# Gap_penalty: 10.0# Extend_penalty: 0.5# Length: 482# Identity: 133/482 (27.6%)# Similarity: 205/482 (42.5%)# Gaps: 85/482 (17.6%)# Score: 353.5metL 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINWLK-LS 64
||||:|:||.....|.|.|:...:.. .::|:||:...||.|:...: |.lysC 8 KFGGTSVADFDAMNRSADIVLSDANV-RLVVLSASAGITNLLVALAEGLE 56
metL 65 QTDRLSAHQVQQTLRRYQCDLISGL----LPAEEADSLISAFVSDLERLA 110..:|. :....:|..|..::..| :..||.:.|:.. ::.|...|
lysC 57 PGERF---EKLDAIRNIQFAILERLRYPNVIREEIERLLEN-ITVLAEAA 102
metL 111 ALLDSGINDAVYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRA- 159||..| .|:..|:|.|||:.|..|...:|.::.:.|.|.|.|:.:|.
lysC 103 ALATS---PALTDELVSHGELMSTLLFVEILRERDVQAQWFDVRKVMRTN 149
metL 160 ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-GFISRNNAGETVLLGRN 208:|..:.:.|......|....|:....:.||:| |||...|.|.|..|||.
lysC 150 DRFGRAEPDIAALAELAALQLLPRLNEGLVITQGFIGSENKGRTTTLGRG 199
metL 209 GSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLLPLLRLDEAS 258||||:|..:......|||.||:||.|:|:.|||.|..|..:..:...||:
lysC 200 GSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVVSAAKRIDEIAFAEAA 249
metL 259 ELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI---------E 299|:|...|.|||..||.|...|:|.:.:..|..|..|.|.: .
lysC 250 EMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKTENPPLF 299
metL 300 RVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPL 349|.||......::|.| ...:..|:.| || |:..||.|
lysC 300 RALALRRNQTLLTLH------SLNMLHSRGF-LA--EVFGILAR------ 334
metL 350 AVGVHNDRQLLQFCYTSEVA--------------DSAL--KILDEAGLPG 383||.. :....||||: |:.| .:|.|.....
lysC 335 ----HNIS--VDLITTSEVSVALTLDTTGSTSTGDTLLTQSLLMELSALC 378 49
Case studyPairwise similarities between opsins
Needleman-Wunsch – pairwise comparisons between opsins
51
Dynamical programming - summary
n Guarantees to find the optimal score.n Is able to return all the alignments which raise the optimal score.n Processing time is proportional to product of sequence lengths.n Can be adapted for global (Needleman-Wunsch) or local (Smith-Waterman)
alignment.n Global alignments
q are appropriate for aligning sequences conserved over their whole length (recent divergence).
n Local alignments q Are able to detect domains which cover only a fraction of the sequencesq Are also able to return a complete alignment, when the conservation covers the whole
sequence
52
References
n Dynamical programming applied to pairwise sequence alignmentq Needleman-Wunsch (pairwise, global)
• Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53.
q Smith-Waterman (pairwise, local)• Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular
subsequences. J Mol Biol 147, 195-7.n Software tools
q The EMBOSS suite provides efficient implementations of the two “classical”dynamical programming algorithms• needle Needleman-Wunsch• water Smith-Waterman
q Pairwise Sequence Alignment tools at the EBI• A web interface to EMBOSS tools (needle, water)• http://www.ebi.ac.uk/Tools/psa/
q They can be installed locally or accessed via the SRS Web server• http://srs.ebi.ac.uk/
53
Exercice
T A C G T
Q
P
M
L
S
C
Q54
Top Related