1
Protein Multiple Alignment
by Konstantin Davydov
2
Papers
MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar
ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
3
Outline
IntroductionIntroduction Background MUSCLE ProbCons Conclusion
4
Introduction
What is multiple protein alignment?
Given N sequences of amino acids x1, x2 … xN:
Insert gaps in each of the xis so that– All sequences have the same length– Score of the global map is maximum
ACCTGCA
ACTTCAA
ACCTGCA--
AC--TTCAA
5
Introduction
Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions
6
Outline
IntroductionBackgroundBackground MUSCLE ProbCons Conclusion
7
Background
Aligning two sequences
ACCTGCA
ACTTCAA
ACCTGCA--
AC--TTCAA
8
Background
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
9
Background
Unfortunately, this can get very expensive Aligning N sequences of length L requires a
matrix of size LN, where each square in the matrix has 2N-1 neighbors
This gives a total time complexity of
O(2N LN)
10
Outline
Introduction BackgroundMUSCLEMUSCLE ProbCons Conclusion
11
MUSCLE
12
MUSCLE
Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied
Three stages At end of each stage, a multiple alignment is
available and the algorithm can be terminated
13
Three Stages
Draft Progressive Improved Progressive Refinement
14
Stage 1: Draft Progressive
Similarity Measure – Calculated using k-mer counting.
ACCATGCGAATGGTCCACAATG
k-mer:
ATG
CCA
score:
3
2
15
Stage 1: Draft Progressive
Distance estimate – Based on the similarities, construct a triangular
distance matrix.
XXXX
0.6XXX
0.80.2XX
0.30.70.5X
X1 X2 X3
X4
X3
X2
X1
X4
16
Stage 1: Draft Progressive
Tree construction – From the distance matrix we construct a tree
XXXX
0.6XXX
0.80.2XX
0.30.70.5X
X1
X1
X2 X3 X4
X2
X3
X4
X1
X4
X2
X3
X3X2X4X1
X3X2
X4X1
17
Stage 1: Draft Progressive
18
Stage 1: Draft Progressive
Progressive alignment – A progressive alignment is built by following the
branching order of the tree. This yields a multiple alignment of all input sequences at the root.
X1
X4
X2
X3X3
X3
X2
X2X4
X4X1
X1Alignment of X1, X2, X3, X4
19
Stage 2: Improved Progressive
Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.
X1
X4
X3X2
XXXX
XXX
XX
X
X1 X2 X3
X4
X3
X2
X1
X4
20
Stage 2: Improved Progressive
Similarity Measure – Similarity is calculated for each pair of sequences
using fractional identity computed from their mutual alignment in the current multiple alignment
TCC--AATCA--GATCA--AAG--ATACT--CTGC
TCC--AATCA--AA
21
Stage 2: Improved Progressive
Tree construction – A tree is constructed by computing a Kimura
distance matrix and applying a clustering method to it
XXXX
XXX
XX
X
X1 X2 X3
X4
X3
X2
X1
X4
22
Stage 2: Improved Progressive
Tree comparison – The new tree is compared to the previous tree
by identifying the set of internal nodes for which the branching order has changed
23
Stage 2: Improved Progressive
Progressive alignment – A new progressive alignment is built
X2
X4
X1
X3X3
X3
X1
X1X4
X4X2
X2New Alignment
24
Stage 3: Refinement
Performs iterative refinement
25
Stage 3: Refinement
Choice of bipartition – An edge is removed from the tree, dividing the
sequences into two disjoint subsets
X5
X4
X2
X3
X1
X5
X1
X4
X2
X3
26
Stage 3: Refinement
Profile Extraction – The multiple alignment of each subset is
extracted from current multiple alignment. Columns made up of indels only are removed
TCC--AATCA--GATCA--AAG--ATACT--CTGC
X1
X3
X4
X5
X2
TCC--AATCA--AA
TCA--GA
T--CTGCG--ATAC
TCCAATCAAA
27
Stage 3: Refinement
Re-alignment – The two profiles are then realigned with each
other using profile-profile alignment.
TCA--GA
T--CTGCG--ATAC
T--CCAAT--CAAATCA--GA
T--CTGCG--ATAC
TCCAATCAAA
28
Stage 3: Refinement
Accept/Reject – The score of the new alignment is computed, if
the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.
T--CCAAT--CAAATCA--GA
T--CTGCG--ATAC
TCC--AATCA--GATCA--AAG--ATACT--CTGC
New Old
OR
29
MUSCLE Review
Performance– For alignment of N sequences of length L– Space complexity: O(N2+L2)– Time complexity: O(N4+NL2)– Time complexity without refinement: O(N3+NL2)
30
Outline
Introduction Background MUSCLE ProbCons Conclusion
31
Hidden Markov Models (HMMs)
M
J I
AGCC-AGC-GCCCAGTIMMMJMMM
X
Y
-- Y X --
:X
:Y
32
Pairwise Alignment
Viterbi Algorithm– Picks the alignment that is most likely to be the
optimal alignment– However: the most likely alignment is not the
most accurate– Alternative: find the alignment of maximum
expected accuracy
33
Lazy Teacher Analogy
10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best
student MEA Approach: Weighted majority vote
4. F4. F 4. T 4. F 4. F
4. F4. F 4. F 4. F 4. T
A- AAB A- A
B+ B+B+B- B- C
34
Viterbi vs MEA
Viterbi– Picks the alignment with the highest chance of
being completely correct
Maximum Expected Accuracy– Picks the alignment with the highest expected
number of correct predictions
35
ProbCons
Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment.
Uses Maximum Expected Accuracy instead of the Viterbi alignment.
5 steps
36
Notation
Given N sequences S = {s1, s2, … sN} a* is the optimal alignment
37
ProbCons
Step 1: Computation of posterior-probability matrices
Step 2: Computation of expected accuracies Step 3: Probabilistic consistency
transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement
38
Step 1: Computation of posterior-probability matrices
For every pair of sequences x,yS, compute the matrix Pxy
Pxy(i, j) = P(xi~yj a* | x, y), which is the probability that xi and yj are paired in a*
39
Step 2: Computation of expected accuracies
For a pairwise alignment a between x and y, define the accuracy as:
accuracy(accuracy(a, a*a, a*) =) = # of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence
40
Step 2: Computation of expected accuracies (continued)
MEA alignment is found by finding the highest summing path through the matrix
Mxy[i, j] = P(xi is aligned to yj | x, y)
41
Consistency
z
x
y
xi
yj yj’
zk
42
Step 3: Probabilistic consistency transformation
Re-estimate the match quality scores P(xi~yj a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison:
P(xi~yj a* | x, y)P(xi~yj a* | x, y, z)
43
Step 3: Probabilistic consistency transformation (continued)
44
Step 3: Probabilistic consistency transformation (continued)
Since most of the values of Pxz and Pzy will be very small, we ignore all the entries in which the value is smaller than some threshold w.
Use sparse matrix multiplication May be repeated
45
Step 4: Computation of guide tree
Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum-
of-pairs
XXXX
XXX
XX
X
X1 X2 X3
X4
X3
X2
X1
X4
46
Step 5: Progressive alignment
Align sequence groups hierarchically according to the order specified in the guide tree.
Alignments are scored using sum-of-pairs scoring function.
Aligned residues are scored according to the match quality scores P(xi~yj a* | x, y)
Gap penalties are set to 0.
47
Post-processing step: iterative refinement
Much like in MUSCLE Randomly partition alignment into two groups
of sequences and realign May be repeated
X5
X4
X2
X3
X1
X5
X1
X4
X2
X3
48
ProbCons overview
ProbCons demonstrated dramatic improvements in alignment accuracy
Longer running time Doesn’t use protein-specific alignment
information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.
49
Outline
Introduction Background MUSCLE ProbCons Conclusion
50
Conclusion
MUSCLE demonstrated poor accuracy, but very short running time.
ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.
51
Results
52
Reliability Scores
53
Questions?
54
References
Robert C Edgar
– MUSCLE: a multiple sequence alignment method with reduced time and space complexity
Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
– ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment
55
References (continued)
Slides on Multiple Sequence Alignment, CS262
Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina
Sirota
Top Related