Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler...
-
Upload
philip-garrett -
Category
Documents
-
view
215 -
download
0
Transcript of Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler...
Progressive multiple sequence alignments from triplets
byMatthias Kruspe and Peter F Stadler
Presented by Syed Nabeel
Outline
BackgroundMotivation AlgorithmComplexity AnalysisExperiments and ResultsDiscussions and Future work
Background
Sequence alignmentA way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity.
Pairwise sequence alignmentAlignment of two sequences to maximize the common elements of the pair (usually a scoring scheme is used)
3
Multiple sequence alignment (MSA) Scoring Scheme
To access the quality of alignment Scores calculated based on substitution matrices e.g.
BLOSUM and PAM etc Multiple sequence alignment (MSA)
An extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignments are often used in identifying
conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.
NP-hard problem
Heuristic methods for MSA Progressive method
ClustalW, T-Coffee, POA, and etc. Iterative method
Muscle, DIALIGN, and etc. Probabilistic method
Probcons, Hmmt, Muscle, and etc.
Progressive method Makes explicit use of the evolutionary
relatedness of the sequences to build the alignment.
Complete MSA of the given sequences is calculated from pairwise alignments of previous aligned sequences by following the branching order of a pre-computed "guide" tree
Reconstruction usually involves some clustering method such as Neighbor-Joining or UPGMA
Problems with Existing Progressive Methods Not guaranteed to find the optimal alignment
utilize only a small part of the information that is potentially available in the complete data set
the relative placement of adjacent insertions and deletions leads to score-equivalent alignments among which the algorithm chooses one by means of a pragmatic rule (e.g. "Always make insertions before deletions")
There is no mechanism to identify errors that have been made in previous steps and to correct them during later stages
Motivation for aln3nn
Utilizes an exact algorithm to compute alignment of sequence and profile triples
Instead of using a single guide tree, phylogenetic networks as constructed by the Neighbor-Net algorithm are used
It involves aggregation step that constructs pairs from triples to subdivide 3-way alignments into pairs of alignments
It provides a chance for the removal of erroneously inserted gaps at later aggregation steps.
Dynamic Programming Approaches
Needleman-Wunsch algorithm Basic dynamic programming scheme for
pairwise sequence comparison Requires quadratic space and time Easily translates to a cubic space and time
algorithms for three sequences. Uses trivial gap cost functions.
Linear vs Affine Gap Costs Linear gap cost
Has only one parameter d, which is a cost per unit length of gap
d is almost always negative, so the alignment with fewer gaps is favoured over the alignment with more gaps
The overall cost for one large gap is the same as for many small gaps
Affine gap cost Higher penalty is assigned for opening a new gap than
for extending an existing one This removes the problem in linear gap costs as overall
cost for one large gap is smaller than that for many small gaps
Gotoh’s Algorithm Makes use of affine gap costs Quadratic CPU and memory requirements
for two sequences Alignment of three sequences with affine
gap costs requires O(n3) time and space Aln3nn is based on Gotoh’s Algorithm with
minor modifications
Basic Concepts Let A, B, and C denote the three sequences. Ai, Bj, and Ck to refer to the ith, jth, and kth position in A, B, and
C '-' denotes the gap character. Scores for the alignment of two or three non-gap characters are
denoted by S(α, β) and S(α, β, γ) Gap penalties are determined from gap open (go) and gap
extensions (ge) scores. M(i, j, k) denotes the best score of the alignments of the prefixes
Ai, Bj, and Ck if the residues (Ai, Bj, Ck) are aligned Ixy(i, j, k) the best score given that (Ai, Bj,-) is the last column of
the partial alignment Ix(i, j, k) the best score given that the last column is of the form
(Ai, -, -) Sum-of-pairs model used for substitution scores
S(a, b, c) = S(a, b) + S(a, c) + S(b,c).
aln3nn Optimization The above mentioned approach has cubic
memory consumption which is acceptable only for small sequence lengths n
Aln3nn Optimization: Divide and Conquer Input sequences that exceed a given threshold length l
are subsequently subdivided into smaller sequences until the length criterion is fulfilled
Partial sequences are aligned separately and the emerging alignments are concatenated afterward
Result is an approximate solution of the global MSA problem
The threshold length depends on sequence properties and the available amount of memory and CPU resources
Determining Alignment Order The order in which sequences and profiles are
aligned has an important influence on the performance of progressive alignment algorithms
Pairwise alignments use binary guide trees to determine alignment order It encapsulate an approximation to the phylogenetic
relationships of the input sequences The input sequences form the leaves of this tree Each interior node corresponds to an alignment The root of the guide tree represents the desired
multiple alignment of all input sequences.
Phylogenetic Networks in aln3nn Neighbor-Net (Nnet) approach is used to construct a
phylogenetic network to calculate the alignment order The input sequences are represented as nodes that are all
disconnected in the beginning. In each aggregation step, Nnet selects two nodes using a
specific selection criterion In contrast to Neighbor-Joining, the two nodes are not paired
immediately Nnet waits until a node has been paired up a second time. Then the corresponding three linked nodes are replaced by two
new linked nodes. The distances of the newly introduced nodes to the remaining
"actives" node are computed as a linear combination of the distances of the nodes prior to aggregation.
The entire procedure is repeated until only three active nodes are left.
Agglomeration and Splitting Node agglomeration occurs when one of
the three involved nodes (B) has two neighbors, while the other two (A and C) have only a single one
The alignment ABC is split such that the sequences contained in B are distributed between two subsets B' and B" so as to maximize the scores of partial alignments AB‘ and B''C
Space and Time Complexity
Simple dynamic programming For 3 way alignment it takes O(n3) space and time (n
being the length of the sequence) Thus the alignment of all N sequences takes O(Nn3) time
Divide-&-Conquer with the cutoff length l
Space Complexity O(n2+l3) space is required This is the space needed to store the additional cost
matrices plus the space required for aligning the remaining (sub) sequences of length at most l.
Space and Time Complexity (contd.)
Time Complexity O(n2+nl2) time is required for alignment of one
triplet The term n2 results from the time that is
needed to calculate the additional cost matrices plus the time to search for the optimal slicing positions.
The term nl2 comes from the alignment of the triplet itself
The total time complexity of the alignment is therefore O(Nn2+Nnl2)
Alignments of Structured RNAs aln3nn software includes the possibility to use
RNA secondary structure annotation as additional input with nucleic acid alignments
Matrix of equilibrium base pairing probabilities Pij is computed for each input sequence
For each sequence position probabilities are calculated for following cases pairing possibilities position i is paired with a position j <i
a position j > i
it remains unpaired
Structural Score Contributions These probabilities are used as structure annotation. For a pair of annotated input sequences A and B we define
structural score contributions for positions i and j by
The total (mis)match score is the weighted sum of the sequence score and the structure score using the equation
Ψ is the balance term that measures the relative contribution of sequence and structure similarity
For very similar sequences one should use ψ ≈ 1 whereas in case of very dissimilar sequences one should use a score dominated by the structural component.
Pairwise versus Three-Way Alignments Sets of artificial sequences generated
using the ROSE package The quality of aln3nn alignments were
compared to standard progressive alignments of three sequences using t_coffee
The same scoring model in aln3nn and t_coffee were used
The analysis indicated that as gaps increased aln3nn produced better scores
Protein Alignments Used three types of substitution matrices:
BLOSUM, PAM and GONNET aln3nn chooses the best suiting matrix of the
given type according to sequence identity The median BAliBASE score is used for each
sequence set as a measurement for alignment quality
Although aln3nn does not employ any heuristic rules to alter scoring parameters it compares well with other common alignment programs
RNA Alignments RNA sequences often evolve much faster than
their secondary structure Alignment quality can be increased dramatically
by including structural information Used six diverse families of RNA data sets from
the BRaliBase for comparisons Structure conservation index (SCI) was used to
assess the quality of the calculated alignments SCI is defined as the ratio of consensus folding energy of
a set of aligned sequences and average unconstrained folding energies of the individual sequences
SCI is close to 0 for structurally divergent sequences and close to 1 for correctly aligned sequences with a common fold
Influence of parameter ψ on SCI The SCI decreases if structural information is
completely ignored (ψ = 1) On the other hand ignoring the sequence
information (ψ = 0) yields even worse results. The reason is that RNA secondary structure
prediction has limited accuracy so that alignments based on predicted structures for individual sequences are based on very noisy data
Also the impact of the ψ parameter varies between different RNA families.
Gap Removals In some data sets one fifth of the gaps in the early stages of
the progressive alignment are later removed again Following table shows the frequency f of gaps that are
removed at intermediate division steps and that are not re-introduced at later stages
Discussion and Future Work A direct comparison of aln3nn with progressive
alignments of the same three sequences shows that the progressive approach leads to significantly suboptimal scores
Aln3nn incurs additional computational costs compared to pair-wise, guide-tree based, approaches but it achieves competitive alignment accuracies on both protein and nucleic acid data
Performance of t_coffee shows that the shortcoming of initial pairwise alignments cannot be fully overcome later on where as aln3nn overcomes this problem
Future work Modifications in the division step for 3 way alignments Improvements in branch and bound approach