Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler...

Progressive multiple sequence alignments from triplets

byMatthias Kruspe and Peter F Stadler

Presented by Syed Nabeel

Outline

BackgroundMotivation AlgorithmComplexity AnalysisExperiments and ResultsDiscussions and Future work

Background

Sequence alignmentA way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity.

Pairwise sequence alignmentAlignment of two sequences to maximize the common elements of the pair (usually a scoring scheme is used)

3

Multiple sequence alignment (MSA) Scoring Scheme

To access the quality of alignment Scores calculated based on substitution matrices e.g.

BLOSUM and PAM etc Multiple sequence alignment (MSA)

An extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignments are often used in identifying

conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.

NP-hard problem

MSA Example

Heuristic methods for MSA Progressive method

ClustalW, T-Coffee, POA, and etc. Iterative method

Muscle, DIALIGN, and etc. Probabilistic method

Probcons, Hmmt, Muscle, and etc.

Progressive method Makes explicit use of the evolutionary

relatedness of the sequences to build the alignment.

Complete MSA of the given sequences is calculated from pairwise alignments of previous aligned sequences by following the branching order of a pre-computed "guide" tree

Reconstruction usually involves some clustering method such as Neighbor-Joining or UPGMA

Problems with Existing Progressive Methods Not guaranteed to find the optimal alignment

utilize only a small part of the information that is potentially available in the complete data set

the relative placement of adjacent insertions and deletions leads to score-equivalent alignments among which the algorithm chooses one by means of a pragmatic rule (e.g. "Always make insertions before deletions")

There is no mechanism to identify errors that have been made in previous steps and to correct them during later stages

Motivation for aln3nn

Utilizes an exact algorithm to compute alignment of sequence and profile triples

Instead of using a single guide tree, phylogenetic networks as constructed by the Neighbor-Net algorithm are used

It involves aggregation step that constructs pairs from triples to subdivide 3-way alignments into pairs of alignments

It provides a chance for the removal of erroneously inserted gaps at later aggregation steps.

Dynamic Programming Approaches

Needleman-Wunsch algorithm Basic dynamic programming scheme for

pairwise sequence comparison Requires quadratic space and time Easily translates to a cubic space and time

algorithms for three sequences. Uses trivial gap cost functions.

Linear vs Affine Gap Costs Linear gap cost

Has only one parameter d, which is a cost per unit length of gap

d is almost always negative, so the alignment with fewer gaps is favoured over the alignment with more gaps

The overall cost for one large gap is the same as for many small gaps

Affine gap cost Higher penalty is assigned for opening a new gap than

for extending an existing one This removes the problem in linear gap costs as overall

cost for one large gap is smaller than that for many small gaps

Gotoh’s Algorithm Makes use of affine gap costs Quadratic CPU and memory requirements

for two sequences Alignment of three sequences with affine

gap costs requires O(n3) time and space Aln3nn is based on Gotoh’s Algorithm with

minor modifications

Basic Concepts Let A, B, and C denote the three sequences. Ai, Bj, and Ck to refer to the ith, jth, and kth position in A, B, and

C '-' denotes the gap character. Scores for the alignment of two or three non-gap characters are

denoted by S(α, β) and S(α, β, γ) Gap penalties are determined from gap open (go) and gap

extensions (ge) scores. M(i, j, k) denotes the best score of the alignments of the prefixes

Ai, Bj, and Ck if the residues (Ai, Bj, Ck) are aligned Ixy(i, j, k) the best score given that (Ai, Bj,-) is the last column of

the partial alignment Ix(i, j, k) the best score given that the last column is of the form

(Ai, -, -) Sum-of-pairs model used for substitution scores

S(a, b, c) = S(a, b) + S(a, c) + S(b,c).

RecurrencesCase 1:(Ai, Bj, Ck)All three sequences are aligned

Recurrences (contd.)Case 2:(Ai, Bj,-) Gap in the C sequence

Recurrences (contd.)Case 3:(Ai, -,-)Gap in the B and C sequence

aln3nn Optimization The above mentioned approach has cubic

memory consumption which is acceptable only for small sequence lengths n

Aln3nn Optimization: Divide and Conquer Input sequences that exceed a given threshold length l

are subsequently subdivided into smaller sequences until the length criterion is fulfilled

Partial sequences are aligned separately and the emerging alignments are concatenated afterward

Result is an approximate solution of the global MSA problem

The threshold length depends on sequence properties and the available amount of memory and CPU resources

Determining Alignment Order The order in which sequences and profiles are

aligned has an important influence on the performance of progressive alignment algorithms

Pairwise alignments use binary guide trees to determine alignment order It encapsulate an approximation to the phylogenetic

relationships of the input sequences The input sequences form the leaves of this tree Each interior node corresponds to an alignment The root of the guide tree represents the desired

multiple alignment of all input sequences.

Phylogenetic Networks in aln3nn Neighbor-Net (Nnet) approach is used to construct a

phylogenetic network to calculate the alignment order The input sequences are represented as nodes that are all

disconnected in the beginning. In each aggregation step, Nnet selects two nodes using a

specific selection criterion In contrast to Neighbor-Joining, the two nodes are not paired

immediately Nnet waits until a node has been paired up a second time. Then the corresponding three linked nodes are replaced by two

new linked nodes. The distances of the newly introduced nodes to the remaining

"actives" node are computed as a linear combination of the distances of the nodes prior to aggregation.

The entire procedure is repeated until only three active nodes are left.

Agglomeration and Splitting Node agglomeration occurs when one of

the three involved nodes (B) has two neighbors, while the other two (A and C) have only a single one

The alignment ABC is split such that the sequences contained in B are distributed between two subsets B' and B" so as to maximize the scores of partial alignments AB‘ and B''C

Agglomeration and Splitting (contd.)

Space and Time Complexity

Simple dynamic programming For 3 way alignment it takes O(n3) space and time (n

being the length of the sequence) Thus the alignment of all N sequences takes O(Nn3) time

Divide-&-Conquer with the cutoff length l

Space Complexity O(n2+l3) space is required This is the space needed to store the additional cost

matrices plus the space required for aligning the remaining (sub) sequences of length at most l.

Space and Time Complexity (contd.)

Time Complexity O(n2+nl2) time is required for alignment of one

triplet The term n2 results from the time that is

needed to calculate the additional cost matrices plus the time to search for the optimal slicing positions.

The term nl2 comes from the alignment of the triplet itself

The total time complexity of the alignment is therefore O(Nn2+Nnl2)

Running Time Comparisons

Alignments of Structured RNAs aln3nn software includes the possibility to use

RNA secondary structure annotation as additional input with nucleic acid alignments

Matrix of equilibrium base pairing probabilities Pij is computed for each input sequence

For each sequence position probabilities are calculated for following cases pairing possibilities position i is paired with a position j <i

a position j > i

it remains unpaired

Structural Score Contributions These probabilities are used as structure annotation. For a pair of annotated input sequences A and B we define

structural score contributions for positions i and j by

The total (mis)match score is the weighted sum of the sequence score and the structure score using the equation

Ψ is the balance term that measures the relative contribution of sequence and structure similarity

For very similar sequences one should use ψ ≈ 1 whereas in case of very dissimilar sequences one should use a score dominated by the structural component.

Experiments and Results

Pairwise versus Three-Way Alignments Sets of artificial sequences generated

using the ROSE package The quality of aln3nn alignments were

compared to standard progressive alignments of three sequences using t_coffee

The same scoring model in aln3nn and t_coffee were used

The analysis indicated that as gaps increased aln3nn produced better scores

Comparisons for 3 and 10 sequences

Protein Alignments Used three types of substitution matrices:

BLOSUM, PAM and GONNET aln3nn chooses the best suiting matrix of the

given type according to sequence identity The median BAliBASE score is used for each

sequence set as a measurement for alignment quality

Although aln3nn does not employ any heuristic rules to alter scoring parameters it compares well with other common alignment programs

Comparison of different alignment programs

RNA Alignments RNA sequences often evolve much faster than

their secondary structure Alignment quality can be increased dramatically

by including structural information Used six diverse families of RNA data sets from

the BRaliBase for comparisons Structure conservation index (SCI) was used to

assess the quality of the calculated alignments SCI is defined as the ratio of consensus folding energy of

a set of aligned sequences and average unconstrained folding energies of the individual sequences

SCI is close to 0 for structurally divergent sequences and close to 1 for correctly aligned sequences with a common fold

Alignment accuracies on RNA samples

Influence of parameter ψ on SCI The SCI decreases if structural information is

completely ignored (ψ = 1) On the other hand ignoring the sequence

information (ψ = 0) yields even worse results. The reason is that RNA secondary structure

prediction has limited accuracy so that alignments based on predicted structures for individual sequences are based on very noisy data

Also the impact of the ψ parameter varies between different RNA families.

Impact of the balancing parameter on SCI

Gap Removals In some data sets one fifth of the gaps in the early stages of

the progressive alignment are later removed again Following table shows the frequency f of gaps that are

removed at intermediate division steps and that are not re-introduced at later stages

Discussion and Future Work A direct comparison of aln3nn with progressive

alignments of the same three sequences shows that the progressive approach leads to significantly suboptimal scores

Aln3nn incurs additional computational costs compared to pair-wise, guide-tree based, approaches but it achieves competitive alignment accuracies on both protein and nucleic acid data

Performance of t_coffee shows that the shortcoming of initial pairwise alignments cannot be fully overcome later on where as aln3nn overcomes this problem

Future work Modifications in the division step for 3 way alignments Improvements in branch and bound approach

Thanks

Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler...

Documents

Transcript of Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler...