Download - 1 Protein Multiple Alignment by Konstantin Davydov.

1

Protein Multiple Alignment

by Konstantin Davydov

2

Papers

MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar

ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

3

Outline

IntroductionIntroduction Background MUSCLE ProbCons Conclusion

4

Introduction

What is multiple protein alignment?

Given N sequences of amino acids x1, x2 … xN:

Insert gaps in each of the xis so that– All sequences have the same length– Score of the global map is maximum

ACCTGCA

ACTTCAA

ACCTGCA--

AC--TTCAA

5

Introduction

Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions

6

Outline

IntroductionBackgroundBackground MUSCLE ProbCons Conclusion

7

Background

Aligning two sequences

ACCTGCA

ACTTCAA

ACCTGCA--

AC--TTCAA

8

Background

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

9

Background

Unfortunately, this can get very expensive Aligning N sequences of length L requires a

matrix of size LN, where each square in the matrix has 2N-1 neighbors

This gives a total time complexity of

O(2N LN)

10

Outline

Introduction BackgroundMUSCLEMUSCLE ProbCons Conclusion

11

MUSCLE

12

MUSCLE

Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied

Three stages At end of each stage, a multiple alignment is

available and the algorithm can be terminated

13

Three Stages

Draft Progressive Improved Progressive Refinement

14

Stage 1: Draft Progressive

Similarity Measure – Calculated using k-mer counting.

ACCATGCGAATGGTCCACAATG

k-mer:

ATG

CCA

score:

3

2

15


Distance estimate – Based on the similarities, construct a triangular

distance matrix.

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1 X2 X3

X4

X3

X2

X1

X4

16


Tree construction – From the distance matrix we construct a tree

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1

X1

X2 X3 X4

X2

X3

X4

X1

X4

X2

X3

X3X2X4X1

X3X2

X4X1

17


18


Progressive alignment – A progressive alignment is built by following the

branching order of the tree. This yields a multiple alignment of all input sequences at the root.

X1

X4

X2

X3X3

X3

X2

X2X4

X4X1

X1Alignment of X1, X2, X3, X4

19

Stage 2: Improved Progressive

Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.

X1

X4

X3X2

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

20


Similarity Measure – Similarity is calculated for each pair of sequences

using fractional identity computed from their mutual alignment in the current multiple alignment

TCC--AATCA--GATCA--AAG--ATACT--CTGC

TCC--AATCA--AA

21


Tree construction – A tree is constructed by computing a Kimura

distance matrix and applying a clustering method to it

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

22


Tree comparison – The new tree is compared to the previous tree

by identifying the set of internal nodes for which the branching order has changed

23


Progressive alignment – A new progressive alignment is built

X2

X4

X1

X3X3

X3

X1

X1X4

X4X2

X2New Alignment

24

Stage 3: Refinement

Performs iterative refinement

25

Stage 3: Refinement

Choice of bipartition – An edge is removed from the tree, dividing the

sequences into two disjoint subsets

X5

X4

X2

X3

X1

X5

X1

X4

X2

X3

26

Stage 3: Refinement

Profile Extraction – The multiple alignment of each subset is

extracted from current multiple alignment. Columns made up of indels only are removed


X1

X3

X4

X5

X2

TCC--AATCA--AA

TCA--GA

T--CTGCG--ATAC

TCCAATCAAA

27

Stage 3: Refinement

Re-alignment – The two profiles are then realigned with each

other using profile-profile alignment.

TCA--GA

T--CTGCG--ATAC

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC

TCCAATCAAA

28

Stage 3: Refinement

Accept/Reject – The score of the new alignment is computed, if

the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC


New Old

OR

29

MUSCLE Review

Performance– For alignment of N sequences of length L– Space complexity: O(N2+L2)– Time complexity: O(N4+NL2)– Time complexity without refinement: O(N3+NL2)

30

Outline

Introduction Background MUSCLE ProbCons Conclusion

31

Hidden Markov Models (HMMs)

M

J I

AGCC-AGC-GCCCAGTIMMMJMMM

X

Y

-- Y X --

:X

:Y

32

Pairwise Alignment

Viterbi Algorithm– Picks the alignment that is most likely to be the

optimal alignment– However: the most likely alignment is not the

most accurate– Alternative: find the alignment of maximum

expected accuracy

33

Lazy Teacher Analogy

10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best

student MEA Approach: Weighted majority vote

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

A- AAB A- A

B+ B+B+B- B- C

34

Viterbi vs MEA

Viterbi– Picks the alignment with the highest chance of

being completely correct

Maximum Expected Accuracy– Picks the alignment with the highest expected

number of correct predictions

35

ProbCons

Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment.

Uses Maximum Expected Accuracy instead of the Viterbi alignment.

5 steps

36

Notation

Given N sequences S = {s1, s2, … sN} a* is the optimal alignment

37

ProbCons

Step 1: Computation of posterior-probability matrices

Step 2: Computation of expected accuracies Step 3: Probabilistic consistency

transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement

38

Step 1: Computation of posterior-probability matrices

For every pair of sequences x,yS, compute the matrix Pxy

Pxy(i, j) = P(xi~yj a* | x, y), which is the probability that xi and yj are paired in a*

39

Step 2: Computation of expected accuracies

For a pairwise alignment a between x and y, define the accuracy as:

accuracy(accuracy(a, a*a, a*) =) = # of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence

40

Step 2: Computation of expected accuracies (continued)

MEA alignment is found by finding the highest summing path through the matrix

Mxy[i, j] = P(xi is aligned to yj | x, y)

41

Consistency

z

x

y

xi

yj yj’

zk

42

Step 3: Probabilistic consistency transformation

Re-estimate the match quality scores P(xi~yj a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison:

P(xi~yj a* | x, y)P(xi~yj a* | x, y, z)

43

Step 3: Probabilistic consistency transformation (continued)

44

Step 3: Probabilistic consistency transformation (continued)

Since most of the values of Pxz and Pzy will be very small, we ignore all the entries in which the value is smaller than some threshold w.

Use sparse matrix multiplication May be repeated

45

Step 4: Computation of guide tree

Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum-

of-pairs

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

46

Step 5: Progressive alignment

Align sequence groups hierarchically according to the order specified in the guide tree.

Alignments are scored using sum-of-pairs scoring function.

Aligned residues are scored according to the match quality scores P(xi~yj a* | x, y)

Gap penalties are set to 0.

47

Post-processing step: iterative refinement

Much like in MUSCLE Randomly partition alignment into two groups

of sequences and realign May be repeated

X5

X4

X2

X3

X1

X5

X1

X4

X2

X3

48

ProbCons overview

ProbCons demonstrated dramatic improvements in alignment accuracy

Longer running time Doesn’t use protein-specific alignment

information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.

49

Outline

Introduction Background MUSCLE ProbCons Conclusion

50

Conclusion

MUSCLE demonstrated poor accuracy, but very short running time.

ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.

51

Results

52

Reliability Scores

53

Questions?

54

References

Robert C Edgar

– MUSCLE: a multiple sequence alignment method with reduced time and space complexity

Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

– ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment

55

References (continued)

Slides on Multiple Sequence Alignment, CS262

Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina

Sirota