Download - 1 Protein Multiple Alignment by Konstantin Davydov.

Transcript
Page 1: 1 Protein Multiple Alignment by Konstantin Davydov.

1

Protein Multiple Alignment

by Konstantin Davydov

Page 2: 1 Protein Multiple Alignment by Konstantin Davydov.

2

Papers

MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar

ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

Page 3: 1 Protein Multiple Alignment by Konstantin Davydov.

3

Outline

IntroductionIntroduction Background MUSCLE ProbCons Conclusion

Page 4: 1 Protein Multiple Alignment by Konstantin Davydov.

4

Introduction

What is multiple protein alignment?

Given N sequences of amino acids x1, x2 … xN:

Insert gaps in each of the xis so that– All sequences have the same length– Score of the global map is maximum

ACCTGCA

ACTTCAA

ACCTGCA--

AC--TTCAA

Page 5: 1 Protein Multiple Alignment by Konstantin Davydov.

5

Introduction

Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions

Page 6: 1 Protein Multiple Alignment by Konstantin Davydov.

6

Outline

IntroductionBackgroundBackground MUSCLE ProbCons Conclusion

Page 7: 1 Protein Multiple Alignment by Konstantin Davydov.

7

Background

Aligning two sequences

ACCTGCA

ACTTCAA

ACCTGCA--

AC--TTCAA

Page 8: 1 Protein Multiple Alignment by Konstantin Davydov.

8

Background

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 9: 1 Protein Multiple Alignment by Konstantin Davydov.

9

Background

Unfortunately, this can get very expensive Aligning N sequences of length L requires a

matrix of size LN, where each square in the matrix has 2N-1 neighbors

This gives a total time complexity of

O(2N LN)

Page 10: 1 Protein Multiple Alignment by Konstantin Davydov.

10

Outline

Introduction BackgroundMUSCLEMUSCLE ProbCons Conclusion

Page 11: 1 Protein Multiple Alignment by Konstantin Davydov.

11

MUSCLE

Page 12: 1 Protein Multiple Alignment by Konstantin Davydov.

12

MUSCLE

Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied

Three stages At end of each stage, a multiple alignment is

available and the algorithm can be terminated

Page 13: 1 Protein Multiple Alignment by Konstantin Davydov.

13

Three Stages

Draft Progressive Improved Progressive Refinement

Page 14: 1 Protein Multiple Alignment by Konstantin Davydov.

14

Stage 1: Draft Progressive

Similarity Measure – Calculated using k-mer counting.

ACCATGCGAATGGTCCACAATG

k-mer:

ATG

CCA

score:

3

2

Page 15: 1 Protein Multiple Alignment by Konstantin Davydov.

15

Stage 1: Draft Progressive

Distance estimate – Based on the similarities, construct a triangular

distance matrix.

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1 X2 X3

X4

X3

X2

X1

X4

Page 16: 1 Protein Multiple Alignment by Konstantin Davydov.

16

Stage 1: Draft Progressive

Tree construction – From the distance matrix we construct a tree

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1

X1

X2 X3 X4

X2

X3

X4

X1

X4

X2

X3

X3X2X4X1

X3X2

X4X1

Page 17: 1 Protein Multiple Alignment by Konstantin Davydov.

17

Stage 1: Draft Progressive

Page 18: 1 Protein Multiple Alignment by Konstantin Davydov.

18

Stage 1: Draft Progressive

Progressive alignment – A progressive alignment is built by following the

branching order of the tree. This yields a multiple alignment of all input sequences at the root.

X1

X4

X2

X3X3

X3

X2

X2X4

X4X1

X1Alignment of X1, X2, X3, X4

Page 19: 1 Protein Multiple Alignment by Konstantin Davydov.

19

Stage 2: Improved Progressive

Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.

X1

X4

X3X2

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

Page 20: 1 Protein Multiple Alignment by Konstantin Davydov.

20

Stage 2: Improved Progressive

Similarity Measure – Similarity is calculated for each pair of sequences

using fractional identity computed from their mutual alignment in the current multiple alignment

TCC--AATCA--GATCA--AAG--ATACT--CTGC

TCC--AATCA--AA

Page 21: 1 Protein Multiple Alignment by Konstantin Davydov.

21

Stage 2: Improved Progressive

Tree construction – A tree is constructed by computing a Kimura

distance matrix and applying a clustering method to it

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

Page 22: 1 Protein Multiple Alignment by Konstantin Davydov.

22

Stage 2: Improved Progressive

Tree comparison – The new tree is compared to the previous tree

by identifying the set of internal nodes for which the branching order has changed

Page 23: 1 Protein Multiple Alignment by Konstantin Davydov.

23

Stage 2: Improved Progressive

Progressive alignment – A new progressive alignment is built

X2

X4

X1

X3X3

X3

X1

X1X4

X4X2

X2New Alignment

Page 24: 1 Protein Multiple Alignment by Konstantin Davydov.

24

Stage 3: Refinement

Performs iterative refinement

Page 25: 1 Protein Multiple Alignment by Konstantin Davydov.

25

Stage 3: Refinement

Choice of bipartition – An edge is removed from the tree, dividing the

sequences into two disjoint subsets

X5

X4

X2

X3

X1

X5

X1

X4

X2

X3

Page 26: 1 Protein Multiple Alignment by Konstantin Davydov.

26

Stage 3: Refinement

Profile Extraction – The multiple alignment of each subset is

extracted from current multiple alignment. Columns made up of indels only are removed

TCC--AATCA--GATCA--AAG--ATACT--CTGC

X1

X3

X4

X5

X2

TCC--AATCA--AA

TCA--GA

T--CTGCG--ATAC

TCCAATCAAA

Page 27: 1 Protein Multiple Alignment by Konstantin Davydov.

27

Stage 3: Refinement

Re-alignment – The two profiles are then realigned with each

other using profile-profile alignment.

TCA--GA

T--CTGCG--ATAC

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC

TCCAATCAAA

Page 28: 1 Protein Multiple Alignment by Konstantin Davydov.

28

Stage 3: Refinement

Accept/Reject – The score of the new alignment is computed, if

the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC

TCC--AATCA--GATCA--AAG--ATACT--CTGC

New Old

OR

Page 29: 1 Protein Multiple Alignment by Konstantin Davydov.

29

MUSCLE Review

Performance– For alignment of N sequences of length L– Space complexity: O(N2+L2)– Time complexity: O(N4+NL2)– Time complexity without refinement: O(N3+NL2)

Page 30: 1 Protein Multiple Alignment by Konstantin Davydov.

30

Outline

Introduction Background MUSCLE ProbCons Conclusion

Page 31: 1 Protein Multiple Alignment by Konstantin Davydov.

31

Hidden Markov Models (HMMs)

M

J I

AGCC-AGC-GCCCAGTIMMMJMMM

X

Y

-- Y X --

:X

:Y

Page 32: 1 Protein Multiple Alignment by Konstantin Davydov.

32

Pairwise Alignment

Viterbi Algorithm– Picks the alignment that is most likely to be the

optimal alignment– However: the most likely alignment is not the

most accurate– Alternative: find the alignment of maximum

expected accuracy

Page 33: 1 Protein Multiple Alignment by Konstantin Davydov.

33

Lazy Teacher Analogy

10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best

student MEA Approach: Weighted majority vote

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

A- AAB A- A

B+ B+B+B- B- C

Page 34: 1 Protein Multiple Alignment by Konstantin Davydov.

34

Viterbi vs MEA

Viterbi– Picks the alignment with the highest chance of

being completely correct

Maximum Expected Accuracy– Picks the alignment with the highest expected

number of correct predictions

Page 35: 1 Protein Multiple Alignment by Konstantin Davydov.

35

ProbCons

Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment.

Uses Maximum Expected Accuracy instead of the Viterbi alignment.

5 steps

Page 36: 1 Protein Multiple Alignment by Konstantin Davydov.

36

Notation

Given N sequences S = {s1, s2, … sN} a* is the optimal alignment

Page 37: 1 Protein Multiple Alignment by Konstantin Davydov.

37

ProbCons

Step 1: Computation of posterior-probability matrices

Step 2: Computation of expected accuracies Step 3: Probabilistic consistency

transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement

Page 38: 1 Protein Multiple Alignment by Konstantin Davydov.

38

Step 1: Computation of posterior-probability matrices

For every pair of sequences x,yS, compute the matrix Pxy

Pxy(i, j) = P(xi~yj a* | x, y), which is the probability that xi and yj are paired in a*

Page 39: 1 Protein Multiple Alignment by Konstantin Davydov.

39

Step 2: Computation of expected accuracies

For a pairwise alignment a between x and y, define the accuracy as:

accuracy(accuracy(a, a*a, a*) =) = # of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence

Page 40: 1 Protein Multiple Alignment by Konstantin Davydov.

40

Step 2: Computation of expected accuracies (continued)

MEA alignment is found by finding the highest summing path through the matrix

Mxy[i, j] = P(xi is aligned to yj | x, y)

Page 41: 1 Protein Multiple Alignment by Konstantin Davydov.

41

Consistency

z

x

y

xi

yj yj’

zk

Page 42: 1 Protein Multiple Alignment by Konstantin Davydov.

42

Step 3: Probabilistic consistency transformation

Re-estimate the match quality scores P(xi~yj a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison:

P(xi~yj a* | x, y)P(xi~yj a* | x, y, z)

Page 43: 1 Protein Multiple Alignment by Konstantin Davydov.

43

Step 3: Probabilistic consistency transformation (continued)

Page 44: 1 Protein Multiple Alignment by Konstantin Davydov.

44

Step 3: Probabilistic consistency transformation (continued)

Since most of the values of Pxz and Pzy will be very small, we ignore all the entries in which the value is smaller than some threshold w.

Use sparse matrix multiplication May be repeated

Page 45: 1 Protein Multiple Alignment by Konstantin Davydov.

45

Step 4: Computation of guide tree

Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum-

of-pairs

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

Page 46: 1 Protein Multiple Alignment by Konstantin Davydov.

46

Step 5: Progressive alignment

Align sequence groups hierarchically according to the order specified in the guide tree.

Alignments are scored using sum-of-pairs scoring function.

Aligned residues are scored according to the match quality scores P(xi~yj a* | x, y)

Gap penalties are set to 0.

Page 47: 1 Protein Multiple Alignment by Konstantin Davydov.

47

Post-processing step: iterative refinement

Much like in MUSCLE Randomly partition alignment into two groups

of sequences and realign May be repeated

X5

X4

X2

X3

X1

X5

X1

X4

X2

X3

Page 48: 1 Protein Multiple Alignment by Konstantin Davydov.

48

ProbCons overview

ProbCons demonstrated dramatic improvements in alignment accuracy

Longer running time Doesn’t use protein-specific alignment

information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.

Page 49: 1 Protein Multiple Alignment by Konstantin Davydov.

49

Outline

Introduction Background MUSCLE ProbCons Conclusion

Page 50: 1 Protein Multiple Alignment by Konstantin Davydov.

50

Conclusion

MUSCLE demonstrated poor accuracy, but very short running time.

ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.

Page 51: 1 Protein Multiple Alignment by Konstantin Davydov.

51

Results

Page 52: 1 Protein Multiple Alignment by Konstantin Davydov.

52

Reliability Scores

Page 53: 1 Protein Multiple Alignment by Konstantin Davydov.

53

Questions?

Page 54: 1 Protein Multiple Alignment by Konstantin Davydov.

54

References

Robert C Edgar

– MUSCLE: a multiple sequence alignment method with reduced time and space complexity

Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

– ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment

Page 55: 1 Protein Multiple Alignment by Konstantin Davydov.

55

References (continued)

Slides on Multiple Sequence Alignment, CS262

Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina

Sirota