# 1 Protein Multiple Alignment by Konstantin Davydov

date post

20-Dec-2015Category

## Documents

view

218download

2

Embed Size (px)

### Transcript of 1 Protein Multiple Alignment by Konstantin Davydov

- Slide 1
- 1 Protein Multiple Alignment by Konstantin Davydov
- Slide 2
- 2 Papers MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
- Slide 3
- 3 Outline Introduction Introduction Background MUSCLE ProbCons Conclusion
- Slide 4
- 4 Introduction What is multiple protein alignment? Given N sequences of amino acids x 1, x 2 x N : Insert gaps in each of the x i s so that All sequences have the same length Score of the global map is maximum ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
- Slide 5
- 5 Introduction Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions
- Slide 6
- 6 Outline Introduction Background Background MUSCLE ProbCons Conclusion
- Slide 7
- 7 Background Aligning two sequences ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
- Slide 8
- 8 Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z
- Slide 9
- 9 Background Unfortunately, this can get very expensive Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N )
- Slide 10
- 10 Outline Introduction Background MUSCLE MUSCLE ProbCons Conclusion
- Slide 11
- 11 MUSCLE
- Slide 12
- 12 MUSCLE Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated
- Slide 13
- 13 Three Stages Draft Progressive Improved Progressive Refinement
- Slide 14
- 14 Stage 1: Draft Progressive Similarity Measure Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2
- Slide 15
- 15 Stage 1: Draft Progressive Distance estimate Based on the similarities, construct a triangular distance matrix. XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
- Slide 16
- 16 Stage 1: Draft Progressive Tree construction From the distance matrix we construct a tree XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X1X1 X2X2 X3X3 X4X4 X2X2 X3X3 X4X4 X1X1 X4X4 X2X2 X3X3 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1
- Slide 17
- 17 Stage 1: Draft Progressive
- Slide 18
- 18 Stage 1: Draft Progressive Progressive alignment A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root. X1X1 X4X4 X2X2 X3X3 X3X3 X3X3 X2X2 X2X2 X4X4 X4X4 X1X1 X1X1 Alignment of X 1, X 2, X 3, X 4
- Slide 19
- 19 Stage 2: Improved Progressive Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated. X1X1 X4X4 X3X3 X2X2 XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
- Slide 20
- 20 Stage 2: Improved Progressive Similarity Measure Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA
- Slide 21
- 21 Stage 2: Improved Progressive Tree construction A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
- Slide 22
- 22 Stage 2: Improved Progressive Tree comparison The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed
- Slide 23
- 23 Stage 2: Improved Progressive Progressive alignment A new progressive alignment is built X2X2 X4X4 X1X1 X3X3 X3X3 X3X3 X1X1 X1X1 X4X4 X4X4 X2X2 X2X2 New Alignment
- Slide 24
- 24 Stage 3: Refinement Performs iterative refinement
- Slide 25
- 25 Stage 3: Refinement Choice of bipartition An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
- Slide 26
- 26 Stage 3: Refinement Profile Extraction The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC X1X1 X3X3 X4X4 X5X5 X2X2 TCC--AA TCA--AA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
- Slide 27
- 27 Stage 3: Refinement Re-alignment The two profiles are then realigned with each other using profile-profile alignment. TCA--GA T--CTGC G--ATAC T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
- Slide 28
- 28 Stage 3: Refinement Accept/Reject The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC NewOld OR
- Slide 29
- 29 MUSCLE Review Performance For alignment of N sequences of length L Space complexity: O(N 2 +L 2 ) Time complexity: O(N 4 +NL 2 ) Time complexity without refinement: O(N 3 +NL 2 )
- Slide 30
- 30 Outline Introduction Background MUSCLE ProbCons Conclusion
- Slide 31
- 31 Hidden Markov Models (HMMs) M JI AGCC-AGC -GCCCAGT IMMMJMMM X Y -- Y X :X :Y
- Slide 32
- 32 Pairwise Alignment Viterbi Algorithm Picks the alignment that is most likely to be the optimal alignment However: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy
- Slide 33
- 33 Lazy Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best student MEA Approach: Weighted majority vote 4. F 4. T 4. F 4. T A-AB A B+B+B- C
- Slide 34
- 34 Viterbi vs MEA Viterbi Picks the alignment with the highest chance of being completely correct Maximum Expected Accuracy Picks the alignment with the highest expected number of correct predictions
- Slide 35
- 35 ProbCons Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. Uses Maximum Expected Accuracy instead of the Viterbi alignment. 5 steps
- Slide 36
- 36 Notation Given N sequences S = {s 1, s 2, s N } a* is the optimal alignment
- Slide 37
- 37 ProbCons Step 1: Computation of posterior- probability matrices Step 2: Computation of expected accuracies Step 3: Probabilistic consistency transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement
- Slide 38
- 38 Step 1: Computation of posterior-probability matrices For every pair of sequences x,y S, compute the matrix P xy P xy (i, j) = P(x i ~y j a* | x, y), which is the probability that x i and y j are paired in a*
- Slide 39
- 39 Step 2: Computation of expected accuracies For a pairwise alignment a between x and y, define the accuracy as: accuracy(a, a*) = # of correct predicted matches length of shorter sequence
- Slide 40
- 40 Step 2: Computation of expected accuracies (continued) MEA alignment is found by finding the highest summing path through the matrix M xy [i, j] = P(x i is aligned to y j | x, y)
- Slide 41
- 41 Consistency z x y xixi yjyj y j zkzk
- Slide 42
- 42 Step 3: Probabilistic consistency transformation Re-estimate the match quality scores P(x i ~y j a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(x i ~y j a* | x, y)P(x i ~y j a* | x, y, z)
- Slide 43
- 43 Step 3: Probabilistic consistency transformation (continued)
- Slide 44
- 44 Step 3: Probabilistic consistency transformation (continued) Since most of the values of P xz and P zy will be very small, we ignore all the entries in which the value is smaller than some threshold w. Use sparse matrix multiplication May be repeated
- Slide 45
- 45 Step 4: Computation of guide tree Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum- of-pairs XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
- Slide 46
- 46 Step 5: Progressive alignment Align sequence groups hierarchically according to the order specified in the guide tree. Alignments are scored using sum-of-pairs scoring function. Aligned residues are scored according to the match quality scores P(x i ~y j a* | x, y) Gap penalties are set to 0.
- Slide 47
- 47 Post-processing step: iterative refinement Much like in MUSCLE Randomly partition alignment into two groups of sequences and realign May be repeated X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
- Sli