1 Protein Multiple Alignment by Konstantin Davydov

Click here to load reader

download 1 Protein Multiple Alignment by Konstantin Davydov

of 55

  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Embed Size (px)

Transcript of 1 Protein Multiple Alignment by Konstantin Davydov

  • Slide 1
  • 1 Protein Multiple Alignment by Konstantin Davydov
  • Slide 2
  • 2 Papers MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
  • Slide 3
  • 3 Outline Introduction Introduction Background MUSCLE ProbCons Conclusion
  • Slide 4
  • 4 Introduction What is multiple protein alignment? Given N sequences of amino acids x 1, x 2 x N : Insert gaps in each of the x i s so that All sequences have the same length Score of the global map is maximum ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
  • Slide 5
  • 5 Introduction Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions
  • Slide 6
  • 6 Outline Introduction Background Background MUSCLE ProbCons Conclusion
  • Slide 7
  • 7 Background Aligning two sequences ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
  • Slide 8
  • 8 Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z
  • Slide 9
  • 9 Background Unfortunately, this can get very expensive Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N )
  • Slide 10
  • 10 Outline Introduction Background MUSCLE MUSCLE ProbCons Conclusion
  • Slide 11
  • 11 MUSCLE
  • Slide 12
  • 12 MUSCLE Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated
  • Slide 13
  • 13 Three Stages Draft Progressive Improved Progressive Refinement
  • Slide 14
  • 14 Stage 1: Draft Progressive Similarity Measure Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2
  • Slide 15
  • 15 Stage 1: Draft Progressive Distance estimate Based on the similarities, construct a triangular distance matrix. XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
  • Slide 16
  • 16 Stage 1: Draft Progressive Tree construction From the distance matrix we construct a tree XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X1X1 X2X2 X3X3 X4X4 X2X2 X3X3 X4X4 X1X1 X4X4 X2X2 X3X3 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1
  • Slide 17
  • 17 Stage 1: Draft Progressive
  • Slide 18
  • 18 Stage 1: Draft Progressive Progressive alignment A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root. X1X1 X4X4 X2X2 X3X3 X3X3 X3X3 X2X2 X2X2 X4X4 X4X4 X1X1 X1X1 Alignment of X 1, X 2, X 3, X 4
  • Slide 19
  • 19 Stage 2: Improved Progressive Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated. X1X1 X4X4 X3X3 X2X2 XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
  • Slide 20
  • 20 Stage 2: Improved Progressive Similarity Measure Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA
  • Slide 21
  • 21 Stage 2: Improved Progressive Tree construction A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
  • Slide 22
  • 22 Stage 2: Improved Progressive Tree comparison The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed
  • Slide 23
  • 23 Stage 2: Improved Progressive Progressive alignment A new progressive alignment is built X2X2 X4X4 X1X1 X3X3 X3X3 X3X3 X1X1 X1X1 X4X4 X4X4 X2X2 X2X2 New Alignment
  • Slide 24
  • 24 Stage 3: Refinement Performs iterative refinement
  • Slide 25
  • 25 Stage 3: Refinement Choice of bipartition An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
  • Slide 26
  • 26 Stage 3: Refinement Profile Extraction The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC X1X1 X3X3 X4X4 X5X5 X2X2 TCC--AA TCA--AA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
  • Slide 27
  • 27 Stage 3: Refinement Re-alignment The two profiles are then realigned with each other using profile-profile alignment. TCA--GA T--CTGC G--ATAC T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
  • Slide 28
  • 28 Stage 3: Refinement Accept/Reject The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC NewOld OR
  • Slide 29
  • 29 MUSCLE Review Performance For alignment of N sequences of length L Space complexity: O(N 2 +L 2 ) Time complexity: O(N 4 +NL 2 ) Time complexity without refinement: O(N 3 +NL 2 )
  • Slide 30
  • 30 Outline Introduction Background MUSCLE ProbCons Conclusion
  • Slide 31
  • 31 Hidden Markov Models (HMMs) M JI AGCC-AGC -GCCCAGT IMMMJMMM X Y -- Y X :X :Y
  • Slide 32
  • 32 Pairwise Alignment Viterbi Algorithm Picks the alignment that is most likely to be the optimal alignment However: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy
  • Slide 33
  • 33 Lazy Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best student MEA Approach: Weighted majority vote 4. F 4. T 4. F 4. T A-AB A B+B+B- C
  • Slide 34
  • 34 Viterbi vs MEA Viterbi Picks the alignment with the highest chance of being completely correct Maximum Expected Accuracy Picks the alignment with the highest expected number of correct predictions
  • Slide 35
  • 35 ProbCons Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. Uses Maximum Expected Accuracy instead of the Viterbi alignment. 5 steps
  • Slide 36
  • 36 Notation Given N sequences S = {s 1, s 2, s N } a* is the optimal alignment
  • Slide 37
  • 37 ProbCons Step 1: Computation of posterior- probability matrices Step 2: Computation of expected accuracies Step 3: Probabilistic consistency transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement
  • Slide 38
  • 38 Step 1: Computation of posterior-probability matrices For every pair of sequences x,y S, compute the matrix P xy P xy (i, j) = P(x i ~y j a* | x, y), which is the probability that x i and y j are paired in a*
  • Slide 39
  • 39 Step 2: Computation of expected accuracies For a pairwise alignment a between x and y, define the accuracy as: accuracy(a, a*) = # of correct predicted matches length of shorter sequence
  • Slide 40
  • 40 Step 2: Computation of expected accuracies (continued) MEA alignment is found by finding the highest summing path through the matrix M xy [i, j] = P(x i is aligned to y j | x, y)
  • Slide 41
  • 41 Consistency z x y xixi yjyj y j zkzk
  • Slide 42
  • 42 Step 3: Probabilistic consistency transformation Re-estimate the match quality scores P(x i ~y j a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(x i ~y j a* | x, y)P(x i ~y j a* | x, y, z)
  • Slide 43
  • 43 Step 3: Probabilistic consistency transformation (continued)
  • Slide 44
  • 44 Step 3: Probabilistic consistency transformation (continued) Since most of the values of P xz and P zy will be very small, we ignore all the entries in which the value is smaller than some threshold w. Use sparse matrix multiplication May be repeated
  • Slide 45
  • 45 Step 4: Computation of guide tree Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum- of-pairs XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
  • Slide 46
  • 46 Step 5: Progressive alignment Align sequence groups hierarchically according to the order specified in the guide tree. Alignments are scored using sum-of-pairs scoring function. Aligned residues are scored according to the match quality scores P(x i ~y j a* | x, y) Gap penalties are set to 0.
  • Slide 47
  • 47 Post-processing step: iterative refinement Much like in MUSCLE Randomly partition alignment into two groups of sequences and realign May be repeated X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
  • Sli