Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores...
Transcript of Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores...
![Page 1: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/1.jpg)
Multiple Sequence Alignment
With thanks to Eric Stone and Steffen Heber,North Carolina State University
![Page 2: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/2.jpg)
3
Definition: Multiple sequence alignment
ATTTG-ATTTGCAT-TGC
ATTTGATTTGCATT-GC
ATTT-G-ATTT-GCAT-T-GC
alignment no alignmentno alignment
• Given a set of sequences, a multiple sequence alignment is an assignment of gap characters such that– the resulting sequences have the same length
– no column contains only gaps
![Page 3: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/3.jpg)
6
Application: Characterize protein families
![Page 4: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/4.jpg)
7
Application: Discover conserved pattern
• A faint similarity between two sequences may become detectable if present in many
![Page 5: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/5.jpg)
8
Application: Recover phylogenetic tree
Pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
pig
dog
human
mouse
![Page 6: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/6.jpg)
9
Multiple sequence alignment (MSA)
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADSgi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADSgi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSgi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADSgi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
• Generalizes pairwise sequence alignment (PSA)– Multiple simply means three or more sequences
• In PSA, paired residues assumed to be homologous– In MSA, columns of residues assumed to be homologous
• Are there fundamental differences between MSA and PSA?– What distinguishes MSA from PSA?
![Page 7: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/7.jpg)
10
Biological distinction
• Homology (common ancestry) in PSA is symmetric
• Phylogeny renders homology in MSA asymmetric
…TTACG……TTGCG…
A G AG
=
…TTACG……TTACG……TTGCG……TTGCG… A G
≠A G A GG A
![Page 8: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/8.jpg)
11
MSA makes a statement about homology
• Like pairwise sequence alignment– Multiple sequence alignment asserts the homology of its columns
• Unlike pairwise sequence alignment, interpretation requires a phylogeny
NYLS
NYLS NFLS
![Page 9: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/9.jpg)
12
Statistical distinction
a
b
c
a b a b
M R( )( ) ba
ab
p
RbaMba
=|,Pr|,Pr
a
b
• PSA compares a "match model" M to a "random model" R
• In MSA, how do we define M without phylogeny?
• How were PSA match probabilities pab obtained?– Can something similar be done for MSA?
![Page 10: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/10.jpg)
13
Computational distinction
• Even for PSA, exhaustive search is exponential– But optimal PSA can be found by DP in O(mn)
• Clearly MSA has higher complexity than PSA– But by how much?
• Is finding the optimal MSA still feasible?
F(i-1, j-1) F(i, j-1)
F(i-1,j) F(i, j)
-d
-d
s(xi ,yj)
![Page 11: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/11.jpg)
14
The problem of global MSA
• Given N sequences: x1, x2,…, xk
• Insert gaps (-) in each sequence xi such that– All sequences have the same length L
– Score of the global map is maximum
• MSA is more sensitive than PSA– A faint similarity between two sequences may become detectable if
present in many
• More sequences may increase alignment quality– But the cost is added complexity
![Page 12: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/12.jpg)
15
How can alignment columns be scored?
• In pairwise sequence alignment– Scores quantify the exchangeability of the residues/gap in the pair
• In multiple sequence alignment– A similar treatment is more complicated and requires use of phylogeny
• One solution:– Evaluate MSA through its constituent PSAs
• These PSAs are called induced pairwise alignments
![Page 13: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/13.jpg)
16
Induced pairwise alignments
• Example MSA:x:AC-GCGG-Cy:AC-GC-GAGz:GCCGC-GAG
• Induces three PSAs:
x:ACGCGG-C x:AC-GCGG-C y:AC-GCGAG
y:ACGC-GAC z:GCCGC-GAG z:GCCGCGAG
• The MSA can be scored by summing over the induced PSAs– This is called the “Sum-of-pairs” approach
![Page 14: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/14.jpg)
17
Example: Sum-of-pairs score
F Y G D
F 5 -2 -2 -1
Y 7 1 -5
G 4 -3
D 5
x: F-Gy: F-Gz: FYD
Gap penalty: -8
BLOSUM 60
x: FGy: FG x: F-G
z: FYD
y: F-Gz: FYD
5 + 4 = 95 - 8 - 3 = -6
5 - 8 - 3 = -6
• Sum-of-pairs score: 9 - 6 - 6 = -3– What is the computational complexity?
![Page 15: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/15.jpg)
18
Problem: Finding the optimal MSA
• Given k sequences: x1, x2,…, xk
• Sum-of-pairs score for any MSA of k sequences is– Sum of scores of all k(k-1)/2 induced PSAs
• Seek MSA A which maximizes sum-of-pairs scoreS(A) = Σi<j S(Aij)
where S(Aij) is the score of the Aij, the PSA of sequences xi
and xj induced by the MSA A
• Clearly exhaustive search is not an option– Can we rely on dynamic programming?
![Page 16: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/16.jpg)
19
Dynamic Programming
• Similar to pairwise alignments, multiple sequence alignments can be computed by dynamic programming
F
2D 3D
![Page 17: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/17.jpg)
20
Generalized Needleman-Wunsch
• Given 3 sequences x, y, and z
• Main iteration loop:
F(i,j,k) = max{ F(i-1, j-1, k-1) + S(xi, yj, zk),F(i-1, j-1, k ) + S(xi, yj, - ),F(i-1, j , k-1) + S(xi, -, zk),F(i-1, j , k ) + S(xi, -, - ),F(i , j-1, k-1) + S( -, yj, zk),F(i , j-1, k ) + S( -, yj, zk),F(i , j , k-1) + S( -, -, zk) }
![Page 18: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/18.jpg)
21
Analysis of algorithm
• Given k sequences of length n:– Space for matrix: O(nk)
– Neighbors/cell: 2k-1
– Time to compute SP score: O(k2)
– Overall runtime: O(k22knk)
• Implications– Can align about 7 relatively short (length 200 - 300) sequences in a
reasonable amount of time
• 27 2007 > 1,600,000,000,000,000,000– Exact optimality is generally not attainable
![Page 19: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/19.jpg)
22
Heuristics for multiple sequence alignment
• Exact optimality is too slow, even by dynamic programming
• Alternative:– Seek good suboptimal solutions that are attainable in reasonable time
• Key questions:– What is a “good” suboptimal solution?
– What is “reasonable” time?
• Heuristics focus on an intelligent reduction of search space– Divide-and-conquer alignment
– Greedy alignment (progressive)
![Page 20: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/20.jpg)
23
Divide-and-conquer alignment (DCA)
• Idea: Reduce search space for dynamic programming by cutting the sequences.
• Algorithm:1. Cut sequences into fragments
until fragments can be aligned by DP
2. Build multiple alignments by DP
3. Concatenate the resulting alignments
![Page 21: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/21.jpg)
24
Sequence 3
Seque
nce
2
Seq
uenc
e 1
Cut points optimize: C = Sprefix + Ssuffix - Scomplete
Reduction of search space
![Page 22: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/22.jpg)
25
Greater reduction of search space
![Page 23: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/23.jpg)
26
Progressive alignment
• Idea: – Build multiple sequence alignment from a series of pairwise alignments
• Strategy:– Choose two sequences to align (optimally)
– Hold pairwise alignment fixed, treat as a new sequence, and iterate
• For n sequences:– Requires n -1 pairwise sequence alignments
• Does the order matter?– What criteria are used to choose the sequences?
![Page 24: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/24.jpg)
27
Guide tree
• Binary tree– Leaves correspond to sequences
– Internal nodes represent alignments
– Root corresponds to final MSA
• The guide tree specifies theorder of alignment
• Usually constructed from matrix of pairwise distances between sequences
ATC ATG TCG
ATCATG
ATC-ATG-
-TCC
TCC
TCGTCC
-TCG
![Page 25: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/25.jpg)
28
Simple approach to distance matrix D
• Example sequences:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
• Simple approach– Count mismatches
• Pairwise distance matrix:
![Page 26: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/26.jpg)
29
From pairwise distances to a tree
• Using this information, a tree can be drawn:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
• Is it guaranteed that the distances exactly fit a tree?
C
D
A
B
41
2
2
1
![Page 27: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/27.jpg)
30
Guide tree
1
23
4
5
![Page 28: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/28.jpg)
31
Guide tree
1
23
4
5
6
55
6
8
![Page 29: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/29.jpg)
32
Guide tree
1
23
4
5
6
5
7
5
6
7
![Page 30: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/30.jpg)
33
Guide tree
1
23
4
5
6
5
7
5
68
8
7
![Page 31: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/31.jpg)
34
Progressive alignment
• Follow branching order of guide tree to build MSA
• Problem: We may have to align– Two sequences– A sequence and an alignment– Two alignments
1
23
4
5
Guide Tree
![Page 32: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/32.jpg)
35
How to align two alignments?
Idea: Dynamic programming; treat columns like single positions.
Example:
a = GTCGTA
b = GTTGTT
GTCGTAGTTGTT
GT-CGT-AGTT-GTT-
GTT-GTT-GT-TGT-T
Align a[3]and b[3]
Align a[3]and gap
Align b[3]and gap F
![Page 33: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/33.jpg)
36
1 PEEKSAVTAL2 GEEKAAVLAL3 PADKTNVKAA4 AADKTNVKAA
5 EGEWGLVLHV6 AAEKTKIRSA
Score: [ ]I)2s(K,V)2s(K, + I)s(L, + V)s(L,+ I)s(T, + V)s(T,8
1+
Scoring an alignment of alignments
• Average over all possibilities, possibly weighted– Generalizes PSA of two sequences to two profiles
![Page 34: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/34.jpg)
37
Complexity of progressive alignment
The time required to align k sequences of length n:
• For progressive alignment: O(k2n2)
• Compare with dynamic programming: O(k22knk)
• BUT– Is there any guarantee on the quality of the progressive MSA?
![Page 35: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/35.jpg)
38
Progressive alignments with ClustalW
• Clustal is the most popular method of MSA
![Page 36: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/36.jpg)
39
ClustalW: Overview
Pairwise Alignments
1 + 2
3 + 4
1 + 3
1 + 4
2 + 4
2 + 3
Guide Tree
1
2
3
4
progressive alignment
23
4
1
1 2 3 4 5
1
2
3
4
5
Distance Matrix
1. Compute pairwise alignments (DP)2. Convert similarities into distances3. Build guide tree from distances by
Neighbor Joining4. Align with respect to guide tree
![Page 37: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/37.jpg)
40
ClustalW server at EBI
http://www2.ebi.ac.uk/clustalw/
![Page 38: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/38.jpg)
41
Input: Sequences in FASTA format
• ClustalW example:– RBP protein sequences from five species:
– Human, mouse, rat, cow, pig
![Page 39: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/39.jpg)
42
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
Input:Five RBP sequences
best score is alignment between rat and mouse
Generate all 10 PSAs
• Highest scoring alignment ⇒ closest sequences on guide tree
![Page 40: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/40.jpg)
43
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
Use PSA scores to create guide tree
Rat RBP
Mouse RBP
Pig RBP
Cow RBP
Human RBP
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
![Page 41: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/41.jpg)
44
Progressively align sequences
• Make a MSA based on the order in the guide tree– Start with the two most closely related sequences
– Then add the next closest sequence
– Continue until all sequences are added to the MSA
• Rule: “once a gap, always a gap.”
1
23
4
5
Progressive MSA
![Page 42: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/42.jpg)
45
ClustalW: “once a gap, always a gap”
x:ACGCGGCy:ACGC-GC
x:ACGCGGCy:ACTT-TC
Closely related Distantly related
• There are many possible ways to make a MSA– Where gaps are added is a critical question
• In which case are gap locations most reliable?
• Gaps often added to the first two sequences– To maintain the initial gap choices is to trust that those gaps are most
believable
![Page 43: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/43.jpg)
46
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50
********************:* ***:*****
gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100
*********:*******.*:************.**:**************
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150
****************:*******:****:*:* ****** *********
* asterisks indicate identity in a column
Output: MSA of 5 RBP sequences
![Page 44: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/44.jpg)
47
ClustalW has sophisticated gap treatment
• Gap opening an extension penalty dependent on:– Weight matrix
– Sequence similarity, length, difference in sequence length
– Position of gaps and residues at gaps
• Motivation: – If positions known of all secondary structure elements (α-helices, β-
strands) in all or some of the sequences
– Could increase the gap penalties inside ss elements and decrease outside them
– Forcing gaps to occur most often in loop regions
![Page 45: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/45.jpg)
50
Sequence weighting in ClustalW
• Choose root such that mean of branch lengths on either side are equal
• For each sequence compute distance from root
• Adjust branches that are used several times
• Use distances as weight factors in SP score
A
B
C
root
0.30.2
0.6
0.1
wA = 0.2 + 0.3/2 = 0.35
wB = 0.1 + 0.3/2 = 0.25
wC = 0.6
guide tree
![Page 46: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/46.jpg)
51
Weighted sum-of-pairs score
• Sequence pairs may be assigned weights to reduce the influence of very similar sequences on the alignment score
• This leads to a weighted sum-of-pairs score (WSP):
( ) ( )∑<
=lk
lkkli mmswmWSP ,
weight factor
![Page 47: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/47.jpg)
52
Additional features of ClustalW
• Individual weights are assigned to sequences– Closely related ⇒ less weight
• Scoring matrices are varied depending on the presence of conserved or divergent sequences, e.g.
PAM20 80-100% id
PAM60 60-80% id
PAM120 40-60% id
PAM250 0-40% id
![Page 48: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/48.jpg)
53
Shortcomings of progressive approaches
• Progressive MSA strongly dependent upon initial alignments
• If sequences aligned at each step are similar– Progressive approach works well
• If MSA is built on dubious PSAs– Errors in alignment propagated and amplified
• Post-processing solution:– Iterative refinement
![Page 49: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/49.jpg)
54
Dangers of progressive alignment
Frozen by initial PSA
Additional sequences makeclear that y: GA-CTT
• Initial alignments are “frozen” even when new evidence is introduced
• Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
x
w
yz
![Page 50: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/50.jpg)
55
Iterative refinement of progressive MSA
x
y
z
x,z fixed projection
allow y to vary
• For each j = 1 to N– Remove sequence xj and realign to remaining alignment of x1,…,xj-
1,xj+1,…,xN
• Repeat until alignment converges
![Page 51: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/51.jpg)
56
Ex: Iterative refinement
• Progressive alignment (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
• After realigning y to the remainder:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
![Page 52: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/52.jpg)
70
Evaluating methods of MSA
• Since DP for MSA is prohibitively slow– Heuristic methods are required
• Heuristics vary in both speed and accuracy
• Speed is well quantified by computational complexity– How can accuracy be quantified?
• Idea: Construct benchmark sets of “correct” MSAs– Evaluate ability of methods to reconstruct “correct” alignment
• Metric: Q = proportion of correctly aligned residues
![Page 53: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/53.jpg)
71
Performance of different alignment tools
BAliBASE(237)
PREFAB(1932)
SABmark(698)
Algorithm
Q
0.804
0.832
0.861
0.882
0.883
0.896
0.910
Q tt tQ
19:25
2:53
1:07
1:18
21:31
1:05
-
5:32
-
0.572
0.589
0.648
0.636
12:25:00
2:57:00
2:36:00
144:51:00
3:11:000.648
0.668 19:41:00
Align-m 0.352 56:44
DIALIGN 0.410 8:28
CLUSTALW 0.439 2:16
MAFFT 0.442 7:33
T-Coffee 0.456 59:10
MUSCLE 0.464 20:42
PROBCONS 0.505 17:20
![Page 54: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/54.jpg)
72
• ClustalW: Most widely used– http://www.ebi.ac.uk/clustalw/
• T-Coffee: Better but slower– http://www.ch.embnet.org/software/TCoffee.html
• ProbCons: Most accurate– http://probcons.stanford.edu/
• MUSCLE: Most scalable– http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
Popular heuristic methods for MSA
![Page 55: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/55.jpg)
73
Summary
• Multiple alignments make a statement about homology
• Optimal solution can be found by DP, but prohibitively slow
• Faster heuristics necessary for most applications
• Most heuristic methods based on progressive method
• Variants include pre-processing and post-processing
• Choice of method dictated by tradeoffs of time and accuracy
![Page 56: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/56.jpg)
74
Postscript: The chicken and the egg
DGMNAGLAQ-VIADGM-ASLAQGVI------SIPGVDK-phylogenetic tree
initial alignment
refine alignment usingweighted SP score
compute sequence weights
estimate phylogenetic tree
![Page 57: Multiple Sequence Alignmentmuse/assets/msa.pdf · • In pairwise sequence alignment – Scores quantify the exchangeability of the residues/gap in the pair • In multiple sequence](https://reader033.fdocuments.net/reader033/viewer/2022041503/5e22f6a8cfb2b91cc400d476/html5/thumbnails/57.jpg)
75
Alignment ↔ phylogeny ↔ alignment ↔ ...
pig ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDDog --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEHuman KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPMouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
pig
dog
human
mouse
• Didn’t we use the tree to build the alignment?– How can we use the alignment to build the tree?