Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...
-
Upload
ashley-hamilton -
Category
Documents
-
view
217 -
download
4
Transcript of Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...
Multiple sequence comparison (MSC)
Reading: Setubal/Meidanis, 3.4
Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14
Why care about similarity?
• Similar sequences have similar structure
Similar structure -> similar sequence?• No, the converse is not true!
• Convergent evolution. Outwardly similar solutions to similar problems may be internally different.
• Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird.
• Same is true of molecular ‘species’ and ‘anatomies’!
Sequence --> function
• Similar sequences have similar function
• ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work
Common origins• Similar sequences have common origins
• ‘Descent with modification’ is Nature’s design mechanism
• Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?)
• Strong similarity may imply strong conservation of sequence or motif
Is multiple sequence comparison a generalization?
• From cs point of view, we’re going from two strings to many strings, a generalization
• Yes, in that it helps detect faint similarities
• No, in that we go from known biological similarity to suspected sequence similarity
‘Big’ uses for MSC
• Represent protein families
• Identify conserved sequence features
• Deduce evolutionary history
Profile representation
• Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character
Profile example
Alignment
a b c - a
a b a b a
a c c b -
c b - b c
Profile
C1 C2 C3 C4 C5
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
d .25 .25 .25
Fit string S to profile P
• Given a profile P and a string S, what is the best alignment (fit) of S to P?
• Example:
S: A a b - b c
P: 1 - 2 3 4 5
Two key issues
• How to score an alignment of a string to a profile
• How to compute an optimal alignment, given a scoring system
Scoring and alignment of profile
• Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column
• Optimal alignment By DP, similar to S-S optimal alignment
• Q: How would you do profile-to-profile scoring and alignment?
Signature (motif) representation
• A motif is a regular expression (re)• Example: a helicase motif
[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where– [abc] = any of a,b,c– & = [ILVMFYW]– x = any amino
– a3 = up to 3 a’s
– an = any number of a’s
• Find a motif by grep-ing
Finding optimal MS alignment
• Need a scoring system
• Given a scoring system, an (efficient) method of calculation
• If no efficient method of getting the right answer, an efficient way of getting a plausible answer
Need MSC measure
• Desirable characteristics:– variable number of sequences– column-wise calculation– order independence
MQPILLL
MLR-LL-
MK-ILLL
MPPVLIL
Sum-of-pairs (SP) measure
• Column score = sum pairwise scores
• k Choose 2 pairs
• Reduces to pairwise alignment when k = 2
• Need to assign (-,-) value
• May compute in either row or column order
DP approach
• Generalization of two-sequence comparison
• k-dimensional array
• space complexity is O(nk)
• MSC with SP measure is NP-complete
MSA speedup heuristic
• This ‘heuristic’ guarantees the right answer!
• But .. it doesn’t guarantee the speedup
• General idea:– find a lower bound on L – if value for a cell exceeds L, it cannot enter into
opt solution
Commonly method -- iterative
• Simplest implementation
• Begin with Si and Sj which are pairwise closest
• Iteratively merge in additional string with smallest edit distance from any in multiple alignment
• Equivalent to finding MSP on edit tree
Clustering method
• Almost any clustering algorithm can be adapted to MSC
• Usually start with small clusters and build big ones
• Also possible start with big cluster, and divide-and-conquer
• Not clear which method is best