Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...

Post on 04-Jan-2016

217 views 4 download

Transcript of Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...

Multiple sequence comparison (MSC)

Reading: Setubal/Meidanis, 3.4

Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14

Why care about similarity?

• Similar sequences have similar structure

Similar structure -> similar sequence?• No, the converse is not true!

• Convergent evolution. Outwardly similar solutions to similar problems may be internally different.

• Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird.

• Same is true of molecular ‘species’ and ‘anatomies’!

Sequence --> function

• Similar sequences have similar function

• ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work

Common origins• Similar sequences have common origins

• ‘Descent with modification’ is Nature’s design mechanism

• Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?)

• Strong similarity may imply strong conservation of sequence or motif

Is multiple sequence comparison a generalization?

• From cs point of view, we’re going from two strings to many strings, a generalization

• Yes, in that it helps detect faint similarities

• No, in that we go from known biological similarity to suspected sequence similarity

‘Big’ uses for MSC

• Represent protein families

• Identify conserved sequence features

• Deduce evolutionary history

Profile representation

• Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character

Profile example

Alignment

a b c - a

a b a b a

a c c b -

c b - b c

Profile

C1 C2 C3 C4 C5

a .75 .25 .50

b .75 .75

c .25 .25 .50 .25

d .25 .25 .25

Fit string S to profile P

• Given a profile P and a string S, what is the best alignment (fit) of S to P?

• Example:

S: A a b - b c

P: 1 - 2 3 4 5

Two key issues

• How to score an alignment of a string to a profile

• How to compute an optimal alignment, given a scoring system

Scoring and alignment of profile

• Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column

• Optimal alignment By DP, similar to S-S optimal alignment

• Q: How would you do profile-to-profile scoring and alignment?

Signature (motif) representation

• A motif is a regular expression (re)• Example: a helicase motif

[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where– [abc] = any of a,b,c– & = [ILVMFYW]– x = any amino

– a3 = up to 3 a’s

– an = any number of a’s

• Find a motif by grep-ing

Finding optimal MS alignment

• Need a scoring system

• Given a scoring system, an (efficient) method of calculation

• If no efficient method of getting the right answer, an efficient way of getting a plausible answer

Need MSC measure

• Desirable characteristics:– variable number of sequences– column-wise calculation– order independence

MQPILLL

MLR-LL-

MK-ILLL

MPPVLIL

Sum-of-pairs (SP) measure

• Column score = sum pairwise scores

• k Choose 2 pairs

• Reduces to pairwise alignment when k = 2

• Need to assign (-,-) value

• May compute in either row or column order

DP approach

• Generalization of two-sequence comparison

• k-dimensional array

• space complexity is O(nk)

• MSC with SP measure is NP-complete

MSA speedup heuristic

• This ‘heuristic’ guarantees the right answer!

• But .. it doesn’t guarantee the speedup

• General idea:– find a lower bound on L – if value for a cell exceeds L, it cannot enter into

opt solution

Commonly method -- iterative

• Simplest implementation

• Begin with Si and Sj which are pairwise closest

• Iteratively merge in additional string with smallest edit distance from any in multiple alignment

• Equivalent to finding MSP on edit tree

Clustering method

• Almost any clustering algorithm can be adapted to MSC

• Usually start with small clusters and build big ones

• Also possible start with big cluster, and divide-and-conquer

• Not clear which method is best