Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...

20
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14

Transcript of Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings,...

Page 1: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Multiple sequence comparison (MSC)

Reading: Setubal/Meidanis, 3.4

Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14

Page 2: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Why care about similarity?

• Similar sequences have similar structure

Page 3: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Similar structure -> similar sequence?• No, the converse is not true!

• Convergent evolution. Outwardly similar solutions to similar problems may be internally different.

• Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird.

• Same is true of molecular ‘species’ and ‘anatomies’!

Page 4: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Sequence --> function

• Similar sequences have similar function

• ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work

Page 5: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Common origins• Similar sequences have common origins

• ‘Descent with modification’ is Nature’s design mechanism

• Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?)

• Strong similarity may imply strong conservation of sequence or motif

Page 6: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Is multiple sequence comparison a generalization?

• From cs point of view, we’re going from two strings to many strings, a generalization

• Yes, in that it helps detect faint similarities

• No, in that we go from known biological similarity to suspected sequence similarity

Page 7: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

‘Big’ uses for MSC

• Represent protein families

• Identify conserved sequence features

• Deduce evolutionary history

Page 8: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Profile representation

• Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character

Page 9: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Profile example

Alignment

a b c - a

a b a b a

a c c b -

c b - b c

Profile

C1 C2 C3 C4 C5

a .75 .25 .50

b .75 .75

c .25 .25 .50 .25

d .25 .25 .25

Page 10: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Fit string S to profile P

• Given a profile P and a string S, what is the best alignment (fit) of S to P?

• Example:

S: A a b - b c

P: 1 - 2 3 4 5

Page 11: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Two key issues

• How to score an alignment of a string to a profile

• How to compute an optimal alignment, given a scoring system

Page 12: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Scoring and alignment of profile

• Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column

• Optimal alignment By DP, similar to S-S optimal alignment

• Q: How would you do profile-to-profile scoring and alignment?

Page 13: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Signature (motif) representation

• A motif is a regular expression (re)• Example: a helicase motif

[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where– [abc] = any of a,b,c– & = [ILVMFYW]– x = any amino

– a3 = up to 3 a’s

– an = any number of a’s

• Find a motif by grep-ing

Page 14: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Finding optimal MS alignment

• Need a scoring system

• Given a scoring system, an (efficient) method of calculation

• If no efficient method of getting the right answer, an efficient way of getting a plausible answer

Page 15: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Need MSC measure

• Desirable characteristics:– variable number of sequences– column-wise calculation– order independence

MQPILLL

MLR-LL-

MK-ILLL

MPPVLIL

Page 16: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Sum-of-pairs (SP) measure

• Column score = sum pairwise scores

• k Choose 2 pairs

• Reduces to pairwise alignment when k = 2

• Need to assign (-,-) value

• May compute in either row or column order

Page 17: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

DP approach

• Generalization of two-sequence comparison

• k-dimensional array

• space complexity is O(nk)

• MSC with SP measure is NP-complete

Page 18: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

MSA speedup heuristic

• This ‘heuristic’ guarantees the right answer!

• But .. it doesn’t guarantee the speedup

• General idea:– find a lower bound on L – if value for a cell exceeds L, it cannot enter into

opt solution

Page 19: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Commonly method -- iterative

• Simplest implementation

• Begin with Si and Sj which are pairwise closest

• Iteratively merge in additional string with smallest edit distance from any in multiple alignment

• Equivalent to finding MSP on edit tree

Page 20: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Clustering method

• Almost any clustering algorithm can be adapted to MSC

• Usually start with small clusters and build big ones

• Also possible start with big cluster, and divide-and-conquer

• Not clear which method is best