CS 5263 Bioinformatics

38
CS 5263 Bioinformatics Lecture 8: Multiple Sequence Alignment

description

CS 5263 Bioinformatics. Lecture 8: Multiple Sequence Alignment. Multiple Sequence Alignment. Definition Given N sequences x 1 , x 2 ,…, x N : Insert gaps (-) in each sequence x i , such that All sequences have the same length L Score of the alignment is maximum Two issues - PowerPoint PPT Presentation

Transcript of CS 5263 Bioinformatics

Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 8:Multiple Sequence Alignment

Page 2: CS 5263 Bioinformatics
Page 3: CS 5263 Bioinformatics

Multiple Sequence Alignment

• Definition– Given N sequences x1, x2,…, xN: Insert gaps

(-) in each sequence xi, such that• All sequences have the same length L• Score of the alignment is maximum

• Two issues– How to score an alignment?– How to find a (nearly) optimal alignment?

Page 4: CS 5263 Bioinformatics

Scoring function - first assumption

• Columns are independent– Similar in pair-wise alignment

• Therefore, the score of an alignment is the sum of all columns

• Need to decide how to score a single column

Page 5: CS 5263 Bioinformatics

Scoring function (cont’d)• Ideally:

– An n-dimensional matrix, where n is the number of sequences– E.g. (A, C, C, G, -) for aligning 5 sequences– Total number of parameters: (k+1)n, where k is the alphabet size

• Direct estimation of such scores is difficult– Too many parameters to estimate– Even more difficult if need to

consider phylogenetic relationships

x

yz

w

v

?Phylogenetic tree

or evolution tree

(more detail in future lectures)

Page 6: CS 5263 Bioinformatics

Scoring Function (cont’d)

• Compromises:– Compute from pair-wise scores

• Option 1: Based on sum of all pair-wise scores• Option 2: Based on scores with a consensus

sequence– Other options

• Consider tree topology explicitly• Information-theory based score• Difficult to optimize

Page 7: CS 5263 Bioinformatics

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment

Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

-

-

-

-

Page 8: CS 5263 Bioinformatics

Sum Of Pairs (cont’d)

• The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 9: CS 5263 Bioinformatics

Example:x: AC-GCGG-C

y: AC-GC-GAGz: GCCGC-GAG

A C G T -A 1 -1 -1 -1 -1

C -1 1 -1 -1 -1

G -1 -1 1 -1 -1

T -1 -1 -1 1 -1

- -1 -1 -1 -1 0

(A,A) + (A,G) x 2 = -1

(G,G) x 3 = 3

(-,A) x 2 + (A,A) = -1

Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Page 10: CS 5263 Bioinformatics

Sum Of Pairs (cont’d)• Drawback: no evolutionary characterization

– Every sequence derived from all others• Heuristic way to incorporate evolution tree

– Weighted Sum of Pairs:

Human

Mouse

Chicken

S(m) = k<l wkl s(mk, ml)wkl: weight decreasing with distance

Duck

Page 11: CS 5263 Bioinformatics

Consensus score

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Consensus sequence:

Page 12: CS 5263 Bioinformatics

Multiple Sequence Alignments Algorithms

• Can also be global or local– We only talk about global for now

• A simple method– Do pairwise alignment between all pairs– Combine the pairwise alignments into a single

multiple alignment– Is this going to work?

Page 13: CS 5263 Bioinformatics

Compatible pairwise alignments

AAAATTTT

TTTTGGGG AAAAGGGG

AAAATTTT--------TTTTGGGG

AAAATTTT----AAAA----GGGG

----TTTTGGGGAAAA----GGGG

AAAATTTT--------TTTTGGGGAAAA----GGGG

Page 14: CS 5263 Bioinformatics

Incompatible pairwise alignments

AAAATTTT

TTTTGGGG GGGGAAAA

AAAATTTT--------TTTTGGGG

----AAAATTTTGGGGAAAA----

TTTTGGGG--------GGGGAAAA

?

Page 15: CS 5263 Bioinformatics

Multidimensional Dynamic Programming (MDP)

Generalization of Needleman-Wunsh:• Find the longest path in a high-dimensional cube

– As opposed to a two-dimensional grid• Uses a N-dimensional matrix

– As apposed to a two-dimensional array

• Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]

F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

Page 16: CS 5263 Bioinformatics

• Example: in 3D (three sequences):

• 23 – 1 = 7 neighbors/cell

F(i-1,j-1,k-1) + S(xi, xj, xk),

F(i-1,j-1,k ) + S(xi, xj, -),

F(i-1,j ,k-1) + S(xi, -, xk),

F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),

F(i-1,j ,k ) + S(xi, -, -),

F(i ,j-1,k ) + S(-, xj, -),

F(i ,j ,k-1) + S(-, -, xk)

Multidimensional Dynamic Programming (MDP)

(i,j,k)

(i,j,k-1)

(i-1,j,k-1)(i-1,j-1,k-1)

(i-1,j-1,k)

(i,j-1,k)

(i-1,j,k)

(i,j-1,k-1)

Page 17: CS 5263 Bioinformatics

Multidimensional Dynamic Programming (MDP)

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Page 18: CS 5263 Bioinformatics

Faster MDP

• Carrillo & Lipman, 1988– Branch and bound– Other heuristics

• Implemented in a tool called MSA• Practical for about 6 sequences of length

about 200-300.

Page 19: CS 5263 Bioinformatics

Faster MDP• Basic idea: bounds of the optimal score of a

multiple alignment can be pre-computed– Upper-bound: sum of optimal pair-wise

alignment scores, i.e.

S(m) = k<l s(mk, ml) k<l s(k, l)

– lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next)

– For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path

– Guarantees optimality

Score of the alignment between k and l induced by m

Optimal msa

Score of optimal alignment between k and l

Page 20: CS 5263 Bioinformatics

Progressive Alignment• Multiple Alignment is NP-hard• Most used heuristic: Progressive Alignment

Algorithm:1. Align two of the sequences xi, xj

2. Fix that alignment3. Align a third sequence xk to the alignment xi,xj

4. Repeat until all sequences are aligned

Running Time: O(NL2)Each alignment takes O(L2)Repeat N times

Page 21: CS 5263 Bioinformatics

Progressive Alignment

• When evolutionary tree is known:– Align closest first, in the order of the tree

Example:Order of alignments: 1. (x,y)

2. (z,w)3. (xy, zw)

x

w

y

z

Page 22: CS 5263 Bioinformatics

Progressive Alignment: CLUSTALW

CLUSTALW: most popular multiple protein alignment

Algorithm:1. Find all dij: alignment dist (xi, xj)

• High alignment score => short distance2. Construct a tree

(Neighbor-joining hierarchical clustering. Will discuss in future)3. Align nodes in order of decreasing similarity

+ a large number of heuristics

Page 23: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK

• S2 TNSD

• S3 NASK

• S4 NTSD

Page 24: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK• S2 TNSD• S3 NASK• S4 NTSD

s1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0 Distance matrix

Page 25: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK• S2 TNSD• S3 NASK• S4 NTSD

s1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

Page 26: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK• S2 TNSD• S3 NASK• S4 NTSD

s1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

Page 27: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK• S2 TNSD• S3 NASK• S4 NTSD

s1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

-TNSDNT-SD

Page 28: CS 5263 Bioinformatics

CLUSTALW example

• S1 ALSK• S2 TNSD• S3 NASK• S4 NTSD

s1 s2 s3 s4

s1 0 9 4 7

s2 0 8 3

s3 0 7

s4 0

s1

s3

s2

s4

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SD

Page 29: CS 5263 Bioinformatics

Problems with progressive alignment:• Depend on pair-wise alignments• If sequences are very distantly related, much higher likelihood of

errors• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Iterative Refinement

Frozen!

Now clear: correct y should be GA-CTT

Page 30: CS 5263 Bioinformatics

Iterative RefinementAlgorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned4. For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Progressive alignment

Page 31: CS 5263 Bioinformatics

Iterative Refinement (cont’d)For each sequence y

1. Remove y2. Realign y

(while rest fixed)

x

y

z

x,z fixed projection

allow y to vary

Note: Guaranteed to converge (why?)Running time: O(kNL2), k: number of iterations

Page 32: CS 5263 Bioinformatics

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 33: CS 5263 Bioinformatics

Iterative Refinement

• Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 34: CS 5263 Bioinformatics

Restricted MDP

• Similar to bounded DP in pair-wise alignment1. Construct progressive multiple alignment m2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

x

y

z

Page 35: CS 5263 Bioinformatics

Restricted MDPx: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

• Within radius 1 of the optimal

Restricted MDP will fix it.

Page 36: CS 5263 Bioinformatics

Other approaches• Statistical learning methods

– Profile Hidden Markov Models– Will discuss in future lectures

• Consistency-based methods– Still rely on pairwise alignment

• But consider a third seq when aligning two seqs• If block A in seq x aligns to block B in seq y, and both aligns

to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable

• Essentially: change scoring system according to consistency• Than applied DP as in other approaches

– Pioneered by a tool called T-Coffee

Page 37: CS 5263 Bioinformatics

Multiple alignment tools• Clustal W (Thompson, 1994)

– Most popular• T-Coffee (Notredame, 2000)

– Another popular tool– Consistency-based– Slower than clustalW, but generally more accurate for more distantly related sequences

• MUSCLE (Edgar, 2004)– Iterative refinement– More efficient than most others

• DIALIGN (Morgenstern, 1998, 1999, 2005)– “local”

• Align-m (Walle, 2004)– “local”

• PROBCONS (Do, 2004)– Probabilistic consistency-based– Best accuracy on benchmarks

• ProDA (Phuong, 2006)– Allow repeated and shuffled regions

Page 38: CS 5263 Bioinformatics

In summary• Multiple alignment scoring functions

– Sum of pairs– Other funcs exist, but less used

• Multiple alignment algorithms:– MDP

• Optimal• too slow• Branch & Bound doesn’t solve the problem entirely

– Progressive alignment: clustalW– Iterative refinement– Restricted MDP– Consistency-based

Heuristic