Rapid Global Alignments

45
Rapid Global Alignments How to align genomic sequences in (more or less) linear time

description

Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. (x,y)  (x’,y’) requires x < x’ y < y’. Each local alignment has a weight - PowerPoint PPT Presentation

Transcript of Rapid Global Alignments

Page 1: Rapid Global Alignments

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 2: Rapid Global Alignments

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 3: Rapid Global Alignments

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 4: Rapid Global Alignments

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj L is implemented as a balanced binary tree

y

h

l

Page 5: Rapid Global Alignments

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)

V(b)V(a)

Page 6: Rapid Global Alignments

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from left to right:

1. When on the leftmost end of rectangle i, compute V(i)

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i, possibly store V(i) in L:

a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) &

lk li

i

j

Page 7: Rapid Global Alignments

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 8: Rapid Global Alignments

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 9: Rapid Global Alignments

Putting it All Together:Fast Global Alignment Algorithms

1. FIND local alignments

2. CHAIN local alignments

FIND CHAIN

GLASS: k-mers hierarchical DP

MumMer: Suffix Tree sparse DP

Avid: Suffix Tree hierarchical DP

LAGAN CHAOS sparse DP

Page 10: Rapid Global Alignments

LAGAN: Pairwise Alignment

1. FIND local alignments

2. CHAIN local alignments

3. DP restricted around chain

Page 11: Rapid Global Alignments

LAGAN

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 12: Rapid Global Alignments

LAGAN: recursive call

• What if a box is too large? Recursive application of LAGAN,

more sensitive word search

Page 13: Rapid Global Alignments

A trick to save on memory

“necks” have tiny tracebacks

…only store tracebacks

Page 14: Rapid Global Alignments

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 15: Rapid Global Alignments
Page 16: Rapid Global Alignments
Page 17: Rapid Global Alignments

Overview

• Definition

• Scoring Schemes

• Algorithms

Page 18: Rapid Global Alignments

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Page 19: Rapid Global Alignments

Scoring Function

• Ideally: Find alignment that maximizes probability that sequences evolved

from common ancestor, according to some phylogenetic model

• More on phylogenetic models later

x

yz

w

v

?

Page 20: Rapid Global Alignments

Scoring Function

• A comprehensive model would have too many parameters, too inefficient to optimize

• Possible simplifications

Ignore phylogenetic tree

Statistically independent columns:

S(m) = G(m) + i S(mi)

m: alignment matrixG: function penalizing gaps

Page 21: Rapid Global Alignments

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 22: Rapid Global Alignments

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 23: Rapid Global Alignments

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Page 24: Rapid Global Alignments

Consensus

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Page 25: Rapid Global Alignments

Multiple Sequence Alignments

Algorithms

Page 26: Rapid Global Alignments

1. Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Page 27: Rapid Global Alignments

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

1. Multidimensional Dynamic Programming

Page 28: Rapid Global Alignments

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

1. Multidimensional Dynamic Programming

Page 29: Rapid Global Alignments

2. Progressive Alignment

• Multiple Alignment is NP-complete

• Most used heuristic: Progressive Alignment

Algorithm:1. Align two of the sequences xi, xj

2. Fix that alignment

3. Align a third sequence xk to the alignment xi,xj

4. Repeat until all sequences are aligned

Running Time: O( N L2 )

Page 30: Rapid Global Alignments

2. Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree

Example:Order of alignments: 1. (x,y)

2. (z,w)3. (xy, zw)

x

w

y

z

Page 31: Rapid Global Alignments

CLUSTALW: progressive alignment

CLUSTALW: most popular multiple protein alignment

Algorithm:

1. Find all dij: alignment dist (xi, xj)

2. Construct a tree

(Neighbor-joining hierarchical clustering)

3. Align nodes in order of decreasing similarity

+ a large number of heuristics

Page 32: Rapid Global Alignments

CLUSTALW & the CINEMA viewer

Page 33: Rapid Global Alignments

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

Page 34: Rapid Global Alignments

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

Page 35: Rapid Global Alignments

MLAGAN: multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Page 36: Rapid Global Alignments

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Page 37: Rapid Global Alignments

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Page 38: Rapid Global Alignments

Iterative Refinement

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

Page 39: Rapid Global Alignments

Iterative Refinement

For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary

Page 40: Rapid Global Alignments

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 41: Rapid Global Alignments

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 42: Rapid Global Alignments

Restricted MDP

Here is another way to improve a multiple alignment:

1. Construct progressive multiple alignment m

2. Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

Page 43: Rapid Global Alignments

Restricted MDP

Run MDP, restricted to radius R from m

x

y

z

Running Time: O(2N RN-1 L)

Page 44: Rapid Global Alignments

Restricted MDP

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

• Within radius 1 of the optimal

Restricted MDP will fix it.

Page 45: Rapid Global Alignments

Optional refinement steps in MLAGAN

• Limited-area iterative refinement

• Radius-r 3-sequence refinement on each node of the tree