Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in...

23
Aligning Alignments Aligning Alignments Exactly Exactly By John Kececioglu, Dean By John Kececioglu, Dean Starrett Starrett CS Dept. Univ. of Arizona CS Dept. Univ. of Arizona Appeared in 8 Appeared in 8 th th ACM RECOME ACM RECOME 2004, 2004, Presented by Jie Meng Presented by Jie Meng
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in...

Page 1: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments ExactlyAligning Alignments Exactly

By John Kececioglu, Dean StarrettBy John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaCS Dept. Univ. of Arizona

Appeared in 8Appeared in 8thth ACM RECOME 2004, ACM RECOME 2004,

Presented by Jie MengPresented by Jie Meng

Page 2: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

BackgroundBackground DefinitionDefinition HardnessHardness An Exponential time algorithmAn Exponential time algorithm

Page 3: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

AlignmentsAlignments

Given two (DNA or Protein) sequences, an Given two (DNA or Protein) sequences, an alignment puts them against each other alignment puts them against each other such that the similar parts are aligned as such that the similar parts are aligned as close as possible, for example:close as possible, for example:

A T – C – T C G C TA T – C – T C G C T- T G - A T G – A T- T G - A T G – A T

There are four kinds of alignments

Match

Insertion;

Deletion;

Mismatch

Page 4: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Scoring AlignmentsScoring Alignments

There are four types of aligned columns:There are four types of aligned columns:– Match – Score Match – Score matchmatch = 0. = 0.

– Mismatch – Score Mismatch – Score mismatchmismatch 0. 0.

– Insertion – Score Insertion – Score insertioninsertion 0. 0.

– DeletionDeletion – Score – Score deletiondeletion 0. 0.

The The scorescore of an alignment is defined to be the of an alignment is defined to be the sumsum of the score of the aligned columns. of the score of the aligned columns.

The goal is to minimize the scoreThe goal is to minimize the score

Page 5: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Gap-costGap-cost

We can extend the score We can extend the score indel indel by by openopen and and extensionextension, then for a gap of size x, we have , then for a gap of size x, we have openopen +x* +x* extensionextension instead of x* instead of x* indel indel ..

AT----CGCTTCAT AT----CGCTTCAT -TGCAT—AT----- -TGCAT—AT-----

openopen +4* +4* extensionextension

Page 6: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Multiple AlignmentsMultiple Alignments

In general we also need compare In general we also need compare multiplemultiple sequences and find the similarities.sequences and find the similarities.

Multiple alignmentMultiple alignment generalizes the generalizes the alignment idea to handle many alignment idea to handle many sequences.sequences.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

Page 7: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Sum-of-Pairs (SP) ScoreSum-of-Pairs (SP) Score

Given a multiple alignment, the Given a multiple alignment, the sum-of-pairssum-of-pairs (SP) (SP) score is given by the sum of the score is given by the sum of the inducedinduced pairwise pairwise alignment scores of each pair in the alignment.alignment scores of each pair in the alignment.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

AT-C-TCGAT -TGCAT--AT AT-C-TCGATAT-C-TCGAT -TGCAT--AT AT-C-TCGAT

-TGCAT--AT ATCCA-CGCT ATCCA-CGCT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

+ +

Page 8: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

BAD NEWSBAD NEWS

Multiple alignment is NP-hardMultiple alignment is NP-hard

One methods is to approximate the One methods is to approximate the optimal value; optimal value;

Progressive alignments Progressive alignments

A problem arised natually: A problem arised natually: Aligning AlignmentsAligning Alignments

Page 9: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

Let S be a collection of strings s1, s2, s3…sk, over alphabet ;

An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si;

Like before, we have three types of aligning score:match, mismatch and substitution;

Page 10: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B;

AT-C-TCGAT-TGCAT--ATATCCA-CGAT

CT-ATTGGAT-TTAT-G--TCTTA-GGGAT

Page 11: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

In other word, We treat the columns of A and B as single letters, just like aligning two sequences.

CTGT-T

AT-TGT

C-TG-T--T

-AT--T-GT

Page 12: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

The score function is still sum-of-pair, namely

We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here

Ai’: a----aa-a

Bj’: aaa-a-a-a

ki lj

ji BAD1 1

'' ),(

Page 13: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.

Page 14: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Aligning Alignments

With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer

c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c;

By cut, it means the set of edges which have one end vertex in L and another is in R;

RL

Page 15: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

NP-hardnessNP-hardness

• Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c;

• we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;

Page 16: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

NP-hardnessNP-hardness

• The dummy rows in A are (0-)n, dummy rows in B are (0--)n;

• As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere;

• As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”

Page 17: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

NP-hardnessNP-hardness

• Simply we let score for match is 0,

score for mismatch is 1,

and gap open cost is 2, gap extension cost is 1

ask whether there is an alignment such that the score is less then d-c;

So we have an instance of Aligning Alignments.

Page 18: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

HOMEWORK4HOMEWORK4

• Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.

Page 19: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Exact Algorithm

The basic idea is still dynamic programming; We have to remember extra information by a set,

so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.

Page 20: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Exact Algorithm

S(i, j)=

B[j])(A[i],1)-j1,-S(i

B[j])(-,1)-jS(i, (A[i],-)j)1,-S(i

0j and 0i }{

0jor 0i {}

Page 21: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Exact Algorithm

C(i,j,t)=min

Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;

}]),[],,[()],[],[(s)1,-j1,-{C(i min

|}][|*)],[,(s)1,-j{C(i, min

|}][|*),],[(s)j,1,-{C(i min

open

tB[j])(A[i],s&1)-j1,-S(is

extensionopen

tB[j])(-,s&1)-jS(i,s

extensionopen

t(A[i],-)s&j)1,-S(is

BqAp

jqBipADsjBiAg

jBksjBg

iAlsiAg

Page 22: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Exact Algorithm

The optimum value is

The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is

)},,({],[

snmCMinnmSs

nk ,)23((

nk ,)()23((

2

12

3

22

nk

kkn

n

k

Page 23: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Any Questions?

423B

[email protected]