Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E....

21
Journal of Discrete Algorithms 3 (2005) 321–341 www.elsevier.com/locate/jda Chaining algorithms for multiple genome comparison Mohamed Ibrahim Abouelhoda, Enno Ohlebusch Faculty of Computer Science, University of Ulm, 89069 Ulm, Germany Available online 21 September 2004 Abstract Given n fragments from k> 2 genomes, Myers and Miller showed how to find an optimal global chain of colinear non-overlapping fragments in O(n log k n) time and O(n log k1 n) space. For gap costs in the L 1 -metric, we reduce the time complexity of their algorithm by a factor log 2 n log log n and the space complexity by a factor log n. For the sum-of-pairs gap cost, our algorithm improves the time complexity of their algorithm by a factor log n log log n . A variant of our algorithm finds all significant local chains of colinear non-overlapping fragments. These chaining algorithms can be used in a variety of problems in comparative genomics: the computation of global alignments of complete genomes, the identification of regions of similarity (candidate regions of conserved synteny), the detection of genome rearrangements, and exon prediction. 2004 Elsevier B.V. All rights reserved. Keywords: Fragment-chaining algorithms; Multiple alignment; Comparative genomics; Range maximum query 1. Introduction Given the continuing improvements in high-throughput genomic sequencing and the ever-expanding biological sequence databases, new advances in software tools for post- sequencing functional analysis are being demanded by the biological scientific community. * Corresponding author. E-mail addresses: [email protected] (M.I. Abouelhoda), [email protected] (E. Ohlebusch). 1570-8667/$ – see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jda.2004.08.011

Transcript of Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E....

Page 1: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

a

bal

ime

lriety ofomes,

tion of

uery

d theor post-unity.

Journal of Discrete Algorithms 3 (2005) 321–341

www.elsevier.com/locate/jd

Chaining algorithms for multiple genomecomparison

Mohamed Ibrahim Abouelhoda, Enno Ohlebusch∗

Faculty of Computer Science, University of Ulm, 89069 Ulm, Germany

Available online 21 September 2004

Abstract

Givenn fragments fromk > 2 genomes, Myers and Miller showed how to find an optimal glochain of colinear non-overlapping fragments in O(n logk n) time and O(n logk−1 n) space. For gap

costs in theL1-metric, we reduce the time complexity of their algorithm by a factorlog2 nlog logn

and thespace complexity by a factor logn. For the sum-of-pairs gap cost, our algorithm improves the t

complexity of their algorithm by a factorlognlog logn

. A variant of our algorithm finds all significant locachains of colinear non-overlapping fragments. These chaining algorithms can be used in a vaproblems in comparative genomics: the computation of global alignments of complete genthe identification of regions of similarity (candidate regions of conserved synteny), the detecgenome rearrangements, and exon prediction. 2004 Elsevier B.V. All rights reserved.

Keywords:Fragment-chaining algorithms; Multiple alignment; Comparative genomics; Range maximum q

1. Introduction

Given the continuing improvements in high-throughput genomic sequencing anever-expanding biological sequence databases, new advances in software tools fsequencing functional analysis are being demanded by the biological scientific comm

* Corresponding author.E-mail addresses:[email protected](M.I. Abouelhoda),[email protected]

(E. Ohlebusch).

1570-8667/$ – see front matter 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.jda.2004.08.011

Page 2: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

mentsedges

ving ge-ls, andch de-

quences.nomegions ine usedns arenser-latingly havenmentsom dif-based on

-based

these

unique

nts

ple,

Fig. 1. Given a set of fragments (upper left figure), an optimal global chain of colinear non-overlapping frag(lower left figure) can be computed, e.g., by computing an optimal path in the graph in (b) (in which not allare shown). How the graph can be constructed from the fragments is explained in Section2.

Whole genome comparisons have been heralded as the next logical step toward solnomic puzzles, such as determining coding regions, discovering regulatory signadeducing the mechanisms and history of genome evolution. However, before any sutailed analysis can be addressed, methods are required for comparing such large seIf the organisms under consideration are closely related (that is, if no or only a few gerearrangements have occurred) or one compares regions of conserved synteny (rewhich orthologous genes occur in the same order), then global alignments can bfor the prediction of genes and regulatory elements. This is because coding regiorelatively well preserved, while non-coding regions tend to show varying degree of covation. Non-coding regions that do show conservation are thought important for regugene expression, maintaining the structural organization of the genome and possibother, yet unknown functions. Several comparative sequence approaches using alighave recently been used to analyze corresponding coding and non-coding regions frferent species, although mainly between human and mouse. These approaches aresoftware-tools for aligning DNA-sequences[5,7,8,11,12,18,19,21,27]; see[9] for a review.

To cope with the shear volume of data, most of the software-tools use an anchormethod that is composed of three phases:

1. Computation of fragments (regions in the genomes that are similar).2. Computation of an optimal global chain of colinear non-overlapping fragments:

are the anchors that form the basis of the alignment.3. Alignment of the regions between the anchors.

The fragments computed in the first phase are often exact matches (maximalmatches as in MUMmer[11,12], maximal multiple exact matches as in MGA[18], orexactk-mers as in GLASS[5]), but one may also allow substitutions (yielding fragmeas in DIALIGN [21]) or even insertions and deletions (as the BLASTZ-hits[26] that areused in MultiPipMaker[27]). Each of the fragments has a weight that can, for exam

Page 3: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 323

ance.blemlinear

to

ightg al-

anu-hm-

imeone

ousrrep-and “Towill

ndrove

lds

g the

e in-over arithmr exam-

nomesome re-

likelyar ap-st firstlter-huge

priorithat isal dif-e willprob-

be the length of the fragment (in case of exact fragments) or its statistical significThis article is concerned with algorithms for solving the combinatorial chaining proof the second phase: finding an optimal (i.e., maximum weight) global chain of conon-overlapping fragments; seeFig. 1(a). Note that every genome alignment tool hassolve the chaining problem somehow, but the algorithms differ from tool to tool.

A well-known solution to the chaining problem consists of finding a maximum wepath in a weighted directed acyclic graph. However, the running time of this chainingorithm is quadratic in the numbern of fragments. This can be a serious drawback ifn islarge. To overcome this obstacle, Zhang et al.[30] presented an algorithm that constructsoptimal chain using space division based onkd-trees, a data structure known from comptational geometry[6]. However, a rigorous analysis of the running time of their algoritis difficult because the construction of the chain is embedded in thekd-tree structure. Another chaining algorithm, devised by Myers and Miller[23], is based on theline-sweepparadigm, and usesorthogonal range-searchingsupported byrange treesinstead ofkd-trees. It is the only chaining algorithm fork > 2 sequences that runs in sub-quadratic tO(n logk n), but the result is a time bound higher by a logarithmic factor than whatwould expect. In particular, fork = 2 sequences it is one log-factor slower than previchaining algorithms[13,22], which require only O(n logn) time. In the epilogue of theipaper[23], Myers and Miller wrote: “We thought hard about trying to reduce this discancy but have been unable to do so, and the reasons appear to be fundamental”improve upon our result appears to be a difficult open problem”. In this article, weimprove their result. For gap costs in theL1-metric, we cannot only reduce the time aspace complexities of Myers and Miller’s algorithm by a log-factor but actually imp

the time complexity by a factorlog2 nlog logn

. For the sum-of-pairs gap cost, our method yie

a reduction by a factor lognlog logn

. In essence, this improvement is achieved by enhancinordinary range tree with (1) the combination of fractional cascading[29] and the efficientpriority queues of[17,28], which yields a more efficient search, and (2) an appropriatcorporation of gap costs, so that it is enough to determine a maximum function valuesemi-dynamic set (instead of a dynamic set). We would like to point out that our algocan also employ other data structures that support orthogonal range-searching. Fo

ple, if thekd-tree is used instead of the range tree, the algorithm takes O((k − 1)n2− 1k−1 )

time and O(n) space in the worst case, fork > 2 and gap costs in theL1-metric.As already mentioned, global alignments are a valuable tool for comparing the ge

of closely related species. In case of diverged genomic sequences, however, genarrangements have occurred during evolution, so that a global alignment strategy ispredestined to failure for having to align unrelated regions in an end-to-end colineproach. In this case, either local alignments are the strategy of choice or one muidentify syntenic regions, which then can be individually aligned. However, both anatives are faced with obstacles. Current local alignment programs suffer from arunning time, while the problem of automatically finding syntenic regions requires aknowledge of all genes and their locations in the genomes—a piece of informationoften not available. (It is beyond the scope of this paper to discuss the computationficulties of gene prediction and the accurate determination of orthologous genes.) Wshow that a local variant of our global chaining algorithm can be used to solve the

Page 4: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

324 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

. Assecondpping

t if itse frag-nt local

for solv-

frag-idually,ntifi-omeetectexcisedrsions

.

heuris-

aticsme

Inontiongions

n

lem of automatically finding regions of similarity in large genomic DNA sequencesin the anchor-based global alignment method, one first computes fragments. In thephase, however, instead of computing an optimal global chain of colinear non-overlafragments, one computes significant local chains. We call a local chain significanscore (the sum of the weights of the fragments in the chain, where gaps between thments are penalized) exceeds a user-defined threshold. The computation of significachains can be done in the same time and space complexities as above-mentioned (ing the global fragment-chaining problem).

Under stringent thresholds, significant local chains of colinear non-overlappingments represent candidate regions of conserved synteny. If one aligns these indivone gets good local alignments. We would like to point out that the automatic idecation of regions of similarity is a first step toward an automatic detection of genrearrangements. In[2], we have demonstrated that our method can be used to dgenome rearrangements such as transpositions (where a section of the genome isand inserted at a new position in the genome, without changing orientation) and inve(where a section of the genome is excised, reversed in orientation, and re-inserted)

We see several advantages of our chaining algorithms:

1. They can handle multiple genomes.2. They can handle any kind of fragments.3. In contrast to most other algorithms used in comparative genomics, they are not

tic: They correctly solve clearly defined problems.

It is worth mentioning that chaining algorithms are also useful in other bioinformapplications such as comparing restriction maps[22] or solving the exon assembly problewhich is part of eucaryotic gene prediction[15]. A variety of applications in comparativgenomics is described in[24].

This article is organized as follows. Section2 defines the basic concepts dealt with.Section3, we will explain a global chaining algorithm that neglects gap costs. In Secti4,the algorithm is modified in two steps, so that it can deal with certain gap costs. Sec5presents a local variant of the global chaining algorithm that can be used for finding reof similarity in large genomic DNA sequences. Finally, Section6 summarizes the maiachievements of our work.

Parts of this article appeared in[1] and[2].

2. Basic concepts and definitions

For 1� i � k, Si = Si[1 . . . ni] denotes a string of length|Si | = ni . In our application,Si is a (long) DNA sequence.Si[l . . . h] is the substring ofSi starting at positionl andending at positionh. A fragmentf consists of twok-tuplesbeg(f ) = (l1, l2, . . . , lk) andend(f ) = (h1, h2, . . . , hk) such that the stringsS1[l1 . . . h1], S2[l2 . . . h2], . . . , Sk[lk . . . hk]are “similar”. Furthermore, we speak of anexact fragment(or multiple exact match) ifthe substrings are identical, i.e.,S1[l1 . . . h1] = S2[l2 . . . h2] = · · · = Sk[lk . . . hk]. An exact

Page 5: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 325

he left

a

mainangle

istical

hase ofedut the

, we

is a

hain

raph

fragment is maximal, if the substrings can neither be simultaneously extended to tnor to the right in allSi .

A fragmentf of k genomes can be represented by a hyper-rectangle inRk with the

two extreme corner pointsbeg(f ) andend(f ), where each coordinate of the points isnon-negative integer (consequently, the wordsnumber of genomesanddimensionwill beused synonymously in what follows). A hyper-rectangle (called hyperrectangular doin [25]) is the Cartesian product of intervals on distinct coordinate axes. A hyper-rect[l1 . . . h1]× [l2 · · ·h2]× · · ·× [lk . . . hk] (with li < hi for all 1� i � k) will here be denotedby R(p,q), wherep = (l1, . . . , lk) andq = (h1, . . . , hk).

With every fragmentf , we associate a positive weightf.weight∈ R. This weight can,for example, be the length of the fragment (in case of exact fragments) or its statsignificance.

In what follows, we will often identify the pointbeg(f ) or end(f ) with the fragmentf . This is possible because we assume that all fragments are known from the first pthe anchored-based approach described in Section1 (so that every point can be annotatwith a tag that identifies the fragment it stems from). For example, if we speak aboscore of a pointbeg(f ) or end(f ), we mean the score of the fragmentf .

For ease of presentation, we consider the points0 = (0, . . . ,0) (the origin) andt =(|S1|+1, . . . , |Sk|+1) (the terminus) as fragments with weight 0. For these fragmentsdefinebeg(0) = ⊥, end(0) = 0, beg(t) = t, andend(t) = ⊥.

The coordinates of a pointp ∈ Rk will be denoted byp.x1,p.x2, . . . , p.xk . If k = 2, the

coordinates ofp will also be written asp.x andp.y.

Definition 2.1. We define a binary relation� on the set of fragments byf � f ′ if andonly if end(f ).xi < beg(f ′).xi for all 1 � i � k. If f � f ′, then we say thatf precedesf ′.

Note that0 � f � t for every fragmentf with f �= 0 andf �= t.

Definition 2.2. A chain of colinear non-overlapping fragments (or chain for short)sequence of fragmentsf1, f2, . . . , f� such thatfi � fi+1 for all 1� i < �. Thescoreof C

is

score(C) =�∑

i=1

fi.weight−�−1∑i=1

g(fi+1, fi)

whereg(fi+1, fi) is the cost of connecting fragmentfi to fi+1 in the chain. We will callthis costgap cost.

Definition 2.3. Givenn weighted fragments and a gap cost function, theglobal fragment-chaining problemis to determine a chain of maximum score (called optimal global cin the following) starting at the origin0 and ending at terminust.

A direct solution to this problem is to construct a weighted directed acyclic gG = (V ,E), where the setV of vertices consists of all fragments (including0 andt) and

Page 6: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

326 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

olutiontion

prob-that

l with

y

e

. If

re

, thef

rocess,

the set of edgesE is characterized as follows: There is an edgef → f ′ with weightf ′.weight− g(f ′, f ) if f � f ′. SeeFig. 1(b) for an example (where edgesf → f ′ areomitted whenever there is a fragmentf̃ such thatf � f̃ � f ′). An optimal global chainof fragments corresponds to a path with maximum score from vertex0 to vertext in thegraph. Because the graph is acyclic, such a path can be computed as follows. Letf ′.scorebe defined as the maximum score of all chains starting at0 and ending atf ′. f ′.scorecanbe expressed by the recurrence:0.score= 0 and

(1)f ′.score= f ′.weight+ max{f.score− g(f ′, f ): f � f ′}.

A dynamic programming algorithm based on this recurrence takes O(|V | + |E|) time pro-vided that computinggap coststakes constant time. Because|V |+|E| ∈ O(n2), computingan optimal global chain takes quadratic time and linear space. This graph-based sworks for any number of genomes and for any kind of gap cost. As explained in Sec1,however, the time bound can be improved by considering the geometric nature of thelem. In order to present our result systematically, we first give a chaining algorithmneglects gap costs. Then we will modify this algorithm in two steps, so that it can deacertain gap costs.

3. The global chaining algorithm without gap costs

3.1. The basic chaining algorithm

Given a setS of points inRk with associated score, arange queryRQ(p, q) asks for

all points of S that lie in the hyper-rectangleR(p,q), while a range maximum querRMQ(p, q) asks for a point of maximum score inR(p,q).

Lemma 3.1. Suppose that the gap cost function is the constant function0. If RMQ(0,

beg(f ′) − �1) returns the end point of fragmentf (where�1 denotes the vector(1, . . . ,1)),thenf ′.score= f ′.weight+ f.score.

Proof. This follows immediately from recurrence(1). �Suppose that the start and end points of the fragments are sorted w.r.t. theirx1 coordi-

nate. Then, processing the points in ascending order of theirx1 coordinate simulates a lin(plane or hyper-plane in higher dimensions) that sweeps the points w.r.t. theirx1 coordi-nate. Here, we will use this so-calledline-sweeptechnique to construct an optimal chaina point has already been scanned by the sweeping line, it is said to beactive; otherwise it issaid to beinactive. During the sweeping process, thex1 coordinates of the active points asmaller than thex1 coordinate of the currently scanned points. According toLemma 3.1,if s is the start point of fragmentf ′, then an optimal chain ending atf ′ can be found byanRMQ over the set of active end points of fragments. Sincep.x1 < s.x1 for every activeend pointp, theRMQ need not take the first coordinate into account. In other wordsRMQ is confined to the rangeR(0, (s.x2, . . . , s.xk) − �1) in R

k−1, so that the dimension othe problem is reduced by one. To manipulate the point set during the sweeping p

Page 7: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 327

ef-

.

the

e

e

e

oted

.

we need a semi-dynamic data structureD that stores the end points of fragments andficiently supports the following two operations: (1) activation and (2)RMQ over the set ofactive points.Algorithm 1 is based on such a data structureD, which will be defined later

Algorithm 1 (k-dimensional chaining ofn fragments).

Sort all start and end points of then fragments in ascending order w.r.t. theirx1 coordinateand store them in the arraypoints; because we include the end point of the origin andstart point of the terminus, there are 2n + 2 points.Store all end points of the fragments (ignoring theirx1 coordinate) as inactive in the(k−1)-dimensional data structureD.for i := 1 to 2n + 2

if points[i] is the start point of fragmentf ′ thenq := RMQ(0, (points[i].x2, . . . ,points[i].xk) − �1)

determine the fragmentf with end(f ) = q

f ′.prec:= f

f ′.score:= f ′.weight+ f.scoreelse /� points[i] is end point of a fragment�/

activate(points[i].x2, . . . ,points[i].xk) in D

In the algorithm,f ′.prec denotes a field that stores the preceding fragment off ′ in achain. It is an immediate consequence ofLemma 3.1that Algorithm 1 finds an optimalchain. The complexity of the algorithm depends of course on how the data structurD isimplemented. In the following subsection, we will outline an implementation ofD thatsupportsRMQs with activation in time O(n logd−1 n log logn) and space O(n logd−1 n) forn points, whered is the dimension. Because in our chaining problemd = k − 1, finding anoptimal chain byAlgorithm 1 takes O(n logk−2 n log logn) time and O(n logk−2 n) space.

3.2. AnsweringRMQs with activation efficiently

Our orthogonal range-searching data structure is based onrange trees, which are wellknown in computational geometry. Given a setS of n d-dimensional points, its range trecan be built as follows (see, e.g.,[4,25]). Ford = 1, the range tree ofS is a minimum-heightbinary search tree or an array storingS in sorted order. Ford > 1, the range tree ofS isa minimum-height binary search treeT with n leaves, whoseith leftmost leaf stores thpoint in S with the ith smallestx1-coordinate. To each interior nodev of T , we associatea canonical subsetCv ⊆ S containing the points stored at the leaves of the subtree roatv. For eachv, let lv (respectivelyhv) be the smallest (respectively largest)x1 coordinateof any point inCv and let

C∗v = {

(p.x2, . . . , p.xd) ∈ Rd−1 | (p.x1,p.x2, . . . , p.xd) ∈ Cv

}.

The interior nodev storeslv , hv , and a(d − 1)-dimensional range tree constructed onC∗v .

For any fixed dimensiond , the data structure can be built in O(n logd−1 n) time and spaceA range queryRQ(p, q) for the hyper-rectangleR(p,q) = [l1 . . . h1] × [l2 . . . h2] × · · · ×[ld . . . hd ] can be answered as follows. Ifd = 1, the query can be answered in O(logn)

Page 8: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

328 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

poset ift

gel range

bi-

tingaryt

emall-

Fig. 2. A set of points and a query rectangle.

time by a binary search. Ford > 1, we traverse the range tree starting at the root. Supnodev is visited in the traversal. Ifv is a leaf, then we report its corresponding poinit lies insideR(p,q). If v is an interior node, and the interval[lv, hv] does not intersec[l1, h1], there is nothing to do. If[lv, hv] ⊆ [l1, h1], we recursively search in the(d − 1)-dimensional range tree stored atv with the hyper-rectangle[l2 . . . h2] × · · · × [ld . . . hd ].Otherwise, we recursively visit both children ofv. This procedure takes O(logd n + z)

time, wherez is the number of points in the hyper-rectangleR(p,q).The technique offractional cascading[29] saves one log-factor in answering ran

queries (in the same construction time and using the same space as the originatree). Here, we will recall this technique for range queries of the formRQ(0, q) becausewe want to modify it to answer range maximum queries of the formRMQ(0, q) efficiently.For ease of presentation, we consider the cased = 2. In this case, the range tree is anary search tree (calledx-tree) of arrays (calledy-arrays). Let v be a node in thex-treeand letv.L andv.R be its left and right children. They-array Av of v contains all thepoints in Cv sorted in ascending order w.r.t. theiry coordinate. Every elementp ∈ Av

has two downstream pointers: The left pointerLptr and the right pointerRptr. The leftpointer Lptr points to the largest (i.e., rightmost) elementq1 in Av.L such thatq1 � p

(Lptr is a NULL pointer if such an element does not exist). In an implementation,Lptr isthe index withAv.L[Lptr] = q1. Analogously, the right pointerRptr points to the largeselementq2 of Av.R such thatq2 � p. Fig. 3shows an example of this structure. Locatall the points in a rectangleR(0, (h1, h2)) is done in two stages. In the first stage, a binsearch is performed over they-array of the root node of thex-tree to locate the rightmospoint ph2 such thatph2.y ∈ [0 . . . h2]. Then, in the second stage, thex-tree is traversed(while keeping track of the downstream pointers) to locate the rightmost leafph1 such thatph1.x ∈ [0 . . . h1]. During the traversal of thex-tree, we identify a set of nodes which wcall canonical nodes(w.r.t. the given range query). The set of canonical nodes is the sest set of nodesv1, . . . , v� ∈ x-tree such that

⊎�j=1 Cvj

= RQ(0, (h1,∞)). (⊎

denotes

disjoint union.) In other words,P := ⊎�j=1 Avj

= ⊎�j=1 Cvj

contains every pointp ∈ S

such thatp.x ∈ [0 . . . h1]. However, not every pointp ∈ P satisfiesp.y ∈ [0 . . . h2]. Here,

Page 9: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 329

ngrs. Ints

he

ointerse.

r ofof

ing

in

ee

n

Fig. 3. Range tree with fractional cascading for the points inFig. 2. The colored nodes are visited for answerithe range query ofFig. 2. Hatched nodes are the canonical nodes. The small circles refer to NULL pointethis example,ph1 = p6 andph2 = p5. The colored elements of they-arraysof the canonical nodes are the poinin the query rectangle ofFig. 2. The valuehv.L in every internal nodev is thex coordinate that separates tpoints in its left subtree from those occurring in its right subtree.

the downstream pointers come into play. As already mentioned, the downstream pare followed while traversing thex-tree, and to follow one pointer takes constant timIf we encounter a canonical nodevj , then the elementej , to which the last downstreampointer points, partitions the listAvj

as follows: Everye that is strictly to the right ofej

is not in R(0, (h1, h2)), whereas all other elements ofAvjlie in R(0, (h1, h2)). For this

reason, we will call the elementej the split element. It is easy to see that the numbecanonical nodes is O(logn). Moreover, we can find all of them and the split elementstheiry-arraysin O(logn) time; cf.[29]. Therefore, the range tree with fractional cascadsupports 2-dimensional range queries in O(logn+ z) time, wherez is the number of pointsin the rectangleR(0, q). For dimensiond > 2, it takes time O(logd−1 n + z).

In order to answerRMQs with activation, we will further enhance everyy-array thatoccurs in the fractional cascading data structure with a priority queue as described[17,28]. Each of these queues is (implicitly) constructed over therank space1 of the points inthe y-array. The rank space of the points in they-array consists of points in the rang[0 . . .m], wherem is the size of they-array, and the rank of a point is its index in thy-array because they-array is sorted w.r.t. they dimension.

The priority queue supports the operationsinsert(r), delete(r), predecessor(r) (givesthe largest element� r), and successor(r) (gives the smallest element> r) in timeO(log logm), wherer is an integer in the range[0 . . .m]. Algorithm 2 shows how to acti-vate a pointq in the 2-dimensional range tree andAlgorithm 3 shows how to answers aRMQ(0, q).

1 See, e.g.,[10,14] for more details on the rank space.

Page 10: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

330 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

:

with

t

Algorithm 2 (Activation of a pointq in the data structureD).

v := root node of thex-treefind the rank (index) r of q in Av by a binary searchwhile (v �= ⊥)

if (Av[r].score> Av[predecessor(r)].score) theninsert(r) into the priority queue attached toAv

while(Av[r].score> Av[successor(r)].score)delete(successor(r)) from the priority queue attached toAv

if (Av[r] = Av.L[Av[r].Lptr]) thenr := Av[r].Lptrv := v.L

elser := Av[r].Rptrv := v.R

Note that in the outer while-loop ofAlgorithm 2, the following invariant is maintainedIf 0 � i1 < i2 < · · · < i� � m are the entries in the priority queue attached toAv , thenAv[i1].score� Av[i2].score� · · · � Av[i�].score.

Algorithm 3 gives pseudo-code for answeringRMQ(0, q), but we would like to firstdescribe the idea on a higher level. In essence, we locate all canonical nodesv1, . . . , v�

in D for the hyper-rectangleR(0, q). For anyvj , 1 � j � �, let the rj th element bethe split element inAvj

. We have seen that⊎�

j=1 Avjcontains every pointp ∈ S such

that p.x ∈ [0 . . . q.x]. Now if rj is the index of the split element ofAvj, then all points

Avj[i] with i � rj are inR(0, q), whereas all other elementsAvj

[i] with i > rj are notin R(0, q). SinceAlgorithm 2 maintains the above-mentioned invariant, the elementhighest score in the priority queue ofAvj

that lies inR(0, q) is qj = predecessor(rj ) (ifrj is in the priority queue ofAvj

, thenqj = rj becausepredecessor(rj ) gives the largeselement� rj ). We then computemax_score:= max{Avj

[qj ].score| 1� j � �} and returnmax_point= Avi

[qi], whereAvi[qi].score= max_score.

Algorithm 3 (RMQ(0, q) in the data structureD).

v := root node of thex-treemax_score:= −∞max_point := ⊥find the rank(index) r of the rightmost pointp with p.y ∈ [0 . . . q.y] in Av

while (v �= ⊥)

if (hv.x � q.x) then /� v is a canonical node�/tmp:= predecessor(r) in the priority queue ofAv

max_score:= max{max_score,Av[tmp].score}if (max_score= tmp.score) then max_point := Av[tmp]

else if (hv.L.x � q.x) then /� v.L is a canonical node�/tmp:= predecessor(Av[r].Lptr) in the priority queue ofAv.L

max_score:= max{max_score,Av.L[tmp].score}

Page 11: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 331

-kes

e

-which

in in

all

x-

tweenunt.

is the

, then

if (max_score= tmp.score) then max_point := Av.L[tmp]r := Av[r].Rptrv := v.R

elser := Av[r].Lptrv := v.L

Because the number of canonical nodes is O(logn) and any of the priority queue operations takes O(log logn) time, answering a 2-dimensional range maximum query taO(logn log logn) time. Since every point occurs in at most logn priority queues, there arat mostn logn delete operations. Hence the total time complexity of activatingn points isO(n logn log logn).

Theorem 3.2. Givenk > 2 genomes andn fragments, an optimal global chain(withoutgap costs) can be found inO(n logk−2 n log logn) time andO(n logk−2 n) space.

Proof. In Algorithm 1, the points are first sorted w.r.t. their first dimension and theRMQwith activation is required only ford = k − 1 dimensions. Ford � 2 dimensions, the preceding data structure is implemented for the last two dimensions of the range tree,yields a data structureD that requires O(n logd−1 n) space and O(n logd−1 n log logn) timefor n RMQs andn activation operations. Consequently, one can find an optimal chaO(n logk−2 n log logn) time and O(n logk−2 n) space. �

In casek = 2, the data structureD is simply a priority queue (over the rank space ofpoints). Therefore, if the points are already sorted, then the algorithm takes O(n log logn)

time. Otherwise, the sorting procedure inAlgorithm 1dominates the overall time compleity of Algorithm 1because it requires O(n logn) time.

4. Incorporating gap costs

In the previous section, fragments were chained without penalizing the gaps in bethem. In this section, we modify the algorithm so that it can take gap costs into acco

4.1. Gap costs in theL1 metric

We first handle the case in which the cost for the gap between two fragmentsdistance between the end and start point of the two fragments in theL1 metric. For twopointsp,q ∈ R

k , this distance is defined by

d1(p, q) =k∑

i=1

|p.xi − q.xi |

and for two fragmentsf � f ′ we defineg1(f′, f ) = d1(beg(f ′), end(f )). If an alignment

of two sequencesS1 andS2 shall be based on fragments and one uses this gap costthe characters between the two fragments aredeleted/inserted; see left side ofFig. 4.

Page 12: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

332 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

st

t

.ation

y

t, we

ntm.

ACCXXXX AGGACC YYYAGG

ACCXXXXAGGACCYYY AGG

Fig. 4. Alignments based on the fragmentsACC andAGG w.r.t. gap costg1 (left) and the sum-of-pairs gap cowith λ > 1

2ε (right), whereX andY are anonymous characters.

The problem with gap costs in our approach is that anRMQ does not take the cosg(f ′, f ) from recurrence(1) into account, and if we would explicitly computeg(f ′, f )

for every pair of fragments withf � f ′, then this would yield a quadratic time algorithmThus, it is necessary to express the gap costs implicitly in terms of weight informattached to the points. We achieve this by using thegeometric costof a fragmentf , whichwe define in terms of the terminus pointt asgc(f ) = d1(t,end(f )).

Lemma 4.1. Let f , f̃ , and f ′ be fragments such thatf � f ′ and f̃ � f ′. Then theinequalityf̃ .score−g1(f

′, f̃ ) > f.score−g1(f′, f ) holds true if and only if the inequalit

f̃ .score− gc(f̃ ) > f.score− gc(f ) holds.

Proof.

f̃ .score− g1(f′, f̃ ) > f.score− g1(f

′, f )

⇔f̃ .score−k∑

i=1

(beg(f ′).xi − end(f̃ ).xi

)>

f.score−k∑

i=1

(beg(f ′).xi − end(f ).xi

)

⇔f̃ .score−k∑

i=1

(t.xi − end(f̃ ).xi

)>

f.score−k∑

i=1

(t.xi − end(f ).xi

)

⇔f̃ .score− gc(f̃ ) > f.score− gc(f ).

The second equivalence follows from adding∑k

i=1 beg(f ′).xi to and subtracting∑k

i=1 t.xi

from both sides of the inequality.Fig. 5 illustrates the lemma fork = 2. �Becauset is fixed, the valuegc(f ) is known in advance for every fragmentf . Therefore,

Algorithm 1 needs only two slight modifications to take gap costs into account. Firsreplace the statementf ′.score:= f ′.weight+ f.scorewith

f ′.score:= f ′.weight+ f.score− g1(f′, f ).

Second, ifpoints[i] is the end point off ′, then it will be activated withf ′.priority :=f ′.score−gc(f ′). Thus, anRMQ will return a point of maximum priority instead of a poiof maximum score. The next lemma implies the correctness of the modified algorith

Page 13: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 333

f

whichse

xt.

n inFor

Fig. 5. Pointsp andq are active end points of the fragmentsf andf̃ . The start points of fragmentf ′ is currentlyscanned by the sweeping line andt is the terminus point.

Lemma 4.2. If the range maximum queryRMQ(0,beg(f ′) − �1) returns the end point ofragmentf̃ , then we havef̃ .score− g1(f

′, f̃ ) = max{f.score− g1(f′, f ): f � f ′}.

Proof. RMQ(0,beg(f ′) − �1) returns the end point of fragment̃f such thatf̃ .priority =max{f.priority: f � f ′}. Sincef.priority = f.score− gc(f ) for every fragmentf , itis an immediate consequence ofLemma 4.1that f̃ .score− g1(f

′, f̃ ) = max{f.score−g1(f

′, f ): f � f ′}. �4.2. The sum-of-pairs gap cost

In this section we consider the gap cost associated with the “sum-of-pairs” model,was introduced by Myers and Miller[23]. For clarity of presentation, we first treat the cak = 2 because the general casek > 2 is rather involved.

4.2.1. The casek = 2For two pointsp,q ∈ R

2, we write∆xi(p, q) = |p.xi − q.xi |, wherei ∈ {1,2}. We will

sometimes simply write∆x1 and∆x2 if their arguments can be inferred from the conteThe sum-of-pairs distance of two pointsp,q ∈ R

2 depends on the parametersε andλ andwas defined in[23] as follows:

d(p,q) ={

ε∆x2 + λ(∆x1 − ∆x2) if ∆x1 � ∆x2,

ε∆x1 + λ(∆x2 − ∆x1) if ∆x2 � ∆x1.

We rearrange these terms and derive the following equivalent definition:

d(p,q) ={

λ∆x1 + (ε − λ)∆x2 if ∆x1 � ∆x2,

(ε − λ)∆x1 + λ∆x2 if ∆x2 � ∆x1.

For two fragmentsf andf ′ with f � f ′, we defineg(f ′, f ) = d(beg(f ′),end(f )). In-tuitively, λ > 0 is the cost of aligning an anonymous character with a gap positiothe other sequence, whileε > 0 is the cost of aligning two anonymous characters.

Page 14: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

334 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

g-

tgaps

-he

al

t

ame

me this

λ = 1 andε = 2, this gap cost coincides with theg1 gap cost, whereas forλ = 1 andε = 1, this gap cost corresponds to theL∞ metric. (The gap cost of connecting two framentsf � f ′ in theL∞ metric is defined byg∞(f ′, f ) = d∞(beg(f ′),end(f )), whered∞(p, q) = maxi∈[1..k] |p.xi − q.xi | for p,q ∈ R

k .) Following [23,30], we demand thaλ > 1

2ε because otherwise it would always be best to connect fragments entirely byas in theL1 metric. So if an alignment of two sequencesS1 andS2 shall be based on fragments and one uses the sum-of-pairs gap cost withλ > 1

2ε, then the characters between ttwo fragments arereplacedas long as possible and the remaining characters aredeletedorinserted; see right side ofFig. 4.

In order to compute the score of a fragmentf ′ with beg(f ′) = s, the following defin-itions are useful. Thefirst quadrantof a points ∈ R

2 consists of all pointsp ∈ R2 with

p.x1 � s.x1 andp.x2 � s.x2. We divide the first quadrant ofs into regionsO1 andO2 bythe straight linex2 = x1 + (s.x2 − s.x1). O1, called thefirst octantof s, consists of allpointsp in the first quadrant ofs satisfying∆x1 � ∆x2 (i.e., s.x1 − p.x1 � s.x2 − p.x2),these are the points lying above or on the straight linex2 = x1 + (s.x2 − s.x1); seeFig. 6.The second octantO2 consists of all pointsq satisfying∆x2 � ∆x1 (i.e., s.x2 − q.x2 �s.x1−q.x1), these are the points lying below or on the straight linex2 = x1+ (s.x2− s.x1).Then f ′.score= f ′.weight+ max{v1, v2}, where vi = max{f.score− g(f ′, f ): f �f ′ andend(f ) lies in octantOi}, for i ∈ {1,2}.

However, our chaining algorithms rely onRMQs, and these work only for orthogonregions, not for octants. For this reason, we will make use of theoctant-to-quadranttrans-formations of Guibas and Stolfi[16]. The transformationT1 : (x1, x2) �→ (x1 − x2, x2)

maps the first octant to a quadrant. More precisely, pointT1(p) is in the first quadranof T1(s) if and only if p is in the first octant of points.2 Similarly, for the transformationT2 : (x1, x2) �→ (x1, x2 − x1), pointq is in the second octant of points if and only if T2(q)

is in the first quadrant ofT2(s). By means of these transformations, we can apply the s

Fig. 6. The first quadrant of points is divided into two octants.

2 Observe that the transformation may yield points with negative coordinates, but it is easy to overcoobstacle by an additional transformation (a translation). Hence we will skip this minor problem.

Page 15: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 335

roperly.

oed

t

r-

s in

e

techniques as in the previous sections. We just have to define the geometric cost pThe following lemma shows how to choose the geometric costgc1 for points in the firstoctantO1. An analogous lemma holds for points in the second octant.

Lemma 4.3. Let f , f̃ , andf ′ be fragments such thatf � f ′ and f̃ � f ′. If end(f ) andend(f̃ ) lie in the first octant of beg(f ′), thenf̃ .score− g(f ′, f̃ ) > f.score− g(f ′, f ) ifand only iff̃ .score−gc1(f̃ ) > f.score−gc1(f ), where gc1(f̂ ) = λ∆x1(t, end(f̂ ))+ (ε −λ)∆x2(t, end(f̂ )) for any fragmentf̂ .

Proof. Similar to the proof ofLemma 4.1. �In Section4.1 there was only one geometric costgc, but here we have to take tw

different geometric costsgc1 andgc2 into account. To cope with this problem, we netwo data structuresD1 andD2, whereDi stores the set of points

{Ti

(end(f )

) | f is a fragment}.

If we encounter the end point of fragmentf ′ in Algorithm 1, then we activate poinT1(end(f ′)) in D1 with priority f ′.score− gc1(f

′) and pointT2(end(f ′)) in D2 with pri-ority f ′.score−gc2(f

′). If we encounter the start point of fragmentf ′, then we launch tworange maximum queries, namelyRMQ(0, T1(beg(f ′)−�1)) in D1 andRMQ(0, T2(beg(f ′)−�1)) in D2. If the firstRMQ returnsT1(end(f1)) and the second returnsT2(end(f2)), thenfi

is a fragment of highest priority inDi , 1 � i � 2, such thatTi(end(fi)) � Ti(beg(f ′)).Because a pointp is in the octantOi of point beg(f ′) if and only if Ti(p) is inthe first quadrant ofTi(beg(f ′)), it follows that fi is a fragment such that its prioity fi.score− gci (fi) is maximal in octantOi . Therefore, according toLemma 4.3, thevalue vi = fi.score− g(f ′, fi) is maximal in octantOi . Hence, if v1 > v2, then weset f ′.prec= f1 and f ′.score:= f ′.weight+ v1. Otherwise, we setf ′.prec= f2 andf ′.score:= f ′.weight+ v2.

For the sum-of-pairs gap cost, the two-dimensional chaining algorithm runO(n logn log logn) time and O(n logn) space because of the two-dimensionalRMQs re-quired for the transformed points. This is in sharp contrast to gap costs in theL1-metric,where we merely need one-dimensionalRMQs.

4.2.2. The casek > 2In this case, the sum-of-pairs gap cost is defined for fragmentsf � f ′ by

gsop(f′, f ) =

∑0�i<j�k

g(f ′i,j , fi,j ),

wheref ′i,j andfi,j are the two-dimensional fragments consisting of theith andj th com-

ponent off ′ and f , respectively. For example, in case ofk = 3, let s = beg(f ′) andp = end(f ) and assume that∆x1(s,p) � ∆x2(s,p) � ∆x3(s,p). In this case, we havgsop(f

′, f ) = 2λ∆x1 + ε∆x2 + (ε − λ)2∆x3 becauseg(f ′1,2, f1,2) = λ∆x1 + (ε − λ)∆x2,

g(f ′1,3, f1,3) = λ∆x1 + (ε −λ)∆x3, andg(f ′

2,3, f2,3) = λ∆x2 + (ε −λ)∆x3. By contrast, if∆x1 � ∆x3 � ∆x2, then the equalitygsop(f

′, f ) = 2λ∆x1 + (ε − λ)2∆x2 + ε∆x3 holds.

Page 16: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

336 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

egions

t in

,

.

d are

ntr

In general, each of thek! permutationsπ of 1, . . . , k yields a hyper-regionRπ definedby ∆xπ(1)

� ∆xπ(2)� · · · � ∆xπ(k)

in which a specific formula forgsop(f′, f ) holds. That

is, in order to obtain the score of a fragmentf ′, we must computef ′.score= f ′.weight+max{vπ | π is a permutation of 1, . . . , k}, where

vπ = max{f.score− gsop(f

′, f ): f � f ′ andend(f ) lies inRπ

}.

Because ourRMQ-based approach requires orthogonal regions, each of these hyper-rRπ of s must be transformed into thefirst hyper-cornerof some point̃s. The first hyper-corner of a point̃s ∈ R

k is thek-dimensional analogue to the first quadrant of a poinR

2. It consists of all pointsp ∈ Rk with p.xi � s̃.xi for all 1� i � k (note that there are 2k

hyper-corners). We describe the generalization of theoctant-to-quadranttransformationsfor the casek = 3. The extension to the casek > 3 is obvious. There are 3! hyper-regionshence 6 transformations:

∆x1 � ∆x2 � ∆x3: T1(x1, x2, x3) = (x1 − x2, x2 − x3, x3),

∆x1 � ∆x3 � ∆x2: T2(x1, x2, x3) = (x1 − x3, x2, x3 − x2),

∆x2 � ∆x1 � ∆x3: T3(x1, x2, x3) = (x1 − x3, x2 − x1, x3),

∆x2 � ∆x3 � ∆x1: T4(x1, x2, x3) = (x1, x2 − x3, x3 − x1),

∆x3 � ∆x1 � ∆x2: T5(x1, x2, x3) = (x1 − x2, x2, x3 − x1),

∆x3 � ∆x2 � ∆x1: T6(x1, x2, x3) = (x1, x2 − x1, x3 − x2).

In what follows, we will focus on the particular case whereπ is the identity permutationThe hyper-region corresponding to the identity permutation will be denoted byR1 and itstransformation byT1. The other permutations are numbered in an arbitrary order anhandled similarly.

Lemma 4.4. Pointp ∈ Rk is in hyper-regionR1 of points if and only ifT1(p) is in the first

hyper-corner ofT1(s), whereT1(x1, x2, . . . , xk) = (x1 − x2, x2 − x3, . . . , xk−1 − xk, xk).

Proof. T1(p) is in the first hyper-corner ofT1(s)

⇔ T1(s).xi � T1(p).xi for all 1� i � k

⇔ s.xi − s.xi+1 � p.xi − p.xi+1 ands.xk � p.xk for all 1� i < k

⇔ (s.x1 − p.x1) � (s.x2 − p.x2) � · · · � (s.xk − p.xk)

⇔ ∆x1(s,p) � ∆x2(s,p) � · · · � ∆xk(s,p).

The last statement holds if and only ifp is in hyper-regionR1 of s. �For each hyper-regionRj , we compute the corresponding geometric costgcj (f ) of

every fragmentf . Note that for every indexj a k-dimensional analogue toLemma 4.3holds. Furthermore, for each transformationTj , we keep a data structureDj that storesthe transformed end pointsTj (end(f )) of all fragmentsf . Algorithm 4 generalizes the2-dimensional chaining algorithm described above tok dimensions. For every start poibeg(f ′) of a fragmentf ′, Algorithm 4 searches for a fragmentf in the first hyper-corne

Page 17: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 337

inte

nthm

the

oints

ks for-sted inthat ablem;

n op-ant of

of beg(f ′) such thatf.score− gsop(f′, f ) is maximal. This entailsk! RMQs because

the first hyper-corner is divided intok! hyper-regions. Analogously, for every end poend(f ′) of a fragmentf ′, Algorithm 4 performsk! activation operations. Therefore, thtotal time complexity ofAlgorithm 4is O(k! n logk−1 n log logn) and its space requiremeis O(k! n logk−1 n). This result improves the running time of Myers and Miller’s algorit[23] by a factor logn

log logn.

Algorithm 4 (k-dim. chaining ofn fragments w.r.t. the sum-of-pairs gap cost).

Sort all start and end points of then fragments in ascending order w.r.t. theirx1 coordinateand store them in the arraypoints; because we include the end point of the origin andstart point of the terminus, there are 2n + 2 points.for j := 1 to k!

apply transformationTj to the end points of the fragments and store the resulting pas inactive in thek-dimensional data structureDj

for i := 1 to 2n + 2if points[i] is the start point of fragmentf ′ then

maxRMQ:= −∞for j := 1 to k!

q := RMQ(0, Tj (points[i] − �1)) in Dj

determine the fragmentfq with Tj (end(fq)) = q

maxRMQ:= max{maxRMQ, fq .score− gsop(f′, fq)}

if fq.score− gsop(f′, fq) = maxRMQthen f := fq

f ′.prec:= f

f ′.score:= f ′.weight+ maxRMQelse /� points[i] is end point of a fragmentf ′ �/

for j := 1 to k!activateTj (points[i]) in Dj with priority f ′.score− gcj (f

′)

5. The local chaining algorithm

In the previous sections, we have tackled the global chaining problem, which asan optimal chain starting at the origin0 and ending at terminust. However, in many applications (such as searching for local similarities in genomic sequences) one is interechains that can start and end with arbitrary fragments. If we remove the restrictionchain must start at the origin and end at the terminus, we get the local chaining proseeFig. 7.

Definition 5.1. Givenn weighted fragments and a gap cost functiong, thelocal fragment-chaining problemis to determine a chain of maximum score� 0. Such a chain will becalledoptimal local chain.

Note that ifg is the constant function 0, then an optimal local chain must also be atimal global chain, and vice versa. Our solution to the local chaining problem is a vari

Page 18: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

338 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

posed

spond-irs

ch thatties as

Fig. 7. Computation of local chains of colinear non-overlapping fragments. The optimal local chain is comof the fragments 1, 4, and 6. Another significant local chain consists of the fragments 7 and 8.

the global chaining algorithm. For ease of presentation, we will use gap costs correing to theL1 metric (see Section4.1), but the approach also works with the sum-of-pagap cost (see Section4.2).

Definition 5.2. Let

f ′.score= max{score(C): C is a chain ending withf ′}.

A chainC ending withf ′ and satisfyingf ′.score= score(C) will be calledoptimal chainending withf ′.

Lemma 5.3. The following equality holds:

(2)f ′.score= f ′.weight+ max{0, f.score− g1(f

′, f ): � f ′}.Proof. Let C′ = f1, f2, . . . , f�, f

′ be an optimal chain ending withf ′, that is,score(C′) =f ′.score. Because the chain that solely consists of fragmentf ′ has scoref ′.weight� 0, wemust havescore(C′) � f ′.weight. If score(C′) = f ′.weight, thenf.score− g1(f

′, f ) � 0for every fragmentf that precedesf ′, because otherwise it would followscore(C′) >

f ′.weight. Hence equality(2) holds in this case. So supposescore(C′) > f ′.weight.Clearly,score(C′) = f ′.weight+score(C)−g1(f

′, f�), whereC = f1, f2, . . . , f�. It is notdifficult to see thatC must be an optimal chain that is ending withf� because otherwiseC′would not be optimal. Therefore,score(C′) = f ′.weight+ f�.score− g1(f

′, f�). If therewere a fragmentf that precedesf ′ such thatf.score− g1(f

′, f ) > f�.score− g1(f′, f�),

then it would follow thatC′ is not optimal. We conclude that equality(2) holds. �With the help ofLemma 5.3, we obtainAlgorithm 5. It is not difficult to verify that we

can implement this algorithm by means of the techniques of the previous sections suit solves the local fragment-chaining problem in the same time and space complexiour solution to the global fragment-chaining problem.

Page 19: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 339

ose

resentof re-

ments.caltions and

al and-

st

e

orthog-the

-reader

Algorithm 5 (Finding an optimal local chain).

for every fragmentf ′ do begindeterminef̂ with f̂ .score− g1(f

′, f̂ ) = max{f.score− g1(f′, f ) | f � f ′}

max:= max{0, f̂ .score− g1(f′, f̂ )}

if max> 0 thenf ′.prec:= f̂ else f ′.prec:= NULLf ′.score:= f ′.weight+ max

endDetermine a fragment̃f such thatf̃ .score= max{f.score| f is a fragment}Report an optimal local chain by following the pointersf̃ .prec until a fragmentf withf.prec= NULL is reached.

We stress thatAlgorithm 5can easily be modified, so that it can report all chains whscore exceeds some thresholdT (in Algorithm 5, instead of determining a fragment̃f

of maximum score, one determines all fragments whose score exceedsT ). Such chainswill be calledsignificant local chains; seeFig. 7. As outlined in Section1, under strin-gent thresholds, significant local chains of colinear non-overlapping fragments repcandidate regions of conserved synteny. Furthermore, the automatic identificationgions of similarity is a first step toward an automatic detection of genome rearrangeThe interested reader is referred to[2], where we have provided evidence that our lochaining method can be used to detect genome rearrangements such as transposiinversions.

6. Conclusions

In this article, we have presented line-sweep algorithms that solve both the globthe local fragment-chaining problem of multiple genomes. Fork > 2 genomes, our algorithms take

• O(n logk−2 n log logn) time and O(n logk−2 n) space without gap costs,• O(n logk−2 n log logn) time and O(n logk−2 n) space for gap costs in theL1 metric,• O(k!n logk−1 n log logn) time and O(k!n logk−1 n) space for the sum-of-pairs gap co

and for gap costs in theL∞ metric.

For k = 2, they take O(n logn) time and O(n) space for gap costs in theL1 metric.If the fragments are already sorted, then the time complexity reduces to O(n log logn).For the sum-of-pairs gap cost and for gap costs in theL∞ metric, our algorithms takO(n logn log logn) time and O(n logn) space.

We stress that our algorithms can employ any other data structure that supportsonal range-searching. For example, if thekd-tree is used instead of the range tree,algorithms take O((k − 1)n2−1/(k−1)) time and O(n) space for gap costs in theL1 metric.(It has been shown in[20] that answeringoned-dimensional range query with thekd-treetakes O(dn1−1/d) time in the worst case.) Moreover, for smallk, a collection of programming tricks can speed up the running time in practice; for more details we refer theto [6].

Page 20: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

340 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341

ein the

this

ymous

. 14th2676,

mics,. 1–

l Con-iology,

mputa-

ompar-

tional

AN3 (4)

s tools:

IAM J.

whole

nt and

ns; II:

c. 16th

c. Nat.

.

Among other things, the software tool CHAINER[3] contains an implementation of thchaining algorithms presented here. Currently, this implementation uses gap costsL1 metric and is based onkd-trees. It remains future work to experimentally comparekd-tree version with an implementation based on range trees.

Acknowledgements

The authors were supported by DFG-grant Oh 54/4-1. Thanks go to an anonreviewer whose suggestions helped to improve the presentation of this article.

References

[1] M.I. Abouelhoda, E. Ohlebusch, Multiple genome alignment: Chaining algorithms revisited, in: ProcAnnual Symposium on Combinatorial Pattern Matching, in: Lecture Notes in Comput. Sci., vol.Springer-Verlag, 2003, pp. 1–16.

[2] M.I. Abouelhoda, E. Ohlebusch, A local chaining algorithm and its applications in comparative genoin: Proc. 3rd Workshop on Algorithms in Bioinformatics, in: LNBI, vol. 2812, Springer-Verlag, 2003, pp16.

[3] M.I. Abouelhoda, E. Ohlebusch, CHAINER: Software for comparing genomes, in: 12th Internationaference on Intelligent Systems for Molecular Biology/3rd European Conference on Computational Bavailable athttp://www.iseb.org/ismbeccb2004/short%20papers/19.pdf.

[4] P. Agarwal, Range searching, in: J.E. Goodman, J. O’Rourke (Eds.), Handbook of Discrete and Cotional Geometry, CRC Press, 1997, pp. 575–603.

[5] S. Batzoglou, L. Pachter, J.P. Mesirov, B. Berger, E.S. Lander, Human and mouse gene structure: Cative analysis and application to exon prediction, Genome Res. 10 (2001) 950–958.

[6] J.L. Bently, K-d trees for semidynamic point sets, in: 6th Annual ACM Symposium on ComputaGeometry, ACM, 1990, pp. 187–197.

[7] N. Bray, L. Pachter, MAVID multiple alignment server, Nucleic Acids Res. 31 (2003) 3525–3526.[8] M. Brudno, C.B. Do, G.M. Cooper, M.F. Kim, E. Davydov, E.D. Green, A. Sidow, S. Batzoglou, LAG

and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA, Genome Res. 1(2003) 721–731.

[9] P. Chain, S. Kurtz, E. Ohlebusch, T. Slezak, An applications-focused review of comparative genomicCapabilities, limitations and future challenges, Briefings in Bioinformatics 4 (2) (2003) 105–123.

[10] B. Chazelle, A functional approach to data structures and its use in multidimensional searching, SComput. 17 (3) (1988) 427–462.

[11] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg, Alignment ofgenomes, Nucleic Acids Res. 27 (1999) 2369–2376.

[12] A.L. Delcher, A. Phillippy, J. Carlton, S.L. Salzberg, Fast algorithms for large-scale genome alignmecomparison, Nucleic Acids Res. 30 (11) (2002) 2478–2483.

[13] D. Eppstein, Z. Galil, R. Giancarlo, G.F. Italiano, Sparse dynamic programming. I: Linear cost functioConvex and concave cost functions, J. ACM 39 (1992) 519–567.

[14] H.N. Gabow, J.L. Bentley, R.E. Tarjan, Scaling and related techniques for geometry problems, in: ProAnnual ACM Symposium on Theory of Computing, ACM, 1984, pp. 135–143.

[15] M.S. Gelfand, A.A. Mironov, P.A. Pevzner, Gene recognition via spliced sequence alignment, ProAcad. Sci. 93 (1996) 9061–9066.

[16] L.J. Guibas, J. Stolfi, On computing all north-east nearest neighbors in theL1 metric, Inform. ProcessLett. 17 (4) (1983) 219–223.

[17] D.B. Johnson, A priority queue in which initialization and queue operations take O(log logD) time, Math.Syst. Theory 15 (1982) 295–309.

Page 21: Chaining algorithms for multiple genome comparison · 2017. 2. 11. · 322 M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 Fig. 1. Given a set of

M.I. Abouelhoda, E. Ohlebusch / Journal of Discrete Algorithms 3 (2005) 321–341 341

312–

-C. ele-

l binary

(2000)

th.

CM-

. Alurution.

, 1985.ouse

gram,sis of

Process.

3.iol. 1

[18] M. Höhl, S. Kurtz, E. Ohlebusch, Efficient multiple genome alignment, Bioinformatics 18 (2002), SS320.

[19] W.J. Kent, A.M. Zahler, Conservation, regulation, synteny, and introns in a large-scale C. briggsaegans genomic alignment, Genome Res. 10 (2000) 1115–1125.

[20] D.T. Lee, C.K. Wong, Worst-case analysis for region and partial region searches in multidimensionasearch trees and balanced quad trees, Acta Informatica 9 (1977) 23–29.

[21] B. Morgenstern, A space-efficient algorithm for aligning large genomic sequences, Bioinformatics 16948–949.

[22] E.W. Myers, X. Huang, An O(n2 logn) restriction map comparison and search algorithm, Bull. MaBiol. 54 (4) (1992) 599–618.

[23] E.W. Myers, W. Miller, Chaining multiple-alignment fragments in sub-quadratic time, in: Proc. 6th ASIAM Symposium on Discrete Algorithms, ACM, 1995, pp. 38–47.

[24] E. Ohlebusch, M.I. Abouelhoda, Chaining algorithms and applications in comparative genomics, in: S(Ed.), Handbook of Computational Molecular Biology, Chapter 20, CRC Press, submitted for publica

[25] F.P. Preparata, M.I. Shamos, Computational Geometry: An Introduction, Springer-Verlag, New York[26] S. Schwartz, J.K. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison, D. Haussler, W. Miller, Human-m

alignments with BLASTZ, Genome Res. 13 (2003) 103–107.[27] S. Schwartz, L. Elnitski, M. Li, M. Weirauch, C. Riemer, A. Smit, NISC Comparative Sequencing Pro

E.D. Green, R.C. Hardison, W. Miller, MultiPipMaker and supporting tools: Alignments and analymultiple genomic DNA sequences, Nucleic Acids Res. 31 (13) (2003) 3518–3524.

[28] P. van Emde Boas, Preserving order in a forest in less than logarithmic time and linear space, Inform.Lett. 6 (3) (1977) 80–82.

[29] D.E. Willard, New data structures for orthogonal range queries, SIAM J. Comput. 14 (1985) 232–25[30] Z. Zhang, B. Raghavachari, R. Hardison, W. Miller, Chaining multiple-alignment blocks, J. Comput. B

(1994) 217–226.