TRAVELING SALESMAN PROBLEM BIOINFORMATIC ALGORITHMS

USING THE

TRAVELING SALESMAN PROBLEMIN

BIOINFORMATIC ALGORITHMS

Master’s Thesis in Computer ScienceFinn Rosenbech Jensen

2006 0636

Supervisor:Christian Nørgaard Storm Pedersen

DEPARTMENT OF COMPUTER SCIENCEAARHUS UNIVERSITY

DECEMBER 2010

Layout and typography in this thesis are made using LATEXwith the memoir documentclass.Any good idea or nice layout feature can be directly attributed to the book:Introduktion til Latex, by Lars Madsen,http://www.imf.au.dk/system/latex/bog/

The TSP tour images are made using the techniques described onhttp://www.cgl.uwaterloo.ca/~csk/projects/tsp/

an output modified version of the ccvt softwarehttp://code.google.com/p/ccvt/

and the TSP heuristic solver LKH

http://www.akira.ruc.dk/~keld/research/LKH/

All self constructed figures are drawn using LaTeXDraw,http://latexdraw.sourceforge.net/

and all graphs and diagrams are constructed using matplotlib,http://matplotlib.sourceforge.net/

http://www.imf.au.dk/system/latex/bog/

http://www.cgl.uwaterloo.ca/~csk/projects/tsp/

http://code.google.com/p/ccvt/


http://latexdraw.sourceforge.net/

http://matplotlib.sourceforge.net/

For brevity is very good, where we are,or are not understood

Samuel Butler

Abstract

Theoretical computer science has given birth to a very intriguing and fas-cinating class of problems known as Nondeterministic Polynomial-time(NP) complete problems. Among their captivating properties are theircomputational complexity despite often deceptively simple formulations.An example of this is the ubiquitous Traveling Salesman Problem (TSP).This problem has had an uncontested popularity in computational math-ematics leading to a plethora of techniques and programs for solving it,both exactly and heuristically.

Another remarkable property of NP complete problems is their fre-quent occurrence in real life problems, making them more than just the-oretical constructions. A practical example of this is the field of bioinfor-matics where many of the challenges faced by its researchers have turnedout to be NP complete problems.

Yet another tantalising characteristic of the NP complete problems istheir reduction property, making every problem equally difficult (or easy)to solve. In other words: to solve them all, we only have to solve one ofthem.

This thesis aims at utilising the above properties with the purposeof examining the effect of trying to solve bioinformatic problems usingthe reduction property and a state of the art implementation for solvingthe TSP. The practical bioinformatic problem is the Shortest SuperstringProblem (SSP). To asses the quality of the obtained solutions, they arecompared to solutions from four approximation algorithms. To conveya full understanding of the algorithms and their approximation factors,the thesis additionally includes a self-contained survey of approximationalgorithms for the SSP.

The thesis further examines the bioinformatic problems concerningMultiple Sequence Alignment (MSA) and hereby presents the definition ofa TSP based scoring function. A near-optimal MSA construction algorithmthat uses this scoring and additionally a divide-and-conquer algorithm forrefining MSAs are implemented and experimentally tested.

Based on truely convincing results the main conclusion of the thesisis that it is definitely a promising idea to apply efficient TSP solver imple-mentations to solve NP complete problems within bioinformatic applica-tions. The results obtained for the implemented MSA algorithms are farmore modest, although the MSA construction algorithm and the scoringfunction should not be dismissed without further study.

i

Acknowledgements

First I would like to thank my supervisor Christian Nørgaard Storm Ped-ersen for being utmost inspiring, positive and patient all along the longand winding road and for giving me the idea for this thesis in the firstplace. I can honestly recommend to chose a supervisor, who has personalexperience with small children.

Second I would like to thank Gaston Gonnet and especially Keld Hels-gaun for their kind and helpful responses to my email queries.

A warm thanks goes to the staff at the Department of Computer Sci-ence at Aarhus University, in particular to Torsten Nielsen, Michael Gladand Kai Birger Nielsen for being very prompt whenever I needed help andfor letting me abuse quite some ’camels’ for my test runs.

Furthermore I would like to thank all the people at the office-floor: An-ders Hauge Nissen, Martin Højgaard Have, Sean Geggie, Mikkel Vester,Marie Mehlsen, Mette Helm, Lene Mejlby and Jesper Jakobsen — you allmade a long indoor summer and fall quite enjoyable.

Also many thanks to my office mates Morten Slot Kristensen and An-ders Andersen for being very patient with a grumpy ol’ man and Andersfor helping me out whenever my lack of technical talent took its toll.

A heartily thanks to my “old” pair-programming partner Dennis DeckerJensen for his never failing willingness to help me out every time I havegotten myself into another fine (coding) mess.

A great many thanks to all my thesis reviewers: Anders Andersen,Jesper Jakobsen, Rasmus Lauritsen and Sarah Zakarias who bravely ac-cepted the harsh challenge of making my attempts in written Englishboth readable as well as understandable.

Last but certainly not least a deeply and wholehearted thanks to mywife Birgit and our three children for enduring me and my absentmind-edness for so long.

Finn Rosenbech Jensen

iii

Contents

Abstract i

Acknowledgements iii

Contents v

1 Introduction 11.1 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis Terminology and Notation . . . . . . . . . . . . . . . . 4

Part IThe Road Goes Ever On and On 9

2 The Traveling Salesman Problem 112.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Seminal Results in TSP Computation . . . . . . . . . . . . . . 172.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Traveling Salesman Problem Solvers 213.1 Characterisation of TSP Solvers . . . . . . . . . . . . . . . . . 213.2 Selection of TSP Solver . . . . . . . . . . . . . . . . . . . . . . 223.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v

CONTENTS

Part IIThe Shortest Superstring Problem 27

4 The Shortest Superstring Problem 294.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Applications of SSP . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Preliminaries 335.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Graphs and Cycle Covers . . . . . . . . . . . . . . . . . . . . . 365.3 Approximation Strategies . . . . . . . . . . . . . . . . . . . . 395.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Algorithms 436.1 The Cycle Algorithm . . . . . . . . . . . . . . . . . . . . . . . 436.2 The RecurseCycle Algorithm . . . . . . . . . . . . . . . . . . . 536.3 The Overlap Rotation Lemma . . . . . . . . . . . . . . . . . . 536.4 The Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . 576.5 Reductions of Hamiltonian Path Problem Instances . . . . . 616.6 The Best Factor Algorithm . . . . . . . . . . . . . . . . . . . . 646.7 The TSP Based Algorithm . . . . . . . . . . . . . . . . . . . . 736.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Experiments 757.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.2 SSP Solver Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 777.3 Timing Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.4 Sequencing Tests . . . . . . . . . . . . . . . . . . . . . . . . . 87

Part IIIThe Circular Sum Scoreand Multiple Sequence Alignment 91

8 Sequence Alignments 938.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.2 Pairwise Sequence Alignment . . . . . . . . . . . . . . . . . . 948.3 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . 968.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9 Circular Sum Score 1019.1 Sum-of-Pairs Score . . . . . . . . . . . . . . . . . . . . . . . . 1019.2 MSAs and Evolutionary Trees . . . . . . . . . . . . . . . . . . 1029.3 Circular Orders, Tours and CS Score . . . . . . . . . . . . . . 103

vi

Contents

9.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10 Algorithms 11110.1 Path Sum Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 11110.2 The MSA Construction Algorithm . . . . . . . . . . . . . . . . 11210.3 The MSA Enhance Algorithm . . . . . . . . . . . . . . . . . . . 11810.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

11 Experiments 12511.1 Reference MSA Applications . . . . . . . . . . . . . . . . . . . 12511.2 Benchmark Database . . . . . . . . . . . . . . . . . . . . . . . 12611.3 Experiments Set-up . . . . . . . . . . . . . . . . . . . . . . . . 12811.4 MSA Construction Tests . . . . . . . . . . . . . . . . . . . . . . 12811.5 MSA Enhancing Tests . . . . . . . . . . . . . . . . . . . . . . . 13411.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Part IVThe End of The Tour 139

12 Conclusion 14112.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

List of Algorithms 144

List of Figures 146

List of Figures 146

List of Tables 148

List of Tables 148

Part VAppendices 149

A Acronyms 151

B LKH output 153

C GenBank Nucleotid Sequences 157

D SSP Test Data 161

Bibliography 167

vii

Nothing is more dangerous than an idea,if it’s the only one you have

Pragmatic Programmers

CHAPTER

1Introduction

One of the most intriguing notions in theoretical computer science is theexistence of the class of problems known as NP complete problems. Thenotion characterises a class of problems for which it is “very unlikely” thatan “efficient” general solution exists.

The argumentation for the “unlikeliness” is partly that forty years ofintense study has not given birth to such a solution,1 partly that the im-plications, should such a solution exists, are very far reaching. As an ex-ample it would be possible to construct an efficient algorithm that couldfind the shortest mathematical proof of any theorem, thereby making thecreative part of theorem proving superfluous.

The notion of “efficient” general solution is defined in terms of polyno-mial2 running-time dependency in the size of the problem instance. ThusNP complete problems can be seen as problems for which we in most casesneither have the time nor the calculating power to find the optimal solu-tion.

This essentially means that searching for exact solutions is infeasi-ble and instead approximation algorithms and heuristics must be used.From a mathematical point of view this might seem unsatisfying, giventhat the problems have trivial mathematical solutions. However from apractical point of view this makes the problems very challenging and in-

1On August 7th. 2010, Vinay Deolalikar, a recognised researcher at Hewlett Packard,published a claimed proof that P 6= NP implying that such solutions do not exist.This was the 7th. out of 8 published proof attempt in 2010 http://www.win.tue.nl/~gwoegi/P-versus-NP.htm , and it spurred an intense effort to understandand pursue its new ideas; but in less than a week some deep and probably irreme-diable flaws were uncovered http://rjlipton.wordpress.com/2010/08/12/fatal-flaws-in-deolalikars-proof .

2as opposed to exponential or factorial

1

http://www.win.tue.nl/~gwoegi/P-versus-NP.htm

http://www.win.tue.nl/~gwoegi/P-versus-NP.htm

http://rjlipton.wordpress.com/2010/08/12/fatal-flaws-in-deolalikars-proof

http://rjlipton.wordpress.com/2010/08/12/fatal-flaws-in-deolalikars-proof

1. INTRODUCTION

teresting giving the opportunity and leeway to try out any conceivableidea for solving them.

Another fascinating hallmark of NP complete problems is their “re-duction” property. This means that each of the problems3 can be re-formulated as an instance of every other NP complete problem. The effectof this is that if an efficient solution for one of these problems would befound, it could be used to find efficient solutions to all of the NP completeproblems.

One of the most famous examples of an NP complete problem isthe TSP. It is at the same time one the most intensely investigated sub-jects in computational mathematics. The problem seems to cast a certainspell of fascination upon the people getting involved with it and it has in-spired studies by mathematicians, chemists, biologists, physicists, neuro-scientists and computer scientists. The popularity of the TSP has had thepractical effect that a lot of applications for solving it, both exactly andheuristically, have been devised.

A practical field where a lot of NP complete problems turns up is bioin-formatics or computational biology. NP complete problems occur in con-nection with bioinformatic challenges like Deoxyribonucleic Acid (DNA)sequencing, sequence alignments, gene identification and evolutionarytree building. Since it is in most cases infeasible to search for optimalsolutions for these problems, this makes the field of bioinformatics veryvivid in terms of devising approximation algorithms and heuristics.

This thesis can be seen in this light, as its main purpose is to examinethe effect of trying to solve bioinformatics problems using the reductionproperty of NP hard problems and a state of the art implementation forsolving the TSP. This idea does run the risk of being too simplistic andgeneral, since in effect the idea is to solve different problems in the sameway. This is opposed to constructing or using specialised methods and al-gorithms designed to solve each specific problem, thereby taking advan-tage of the individual character of the problem. On the other hand, theidea could prove applicable given that more than fifty years of vast explo-ration and work has been put into implementing efficient TSP solvers.

The second thesis part is dedicated to the above purpose. It containsthe results of heuristically solving the NP complete, bioinformatics relatedproblem known as the SSP by reducing it to an instance of the TSP. Toasses the quality of the solutions, they are compared to solutions fromfour approximation algorithms one of which achieves the best known ap-proximation factor for the SSP.

3actually every problem belonging to the larger class of problems known as NP

2

1.1. Road Map

To convey a full understanding of the approximation algorithms andtheir approximation factors, this part also contains a chapter that can becharacterised as a new-devised, self-contained survey of approximationalgorithms for the SSP.

As a further application of the idea, the third part of the thesisexplores the use of the TSP in connection with bioinformatics problemsconcerning sequence alignment construction and evaluation. The part fo-cuses on MSA and the definition of a scoring function which takes the evo-lutionary history of proteins into account. A near-optimal MSA construc-tion algorithm based on this scoring function and additionally a divide-and-conquer algorithm for refining MSAs are implemented and experi-mentally tested. These algorithms originate in work done at the SwissFederal Institute of Technology around 2000 and the thesis part coversimprovements on both algorithms.

The goals and contributions of this thesis can thus be summarisedas:

1) an examination of the feasibility of applying efficient TSP solver im-plementations to solve NP complete problems within bioinformaticapplications,

2) a self-contained survey of approximation algorithms for the SSP in-cluding descriptions of implementations and testing of solution qual-ity, and

3) a description and an experimental testing of the improvements onthe TSP based algorithms for MSA construction and MSA refinement.

1.1 Road Map

The thesis is divided into four parts:

Traveling Salesman ProblemThe first part is a short (minimal?) tour through some history, definitionsand theoretical results for the TSP. It also characterises TSP solvers andargues for the choice of solver to be used in the algorithm implementa-tions.

This part should provide the reader with an understanding of the TSPand the current state-of-the-art solver implementations.

3

1. INTRODUCTION

Shortest Superstring ProblemThe second part has emphasis on the SSP and is by far the most theoreticaland difficult part of the thesis. It starts off with a chapter introducing anddefining the problem along with its applications. Next follows a chaptercontaining preliminaries for the theoretical treatment of the implementedalgorithms.

These algorithms are then the theme of the succeeding chapter, whichis a full treatment of the five implemented algorithms, their approxima-tion factors and implementations. This mix of theory and practical im-plementation description is intentional, in the hope that it will make thereading more varied.

Concludingly a chapter follows, which describes the experiments con-ducted on the algorithms and the results thereof.

The part should provide the reader with an understanding of both thetheoretical foundation for SSP approximation algorithms as well as tech-nical aspects of their implementation and the achieved practical results.

Multiple Sequence AlignmentThe third part focuses on MSA and the evolutionary based scoring func-tion. Both an MSA construction algorithm and an MSA improving algo-rithm, together with their implementation, are presented. The part fin-ishes with a chapter describing the tests conducted on the algorithms andresults thereof.

This part of the thesis should provide the reader an understanding ofsequence alignment and some of the aspects concerning constructing andestimating MSAs.

ConclusionThe fifth part contains the thesis conclusion and ideas for future work.

1.2 Thesis Terminology and Notation

This section contains a short description of the mathematical terminologyand notation used in the thesis.

1.2.1 Graphs.

We denote directed graphs as G(N,A) and use the names “nodes” and“arcs” for their components. Undirected graphs will be denoted G(V,E)and their components described using the names “vertices” and “edges”.Arcs and edges are denoted ai and ei or by the nodes or vertices they haveas endpoints (ni, nj) and (vi, vj). In an undirected graph the edges (vi, vj)and (vj , vi) are identical.

4

1.2. Thesis Terminology and Notation

Remark 1For simplicity the definitions in the rest of this section will all be phrasedin terms of undirected graphs. Each definition containing the words ’edges’and ’vertices’ can be substituted by an analogous definition for directedgraphs phrased using ’nodes’ and ’arcs’. 2

A graph G′ = (V ′, E′) is a subgraph of G(V,E) iff V ′ ⊆ V and E′ ⊆ E.If V ′ ⊆ V and E′ = { e = (vi, vj) | e ∈ E, vi, vj ∈ V ′} then (V ′, E′) is calledthe subgraph induced by V ′.

By a walk in a graph G(V,E) we understand a sequence of vertices:v1, v2, . . . , vk where the walk start in v1 and continuously follow the edges(vi, vi+1) between the vertices vi, vi+1 in the sequence. A walk is calleda path if each of the vertices in the sequence is distinct.4 A walk wherev1 = vk is called a circuit and a path with v1 = vk is called a tour or acycle.5 A path or a tour, visiting all vertices in the graph exactly once, iscalled Hamiltonian. For a tour visiting k vertices the short-cut notation:v1, v2, . . . , vk will also be used instead of the more correct v1, v2, . . . , vk, v1.A tour T ′ is called a subtour of T if all vertices in T ′ are contained in T .

For an undirected graphG(V,E) the “degree” of a vertex is the numberof edges incident to the vertex. For a directed graph G(N,A) each nodehas an “in-degree” and an “out-degree” signifying the number of arcs go-ing to and from the node respectively. The notation d in

n and d outn denote

the in-degree and out-degree of node n.

Example 1In Figure 1.1 (a) the sequence (n1, n5, n4, n2) is a path and in (b) the se-quence (n1, n5, n4, n2, n1) is a tour. In subfigure (c) the sequence (n1, n5, n4, n2, n3)is a Hamiltonian path and in subfigure (d) the sequence (n1, n5, n4, n2, n3, n1)is a Hamiltonian tour. The degrees d in

1 , d out2 , d out

3 and d in4 equal two, all

other degrees equal one. 2

A complete graph G(V,E) is a graph where E includes all possibleedges in the graph. In other words E consists of all pairs of vertices(vi, vj), i 6= j in V . Finally we define the cut-set of a set S ⊆ V as δ(S) ={e | e = (vi, vj) ∈ E, vi ∈ S, vj /∈ S} The cut-set of a vertex set consists ofthe edges having one endpoint in the set and the other endpoint outsidethe set. In other words, the edges that have to be “cut” in order to removeS from the graph.

Example 2Figure 1.2 (a) shows a complete (undirected) graph with five vertices. Insubfigure (b) the cut set δ(v1) of the vertex v1 is illustrated. 2

4we use path in the implicit meaning “simple” path5again we use tour (cycle) in the implicit meaning simple tour (cycle)

5

1. INTRODUCTION

1

2

34

5

(a)

1

2

34

5

(b)

1

2

34

5

(c)

1

2

34

5

(d)

Figure 1.1: paths and tours in a graph

1

2

34

5

(a) complete graph

1

2

34

5

(b) cut set of v1

Figure 1.2: complete graph and cut set

6

1.2. Thesis Terminology and Notation

A bi-partite graph G(V,E) = G(V ′, V ′′, E) is a graph, where the vertexset V can be divided into two disjoint vertex sets V ′, V ′′ with the propertythat E = {(v′, v′′)| v′ ∈ V ′, v′′ ∈ V ′′}. In other words all edges are goingfrom V ′ to V ′′. For simplicity the notation G(V, V ′, E) is also used for bi-partite graphs.

A weighted graph G(V,E,W ) is a graph that associates a weight (orcost) with each of its edges. The weights are typically real numbers, thatis W : E → R. We denote the weight of an edge w(e) and the weight of agraph G by W (G) =

∑e∈ E w(e).

Finally a multigraph is a graph, which can have more than one copyof its edges.

1.2.2 Trees.

A connected graph without cycles, T (V,E) is called a “tree”. In a tree eachvertex can have zero or more “child vertices” called its children and atmost one “parent vertex” called its parent. A vertex having no childrenis called a “leaf”. A tree can be “rooted” or “unrooted” implying whetheror not there is a single vertex having no parent, denoted the “root” of thetree. Non-leaf vertices are also called “interior” vertices.

A subtree, Tv(V ′, E′) of a tree, T (V,E) is the induced subgraph whereV ′ = {v} ∪ {vi|vi is a descendant of v}. In other words a tree having v asroot and including everything “below” v in T

A special kind of tree is the “binary tree” where the number of childrenfor each vertex is at most two. In a binary tree with n leafs there will ben− 1 interior vertices.

Example 3In Figure 1.3 (a) is shown an unrooted tree with two interior verticesand five leafs. Subfigure (b) shows a rooted tree consisting of a root, twointerior vertices and five leafs and finally subfigure (c) shows a binarytree consisting of a root, three interior vertices and five leafs. 2

b

b

bb

b

b

b

(a)

b

b

bb

b

b

b

b

(b)

bb

b

b

b

b bbb b

(c)

Figure 1.3: Different tree types

7

1. INTRODUCTION

1.2.3 Strings.

When naming strings and chars, letters from the “middle” of the alphabet(s, t, u, . . . ) will be used for strings, and letters from the beginning of thealphabet (a, b, . . . ) will be used for chars. The empty string will be denotedε.

For any string s let |s| denote the length of s. For a collection ofstrings S = (s1, s2, . . . , sn) we denote the accumulated length of S as||S|| =

∑ni=1 |si|. String indexing ′[ ]′ will be one-based and string slic-

ing ′[ : ]′ will be start inclusive and end non-inclusive. The notation s[: i]is shorthand for s[1 : i] and s[i :] is shorthand for s[i : |s| + 1]. Stringconcatenation will be implicit i.e. the concatenation of strings s, t will bedenoted st.

Example 4

Let s = abcde, t = cbathen |ε| = 0, |s| = 5, s[1] = a, s[2] = b, s[5] = eand s[2 : 5] = bcd, s[1 : 3] t[2 : 4] = s[: 3] t[2 :] = abba,

t3 = ttt = cbacbacba, t∞ = ttt... 2

1.2.4 Arrays.

Since strings can be seen as char arrays we will treat array operationsanalogously to strings, i.e. array indexing ′[ ]′ will be one-based and arrayslicing ′[ : ]′ will be start inclusive and end non-inclusive. Furthermore,for an array Arr, an element elem, and an index idx we use the followingarray functions:

• LIST() constructs an array,

• SIZE(Arr) or |Arr| denotes the number of elements in Arr,

• APPEND(Arr, elem) appends elem at the end of Arr,

• REMOVE(Arr, idx) removes the element at index idx from Arr,

• SUM(Arr) calculates the sum of the elements in Arr. (Requires nu-meral elements in Arr).

8

Part I

The road goes ever on andon

DNA climb tour with 8190 cities

The Road goes ever on and onDown from the door where it began.Now far ahead the Road has gone,And I must follow, if I can,Pursuing it with eager feet,Until it joins some larger wayWhere many paths and errands meet.And whither then? I cannot say.

Bilbo Baggins (J.R.R. Tolkien)CHAPTER

2The Traveling SalesmanProblem.

This chapter introduces the Traveling Salesman Problem and gives ashort overview of its history along with some major computational re-sults. It furthermore contains the Linear Program (LP) formulation of theTSP. As this chapter is meant as an introduction, treatment and definitionof some technical terms will be postponed to later chapters.

2.1 History

The TSP is one of the most (if not the most) widely studied problems incomputational mathematics. One of the reasons for this might very wellbe the ease of formulating and understanding the problem:

Definition 1 (The Traveling Salesman Problem)Given a collection of cities along with the cost of travel between eachpair of them, determine the cheapest route visiting each city exactly once,starting and ending in the same city. 2

Given this simple formulation it might be expected that the problemcould have an equally simple solution. This turns out not to be the case.Despite being easily understood no general efficient solution to the TSPhas been found. However the problem has inspired multiple studies bymathematicians, computer scientists, chemists, physicists, psychologistsand a host of non professional researchers.

An interesting historical reference is the German handbook from 1832by B.F. Voigt with the title: “Der Handlungreisende - wie er sein soll undwas er zu thun hat um Aufträge zu erhalten und eines glücklichen Erfolgs

11

2. THE TRAVELING SALESMAN PROBLEM

Figure 2.1: The Commis-Voyageur tour for 47 Ger-man cities (from [Applegate et al., 2007]).

in seinen Geshäften gewiss zu sein - Von einem alten Commis-Voyageur”.1

The handbook contains routes through Germany and Switzerland andgoes as far as to claim:

We believe we can ensure as much that it will not be possi-ble to plan the tours through Germany in consideration of thedistances and the traveling back and fourth, which deservesthe traveller’s special attention, with more economy. The mainthing to remember is always to visit as many localities as pos-sible without having to touch them twice.2

One of the tours shown in Figure 2.1 goes through 47 German citiesand is actually of very good quality and might even be optimal given thetravel conditions of that time.

The first explicit mentioning of the TSP as a mathematical optimisa-tion problem is probably due to the Austrian mathematician Karl Mengerwho in a mathematical colloquium in 1930 in Vienna used the formula-tion:

Wir bezeichnen als Botenproblem (weil diese Frage in derPraxis von jedem Postboten, übrigens auch von vielen Reisenden

1The traveling salesman - how he has to be and what he has to do to acquire ordersand be assured of a fortunate success in his business - by an old Commis-Voyageur

2translated from the German original by Linda Cook

12

2.1. History

zu lösen ist) die Aufgabe, für endlich viele Punkte, deren paar-weise Abstände bekannt sind, den kürzesten die Punkte verbinden-den Weg zu finden.3

Shortly after this the TSP became popular among mathematicians at Prince-ton University. There does not exist any authoritative source for the ori-gin of the problems name, but according to Merrill Flood and A. W. Tuckerit became introduced by its present day name in 1934 as part of a seminargiven by Hassler Whitney at Princeton University.4

An additional characteristic of the TSP is, that it is relatively easy toconstruct good solutions, but far from easy to find a provable optimal so-lution. This has made the problem a popular “guinea pig” for all kindsof optimisation techniques during the last half of the 20th. century (seeTable 2.1 on the following page).

An excellent example of this is the seminal paper [Dantzig et al., 1954]where the inventor of the Simplex Algorithm together with two colleaguesintroduce the integer LP formulation of the TSP (see Definition 2) as wellas a cutting plane method5 in order to solve a 49 cities problem instanceto provable optimality.

Definition 2 (Integer LP-formulation for TSP)Let G = (V,E,W ) be a weighted graph with V = {v1, v2, . . . , vn}, then theTSP in G can be formulated as

Minimise∑e ∈ E

w(e) · xe subject to∑e ∈ δ({vi})

xe = 2, ∀ vi ∈ V (degree constraints) (2.1)

∑e ∈ δ(S)

xe ≥ 2, S ⊂ V, S 6= ∅ (subtour elimination constraints) (2.2)

xe ∈ {0, 1}, ∀ e ∈ E (integer constraints) (2.3)2

Remark 2In Definition 2 the degree constraints (2.1) ensure that all vertices havedegree exactly two i.e. a tour enters and leaves each city exactly once. Thesubtour elimination constraints (2.2) ensure that the solution will consist

3We denote as messenger-problem (because this question has to be solved by any post-man and by the way many salesmen) the challenge of finding the shortest route betweena finite number of points of which the pairwise distance is known.

4although Whitney queried some twenty years later did not recall the problem[Dantzig, Fulkerson, and Johnson, 1954]

5a way of coming from the optimal fractional variable solution to the optimal integervariable solution

13


Table 2.1: Some TSP history milestones (mainly from [Okano, 2009])

Year Milestone Contributor(s)

1954 49 cities instance solved to optimal-ity using a LP and manually addedcutting planes

Dantzig, Fulkerson and Johnson

1962 33 cities TSP contest Procter & Gamble

1970 Lagrangian relaxation with errorabout 1 % below optimal.

M. Held and R. M. Karp

1972 Proof of NP-completeness for TSP R. M. Karp

1973 k-opt heuristic 1 to 2 % above opti-mal

Lin and Kernighan

1976 Factor 32 -approximation algorithm

for MTSPNicos Christofides

1983 Simulated Annealing based heuris-tic

Kirkpatrick, Gelatt and Vecchi

1985 Recurrent neural network-basedheuristic

Hopfield and Tank

1990 TSP heuristics based on k-d trees Bentley

1991 TSPLIB published Reineld

1992 3038 cities instance solved to opti-mality using cutting plane genera-tion (Concorde)

David Applegate, Robert Bixby,Vašek Chvátal and William Cook

1996 PTAS for Euclidean TSP with run-ning time: nO( 1

ε)

Sanjeev Arora

1998 Improved k-opt heuristic LKH,within 1 % above optimal.

Keld Helsgaun

2004 24 978 cities instance solved by LKHand proven optimal by Concorde

Applegate, Bixby, Chvátal, Cookand Helsgaun

2006 85 900 cities instance solved andproven optimal using LKH and Con-corde

Applegate, Bixby, Chvátal, Cook,Espinoza, Goycoolea and Helsgaun

14

2.1. History

of one large cycle and not two or more sub-cycles. Finally the integerconstraints (2.3) ensure that we use each edge in a “binary” fashion i.e.the edge is either part of the tour or it is not part of the tour. 2

In 1962, the TSP became publicly known to a great extent in the USAdue to a contest by Procter & Gamble consisting of a problem instance of33 cities. The $ 10 000 Price for the shortest solution was at that timeenough to purchase a new house in many parts of the country.

Figure 2.2: The 33 city contest from 1962.

In 1970, Held and Karp developed a one-tree6 relaxation which pro-vides a lower bound within 1 % from the optimal. It achieves this by re-laxing the degree constraints in Definition 2 using a Minimum SpanningTree (MST) and Lagrangian multipliers.

In 1972, Karp [1972] proved the NP-completeness of the HamiltonianCycle Problem (HCP) from which the NP-completeness of the TSP followsalmost directly [Karp, 1972].

In 1973, Lin and Kernighan proposed a variable-depth edge exchang-ing heuristic for refining an initial tour. The method, now known as the“Lin-Kernighan” algorithm, performs variable k-opt moves that allow in-termediate tours to be longer than the original tour. A k-opt move can be

6A tree containing exactly one cycle

15


seen as the removal of k edges from a TSP tour followed by the patchingof the resulting paths into a tour using k other edges (see Figure 2.3).

The algorithm produces tours about 1 to 2 % above the optimal lengthand has won the reputation of being difficult to implement correctly. In[Helsgaun, 1998, Section 2.3.2, page 7] a survey paper published in 1989is mentioned, where the authors claim that no other implementation ofthe algorithm at that time had shown as good efficiency as was obtainedby Lin and Kernighan in [Lin and Kernighan, 1973].

b

bb

b

v2

v3

v1

v4

b

bb

b

v2

v3

v1

v4

(a) 2-opt move

b

bb

b

v3

v4

v1

v2

b bv6 v

5

b

bb

b

v3

v4

v1

v2

b bv6 v

5

(b) 3-opt move

Figure 2.3: Different k-opt moves

In 1976, Christofides published a tour construction method, that achievesa 3

2 -approximation7 by using a MST and “Perfect Matching”. Apart fromthe euclidean TSP this is still the tightest approximation ratio known[Christofides, 1976].

Other examples of using the TSP as a "guinea pig" are found in thearticle [S. Kirkpatrick and Vecchi, 1983] which introduces the random,local search technique known as “Simulated Annealing” and in the article[Hopfield and Tank, 1985], which is one of the first publications discussing“Neural Network” algorithms. Both articles use the TSP as a workingexample.

In 1990, Bentley developed a new highly efficient variant of the k-dtree8 data structure, which is used for proximity checking, while he wasworking on heuristics for the TSP [Bentley, 1992].

7i.e. guaranteeing a solution no worse than 32

times the optimal solution8a binary search tree structure extended in k dimensions

16

2.2. Seminal Results in TSP Computation

In 1991, Reinelt composed and published TSPLIB9,a library containingmany of the test problems studied over the last 50 years [Reinelt, 1991].

In 1992, David Applegate, Robert Bixby, Vašek Chvátal and WilliamCook solved a 3038 TSP city instance to optimality using the exact TSPsolver program Concorde, on which they started the development in 1990.Concorde has ever since been involved in all proven optimality tour records.

In 1996, the first Polynomial Time Approximation Scheme (PTAS) forthe euclidean TSP was devised by Arora. The PTAS finds tours with length(1 + ε) times the optimal and has a running-time of nO( 1

ε). Since it had

previously been proven that both the general as well as the Metric Travel-ing Salesman Problem (MTSP) do not have a PTAS this result was receivedwith surprise.

In 1998 Keld Helsgaun released a highly efficient and improved ex-tension of the Lin-Kernighan heuristic algorithm, called Lin-Kernighan-Helsgaun (LKH). Among other characteristics it uses one-tree approxima-tions for determining candidate edge-lists10 and 5-opt moves. LKH haslater been extended and it has participated with Concorde in solving thelargest instances of the TSP to this day. Furthermore LKH has been hold-ing the record for the 1 904 711 city World TSP Tour11 since 2003. It hassubsequently improved the tour three times (most recently in May 2010).

2.2 Seminal Results in TSP Computation

The ability to solve TSP instances to optimality has profited both fromthe discovery of new computational techniques as well as the advances incomputer technology. Table 2.2 on page 19 shows some of the milestoneresults for solving TSP instances.

As mentioned in Table 2.1 the TSP tour for 24 978 cities in Swedendepicted in Figure 2.4 on the following page, was found in a cooperationbetween LKH and Concorde. LKH found the optimal tour and a Concordecalculation using a cluster of 96 machines and a total running-time of 84.8CPU years deemed it optimal. The initial cutting-plane procedure on thefirst LP began in March 2003 and the final branch-and-cut run finished inJanuary 2004. A final check of the solution was completed in May 2004.

The largest TSP instance that has been optimally solved to this dateconsists of a tour through 85 900 cities in a Very-large-scale Integration(VLSI) application. The problem arose in the Bell Laboratories in the late1980s. This achievement also consisted in a cooperation between Con-corde and LKH. Again LKH found a good tour, which was then used as an

9http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/

10a list containing the preferred routes between two cities11http://www.tsp.gatech.edu/world/

17

http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/

http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/

http://www.tsp.gatech.edu/world/


Figure 2.4: The optimal TSP tour in Swe-den. (from http://www.tsp.gatech.edu/sweden/

tours/swtour_small.htm )

18

http://www.tsp.gatech.edu/sweden/tours/swtour_small.htm


2.3. Notes

Table 2.2: Milestones for solving TSP (from [Applegate et al., 2007])

Year Contributor(s) Size TSPLIB name

1954 G. Dantzig, R. Fulkerson, S. Johnson 49 cities dantzig421971 M. Held, R. M. Karp 57 cities -1971 M. Held, R. M. Karp 64 cities random points1975 P.M. Camerini, L. Fratta, F. Maffioli 67 cities random points1975 P. Miliotis 80 cities random points1977 M. Grötschel 120 cities gr1201980 H. Crowder, M. W. Padberg 318 cities lin3181987 M. W. Padberg, G. Rinaldi 532 cities att5321987 M. Grötschel, O. Holland 666 cities gr6661987 M. W. Padberg, G. Rinaldi 1002 cities pr10021987 M. W. Padberg, G. Rinaldi 2392 cities pr2392

1992 Concorde 3038 cities pcb30381993 Concorde 4461 cities fnl44611994 Concorde 7397 cities pla73971998 Concorde 13509 cities usa135092001 Concorde 15112 cities d151122004 Concorde 24978 cities sw249782004 Concorde with Domino-Parity 33810 cities pla338102006 Concorde with Domino-Parity 85900 cities pla85900

upper bound input for Concorde. The determination of the optimal tour,which turned out to be one found by LKH, required a total running-time ofapproximately 136 CPU years. To certify the optimality a proof-checkingprocedure, which ran for over 500 hours, was devised – a project whichlead to an article by itself [Applegate, Bixby, Chvátal, Cook, Espinoza,Goycoolea, and Helsgaun, 2009].

2.3 NotesThis chapter is based on [Applegate et al., 2007] and [Okano, 2009]. Allother references are explicitly mentioned.

19

The Traveling Salesman always visitsKnuth first

Knuth homage web page

CHAPTER

3Traveling SalesmanProblem Solvers

This chapter describes the different types of TSP solvers. It further intro-duces the TSP solver, which was used in the later algorithm implementa-tions and the argumentation for choosing it. To avoid getting too involvedin technical details all descriptions are kept short.

3.1 Characterisation of TSP SolversTSP solvers come in two main classes:

Exact Solvers There are two groups of exact solvers. One of these issolving relaxations of the TSP LP formulation (see Definition 2 onpage 13) and uses methods like Cutting Plane, Interior Point, Branch-and-Bound and Branch-and-Cut. Another smaller group is usingDynamic Programming. For both groups the main characteristic isa guarantee of finding optimal solutions at the expense of runningtime and space requirements.

Non-exact Solvers These solvers offer potentially non-optimal but typ-ically faster solutions. In a way the opposite trade-off of the exactsolvers. Non-exact solvers can be subdivided into:

Approximation Algorithms These algorithms come with a worstcase approximation factor for the found solution. The two tra-ditional methods for solving the TSP are a pure MST based algo-rithm, which achieves a factor 2 approximation and a combinedMST and Minimum Matching Problem (MMP) based algorithmdue to Christofides, which achieves a factor 3

2 approximation.Both methods are restricted to the MTSP as they depend on the

21

3. TRAVELING SALESMAN PROBLEM SOLVERS

triangle inequality. The PTAS for Euclidean TSP is mainly a the-oretical result due to its prohibitive running time.

Heuristic Algorithms These algorithms only promise a feasiblesolution. They range from simple tour-construction methodslike Nearest Neighbour, Clarke-Wright and Multiple Fragment1

to more complicated tour improving algorithms like Tabu Searchand Lin-Kernighan. Finally there is a group of fascinating algo-rithms which unfortunately tend to combine approximate solu-tions and large running-times. Here we find methods like Sim-ulated Annealing, Genetic Algorithms, Ant Colony Algorithmsand machine learning algorithms like Neural Networks.

3.2 Selection of TSP SolverThe main problem in choosing a TSP solver for the later algorithm imple-mentations was to decide whether an exact solver or a heuristic solverwould be appropriate, since the “state of the art” programs for each classare easily found. These implementations are the Concorde exact solverand the LKH heuristic solver respectively. This evaluation is in accordancewith the findings reported in [Laporte, 2010]

Concorde This is an ANSI C implementation consisting of approximately130 000 lines of code. Concorde is an LP based solver that uses sev-eral modern cutting plane techniques thus approaching the optimalvalue from below. As all exact solvers its running time and spaceconstraints suffers from the “cost of being exact”. The source code isavailable for academical use.2

LKH This is a considerably optimised modification of the Lin-Kernighanalgorithm implemented using ANSI C. To quote [Applegate et al.,2007, Chapter 17.2]

Since its introduction, the LKH variant of the local-searchalgorithm of Lin and Kernighan has set the standard forhigh-end TSP heuristics.

LKH starts by using a Held and Karp one-tree ascent method to con-struct a candidate edge list. A problem instance is then solved us-ing a number of independent runs. Each run consist of a series oftrials, where in each trial a tour is determined by the modified Lin-Kernighan algorithm. The trials of a run are not independent, sinceedges belonging to the best tour of the current run are used to prune

1often called Greedy2http://www.tsp.gatech.edu/concorde.html

22

http://www.tsp.gatech.edu/concorde.html

3.2. Selection of TSP Solver

the search. In the standard setting, LKH refines using 5-opt moves inten independent runs each containing a number of trials equallingthe number of cities. A typical output from LKH can be seen in Ap-pendix B. The code is distributed for research use.3

Figure 3.1: A LKH tour of 11 455 danish cities.

In order to gain experience with the programs the following exper-iment was conducted: On a single dedicated machine4 both programswere challenged with the dk11455.tsp problem. As the name indicatesthe problem is a tour consisting of 11 455 cities in Denmark (see figure3.1). The optimal solution is a tour of length almost 21 000 km. As mostof the other large solved tours, it was found by LKH and proven optimalusing Concorde. With both programs in standard configurations the re-sults, shown in Table 3.1 on the following page and Figure 3.2 on page 25were obtained.

The experiment with Concorde was stopped after 46 days. As can beseen in Figure 3.2 it looks as if Concorde steadily will approach the opti-

3http://www.akira.ruc.dk/~keld/research/LKH/4Pentium 4, 3066 MHz, 512 MB RAM

23


3. TRAVELING SALESMAN PROBLEM SOLVERS

Table 3.1: LKH and Concorde results for dk11455.tsp

TSP instance dk11455.tsp OPT: 20 996 131

LKH:day 1 startedday 6 produced 10 tours Best: 20 996 166

Concorde:day 1 started with best LKH tour as upper bound: 20 996 166

Initial lower bound: 20 991 500

Active nodes Temporary files Lower boundday 8 557 1.2 GB 20 992 921day 15 987 2.1 GB 20 993 016day 22 1183 2.5 GB 20 993 069day 29 1478 3.0 GB 20 993 116day 36 1803 3.7 GB 20 993 152day 43 2027 4.1 GB 20 993 171day 46 2192 4.4 GB 20 993 182

mal value. A crude extrapolation5 predicts that the optimal value wouldhave been found in about 16 months. Although the TSP instance in this ex-periment was larger than the expected size of the instances in the thesisexperiments the characteristics of the two programs were clearly illus-trated.

In the end LKH was chosen as the standard TSP solver for the experi-ments in this thesis on account of the following:

• As mentioned in [Laporte, 2010], it consistently produces optimalsolutions on small and medium size instances, in less than a secondfor n = 100 and for n = 1000 in less than a minute.6 This covers theexpected size of instances in the thesis experiments very well.

• The need for provable optimality is of minor importance, when theTSP solution is used to implement heuristic algorithms.

3.3 NotesConcorde is described in details by its authors in [Applegate et al., 2007]and LKH is described in details in the technical reports [Helsgaun, 1998,2006] and in the papers [Helsgaun, 2000, 2009].

5not scientifically sound but very tempting6on a 300 MHz G3 Power Macintosh

24

3.3. Notes

0 10 20 30 40Days

0

1000

2000

3000

4000

5000

6000

7000

8000

Tou

r len

gth

+2.099e7 LKH and Concorde results on dk11455.tsp

LKH tour valuesOptimal tour lengthConcorde lower bound

Figure 3.2: Results on dk11455.tsp. Note the addi-tion of 20 990 000 on the y-axis.

25

Part II

The Shortest SuperstringProblem

Person and bases tour with 16384 cities

The Shortest Common SuperstringProblem, an elegant but flawedabstraction

Richard M. Karp

CHAPTER

4The ShortestSuperstring Problem

This chapter introduces and defines the Shortest Superstring Problem. Itfurther contains the motivation for including the problem in this thesis.

4.1 The Problem

The SSP or Shortest Common Superstring Problem is defined as follows:

Definition 3Given a collection of strings S = (s1, s2, . . . , sn) and a finite alphabet

∑,

with sj ∈∑∗, j ∈ 1 . . . n, determine the shortest string containing each

string in S as a (consecutive) substring. In other words find the shortestsuperstring for the collection S. 2

Remark 3Without loss of generality (WLOG) it can be assumed that S is substringfree, i.e. no string in S is a substring of any other string in S. This assump-tion, which is made for the rest of this thesis part, does not influence theshortest superstring. 2

The SSP is, like the TSP, NP hard [Gusfield, 1997, chapter 16.17] andalso MAX-SNP hard [Blum, Jiang, Li, Tromp, and Yannakakis, 1994,Section 7], which means that no PTAS exists unless P = NP. It is thereforehighly improbable that a general, efficient algorithm for solving the SSPexists, thus the best we can hope for is approximation algorithms andheuristics.

29

4. THE SHORTEST SUPERSTRING PROBLEM

4.2 Applications of SSP

Articles concerning the SSP typically mention applications in bioinfor-matics and data compression [Blum et al., 1994; Breslauer, Jiang, andJiang, 1996; Ilie and Popescu, 2006; Kaplan, Lewenstein, Shafrir, andSviridenko, 2005; Kaplan and Shafrir, 2005; Romero, Brizuela, and Tch-ernykh, 2004; Tarhio and Ukkonen, 1988].

The first application refers to the DNA-sequencing method known asShotgun Sequencing, the second refers to the idea of compressing a setof strings into their shortest superstring combined with an index pair foreach string marking the beginning and ending index respectively. As wewill argue, the citation of these applications seem more to be based upon“transitive” citing rather than upon real applications.

4.2.1 Shotgun Sequencing

When having to sequence a genome it is not technically possible to se-quence strands of length larger than 100 to 1000 basepairs. For this rea-son longer strands must be divided into smaller pieces and subsequentlyre-assembled. In Shotgun Sequencing one first make many copies (clones)of the DNA of interest. These copies are then randomly cut into smallerpieces by physical, enzymatic or chemical means.

The difficulty in assembling the original sequence stems from the fact,that though the smaller pieces constitute overlapping sequences of theoriginal DNA, the succession of the overlapping sequences is lost in thecutting process.

This is where the SSP come into play, as a theoretical model for re-assembling the overlapping pieces into a superstring which is assumedto be a good approximation of the original DNA sequence. This neat sim-ple idea has set off many studies of efficient approximation algorithmsfor SSP, but as a model for sequence-assembly it does have some majorproblems especially concerning read-errors in the fragments and identi-cal sub-sequences (repeats) in the original DNA sequence.

Example 5At the top of Figure 4.1 a sequence with two repeats (marked R) is clonedfour times and the clones cut into smaller pieces. After assembly (atthe bottom of the figure) only one occurrence of the repeat region willbe present together with a smaller and originally non-existent region(marked R’). This is due to all the “inner” fragments of the repeats be-ing assembled as belonging to a single occurrence of the repeat. 2

A rather devastating critic of using SSP directly to solve sequencingproblems is given by Richard M. Karp in [Karp, 2003]:

30

4.2. Applications of SSP

RR

R’R

Figure 4.1: sequence assembly problem with re-peats

The shortest superstring problem is only superficially re-lated to the sequencing assembly problem. Its difficulty stemsfrom pathological examples that are unlikely to occur in prac-tise. It does not take noisy reads into account, and admits solu-tions with an unreasonably large number of mutually overlap-ping read.

Although Shotgun Sequencing was the most advanced technique forsequencing genomes from about 1995–20051 and was used by the pri-vately founded group solving the mapping of the human genome [Venter,Adams, Myers, Li, Mural, Sutton, Smith, Yandell, Evans, Holt, et al.,2001], newer technologies known as next-generation sequencing have sur-faced. These techniques are far superior to Shotgun Sequencing: both inthe amount of data produced as well as with respect to the time neededfor producing the data — they do tend to have lower accuracy though[Metzker, 2009].

4.2.2 Data compression

The idea of using SSP as a way of compressing a set of strings originatesin the article [Gallant, Maier, and Storer, 1980] which itself refers backto two technical reports by J. Storer (one of the authors) from 1977.

The practical application mentioned is the need for string compressionby a compiler. It has not been possible to track down references describingthe application of this technique in compilers. Furthermore string com-pression is hardly of any significant concern for modern day compilers asthe time versus space trade-off clearly seems to favour speed. An evenmore serious objection, mentioned in [Gusfield, 1997, page 445, exercise

1http://en.wikipedia.org/wiki/Shotgun_sequencing

31

http://en.wikipedia.org/wiki/Shotgun_sequencing

4. THE SHORTEST SUPERSTRING PROBLEM

25, 26], is that it seems like a rather exotic idea trying to solve a compres-sion problem by using an NP complete problem — especially as there aresimpler, better and more efficient algorithms available e.g. based on CycleCovers (CCs).

4.2.3 Why is SSP then interesting anyway ?

Having defied the traditional practical uses of SSP and thereby almostremoved the foundation for working with the problem, we now face thechallenge of arguing for its inclusion in this thesis.

In prioritised order the reasons are,

1) the SSP is in itself an interesting problem,

2) historically the SSP has been one of the first sequencing models“bridging” between computer science and biology,

3) the SSP has been subject to a considerable amount of research lead-ing to a number of approximation algorithms. To quote from [Gus-field, 1997, Chapter 16.1] “This pure, exact string problem is moti-vated by the practical problem of shotgun sequence assembly anddeserves attention if only for the elegance of the results that havebeen obtained.”

4) the SSP is closely related to the TSP, so it would seem unnatural notto include it at all, and finally

5) the SSP as a crude model is usable in bioinformatics. Besides theShotgun Sequencing method it is related to the class of DNA-sequencingmethods known as “Sequencing by Hybridisation” and it has alsobeen proposed as a model for genome compression of viruses [Ilieand Popescu, 2006].

with these reasons in mind, the rest of this thesis Part will hopefullyseem founded to the reader.

32

CHAPTER

5Preliminaries

This chapter will introduce notation, definitions, some basic facts aboutstrings and approximation strategies all of which will be needed in thelater treatment of approximation factors and algorithm descriptions. Forfurther notation usage see Section 1.2 on page 4

5.1 Strings

Given two strings s, t where s = xy and t = yz with x, z 6= ε, if y is thelongest such string it is called the (maximal) overlap of s and t and isdenoted ov(s, t). The string x is called the (minimal) prefix of s and t andis denoted pr(s, t). Similarly z is sometimes called the suffix of s and t andis denoted suf(s, t). The length of the prefix, |pr(s, t)| is called the distancefrom s to t and is denoted dist(s, t).

Example 6

Let s = half, t = alfalfa thenov(s, t) = alf, ov(t, s) = ε, ov(t, t) = alfa,|ov(s, t)| = 3, |ov(t, s)| = 0, |ov(t, t)| = 4

pr(s, t) = h, pr(t, s) = alfalfa, pr(t, t) = alf,dist(s, t) = 1, dist(t, s) = 7, dist(t, t) = 3

note that |s| = dist(s, t) + |ov(s, t)|2

33

5. PRELIMINARIES

Definition 4 (Merge of strings)Let S = (s1, s2, . . . , sn) be an ordered collection of strings

〈s1, s2, . . . , sn−1, sn〉 = pr(s1, s2) pr(s2, s3) · · · pr(sn−1, sn) sn

is called the merge of the strings (s1, s2, . . . , sn). If S is substring free theorder of the string in the merge is well-defined, as no two strings willbegin or end at the same index. 2

Example 7For two strings, s, t we have

〈s, t〉 = pr(s, t) t = pr(s, t) ov(s, t) suf(s, t) = s suf(s, t),

|〈s, t〉| = dist(s, t) + |t| = |s| − |ov(s, t)|+ |t| = |s|+ |t| − |ov(s, t)| 2

Lemma 1For an ordered collection of strings S = (s1, s2, . . . , sn), the merge of thestrings is the shortest superstring for the strings in that order. 2

To get warmed up and for the sake of completeness we include the proof:

PROOF ([GUSFIELD, 1997]) The fact that the merge is a superstring iseasiest seen by induction on the string indices beginning from the end ofthe merge.

Base case: sn is a substring of the merge as it is explicitly included.

Induction case: Assume si+1 is a substring of the merge.Now si = pr(si, si+1)ov(si, si+1). As pr(si, si+1) is explicitly includedin the merge, it remains to show that ov(si, si+1) follows directlyafter. Since S is substring free we have ov(si, si+1) is a prefix of si+1,but si+1 is per induction hypothesis a substring of the merge startingat pr(si, si+1).

The minimality follows from the definition of pr(·, ·) �

Definition 5 (Cyclic String)For any string s we denote by φ(s) the cyclic string resulting from “wrap-ping around” the string so that s

[|s|]

precedes s[1]. The length of φ(s)equals |s|. 2

Any string t which is a substring of s∞ is said to map to the cyclicstring φ(s) and correspondingly φ(s) is said to be mapping all such strings.

34

5.1. Strings

Example 8

Let s = abac then all the following strings map to φ(s)

t = bau = bacabav = acabacabaca

2

Definition 6 (Factor and Period)A string x is a factor of another string s if s = xiy with y a (possibly empty)prefix of x and i ∈ N. The length of the factor |x| is called a period of s. 2

The shortest factor of a string s and its period are denoted factor(s)and period(s) respectively. In case s is semi-infinite,1 we call s periodicif s = xs for x 6= ε The shortest such x is called the factor of s and itslength the period of s. Two strings s, t (finite or not) are called equiva-lent if their factors are cyclic shifts of each other, i.e. ∃ x, y : factor(s) =xy and factor(t) = yx.

The following is a classical string algorithm result

Lemma 2 (The Periodicity Lemma)Let s be a string with periods p, q. If |s| ≥ p+ q then the greatest commondivisor of p, q = gcd(p, q) will also be a period for s. 2

PROOF ([GUSFIELD, 1997]) Assume WLOG that q < p. Consider thenumber d = p− q. We will show that s[i] = s[i+ d] ∀ i ∈ 1 . . . |s| − d.

Case q < i This means s[i− q] exists. We now have:

s[i] = s[i− q] (as q is a period for s)= s[i− q + p] = s[i+ d] (as p is a period for s)

Case i ≤ q This means s[i+ p] exists as |s| ≥ p+ q. We now have:

s[i] = s[i+ p] (as p is a period for s)= s[i+ p− q] = s[i+ d] (as q is a period for s)

As |s| ≥ (p − q) + q we have shown, that applying Euclids Algorithm onp, q we will at each recursive step have numbers that are periods for s. AsEuclids Algorithm ends with gcd(p, q) the claim follows. �

1an infinite string with one fixed end: s = ababab . . .

35

5. PRELIMINARIES

5.2 Graphs and Cycle CoversTo model the overlaps of a collection of strings S = (s1, s2, . . . , sn) thefollowing two complete graphs are very useful:

Definition 7Prefix (Overlap) Graph

Let G(N,A,W ) be a weighted, directed graph, withN = S,A = {aij} ∀ i, j in 1 . . . n

W : A→ N with w(aij) =

{dist(si, sj) (prefix graph)

|ov(si, sj)| (overlap graph)2

We associate each arc in the prefix (overlap) graph with the prefix (over-lap) of the two strings corresponding to the start- and end-nodes for thearc.Example 9The prefix (distance) and overlap graphs for the string collection: (ate,half, lethal, alpha, alfalfa) is illustrated in figure 5.1. Only arcs havingvalues different from the length of the string (in the prefix graph) andzero (in the overlap graph) are shown. 2

lethal half ate

alpha

alfalfa

4

3

4

1

3

4 6

4

6

4

3

5

(a) prefix graph

lethal half ate

alpha

alfalfa

2

3

2

3

2

1 1

1

1

1

4

1

(b) overlap graph

Figure 5.1: Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’(from [Blum et al., 1994])

Note that a cycle in the prefix graph for S consisting of the nodesn1, n2, . . . , nk will be of the form

C = φ(t) = φ(pr(s1, s2) pr(s2, s3) · · · pr(sk−1, sk) pr(sk, s1))

The notation C = s1, s2, . . . , sk will be used to describe such a cycle interms of the strings from S that C is mapping.

36

5.2. Graphs and Cycle Covers

With tsi we will denote the string equivalent to t achieved by “break-ing” the cycle at the point where string si starts. Defining sn+1 = s1 wethen have:

tsi = pr(si, si+1) pr(si+1, si+2) · · · pr(si−2, si−1) pr(si−1, si)

The next two lemmas prove a rather intuitive property and a moreinteresting property concerning cycles in the prefix graph:

Lemma 3For any cycle C = φ(t) = s1, s2, . . . , sn in the prefix graph for S we havethat every si, i ∈ 1 . . . n and 〈s1, s2, . . . , sn〉 maps to φ(t) 2

PROOF As the string collection is substring free set, we have that si is aprefix of pr(si, si+1)si+1 which again is a prefix of pr(si, si+1)pr(si+1, si+2)si+2

and so on (defining sn+1 = s1). By letting k =⌈|si|/|t|

⌉we will get that si

is a prefix of tksi which itself maps to φ(t)k+1.The claim for 〈s1, s2, . . . , sn〉 can be shown by an analogous argument.�

In a way the “reverse” property also holds:

Lemma 4Let S be a collection of strings. Let S ⊆ S with the property that all stringsin S are sub strings of t∞, then there exists a cycle C = φ(t) in the prefixgraph for S such that all strings in S map to C 2

PROOF All strings in S will appear in t∞ with an interval of |t| characters.This induces a circular ordering of the strings in S. A cycle in the prefixgraph for S defined by this ordering will fulfil the criteria. �

The optimal SSP solution for S = (s1, s2, . . . , sn) will be of the form〈sπ1 , sπ2 , . . . , sπn〉 for some permutation Π : [1, n]→ [1, n] This means:

OPT (S) = 〈sπ1 , sπ2 , · · · , sπn〉= pr(sπ1 , sπ2) pr(sπ2 , sπ3) · · · pr(sπn−1 , sπn) sπn

= pr(sπ1 , sπ2) pr(sπ2 , sπ3) · · · pr(sπn−1 , sπn) pr(sπn , sπ1) ov(sπn , sπ1)

Note that the above concatenation of prefixes is describing a cycle in theprefix graph for S. For the length of the optimal solution we then have:

|OPT (S)| = |pr(sπ1 , sπ2)|+ |pr(sπ2 , sπ3)|+ · · · + |pr(sπn , sπ1)|+ |ov(sπn , sπ1)|(5.1)

Remark 4Expression 5.1 conveys that the length of the optimal solution is the sumof a TSP tour in the prefix graph for S plus the overlap of the “end” and“start” strings of the tour. 2

37

5. PRELIMINARIES

If we let |TSP ∗(Gpr)| denote the length of a shortest TSP tour in the prefixgraph for S we get the following lower bound for |OPT (S)|:

|TSP ∗(Gpr)| < |TSP ∗(Gpr)|+ |ov(sπn , sπ1)| ≤ |OPT (S)| (5.2)

If we let wpr(aij), wov(aij) denote the weights for the arc aij in the prefix-and overlap graph respectively, we have:

wpr(aij) + wov(aij) = |si| ⇔ wpr(aij) = |si| − wov(aij), ∀ i, j ∈ 1 . . . n (5.3)

We now turn to the important concept of a Cycle Cover:

Definition 8 (Cycle Cover of String Collection S)

Let CC(S) = {C1, C2, . . . , Ci}= {φ(t1), φ(t2), . . . , φ(ti)}

where ∀ sj ∈ S, ∃ φ(tk) ∈ CC(S) : sj maps to φ(tk)

then CC(S) is called a Cycle Cover of S. 2

The size of a Cycle CoverCC(S) is denoted ||CC(S)|| and equals∑i

k=1 |φ(tk)|.The minimal length cycle cover of a collection of strings S is denotedCC∗(S). We get the following properties for a minimal CC.

Lemma 5For every si ∈ S = (s1, s2, . . . , sn), si maps to a unique Ci ∈ CC∗(S).For every distinct pair of cycles Ci = φ(ti), Cj = φ(tj) ∈ CC∗(S) we haveti and tj are inequivalent. 2

PROOF As the collection is substring free every pair of strings si, sj willhave pref(si, sj) 6= ε. If si would map onto more than one cycle in CC∗(S)it would contribute to the length of CC∗(S) with two non-empty pre-fixes, which contradicts the minimal length of CC∗(S). The second claimnow follows immediately as the strings mapping to two equivalent cycleswould be identical. �

we can now show a useful property of the prefix and overlap graphs

Theorem 1 (Prefix-Overlap Graph Duality)Let S = (s1, s2, . . . , sn) be a string collection, then the following relationshold between TSP-tours and CCs in the prefix and overlap graphs for S:

• A shortest TSP tour in the prefix graph will be a maximum TSP tourin the overlap graph (and vice versa).

• A minimal CC in the prefix graph will be a maximum CC in the over-lap graph (and vice versa). 2

38

5.3. Approximation Strategies

PROOF Let TSP ∗(Gpr) = nπ1 , nπ2 , . . . , nπn be a shortest TSP tour in theprefix graph for S. Using Expression 5.3 on the preceding page we thenhave:

|TSP ∗(Gpr)| = |pr(sπ1 , sπ2) pr(sπ2 , sπ3) · · · pr(sπn−1 , sπn) pr(sπn , sπ1)|= |pr(sπ1 , sπ2)|+ |pr(sπ2 , sπ3)|+ · · ·+ |pr(sπn , sπ1)|= sπ1 − |ov(sπ1 , sπ2)|+ sπ2 − |ov(sπ2 , sπ3)|+ · · ·+ sπn − |ov(sπn , sπ1)|= ||S|| − (|ov(sπ1 , sπ2)|+ |ov(sπ2 , sπ3)|+ · · ·+ |ov(sπn , sπ1)|)= ||S|| − |ov(sπ1 , sπ2) ov(sπ2 , sπ3) · · · ov(sπn−1 , sπn) ov(sπn , sπ1)|

(5.4)

now as |TSP ∗(Gpr)| is minimal and ||S|| is a constant, the string length onthe right hand side in Expression 5.4 will have to be maximal. But thisis exactly the length of a TSP tour in the overlap graph for S. The otherrelation can be proved analogously. �

We can now formulate the following lower bound relation:

||CC∗(S)|| ≤ |TSP ∗(Gpr)| < |OPT (S)| (5.5)

by noting that a TSP tour in the prefix graph for S is a special case of a CCfor S. An equivalent consideration is to consider any CC of S as a solutionto the LP-formulation of the TSP (see Definition 2 on page 13) with the“subtour elimination constraint” removed.

5.3 Approximation Strategies

From the results in the previous section it can be seen that a possibleapproximation strategy for SSP would be to approximate the length of aTSP tour in the prefix graph for the string collection. Unfortunately thebest known approximation factors for the minimum Asymmetric Trav-eling Salesman Problem (ATSP) are of the order of log(n) [Kaplan et al.,2005], which makes this strategy less interesting.

Another approach is to consider the length of the optimal solution ex-

39

5. PRELIMINARIES

pressed in the overlap graph for the string collection S:

|OPT (S)| = |〈sπ1 , sπ2 , · · · , sπn〉|= |pr(sπ1 , sπ2)|+ |pr(sπ2 , sπ3)|+· · · + |pr(sπn−1 , sπn)|+ |sπn |

= ( |sπ1 | − |ov(sπ1 , sπ2)| ) + ( |sπ2 | − |ov(sπ2 , sπ3)| )+· · ·+ ( |sπn−1 | − |ov(sπn−1 , sπn)| ) + |sπn |

=n∑i=1

|sπi | −n−1∑i=1

|ov(sπi , sπi+1)|

= ||S|| −n−1∑i=1

|ov(sπi , sπi+1)| (5.6)

The right side sum in the Expression 5.6 is called the total overlap or thecompression for OPT (S) and will be denoted OV ∗(S).

Remark 5Expression 5.6 conveys that the length of an optimal SSP solution equalsthe length of S minus the length of a maximum Hamiltonian Path in theoverlap graph for S. 2

The maximum Hamiltonian Path Problem (HPP) can actually be ap-proximated by a constant factor, but this does not directly lead to an us-able approximation strategy for SSP. The reason being that the optimalsolution may have a large total overlap (of size O(||S||)), which meansthat sacrificing even a small fraction of the total overlap can lead to badsolutions.Example 10

Let S = (abc,bcd, cde,def, efg, fgh, ghi,hij)

then ||S|| = 8 · 3 = 24, OV ∗(S) =

7∑i=1

|ov(si, si+1)| = 7 · 2 = 14

and |OPT (S)| = |abcdefghij| = 24− 14 = 10

An approximation factor for the total overlap of even 23 could lead to an

SSP solution that is 50 % larger than optimal. 2

The main problem with this approximation strategy is that we are ap-proximating the total overlap and not the length of the SSP solution. Forthis strategy to be applicable it is therefore necessary to assure that thestrings in the collection do not have large mutual overlaps. This leads tothe following approximation algorithm template:

40

5.4. Notes

Definition 9 (CC Approximation Algorithm Template)Given a collection of strings, S, transform S into another set of strings, R,with the ensuing properties:

1) Strings in R do not overlap too much.

2) All strings in S are sub-strings of a string in R, i.e. a superstring forR induces a superstring for S.

3) |OPT (R)| is not much larger than |OPT (S)|.

Remark 6The above template achieves the following:

• from 1) it follows that the reduction to HPP is applicable, and

• from 2) and 3) it follows that an SSP solution for R is a good SSPsolution for S.

These claims will be formalised in Section 6.1.1 2

The standard approach to fill in the template is:

• Find a minimum Cycle Cover CC∗(S) = {C1, C2, . . . , Cj} in the prefixgraph for S.

• For all Ci = si1 , si2 , . . . , sik ∈ CC∗(S) choose a representative forthe cycle, r(Ci) containing all strings in S that maps to the cycle, assub-strings. For example let r(Ci) = 〈si1 , si2 , . . . , sik〉.

• Let R = {r(Ci)}, i ∈ 1 . . . j.

The rationale behind this approach is that strings that overlap a lotwill end up in the same cycle in the CC, so substituting these strings by asingle representative should not increase the length of the optimal solu-tion too much, but it will reduce the total overlap for the instance. Thisapproach is the basis for the implemented algorithms called Cycle (Sec-tion 6.1 on page 43) and RecurseCycle (Section 6.2 on page 53) respec-tively.

5.4 NotesThis chapter is based on the excellent but unfinished note by MarcinMucha [Mucha, 2007]. All other references are explicitly mentioned.

41

Copy from one, it’s plagiarism; copyfrom two, it’s research.

John Milton

CHAPTER

6Algorithms

In order to run practical experiments for solving the SSP five algorithmswere implemented. The first two are examples of CC approximation algo-rithms, the third is a well known greedy algorithm , the fourth an approx-imation algorithm achieving the best known approximation factor of 2.5and finally the last algorithm is TSP based.

This chapter contains descriptions of these algorithms and their ac-tual implementations. It is additionally comprised of theoretical sectionscontaining proofs for the approximation factors. It is the intention thatthe chapter should be self-contained with respect to these proofs. Thestrategy has been to include proofs for lemmas and theorems originat-ing in directly used references and omit proofs for results that are onlyreferenced in these.

The survey character of this chapter has had the unfortunate effect ofleading to sections having an abundance of lemmas, theorems and proofs.In order to compensate for this and ease the reading, most of the proofshave been elaborated on in terms of smaller steps and/or supplementarylemmas, remarks and examples.

6.1 The Cycle Algorithm

This algorithm is a variant of the algorithm TGREEDY in [Blum et al.,1994] called 3All in [Romero et al., 2004]. It consists of the following sixsteps, where steps 2) to 3) correspond to the template from Definition 9on page 41

Given a list of strings S

1) Construct prefix graph for S.

43

6. ALGORITHMS

2) Find a minimum Cycle Cover CC∗(S) = {C1, C2, . . . , Cj} in the prefixgraph for S.

3) For all Ci = si1 , si2 , . . . , sik in the CC choose a representative r(Ci) =〈si1 , si2 , . . . , sik〉 by breaking the cycle at an index so that |r(Ci)| issmallest possible. Define the set R = (r(Ci)), i ∈ 1 . . . j.

4) Repeat step 1) for R.

5) Repeat step 2) for R creating a new set of representatives T , withthe restriction that the CC for R should be non-trivial, i.e. no cyclein the cover includes only a single string from R.

6) Return a random merge of T

Though the algorithm is doing the same thing twice the restriction instep 5) assures that the sets R and T are different. Pseudo-code for thealgorithm is shown in Algorithm II.1 on the facing page.

6.1.1 Approximation factor

In order to show the approximation factor for the Cycle algorithm, wewill refrain from the direct proofs in [Blum et al., 1994; Gusfield, 1997]preferring the more general proof-method by Mucha. The reason for thischoice is that it will lead to a refinable theorem and a “reusable” proof(Theorem 2 on page 47).

The approximation factor for the Cycle algorithm follows from a for-malisation of the claims in Remark 6 on page 41 concerning the con-structed representatives. Beginning with the last claim and denoting theoriginal collection of strings S and the constructed set of representativesR we have:

Lemma 6|OPT (R)| ≤ |OPT (S)|+ |CC∗(S)| ≤ 2 · |OPT (S)| 2

PROOF ([MUCHA, 2007]) We have R = (r(C1), r(C2), . . . , r(Cj)) whereeach r(Ci) is of the form 〈si1 , si2 , . . . , sik〉. For each r(Ci) let

r(Ci) = 〈si1 , si2 , . . . , sik , si1〉 and let

R = (r(C1), r(C2), . . . , r(Cj)).

Since each r(Ci) is a substring of r(Ci) we have |OPT (R)| ≤ |OPT (R)|.Each r(Ci) begins and ends with the same string. Let s(r(Ci)) de-

note this string and let SR = (s(r(C1)), s(r(C2)), . . . , s(r(Cj))). SinceSR ⊆ S we have |OPT (SR)| ≤ |OPT (S)|. Noting that OPT (SR) can be

44

6.1. The Cycle Algorithm

Algorithm II.1 CycleRequire: Nonempty substring free list of strings, sl

procedure CYCLE(sl)avoidSimpleCycles← false

3: reps1← MAKEREPRESENTATIVES(sl, avoidSimpleCycles)if size(reps1) = 1 then

return first(reps1)6: end if

avoidSimpleCycles← truereps2← MAKEREPRESENTATIVES(reps1, avoidSimpleCycles)

9: return MERGESTRINGS(reps2)end procedure

12: procedure MAKEREPRESENTATIVES(sl, avoidSimpleCycles)CC ← MAKEMINIMUMCYCLECOVER(sl, avoidSimpleCycles)reps = empty list

15: for all C = sc0 , sc1 , . . . , sck ∈ CC doi← BESTBREAKINDEX(C)rep← 〈s(i mod k), s((i+1) mod k), . . . , s((i−1) mod k)〉

18: reps← reps ∪ repend forreturn reps

21: end procedure

Termination: The algorithm only iterates over finite non-increasingdata structures.

Correctness: Assuming all the the called sub-procedures are correct (seeSection 6.1.2 onwards), all strings in the parameter list sl will besub-strings of a representative so the algorithm will be yielding afeasible solution of the SSP.

45

6. ALGORITHMS

transformed into a solution for R by replacing each s(r(Ci)) by its corre-sponding r(Ci) and that this transformation will increase the length by|CC∗(S)| we get

|OPT (R)| ≤ |OPT (R)|≤ |OPT (SR)|+ |CC∗(S)|≤ |OPT (S)|+ |CC∗(S)|. �

In order to show that the strings in R do not overlap too much, we willneed following lemma:

Lemma 7Let C = φ(t) = s1, s2, . . . , sn ∈ |CC∗(S)| and let r(C) be the representativefor C, then

period(r(C)) = |t|. 2

PROOF ([MUCHA, 2007]) As r(C) maps to C = φ(t), |t| will be a periodfor r(C). Assume there exists a factor f , for r(C) with |f | < |t|. Thenwe have that r(C) maps to φ(f). Being sub-strings of r(C) so does allthe strings s1, s2, . . . , sn. Lemma 4 on page 37 now gives the existence ofa cycle shorter than C “including” all the strings that maps to C, whichcontradicts that C belongs to a minimum cycle cover for S. �

We can now show

Lemma 8Let C1 = φ(t1), C2 = φ(t2) ∈ |CC∗(S)| and let r1(C), r2(C) be their repre-sentatives, we then have

|ov(r(C1), r(C2))| < |t1|+ |t2|. 2

PROOF ([MUCHA, 2007]) We have C1 and C2 ∈ |CC∗(S)| are both mini-mal and from Lemma 7 it follows period(r(C1)) = |t1|, period(r(C2)) = |t2|.Let u = ov(r(C1), r(C2)). Now if |u| ≥ |t1|+ |t2| we would have that |t1|, |t2|both are periods for u. Then the Periodicity Lemma (Lemma 2 on page 35)in connection with Lemma 4 on page 37 would lead to a contradiction ofthe minimality for C1 and C2 �

Using expression 5.5 on page 39 and Lemma 8 we get:

Corollary 1Any SSP-solution for the set R will have

OV (R) ≤ 2 ·OPT (S) 2

46


We can now turn to the final theorem needed to prove the approxima-tion factor for the cycle algorithm:

Theorem 2Given an α-approximation for the MAX-HPP, it is possible to get a (4− 2α)approximation for the SSP. 2

PROOF ([MUCHA, 2007]) First construct the set of representativesR andthen use the HPP approximation to get an SSP solution R, for R achievinga total overlap of α ·OV ∗(R). We then have:

|R| = ||R|| −OV (R)

= ||R|| − α ·OV ∗(R)

= ||R|| −OV ∗(R) + (1− α) OV ∗(R)

= OPT (R) + (1− α) OV ∗(R)

≤ 2 ·OPT (S) + (1− α) OV ∗(R) (from Lemma 6)≤ 2 ·OPT (S) + (1− α) 2 ·OPT (S) (from Corollary 1)= (4− 2α) OPT (S). �

Theorem 3The Cycle algorithm is a factor 3 approximation algorithm. 2

PROOF Inspecting the steps of the Cycle algorithm (Section 6.1 on page 43),the algorithm breaks the cycles optimally in Step 3), continues to find anminimal non-trivial Cycle Cover CC∗(R), for R in Step 4) and Step 5) andfinally return a random merge of the set of representatives T , in step 6).

According to Theorem 1 on page 38, CC∗(R) is also a maximal CC inthe overlap graph for R. Translated to the overlap graph for R, what thealgorithm does in step 4) to 6) is equivalent to constructing a solution tothe HPP in the overlap graph for R in the following way:

• Split each cycle in CC∗(R) by throwing away the arc with smallestweight (minimal overlap).

• Arbitrarily mend the ensuing paths together using random arcs be-tween the end nodes.

This gives a 12 -approximation for the HPP, as the cycles each consists

of at least two arcs and throwing away the ones with minimal weight willnot lose more than half of |CC∗(R)|.

By Theorem 2 this then gives a factor 3 approximation for the Cyclealgorithm. �

47

6. ALGORITHMS

Remark 7It also follows that if the Cycle algorithm in Step 3) would randomly breakthe cycles and return a random concatenation of this set of representa-tives, no guarantee would be possible for the approximation factor, whichcorresponds to α = 0 (a 0-approximation of the max HPP) giving a factor 4approximation by Theorem 2. 2

6.1.2 Implementation

The implementation of the Cycle algorithm is basically a Python imple-mentation of Algorithm II.1 on page 45. The Cycle algorithm is depen-dent on the implementation of two subroutines to calculate an all-pairssuffix-prefix matrix for the strings in S and for finding a minimal CC forS respectively.

All pairs suffix-prefix calculation

The first attempted implementation of this simple problem was a Pythonimplementation of the suffix-tree based algorithm described in [Gusfield,1997, Section 7.10]. Unfortunately this task proved too complicated andthereby more error-prone than expected. In order to sanity check thesuffix-tree algorithm a simple naive double-loop implementation was made.The results were in a sense both disillusioning as well as enlightening:

First the suffix-tree implementation kept growing more and more com-plex while still containing bugs. Second the naive test implementationhad a running-time of order five times faster on the data where the algo-rithms produced identical results.

As a second try a Knuth-Morris-Pratt (KMP)-algorithm based imple-mentation described in [Tarhio and Ukkonen, 1988] was implemented.This algorithm consistently produced identical results to the naive testimplementation, thus it was probably free from bugs, but it was a factorthree times slower than the naive test algorithm.

The moral of the story is that even though the theoretical running-times for the different implementations are as shown in Table 6.1 , themeasured running-times clearly favoured using the naive implementa-tion.

Part of the explanation for this result could be that the Pythonmethodstring.startswith(), used in the interior of a double for-loop, is prob-ably implemented as a highly optimised C or assembler method.

In order to confirm this hypothesis a number of tests were run withfour different implementations:

• a fully naive algorithm, which uses char-by-char comparison insteadof string.startswith(),

• the KMP-based algorithm,

48


Table 6.1: All-pairs suffix-prefix running-times

Algorithm Running-Time

Suffix-tree based O(n|t| · (1 + n|t|)) = O(||S||+ n2)

KMP based O(n|t| · (1 + n)) = O(||S||+ n||S||))Naive O(n|t| · (n|t|)) = O(||S||2)

where n is the number of strings and |t| the (identical)length of each string, i.e. n|t| = ||S||

• the naive algorithm using string.startswith() implemented inPython, and finally

• a Java implementation of the naive algorithm using the Javamethodstring.startswith().

The results of these test runs are shown in Figure 6.1

1000 2000 3000 4000 5000 6000 7000Cumulated length of strings

1

2

3

4

5

6

Runn

ing-

time

in s

econ

ds

Overlap timeit results

Fully naive alg.KMP alg.Python naive alg.Java naive alg.

Figure 6.1: Running-time measures of all-pairs overlap calculations

In the end a Python wrapper calling the Java implementation of thenaive algorithm was used by all SSP-algorithms, as it turned out to be the

49

6. ALGORITHMS

fastest implementation.

Finding minimal CCs

The “traditional” way of finding a minimal CC is by reducing the prob-lem to an instance of the Assignment Problem (AP) or Perfect MatchingProblem. The reduction consists of solving the AP in a complete bi-partitegraph. The node-set is N = {S, S} and arc weights are equal to eitherthe weights of the prefix or the overlap graph, depending on whether aminimal or maximal assignment is wanted.

In the general case the AP can be solved by the Hungarian Method1

which has a running-time of O(n3), where n is the number of strings.It turns out that due to the weights being overlap values between thestrings, a maximal assignment can be found by a greedy algorithm havinga running-time of O(log(n) · n2).

Fast maximal assignment algorithm

The fact that a greedy approach works is dependent on the following prop-erty

Definition 10 (Inverse Monge Condition)Let B = (N1, N2, A,W ) be a complete, weighted bipartite graph. Letu, u′ ∈ N1 and v, v′ ∈ N2. Assume WLOG

w(auv) ≥max (w(auv′), w(au′v), w(au′v′))

If the inequality

w(auv) + w(au′v′) ≥ w(auv′) + w(au′v) (6.1)

holds for any such pairs u, u′ and v, v′ in B then B is said to satisfy theInverse Monge Condition. 2

Given a collection of strings S = (s1, s2, . . . , sn), let BS be the complete,weighted bipartite given byBS = (S, S,A,W ) andw(aij) = ov(si, sj), ∀ i, j ∈1 . . . n. We then have

Lemma 9BS satisfies the Inverse Monge Condition. 2

PROOF ([BLUM ET AL., 1994]) Let su, su′ , sv, sv′ ∈ S. Denote ov(si, sj)and |ov(si, sj)| by ovij andwij respectively. Assumewuv ≥max (wuv′ , wu′v, wu′v′).If wuv ≥ wuv′ + wu′v, then Expression 6.1 is obviously fulfilled. Assume

1http://en.wikipedia.org/wiki/Hungarian_algorithm

50

http://en.wikipedia.org/wiki/Hungarian_algorithm


wuv < wuv′ + wu′v i.e. wu′v′ > 0, and let ovuv = a1a2 . . . awuv . Then ovuv′ =a(wuv−wuv′+1) . . . awuv and ovu′v = a1 . . . awu′v . However,

ovu′v′ = a(wuv−wuv′+1) . . . awu′v

⇓wu′v′ = wu′v − (wuv − wuv′ + 1) + 1

= wu′v + wuv′ − wuv. �

For a graphical interpretation see Figure 6.2

wuv

su

sv

sv’

su’

wuv’

wu’v

wu’v’

Figure 6.2: Illustration of Lemma 9 : wuv + wu′v′ ≥wu′v + wuv′

Now consider the greedy algorithm MAKEMAXIMALASSIGNMENT forfinding a maximum weight assignment (Algorithm II.2 on the next page)

Assuming the marking of rows and columns and the checking for avail-ability can be done in constant time, the running-time for Algorithm II.2will be O(n2 + log(n2) · n2 + n) = O(log(n) · n2).

The returned maximum assignment can easily be translated to a max-imum CC for S in the overlap graph by following arcs “from left to right” inthe assignment. According to Theorem 1 on page 38 this is also a minimalCC in the prefix graph for S.

The SSP-algorithms needing a minimal CC are all using a Python im-plementation of Algorithm II.2. The Cycle algorithm is dominated by thisprocedure and thus having a running-time of O(log(n) · n2).

51

6. ALGORITHMS

Algorithm II.2 MakeMaximalAssignmentRequire: A n× n overlap matrix M for string collection SEnsure: A maximum weight assignment for the bi-partite graph BS

procedure MAKEMAXIMALASSIGNMENT(M)overlaps← empty list of size n

3: for row = 1 to n dofor col = 1 to n do

overlaps← overlaps ∪ (M[row][col], (row, col))6: end for

end forSORT(overlaps) . descending by overlap value

9: assignment← empty set of size nmark all rows and columns availablefor all (w,(r,c)) ∈ overlaps do . iterating from start to end

12: if r and c both available thenassignment← assignment ∪ (r,c)mark row r and column c unavailable

15: end ifend forreturn assignment

18: end procedure


Correctness: ([Blum et al., 1994]) Let A be the assignment returnedby the algorithm and letAmax be a maximum assignment forBs withthe property that it has the maximum number of arcs in commonwith A. Assume A 6= Amax and let aij be the arc of maximal weightin the symmetric difference of A and Amax (with ties “sorted” thesame way as in the algorithm). There are now two possibilities:

aij ∈ Amax \A : This means that A must contain either an arc aij′or ai′j of “larger” weight than aij as the algorithm did not chooseaij . Neither of these arcs can be in Amax as it is an assignmentleading to a contradiction of the choice of aij .

aij ∈ A \Amax : In this case there are two arcs aij′ , ai′j ∈ Amax of“smaller” weight than aij . But according to Lemma 9 we couldnow replace the arcs aij′ and ai′j in Amax by the arcs aij and ai′j′getting an assignment A′max with no less accumulated weightbut having more arcs in common with A than Amax. This is acontradiction of the choice of Amax.

52

6.2. The RecurseCycle Algorithm

6.2 The RecurseCycle AlgorithmThis is a slight variation of the Cycle algorithm. The variation consists ofrepeating steps 4) and 5) of the Cycle algorithm until only one represen-tative string remains:

1) Execute Cycle algorithm step 1) to 3)

2) Repeat Cycle algorithm step 4) and 5) until a representative set ofsize one is achieved.

3) Return the single element of the final representative set.

The pseudo-code for the algorithm is shown in Algorithm II.3 on thenext page.

6.2.1 Approximation factorTheorem 4The RecurseCycle algorithm is a factor 3 approximation algorithm. 2

PROOF As the algorithm in effect improves on the Cycle algorithm byreturning a “good” as opposed to a complete random merge of representa-tives, its approximation factor cannot be worse than the factor 3 for theCycle algorithm. �


The algorithm is basically a small extension of the Cycle algorithm andis likewise implemented using python. As a practical matter the recur-sive implementation had to be substituted by an iterative version due tothe missing tail-recursion primitive in the Python interpreter. The orig-inal recursive implementation ran into problems with maximum stackheight.2 The running-time for the RecurseCycle algorithm is dominatedby the MAKEMAXIMALASSIGNMENT () procedure (see Algorithm II.2) andthus of size O(log(n) · n2).

6.3 The Overlap Rotation LemmaWe will now shortly divert from the algorithm descriptions in order tointroduce a couple of very useful results proved by Breslauer et al.. Theseresults are at the basis of the proofs of the best known approximationfactors for both the Greedy algorithm (Section 6.4 on page 57) and theBest Factor algorithm (Section 6.6 on page 64).

2This problem could alternatively have been solved by either increasing the platformdependent maximum stack height or by using stack-less python.

53

6. ALGORITHMS

Algorithm II.3 RecurseCycleRequire: Nonempty sub-string free list of strings, sl

procedure RECURSECYCLE(sl, avoidSimpleCycles)if size(sl) = 1 then

3: return first(sl)end ifreps← MAKEREPRESENTATIVES(sl, avoidSimpleCycles)

6: return RECURSECYCLE(reps, true )end procedure

9: begin mainreturn RECURSECYCLE(sl, false )

end main

Termination: In the base case (line 3) the algorithm returns when thestring list has size 1. In the inductive case (lines 5 to 6) the re-striction to non-trivial cycle covers in MAKEREPRESENTATIVES(sl,avoidSimpleCycles) (line 5) ensures that size(reps) ≤ size(sl) − 1.As the size of the original string list is final and nonzero, this guar-antees termination.

Correctness: For each invocation of RECURSECYCLE(sl, avoidSimpleCy-cles) the following invariant holds: A (random) concatenation of thestrings in the string list is a superstring for the strings in the list.

Before the first invocation the invariant holds trivially.

If RECURSECYCLE(sl, avoidSimpleCycles) returns from the basecase, the string list contains a single string, which is a superstringfor itself. If RECURSECYCLE(sl, avoidSimpleCycles) returns fromthe inductive case, at least two strings have been replaced by theirmerge. As a merge is a superstring for the strings involved the in-variant will hold.

54

6.3. The Overlap Rotation Lemma

Lemma 10 (Overlap Rotation Lemma)Let α be a periodic semi-infinite string. There exists an integer k, such thatfor any (finite) string s that is inequivalent to α,

ov(s, α[k,∞]) < period(s) + 12 period(α)

Furthermore such a k can be found in time O(period(α)) 2

For the somewhat “technical” proof see [Breslauer et al., 1996, Section3].

Remark 8The Overlap Rotation Lemma shows that given any cyclic string Φ(t) andany string s inequivalent to t, it is possible to restrict the overlap betweens and t by rotating t sufficiently.

The Overlap Rotation Lemma actually consists of two results, but weonly need the first of these. As will be shown later (Lemma 12) the imple-mented algorithms will not actually have to determine the integer k. 2

The following is a rephrasing of another result in [Breslauer et al.,1996, Section 5, Lemma 5.1] due to Mucha.

Lemma 11Let S = (s1, s2, . . . , sn) be a collection of strings with minimal Cycle CoverCC∗(S). It is possible to choose representatives R for cycles in CC∗(S) suchthat

OPT (R) ≤ 2 ·OPT (S)

and for any two distinct cycles Ci, Cj ∈ CC∗(S)

ov(r(Ci), r(Cj)) < |Ci|+ 12 |Cj |. (6.2)

2

PROOF ([MUCHA, 2007]) Let Ci ∈ CC∗(S) = si1 , si2 , . . . , sik and let

α = pr(si1 , si2) pr(si2 , si3) · · · pr(sik−1, sik) pr(sik , si1).

Consider the semi-infinite string α∞ and let k be the integer given by theOverlap Rotation Lemma 10. Assume WLOG that k is in the pr(si1 , si2)part of α. (If not we can renumber the indices and split α accordingly.) Inthe approach until now we could choose the representative

r(Ci) = 〈si2 , si3 , . . . , sik , si1〉.

55

6. ALGORITHMS

As shown in the proof of Lemma 6 on page 44 we could actually choosethe longer representative

r(Ci) = 〈si1 , si2 , si3 , . . . , sik , si1〉,

and still have OPT (R) ≤ 2 · OPT (S). Now it follows that if we insteadchoose the representative

r(Ci) = (pr(si1 , si2)[k : ])〈si2 , si3 , . . . , sik , si1〉 (6.3)

we have that r(Ci) is a suffix of r(Ci) which itself is a suffix of r(Ci). Thisimplies that OPT (R) ≤ 2 ·OPT (S).

Equation 6.2 on the previous page follows directly from Lemma 5 onpage 38 and the Overlap Rotation Lemma on the previous page. �

These results immediately leads to Corollary 2 and an improvementof Theorem 2 on page 47

Corollary 2An optimal SSP-solution for the set R will have

OV (R) ≤ 32 ·OPT (S) 2

Theorem 5Given an α-approximation for the MAX-HPP problem, it is possible to get a(3.5− 1.5α) approximation for the SSP. 2

PROOF Analogous to the proof for Theorem 2 on page 47 �

Corollary 3The Cycle algorithm and RecurseCycle algorithms are factor (3.5 − 1.5 ·0.5) = 2.75 approximation algorithms. 2

As noted in [Mucha, 2007] there seems to be something non-intuitiveor even contradictory about the approach used above. A better approx-imation for the total overlap of representatives was the road to the im-provement in Theorem 5 and therefore it seemed sensible to minimisethe overlaps of representatives; but actually our overall objective is tofind a shortest superstring and therefore it surely is preferable to makerepresentatives overlap as much as possible. Fortunately Mucha providesthe following lemma.

Lemma 12Using any set of representatives R with ||R|| ≤ ||R|| yields a (3.5 − 1.5α)approximation for the SSP, where α is the approximation factor for thedirected, MAX-HPP algorithm used. 2

56

6.4. The Greedy Algorithm

PROOF ([MUCHA, 2007]) Let OV (R) and OV (R) be the total overlap forthe optimal solutions OPT (R) and OPT (R) respectively and let |SSPR| bethe length of the approximated solution for R. We then have

|SSPR| ≤ ||R|| − αOV (R)

= ||R|| − α(||R|| −OPT (R)) (Expression 5.6)= (1− α) · ||R||+ αOPT (R)

≤ (1− α) · ||R||+ 2αOPT (S) (Lemma 11)≤ (1− α) · 3.5OPT (S) + 2αOPT (S) (Corollary 2 and Lemma 11)= (3.5− 1.5α) ·OPT (S) �

Remark 9Lemma 12 ensures that the original representatives R defined in step 3)on page 44 can be used to obtain the approximation factor in Theorem 5 ifthe cycles are broken at the string yielding the shortest |r(C)| (to ensure||R|| ≤ ||R||). The rotations used by Breslauer et al. are necessary tobound the approximation factor but not in the algorithm itself. 2

6.4 The Greedy Algorithm

The greedy algorithm is probably the first idea that comes to mind forsolving the SSP. If S = (s1, s2, . . . , sn) is a collection of strings, keep onmerging the two strings that have the largest overlap and put the re-sulting merge back into the collection, until only one string is left. Thepseudo-code for this simple and in practise surprisingly good greedy algo-rithm is shown in Algorithm II.4 on the following page.

6.4.1 Approximation factor

In [Tarhio and Ukkonen, 1988] it is shown that this algorithm achievesat least a factor 1

2 of the maximal total overlap possible. Furthermore,it is conjectured that the algorithm is a 2-approximation algorithm. Thefollowing tight example from [Blum et al., 1994] shows that the approxi-mation factor cannot be better than 2.

Example 11

Let S = {c(ab)k, (ba)k, (ab)kc}then |Greedy(S)| = |c(ab)kc(ba)k| = 4k + 2

and |OPT (S)| = |c(ab)k+1c| = 2k + 4 2

57

6. ALGORITHMS

Algorithm II.4 GreedyRequire: Nonempty substring free list of strings, sl

procedure GREEDY(sl)if size(sl) = 1 then

3: return first(sl)end ifFind si, sj ∈ sl : i 6= j and ov(si, sj) is maximal.

6: sl← sl \ {si, sj}sl← sl ∪ 〈si, sj〉return GREEDY(sl)

9: end procedure

Termination: In the base case (line 3) the algorithm returns when thestring list has size 1. In the inductive case (lines 6 to 8) two stringsare removed and one merged string is added to the string list, re-ducing its size by one. As the size of the original string list is finaland nonzero, this guarantees termination.

Correctness: For each invocation of GREEDY(sl) the following invariantholds: A (random) concatenation of the strings in the string list is asuperstring for strings in the list.

Before the first invocation the invariant holds trivially.

If GREEDY(sl) returns from the base case, the string list containsa single string, which is a superstring for itself. If GREEDY(sl) re-turns from the inductive case, two strings have been replaced bytheir merge. As a merge is a superstring for the strings involved theinvariant will hold.

Even though this conjecture has been subject to a lot of study, it hasnot (yet) been possible to prove a better factor than 3.5. This result due toKaplan and Shafrir is in essence an application of Corollary 2 on page 56instead of Corollary 1 on page 46 in the original factor 4 proof by Blumet al. Before we can show this, we will need quite some terminology andresults from [Blum et al., 1994].

Consider an implementation of the Greedy algorithm that satisfies theconditions in line 5 of Algorithm II.4 by the following. It runs down acollection of the arcs from the overlap graph sorted by decreasing weight(analogous to Algorithm II.2) and for each arc aij it decides whether tomerge its node strings si, sj or not.

58

6.4. The Greedy Algorithm

Let the string returned by the Greedy algorithm be denoted tgreedy andassume a renumbering of the strings in S so that

tgreedy = 〈s1, s2, . . . , sn〉.

Let e, f be arcs in the overlap graph for S. We say that e dominates fif it comes before f in the sorted collection of the arcs. This means thatw(e) ≥ w(f) and if e = aij then either f = aij′ or f = ai′j . Greedy willreject merging the end strings of an arc f if either

1) f is dominated by an already chosen arc e

2) f is not dominated but the merge of its end strings would form acycle.

If the merge is rejected because of 2) then f is called a bad back arc.3 Letf = aji be a bad back arc then f corresponds to the merge 〈si, si+1, . . . , sj〉,which also corresponds to a path in the overlap graph. As this merge wasalready formed when Greedy considered f it follows that

ov(sk, sk+1) = w(ak,k+1) ≥ w(f) = w(aj,i) = ov(sj , si), ∀ k ∈ [i, j − 1].

Additionally when f is considered, the arcs ai−1,i and aj,j+1 cannot havebeen considered yet. The arc f is said to span the closed interval If = [i, j].In [Blum et al., 1994, Section 5] the lemma below is proved

Lemma 13Let e and f be two bad back arcs. Then the closed intervals Ie, If spannedby the arcs are either disjoint or one contains the other. 2

Minimal (with respect to containment) intervals of bad back arcs arecalled culprits. It follows from the definition, that culprit intervals mustbe disjoint. Each culprit [i, j] like its bad back arc corresponds to themerge 〈si, si+1, . . . , sj〉.

Let Sm ⊆ S = {sk| k ∈ [i, j], where [i, j] = culprit}

Each culprit [i, j] also defines a cycle Cij = si, si+1, . . . , sj in the overlapgraph for S. Let CCc(Sm) be the CC over Sm defined by the culprits. Nowconsider the result of applying Algorithm II.2 on page 52 on the overlapgraph induced by Sm. It turns out that this will exactly produce the CycleCover CCc(Sm), which implies that CCc(Sm) is a minimal CC and we get

|CCc(Sm)| = |CC∗(Sm)| ≤ |OPT (Sm)| ≤ |OPT (S)|. (6.4)

3in [Blum et al., 1994] these are called bad back edges

59

6. ALGORITHMS

Finally denote the sum of the weights of all culprit bad back arcs in theoverlap graph

Oc =∑

[i,j] = culprit

|ov(sj , si)|.

Blum et al. utilises the culprits to partition the arcs of the path cor-responding to tgreedy into four sets. Through clever analysis they end upshowing the following result [Blum et al., 1994, Section 5].

Lemma 14

|tgreedy| ≤ 2 ·OPT (S) +Oc − |CCc(Sm)|. 2

Having by now put quite some canons into position, we can start aimingat the final proof. The main hurdle is to derive a use-full approximationfor Oc. Let A = {r(Cir)| [i, r] = culprit} be the representatives for Smguaranteed to exist by Lemma 11 on page 55. We then have

Lemma 15For each culprit [i, r], |ov(sr, si)| ≤ |r(Cir)| − |Cir|. 2

PROOF ([BLUM ET AL., 1994]) Let j ∈ [i, r] be the index such that themerge 〈sj+1, . . . , sr, si, . . . , sj〉 is a suffix of r(Cir) (e.g. j = si2 in Equa-tion 6.3 on page 56). We now get

|r(Cir)| ≥ |〈sj+1, . . . , sr, si, . . . , sj〉| The merge is a suffix= |pr(sj+1, sj+2) · · · pr(sj−1, sj)sj |= |pr(sj+1, sj+2) · · · pr(sj−1, sj)pr(sj , sj+1)|+ |ov(sj , sj+1)|≥ |Cir|+ |ov(sr, si)| �

Corollary 4

Oc ≤ ||A|| − |CCc(Sm)| 2

PROOF ([BLUM ET AL., 1994])

Oc =∑

[i,j] = culprit

|ov(sj , si)| ≤∑

[i,j] = culprit

|r(Cir)| − |Cir| ≤ ||A|| − |CCc(Sm)|.

�

We are now able to approximate Oc:

Lemma 16

Oc ≤ OPT (S) + 32 |CCc(Sm)|. 2

60

6.5. Reductions of Hamiltonian Path Problem Instances

PROOF ([KAPLAN AND SHAFRIR, 2005])

||A|| = OPT (A) +OV (A)

≤ OPT (A) + 32 |CCc(Sm)| (Corollary 2)

≤ OPT (Sm) + |CCc(Sm)|+ 32 |CCc(Sm)| (Lemma 6)

= OPT (Sm) + 52 |CCc(Sm)|.

Now using Corollary 4 on the preceding page we have

Oc ≤ ||A|| − |CCc(Sm)|≤ OPT (Sm) + 3

2 |CCc(Sm)|≤ OPT (S) + 3

2 |CCc(Sm)|. �

So putting it all together we get:

Theorem 6Greedy is a factor 3.5 approximation algorithm. 2

PROOF ([KAPLAN AND SHAFRIR, 2005]) Using Lemma 16 on the facingpage and Lemma 14 on the preceding page we get

|tgreedy| ≤ 2 ·OPT (S) +OPT (S) + 32 |CCc(Sm)| − |CCc(Sm)|

= 3 ·OPT (S) + 12 |CCc(Sm)|

= 3.5 OPT (S) �


The actual implementation of the Greedy algorithm is a Python imple-mentation of the pseudo-code in Algorithm II.4 on page 58 with the checksin line 5 implemented as a generator method. Since this method has tosort the n2 pairwise overlap values, the Greedy algorithm running-time isdominated by the running-time for this sorting which in this implemen-tation4 is of size O(log(n2) · n2) = O(log(n) · n2).

6.5 Reductions of HPP InstancesSince the last two algorithms to be presented in this chapter, both arebased on the overlap formulation (Expression 5.6 on page 40) for the op-timal solution of the SSP, we will now treat the necessary reductions fromthe original HPP to equivalent instances of ATSP and TSP respectively.

4a more efficient implementation could use Radix Sort, since the size of the overlapvalues are restricted

61

6. ALGORITHMS

6.5.1 Reduction from a HPP to an ATSP

Given a collection of strings S = (s1, s2, . . . , sn). Let GS(N,A,W ) be theoverlap graph for S. Let H(GS) be the instance of the maximum HPP inthe overlap graph for S. Define an instance of the maximum ATSP, A(G

′S),

by

G′S = G(N

′, A′,W

′) with

N′

= N ∪ n0, A′

= A ∪ {ai0, a0i} i in 1 . . . n

W′

: A′ → N with w

′(aij) =

{w(aij) for i, j 6= 0

0 otherwise

Lemma 17The reduction from HPP to ATSP is correct. 2

PROOF This graph transformation can clearly be done in polynomial time.If POPT = (nP1 , nP2 , . . . , nPn) is the hamiltonian path corresponding to theoptimal solution for H(GS), then the TSP-tour TP = (n0, nP1 , nP2 , . . . , nPn)will be an optimal solution for A(G

′S) with the same length as POPT . Con-

versely if TOPT = (n0, nT1 , nT2 , . . . , nTn) is the TSP-tour for the optimal so-lution for A(G

′S), then the path PT = (nT1 , nT2 , . . . , nTn) will be an optimal

solution for H(GS) because removing the visit of node n0 does not reducethe length of the solution.

Both cases can be proved using that existence of a better TSP tour(Hamiltonian path) leads to a contradiction of the assumed optimality ofPOPT and TOPT respectively. �

Remark 10The adding of the extra n0 node is necessary in order to make the reduc-tion work. Intuitively it could seem as if an optimal solution to the HPPwould be equivalent to determining an optimal solution for the TSP andthen removing the heaviest edge. The graph depicted in Figure 6.3 onthe facing page illustrates that this intuition is quite wrong. An optimalsolution to the TSP (depicted with full drawn lines) has length 1002. Re-moving the heaviest edge from this tour will result in a solution of theHPP with length 502. However, the optimal HPP solution has length 3. 2

6.5.2 Reduction from ATSP to TSP

Let A(Gatsp) be an instance of the maximum ATSP in a graph Gatsp =(Natsp, Aatsp,Watsp). Transform Gatsp into an undirected graph Gtsp =(Vtsp, Etsp,Wtsp) by the following steps:

62

6.5. Reductions of Hamiltonian Path Problem Instances

1000

1

1

1 500500

b

bb

b

Figure 6.3: Shortest Hamiltonian path is not neces-sarily included in the optimal TSP tour.

1) Split each node in n ∈ Natsp into a pair of nodes nin, nout, where allincoming arcs for n are incoming to nin and all outgoing arcs for nare outgoing from nout. Let Vtsp = Natspin ∪Natspout .

2) In the graph resulting from 1). Add an opposite arc with sameweight for all arcs, making the resulting graph symmetric. Convertall correlating arc pairs into an edge of the same weight, therebyconverting the graph into an undirected graph.

Define wtsp(eij) =

{watsp(aij) if aij ∈ Natsp

watsp(aji) if aji ∈ Natsp

3) Define a value K � max(Watsp) where max(Watsp) is the maximumweight in Gatsp.

4) For all pairs vi, vj ∈ Vtsp such that vi = nkin ∈ Natspin and vj =nkout ∈ Natspout add an edge eij , between the pair with wtsp(eij) = K.

5) Make the graph complete by adding missing edges eij with wtsp(eij) =−K.

6) Multiply all edge weights by -1.

A graphical illustration of the transformation is shown in Figure 6.4on the next page

Steps 1) and 2) ensures that the original arcs with their original weightsare present in the undirected graph. Steps 3) to 5) guarantees that anymaximal TSP tour in the “node splitted” graph will always travel directlybetween any corresponding in and out vertex-pair. Finally step 6) ensuresthat a maximal TSP tour is turned into an equivalent minimal TSP tour.

Lemma 18The reduction from ATSP to TSP is correct. 2

63

6. ALGORITHMS

1 2

(a)

1 2

(b)

1 2 21

(c)

1 2

K

K

(d)

1 2

K

K

-K -K

(e)

-1 -2

-K

-K

K K

(f)

Figure 6.4: Transforming an ATSP instance to a TSP instance

PROOF The whole graph transformation can clearly be done in polyno-mial time, i.e. a reduction of an instance of A(Gatsp) in graph Gatsp can beturned into an instance T (Gtsp) in graph Gtsp in polynomial time.

If TAOPT = (n1, n2, . . . , nn) is an optimal maximal ATSP tour in Gatspwith length Latsp, then the TSP-tour

Ttsp = (vn1in, vn1out

, vn2in, vn2out

, . . . , vnnin , vnnout)

will be an optimal solution for the TSP in Gtsp with length− (Latsp+n ·K).Conversely if

TTOPT = (vn1in, vn1out

, vn2in, vn2out

, . . . , vnnin , vnnout)

is an optimal TSP tour in Gtsp with length Ltsp then Tatsp = (n1, n2, . . . , nn)is an optimal maximal ATSP tour in Gatsp with length − (Ltsp + n ·K).

As for the first reduction, both cases can be proved using that existenceof a better TSP tour (ATSP tour) leads to a contradiction of the assumedoptimality of TAOPT and TTOPT respectively. �

6.6 The Best Factor AlgorithmThe basis for this algorithm is the overlap formulation (Expression 5.6 onpage 40) for the optimal solution of the SSP. After making the reductions

64

6.6. The Best Factor Algorithm

described in Section 6.5.1 on page 62 the algorithm finds an approxima-tion for the resulting ATSP instance. To this end the techniques describedin [Kaplan et al., 2005] are used.

6.6.1 Overview

The algorithm starts off by finding a solution to the following relaxationof an LP-problem:

Definition 11Let S = (s1, s2, . . . , sn) be a string collection and let G(N,A,W ) be itsoverlap graph. The following is a relaxation of the LP-formulation forMaximal CC with Two-Cycle Constraint

Maximise∑aij∈A

w(aij) · xij subject to

∑i

xij = 1, ∀ j (in-degree constraints)∑j

xij = 1, ∀ i (out-degree constraints)

xij + xji ≤ 1, ∀ i 6= j (2-cycle constraints)xij ≥ 0, ∀ i 6= j (non-negativity constraints) 2

The above LP-formulation can be seen as a variation of the LP-formulationof the TSP (see Definition 2 on page 13) with the subtour elimination con-straints being modified to only consider possible subsets of size 2.

Using the found solution to the LP formulation in Definition 11 thealgorithm first “integrates” the possibly non-integral solution. This isdone by scaling up (multiplying) with the least common denominator Dof all variables. This integral solution defines a multigraph having a totalweight of at least D times the value of the original solution. Furthermorethe multigraph does not contain more than

⌊D2

⌋copies of any 2-cycle. By

using a representation of this multigraph as a pair of CCs and by carefulrounding and recursing Kaplan et al. end up with two CCs CC1, CC2 withthe property

||CC1||+ ||CC2|| ≥ 2 ·OPTatsp. (6.5)

This pair of CCs is then further decomposed by a 3-coloring into threecollections of paths. It now follows that the heaviest collection will haveweight at least 2

3 · OPTatsp and the returned approximation solution isachieved by randomly patching this path collection into a tour.

Remark 11As mentioned in [Mucha, 2007], the result achieved by Kaplan et al. isquite impressive. The theoretical bound for an integral solution to the LP

65

6. ALGORITHMS

seems to be 23 (see Section 6.6.7 on page 72), and Kaplan et al. manage to

achieve this bound using some nice techniques. For this reason the mainpart of the algorithm will be covered in details in the following. 2

6.6.2 Preliminaries and notation

The algorithm consists of the following steps

1) Transform the HPP into an instance of the ATSP and construct theoverlap matrix for this instance.

2) Find a solution to the corresponding LP in Definition 11.

3) Scale up the possibly non-integral solution by multiplying with theleast common denominator D of all variables.

4) Recursively decompose the resulting multigraph until a pair of CCsfulfilling Equation 6.5 is achieved.

5) Decompose this CC pair into three path collections.

6) Patch the heaviest path collection into a tour.

Step 1) follows the description in Section 6.5 on page 61 and Step 2)is done by use of an LP solver program capable of solving mixed integerlinear programs.

The steps 3) and 4) will be the topic of the rest of this section. Adetailed description of Step 5) is given in [Kaplan et al., 2005, Section 5].Finally Step 6) is done by arbitrarily merging the disjoint paths togetheruntil a tour is constructed.

Definitions and notation

Let G(N,A,W ) be the overlap graph from the LP. Let {x∗ij} for i, j ∈1 . . . |N | and i 6= j be an optimal (fractional) solution from Step 2) withOPTf the maximal value found for the objective function of the LP. Denotethe optimal value for the corresponding ATSP OPTatsp. Let D be the leastcommon denominator from Step 3). Define k · (i, j) to be the multisetcontaining k copies of the arc (i, j) and define the weighted multigraphD ·G = (N,A,W ) where A = {(Dx∗ij) · (i, j) | (i, j) ∈ A} and all copies of thesame arc have identical weight. Let Wmax denote the maximum weightfor an arc in G. Denote the number of copies (the multiplicity) of an arca by mG(a) and denote a 2-cycle between the nodes i, j by Cij .

Remark 12The multiplier D might be exponential in the size of N , as it can be theproduct of |A| = (n − 1)2 factors. However, the algorithm avoids using

66


more than polynomial space since the number of bits necessary to repre-sent D is polynomial. Furthermore, multigraphs are implicitly and neverexplicitly represented by maintaining the multiplicities for all arcs. 2

Definition 12 (Half-Bound d-regular Multigraph)A (directed) multigraph G(N,A) is said to be d-regular if the in and out-degrees for all nodes equal d. A d-regular multigraph is said to be half-bound if it contains at most

⌊d2

⌋copies of any 2-cycle. 2

6.6.3 Finding the lowest common denominator

Determining the multiplierD in Step 3) is done using a Greatest CommonDivisor (GCD) implementation as a subroutine. The pseudo-code for thealgorithm is shown in Algorithm II.5.

Algorithm II.5 leastCommonDenominatorRequire: Nonempty list of fractional numbers,

fl = [p1q1 ,p2q2, . . .

p|fl|q|fl|

] with GCD(pi, qi) = 1 and pi ≤ qi ∀ i ∈ 1 . . . |fl|Ensure: The smallest number, lcd ∈ N : lcd · pq ∈ N ∀ p

q ∈ fl.procedure LEASTCOMMONDENOMINATOR(fl)

3: lcd← 1for all p

q in fl dolcd← lcd · q

GCD(lcd, q)6: end for

return lcdend procedure


Correctness: Using induction over the size of fl, we get

Base Case: If |fl| = 1 we have lcd1 = q1GCD(lcd0, q1) = q1

1 = q1 wherelcd0 = 1 follows from line 3.

Induction Case: Assume lcdi−1 is correct and |fl| = i. We have

lcdi = lcdi−1 ·qi

GCD(lcdi−1, qi)

No matter whether qi divides lcdi−1 or not, it follows from thedefinition of GCD that qi

GCD(lcdi−1, qi) will be the smallest possiblenumber to multiply with lcdi−1 such that the qi’s divides theproduct.

67

6. ALGORITHMS

The running-time for LEASTCOMMONDENOMINATOR() is dependenton the running-time for the used sub-procedure GCD().5 Using an imple-mentation of Euclids Algorithm we have for the running-time of GCD()rtGCD = O(log(max(lcd, q))) (see [Cormen, Leiserson, Rivest, and Stein,2001, chapter 31.2]) and for LEASTCOMMONDENOMINATOR(): rtLCD =O(|fl| · rtGCD).

6.6.4 Finding Cycle Covers when D is a power of two

To determine the CCs in Step 4) on page 66 we follow the presentation in[Kaplan et al., 2005, Section 3] and first show the procedure for the casewhere the multiplier D is a power of two.

Let G0 = D ·G where G = G(N,A,W ) is the graph from the LP in Step2). G0 is then D-regular, half-bound and has W (G0) ≥ D ·OPTatsp. Definea D-regular bi-partite undirected multigraph B(V, V ′, E) as follows: foreach node ni ∈ N define two vertices vi ∈ V, v′i ∈ V ′ and for each arcaij = (ni, nj) ∈ A define the edge e = (vi, v

′j) ∈ E.

Next divide B into two D2 -regular bi-partite multigraphs B1, B2 as fol-

lows: for each edge e ∈ B with mB(e) ≥ 2 add⌊mB(e)

2

⌋copies of e to both

B1 and B2 and remove 2 ·⌊mB(e)

2

⌋copies of e from B. Distribute the “left-

over” edges among B1 and B2 as follows: for each connected componentin the remaining subgraph of B determine an Eulerian Cycle,6 and alter-nately assign the edges in these cycles to B1 and B2.

Now, letG′ and G′′ be the sub-graphs ofG0 corresponding toB1 and B2

respectively. Note that both G′ and G′′ are D2 -regular and half-bound. The

latter is given since G0 was half-bound and D-regular, meaning that forevery 2-cycle Cij in G0 at least one of the arcs (ni, nj) and (nj , ni) has mul-tiplicity ≤ D

2 . By letting G1 be the heaviest graph of G′ and G′′, it isachieved that W (G1) ≥ D

2 · OPTatsp because W (G′) + W (G′′) = W (G0) ≥D ·OPTatsp.

Applying the above procedure log(D) − 1 times the obtained graphGlog(D)−1 will be a 2-regular half-bound multigraph havingW (Glog(D)−1) ≥2 ·OPTatsp. From this graph it is fairly simple7 to obtain a pair of CCs sat-isfying Equation 6.5 on page 65.

The running-time for the procedure will be O(log(D) · n2) followingfrom log(D) iterations each costing O(n2).

5actually it is also dependent on a procedure being able to extract the denominator ofa fractional number from its decimal representation. However, the dominant procedurewill be the GCD.

6a cycle containing all edges in a graph7e.g. by using arc coloring in a bi-partite graph.

68


6.6.5 Finding Cycle Covers when D is not a power of two

When D is not a power of two the challenge is to use a rounding procedureon the solution of the LP in order to derive a 2y-regular multigraph. Whenthis is achieved the procedure in Section 6.6.4 can be applied.

For reasons that will become apparent later assume |N | ≥ 5. Deter-mine y ∈ N such that 2y−1 < 12 |N |2Wmax ≤ 2y. Define D = 2y − 2 |N |.For each value x∗ij let xij denote the largest integer multiple of 1/D lessthan x∗ij . Define the multigraph D · G = (N,A,W ) where A = {(Dxij) ·(i, j) | (i, j) ∈ A}. Since Dxij ∈ N this graph is well-defined.

Lemma 19For all nodes n in D ·G the following inequalities hold

D− (|N | − 1) ≤ d inn , d out

n ≤D. 2

PROOF ([KAPLAN ET AL., 2005]) For node nj it follows from the round-ing of the x∗ij values that xij ≥ x∗ij−1/D and from the in-degree constraintsin Definition 11 on page 65 we get that

∑i x∗ij = 1. From the definition of

D ·G it then follows that

d innj = D

∑i

xij (6.6)

≥D∑i

(x∗ij − 1/D

)≥D

(1− |N | − 1

D

)= D− (|N | − 1)

using xij ≤ x∗ij the upper bound for the in-degree follows from Equation(6.6). The bounds on the out-degree can be proved analogously. �

In the following we will say that a multigraph G has the sum-of-arcsproperty if

mG(aij) +mG(aji) ≤D ∀ ni, nj , i 6= j, i, j ∈ 1 . . . |N |.

From the 2-cycle constraints it follows that D · G has the sum-of-arcsproperty, that is it has at most

⌊D2

⌋copies of any 2-cycle Cij .

Next the multigraphD ·G has to be completed into a 2y-regular graphby the addition of arcs. Thereby it has to be ensured that the resultinggraph will be half-bound. The process of arc-adding is divided into an arc-addition stage and an arc-augmenting stage, and ensures the half-boundproperty by satisfying the condition, that mG(aij) +mG(aji) ≤D for everypair of nodes ni, nj .

69

6. ALGORITHMS

Arc-addition stage: If ∃ ni, nj , i 6= j : d outni < D and d in

nj < D andmG(aij) +mG(aji) <D add an arc aij to the graph D ·G

denoting the graph by the end of the arc-addition stage G′ the followinglemma holds

Lemma 20With the exception of maximally two nodes ni, nj every node n in G′ hasd inn = d out

n = D. If such pair of nodes do exist, then mG(aij) + mG(aji) =D. 2

PROOF ([KAPLAN ET AL., 2005]) Assume the existence of three distinctnodes ni, nj , nk such that for each of these either the in-degree or out-degree is less than D. Since

∑n d

inn =

∑n d

outn there will be at least one

node, say ni with d outni < D and at least one node, say nj with d in

nj < D.Assume further d in

nk< D (the case d in

nk= D ⇒ d out

nk< D is symmet-

ric). Since the arc-addition stage finished without adding further copiesof aij and aik it follows that mG′(aij) +mG′(aji) = mG′(aik) +mG′(aki) = D.This again implies that d in

ni + d outni = 2D. However, since the arc-addition

stage ensures that neither the in nor the out-degree of any node will belarger than D it follows that d out

ni = D, leading to a contradiction. �

By Lemma 20 at most two nodes ni, nj exist for which the in-degreeand out-degree do not equal D. For these nodes it follows from Lemma19 that D − (|N | − 1) ≤ d in

ni , doutni , d in

nj , doutnj . The first phase of the arc-

augmenting stage starts by choosing an arbitrary node nk 6= ni, nj andadd copies of the arcs aki, aik, akj , ajk until d in

ni = d outni = d in

nj = d outnj =

D. At this point any node nm,m 6= k has d innm = d out

nm = D. Denote thecorresponding graph G′′

Since there has been added maximally 2(|N | − 1) incoming and maxi-mally 2(|N | − 1) outgoing arcs to and from nk and because D = 2y − 2|N |it still holds that d in

nk, d outk < 2y (Lemma 19). Furthermore, after the arc-

addition stage we had mG′(aki),mG′(aik) ≤ D. The first phase of the arc-augmenting stage added maximally (|N | − 1) copies of both aki and aik soit now holds that mG′′(aki),mG′′(aik) ≤ 2y. Since a similar argument holdsfor akj and ajk it is still possible to augment G′′ into a 2y-regular graph G0

having the sum-of-arcs property. Since nk is the only node in G′′ whosein-degree or out-degree is larger than D it follows that d in

nk= d out

nk= L

and D ≤ L ≤ 2y.The second phase of the arc-augmenting stage adds L −D arbitrary

cycles passing through all nodes except nk, ensuring that all nodes haved inn = d out

n = L. Then 2y −L arbitrary Hamiltonian cycles are added withthe restriction that none of these Hamiltonian cycles uses any of the arcsaki, aik, akj and ajk. Note that there has to be at least five vertices in order

70


for these these Hamiltonian cycles to exist. Calling the resulting graphG0 it follows by construction that:

Lemma 21G0 is 2y-regular and half-bound. 2

Next the procedure of Section 6.6.4 on page 68 can be applied to G0.After y − 1 iterations the procedure will end with a 2-regular half-boundgraph Gy−1. It remains to prove the following lemma

Lemma 22The graph Gy−1 will have W (Gy−1) ≥ 2OPTatsp 2

PROOF ([KAPLAN ET AL., 2005]) Using the following facts:

i) xij ≥ x∗ij − 1/D,

ii) D = 2y − 2|N |

iii)∑

aij∈A 1 = (|N | − 1)2 ≤ |N |2, and

iv) OPTf ≤ |N |Wmax

we have for the weight of G0 that

W (G0) ≥∑aij∈A

D · w(aij) · xij

≥D ·

∑aij∈A

w(aij)x∗ij −

∑aij∈A

w(aij)

D

( from i) )

≥D ·

OPTf − ∑aij∈A

w(aij)

D

≥D ·

OPTf − ∑aij∈A

Wmax

D

≥ 2y OPTf − 2|N |OPTf − |N |2Wmax ( from ii) and iii) )

≥ 2y OPTf − 3|N |2Wmax ( from iv) )

At each of the y − 1 iterations the heaviest graph is retained, thus us-ing 2y ≥ 12 |N |2Wmax ⇔ 2y−1 ≥ 6 |N |2Wmax and OPTf ≥ OPTatsp. The

71

6. ALGORITHMS

following holds for the weight of Gy−1:

W (Gy−1) ≥W (G0)

2y−1

≥2y OPTf − 3 |N |2Wmax

2y−1

≥ 2OPTf −3 |N |2Wmax

6 |N |2Wmax

≥ 2OPTatsp −1

2≥ 2OPTatsp since W (Gy−1) and OPTatsp are integers. �

6.6.6 Approximation factorTheorem 7The Best Factor algorithm is a 2.5 approximation algorithm. 2

PROOF The algorithm achieves a factor 23 for the MAX-ATSP. Through

the reductions in Section 6.5.1 on page 62 this gives a factor 23 for the

MAX-HPP. From Theorem 5 on page 56 we now have that this gives a3.5− 1.5 · 23 = 2.5 approximation for the SSP. �


The implementation consists of a Python implementation, that utilisestwo exterior LP solver programs 8 to find the solution in step 2) on page 66.

A quite remarkable observation is that the solutions to the LP prob-lems turned out to be integral in all but one case out of 355 280 . Thissimplified the implementation extraordinarily since the steps 3) to 5) onpage 66 could be substituted with the steps:

iii) No rounding is needed because of integral solution.

iv) No pair of CCs is needed, as the returned CC is optimal.

v) No decomposition is needed, just remove the lightest arc from eachcycle in the cover to achieve a heaviest path collection.

Removing the lightest arc in step v) will maximally reduce the accu-mulated weight with one third (as all cycle consist of minimally threearcs), thus the approximation factor will be ensured.

8QSopt: http://www2.isye.gatech.edu/~wcook/qsopt/ and glpsol:http://www.gnu.org/software/glpk/

72

http://www2.isye.gatech.edu/~wcook/qsopt/

http://www.gnu.org/software/glpk/

6.7. The TSP Based Algorithm

Remark 13It has not been possible to explain this overwhelming majority of inte-gral solutions to the LP. However, using constructed examples of distancematrices it has been proven, that the feature neither is dependent on thedistance matrices being asymmetric nor on their satisfying of the Inverse-Monge-Condition. 2

6.7 The TSP Based AlgorithmLike the Best Factor algorithm 6.6 on page 64 the basis for this algorithmis the overlap formulation (Expression 5.6 on page 40). After making thereductions described in Section 6.5.1 and Section 6.5.2 on page 62 we canuse LKH to find a solution and transforms this back into a solution for theoriginal HPP.


The implementation consists of a Python implementation that trans-forms an original asymmetric overlap matrix into a corresponding sym-metric distance matrix and then calls the LKH program to find a solutionfor the transformed TSP instance. This solution is then converted into asolution for the original SSP using the methods in Section 6.5 and expres-sion 5.6 on page 40.

The running-time for the implementation is dominated by the calcula-tion of the matrices and the running-time for LKH on the transformed in-stance. Letting S = (s1, s2, . . . , sn) be the collection of strings, the running-time for the calculation of the original ATSP overlap matrix can (theo-retically) be reduced to O(||S|| + n2) [Gusfield, 1997]. Transforming thismatrix into a symmetric matrix for the transformed instance will have arunning-time of O((2n)2). The running-time for LKH is empirically deter-mined to approximately O((n)2.2) [Helsgaun, 1998].

In practise the calculation of the all-pairs prefix-suffix overlap matrixwas implemented as mentioned in Section 6.1.2 on page 48.

6.8 SummaryTo recapitulate, Table 6.2 shows the characteristics of the implementedalgorithms.

73

6. ALGORITHMS

Table 6.2: SSP algorithm characteristics

Algorithm Approximation factor Running-time

Cycle 2.75 O(log(n) · n2)

RecurseCycle 2.75 O(log(n) · n2)

Greedy 3.5 O(log(n) · n2)

Best Factor 2.5 O(log(D) · n2)

TSP – O(n2.2)

In the running-times D is the multiplier form Step 3) on page 66and n is the number of strings in the stringcollection S = (s1, s2, . . . , sn)

The running-times for the Cycle algorithm and the RecurseCycle algo-rithm are dominated by Algorithm II.2 on page 52. The running-time forGreedy is dominated by the sorting of n2 pairwise overlaps. The running-time for the Best Factor algorithm might be dominated by the LP solverprogram. Finally is the running-time for the TSP algorithm dominated byrunning LKH.

6.9 NotesThe approximation factor theory in Section 6.1 and Section 6.2 is basedon [Blum et al., 1994; Mucha, 2007]. The theory in Section 6.3 is based on[Breslauer et al., 1996] and the theory in Section 6.4 is based on [Blumet al., 1994; Kaplan and Shafrir, 2005]. Finally Section 6.6 is based com-pletely on [Kaplan et al., 2005]. All other references are explicitly men-tioned.

74

A little experience often upsets a lot oftheory.

CHAPTER

7Experiments

This chapter contains descriptions of the experiments performed on theimplemented SSP algorithms together with their results.

The main and almost sole purpose of the SSP experiments was for eachof the different algorithms to evaluate its properties concerning SSP solu-tion quality. This especially meant that no particular emphasis concern-ing running-time efficiency was put into the implementations of the algo-rithms. In other words: A fast and simple implementation was preferredto a complex but more running-time efficient implementation.

Since it turned out to be fairly easy to collect test results concerningproperties like running-time and solution quality as sequencing method,tests for these were also conducted. It should be kept in mind though, thatthe latter tests were made more out of curiosity than of planned intentand as such care should be taken not to draw too definite conclusionsupon them.

An initial section describing the test data generation is followed bysections covering the tests conducted for SSP solution quality, running-time and sequencing quality respectively. Each of these experiment sec-tions finishes with the conclusions drawn from the test results.

7.1 Test Data

The test data was constructed according to one of the procedures de-scribed in [Romero et al., 2004]. The procedure resembles the ShotgunSequencing method (see 4.2.1 on page 30) thus we will need the followingdefinition:

75

7. EXPERIMENTS

Definition 13 (Coverage)As a measure of relation between number of clones and number of over-lapping fragments, the coverage is defined as

cov =n · `|G|

(7.1)

where n is the number of fragments, ` is the average length of the frag-ments and |G| is the length of the original DNA sequence (genome). 2

As mentioned in [Gusfield, 1997, chapter 16.14] a five- to tenfold cov-erage generates the best results for Shotgun Sequencing.

The steps for producing test data are:

1) From each of a pool of 64 fasta-format files1 a nucleotide sequenceof length N is obtained.

2) A collection of strings S of equal length ` is constructed by shiftingcharacter by character along the obtained nucleotide sequence.

3) The collection S is modified by randomly removing a number ofstrings r until a wanted coverage value is obtained.

4) Ensure that S is substring free (see Remark 3 on page 29).

In the experiments coverage values of 6, 8, 10, 20, 30 and 40 wereused. String length varied from minimally 6 to maximally 100, collectionsize from 100, 200, . . . , 1000 and a few of size 2000 and 3000. For exactdetails and calculation of the necessary nucleotide sequence length N seeAppendix D on page 161.

In [Romero et al., 2004] two further methods to generate test datawere used. These methods consist of a random generation of the nu-cleotide sequence in step 1) and a random generation of all strings thecollection S respectively.

The choice of the above method was due to the results obtained byRomero et al.. These results showed that the largest deviations from alower bound for the SSP solutions, was obtained using test data genera-tion from real nucleotide sequences. This might not be all that surpris-ing, as uniformly random generated nucleotide sequences tend to havesmall and sparse overlaps, thus are easy to compress optimally. The 64nucleotide sequences were obtained from GenBank2 and comprise thesequences which were possible to contrive using the accession numbersgiven in [Romero et al., 2004].

1see appendix C on page 1572http://www.ncbi.nlm.nih.gov/

76

http://www.ncbi.nlm.nih.gov/

7.2. SSP Solver Tests

Table 7.1: SSP experiments overview

coverage string length string count

SSP solution Figure 7.1 figures 7.2, 7.3 figures 7.4, 7.5

Running-time – – figures 7.7 – 7.9

Sequencing – Figure 7.10a Figure 7.10b

Figures showing the quality dependency of the row values upon thethe column variables

Table 7.1 contains an overview of the following figures and which re-lationship combination they are illustrating. All tests were run on twodedicated, identical machines.3

Remark 14The following abbreviations are used in the figure legend for all figures:

• ’RCycle’ for RecurseCycle algorithm,

• ’Best’ for Best Factor algorithm, and

• ’LB’ for lower bound.

Please note the fine scaling on the y-axes in the figures 7.1–7.5. 2

7.2 SSP Solver Tests

Even though the SSP-solutions originating from the TSP algorithm wereexpected to be optimal at the very least for string count up to 1000, it wasdecided to calculate a lower bound for the optimal SSP solution for eachtest.

The lower bound was calculated by Concorde as a guaranteed (withrespect to numerical precision) feasible solution to the dual of a LP relax-ation for the TSP as a lower bound. The results of for all SSP algorithmsare thus depicted as their percentage deviation above the lower bound i.e.the lower bound value is used as a 100 % mark.

7.2.1 Variable coverage

From Equation 7.1 it follows that the coverage and the length of the orig-inal sequence are inversely proportional. For the test data this means,

3Intel Xeon, Dual core, 3.0 ghz!, 1.0 GB RAM

77

7. EXPERIMENTS

that keeping string (fragment) count and length constant, increasing cov-erage is equivalent to shorter original sequence length. In other wordsthe problem of solving the SSP is expected to become easier for increas-ing coverage as the superstring to determine becomes smaller and thefragment count and overlaps are constant.

5 10 15 20 25 30 35 40Coverage

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045

Perc

enta

ge in

term

s of

low

er b

ound

+9.999e1 SSP results for count 200, length 50

CycleRCycleBestGreedyTspLB

(a) solution dependency on coverage

5 10 15 20 25 30 35 40Coverage

0.010

0.015

0.020

0.025

Perc

enta

ge in

term

s of

low

er b

ound

+9.999e1 SSP results for count 200, length 100


(b) solution dependency on coverage

Figure 7.1: SSP solution dependency on variable coverage

78


The tests were conducted with coverage values of 6, 8, 10, 20, 30 and40. Only the results for string count 200 and string lengths 50 and 100are shown, as they are representative for the results using other stringcounts and lengths. As can be seen in Figure 7.1 the expected increase inSSP-solutions quality with larger coverage values is not obvious. Actuallyall algorithm SSP-values (except for the TSP algorithm) seem to fluctuatesomewhat. The fluctuations can partly be explained as a visual effect dueto the very fine y-axis scale.

7.2.2 Variable length

From Equation 7.1 it can be seen that the string (fragment) length and thelength of the original sequence are directly proportional. For the test datathis means, that keeping string count and coverage constant, increasinglength is equivalent to larger original sequence length.

The problem of solving the SSP is expected to become easier though asthe possible overlaps are increasing with the string lengths.

20 40 60 80 100String length

0.1

0.2

0.3

0.4

0.5

Perc

enta

ge in

term

s of

low

er b

ound

+9.99e1 SSP results for coverage 8, count 100


Figure 7.2: SSP solution dependency on length

For these tests string lengths of 10, 20, 25, 40, 50 and 100 were used.In Figure 7.2 and Figure 7.3 are some of the results for a coverage of 8 de-picted, the results for other coverage and string count values are similar.

79

7. EXPERIMENTS


0.3

0.5

0.7

0.9

1.1

Perc

enta

ge in

term

s of

low

er b

ound

+9.97e1 SSP results for coverage 8, count 500


Figure 7.3: solution dependency on length, comparewith Figure 7.10a

In contrast to the variable coverage results, the variable length resultsare very consistent. All algorithms are producing values of increasingquality for longer lengths. Only a string length of 10 results in valuesup to 1 % above the lower bound. As always this is not case for the TSPalgorithm, which constantly lies more or less on the lower bound line.

7.2.3 Variable count

From Equation 7.1 it follows that the string count and the length of theoriginal sequence are directly proportional. For the test data this means,that keeping string length and coverage constant, increasing count isequivalent to larger original sequence length.

The problem of solving the SSP is expected to stay undisturbed as thepossible overlaps are non-increasing and the extra amount of strings arecompensated for by increasing superstring length.

For these tests string counts of 100, 200, 300, 400, 500, 600 and 700were used. In Figure 7.5 and Figure 7.5 are the results for a coverage of8 depicted for string lengths of 25 and 40, the results for other coverageand string length values being similar.

The results are showing a slight tendency of reduced solution quality

80


100 200 300 400 500 600 700Number of strings

0.10

0.12

0.14

0.16

0.18

0.20

0.22

Perc

enta

ge in

term

s of

low

er b

ound

+9.99e1 SSP results for coverage 8, length 25


Figure 7.4: solution dependency on count, comparewith Figure 7.10b

upon growing string count. The fluctuations in the figures are mainly dueto the changes in y-value scale between Figure 7.4 and Figure 7.5, andprobably also a too conservative lower bound value. If the results fromFigure 7.5 are correlated using the TSP algorithm values as lower bound,the lines look more smooth (see Figure 7.6).

7.2.4 SSP solution conclusions

Some recurrent patterns that are present in all of the SSP solution figuresare:

• the TSP algorithm values are almost constantly laying on the lowerbound line,

• the Greedy algorithm are producing the second best values, followedby the two Cycle algorithms and the Best Factor algorithm, whichare producing solutions of similar quality,

• the Cycle algorithm and the RecurseCycle algorithm are, for allpractical manners, producing identical values, and finally

81

7. EXPERIMENTS


0.01

0.02

0.03

0.04

0.05

0.06

0.07Pe

rcen

tage

in te

rms

of lo

wer

bou

nd



Figure 7.5: solution dependency on count

Table 7.2: minimum scores for SSP algorithms in 355 280 tests

Algorithm minimum scores percentage

Cycle 301 637 84.90 %

RecurseCycle 301 664 84.91 %

Best Factor 301 251 84.79 %

Greedy 306 519 86.28 %

TSP 355 280 100.00 %

• all algorithms are producing results typically less than 1 % abovethe lower bound, but always less than 2 %.

To further assess the results Table 7.2 shows a listing of the numberof times an algorithm achieved the smallest (best) value in all test runs.

The conclusions that can be drawn upon these observations are:

1) The TSP algorithm is far superior to the other algorithms in view ofsolution quality.

82

7.3. Timing Tests


0.01

0.02

0.03

0.04

0.05

Perc

enta

ge in

term

s of

low

er b

ound



Figure 7.6: Figure 7.5 correlated using TSP algo-rithm values as 100 %

2) The Greedy algorithm produces better results than CC based algo-rithms even though its approximation factor is higher.

3) The Best Factor algorithm, the Cycle algorithm and the RecurseCy-cle algorithm are producing alike quality solutions.

4) The improvement of the RecurseCycle algorithm in respect to theCycle algorithm is negligible.

5) All approximation algorithms are producing far better solutions thantheir worst case approximation factor might indicate.

7.3 Timing TestsFor test data with coverage 6 the running-time for all algorithms plottedagainst string count is shown using normal scale in Figure 7.7

To examine the running-time complexities, the Best Factor algorithmand the TSP algorithm running-times were plotted using double-log scale(see Figure 7.8a). A linear regression was made for both algorithms4 and

4using the stats package from scipy

83

7. EXPERIMENTS

100 200 300 400 500 600 700 800 900 1000String count

0

20

40

60

80

100

120

Runn

ing-

time

SSP algorithms running-time in seconds for coverage 6

CycleRCycleBestGreedyTsp

Figure 7.7: SSP algorithms running-time using nor-mal scale

the result is plotted in Figure 7.8b. The values obtained by the linearregressions are shown in Table 7.3.

The Greedy algorithm, the Cycle algorithm and the RecurseCycle al-gorithm running-times are shown in Figure 7.9a using double-log scaleand running-time values divided by log(string count) values. A linear re-gression was made for all algorithms and the result is plotted in Figure7.9b. The values for the linear regressions are shown in Table 7.3.

The following observations can be made:

• the TSP algorithm and the Best Factor algorithm are clearly slowest,

• the Greedy algorithm, the Cycle algorithm and the RecurseCyclealgorithm are all comparable with respect to running-time, and

• the TSP algorithm and the Best Factor algorithm are having a poly-nomial running-time dependency, the Greedy algorithm, the Cyclealgorithm and the RecurseCycle a log times polynomial running-time dependency. Comparing these observations with Table 6.2 onpage 74 it can be concluded that all algorithms, apart from the BestFactor algorithm, are exhibiting the expected running-times. The

84

7.3. Timing Tests

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0log2(String count)

6

8

10

12

14

16

18

log2

(Run

ning

-tim

e)

SSP algorithms running-time in milliseconds for coverage 6

BestTsp

(a) algorithm running-times

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0log2(String count)

6

8

10

12

14

16

18

log2

(Run

ning

-tim

e)


BestBest regressionTspTsp regression

(b) running-time linear regressions

Figure 7.8: Best Factor and TSP algorithm running-times

85

7. EXPERIMENTS

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0log2(String count)

3

4

5

6

7

8

9

10lo

g2(R

unni

ng-ti

me/

log2

(Str

ing

coun

t))


CycleRCycleGreedy

(a) algorithm running-times

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0log2(String count)

3

4

5

6

7

8

9

10

log2

(Run

ning

-tim

e/lo

g2(S

trin

g co

unt)

)


CycleCycle regressionRCycleRCycle regressionGreedyGreedy regression

(b) running-time linear regressions

Figure 7.9: Cycle, RecurseCycle and Greedy algorithm running-times

86

7.4. Sequencing Tests

Table 7.3: Running time linear regression values

Algorithm slope r-value r-squared p-value stderr

Cycle 1.89 1.00 1.00 0.00 0.022

RecurseCycle 1.89 1.00 1.00 0.00 0.021

Best Factor 2.16 1.00 1.00 0.00 0.026

Greedy 1.67 1.00 1.00 0.00 0.027

TSP 2.87 0.99 0.98 0.00 0.060

The r-value is the correlation coefficient, the r-squared is the coefficient ofdetermination and the two-sided p-value is for a hypothesis test whose nullhypothesis is that the slope is zero i.e. no correlation at all.

running-time measured for the Best Factor algorithm is probablydue to its external LP solver.

7.4 Sequencing TestsFor the test data with coverage 8 the similarity to the original string wastested using the needle application from the EMBOSS suite [Rice, Long-den, Bleasby, et al., 2000].5 This is an implementation of the Needleman-Wunsch Optimal Pairwise Alignment (OPA) algorithm but in this case itwas only used in order to get a percentage value for the similarity betweenthe two strings.

Remark 15The alignment tests were conducted before the Best Factor algorithm wasimplemented, therefore no values are shown for this algorithm in the fig-ures. 2

The results for variable string length and variable string count werecollected. Results for string count 600 and for string length 25 are shown,as the other string counts/lengths results were similar. The results aredepicted in Figure 7.10.

Figure 7.10a should be compared to Figure 7.3 on page 80 and Figure7.10b should be compared with Figure 7.4 on page 81.

Another measure of the similarity between the calculated superstringand the original string is illustrated in Table 7.4

From the figures it can be seen that there is a correlation between theSSP solution quality and the string similarity indicating, that the SSP is a

5http://emboss.sourceforge.net/

87

http://emboss.sourceforge.net/

7. EXPERIMENTS


75

80

85

90

95

Perc

enta

ge id

entit

y to

orig

inal

str

ing

Align results for stringcount: 0500

RCycleTspGreedyCycle

(a) alignment similarity for variable string length, compare with Figure7.3

100 200 300 400 500 600 700Strings count

96.5

97.0

97.5

98.0

98.5

Perc

enta

ge id

entit

y to

orig

inal

str

ing

Align results for stringlength: 025

RCycleTspGreedyCycle

(b) alignment similarity for variable string count, compare with Figure7.4

Figure 7.10: SSP algorithm alignment similarity results

88

7.4. Sequencing Tests

Table 7.4: superstring lengths compared to original string length

Algorithm shorter equal longer

Cycle 78 % 11 % 11 %

RecurseCycle 78 % 11 % 11 %

Best Factor 79 % 11 % 10 %

Greedy 79 % 12 % 9 %

TSP 86 % 14 % 0 %

plausible model for the Shotgun Sequencing. From Table 7.4 it is evident,that all algorithms compress the fragments too much, though. Note thatdue to the way of constructing the test data (removal of random chosenfragments in Step 3) on page 76) it might not be possible to reconstruct asuperstring with same length as the original sequence.

89

Part III

The Circular Sum Scoreand

Multiple SequenceAlignment

DNA strands tour with 16384 cities

One or two homologous sequenceswhisper . . .A full multiple alignment shouts outloud

Arthur Lesk

CHAPTER

8Sequence Alignments

This chapter introduces and defines biological sequence alignments. Itfurther gives a description of the problems concerned with especially MSAsalong with a characterisation of the commonly used construction algo-rithms.

8.1 Motivation

Many algorithms in bioinformatics are concerned with comparing sequences.This is mainly due to the following:

Definition 14 (The First Fact of Biological Sequence Analysis)In biomolecular sequences (DNA, RNA or amino acid sequences) high se-quence similarity usually implies significant functional or structural sim-ilarity. 2

This fact can be popularly rephrased as:

If it looks like a duck, it will probably walk and talk like a duck.

Evolution reuses and modifies successful structures and “duplicationwith modification” is the central paradigm for protein (amino acid se-quence) evolution. A large majority of the human genes are very simi-lar if not directly identical to the genes of for example the mouse.1 Whatmakes a mouse differ from a human may be determined more by proteinregulation, than difference in the protein sequences.

1This explains why some molecular biologists affectionately refer to mice as “thosefurry little people”

93

8. SEQUENCE ALIGNMENTS

Though the first fact is extremely powerful, it does not signify a sym-metric property – which is captured in:

Definition 15 (The Second Fact of Biological Sequence Analysis)Evolutionary and functionally related molecular sequences can differ sig-nificantly throughout much of the sequence and yet preserve the samethree-dimensional structure(s), or the same two-dimensional substruc-ture(s), or the same active sites, or the same or related dispersed residues(DNA or amino acid). 2

The second fact could then be rephrased as

Even if it walks and talks like a duck, it does not have tolook like a duck.

8.2 Pairwise Sequence AlignmentThe similarity of two sequences is often measured using an alignment:

Definition 16 (Global pairwise alignment)A (global) alignment of two strings s1 and s2 is obtained by first insertingchosen spaces (or dashes) either into or at the ends of s1 and s2, and thenplacing the resulting strings one above the other so that every characteror space in either string is opposite a unique character or space in theother string. 2

The term “global” emphasises that the full length of both strings (asopposed to only a substring) participates in the alignment. A pairwisealignment (using dashes) of two strings s1, s2 ∈ Σ∗ results in two stringsa1, a2 ∈ (Σ′)∗ where Σ′ = Σ∪{′−′} and ′−′ /∈ Σ with the property that if weremove the inserted dashes we will obtain the original strings. A singleoccurrence or a contiguous sequence of the space character (′−′ above) isdenoted a “gap”. Biologically gaps are interpreted as either insertion ordeletion events (indels) in the sequences.

Remark 16In the rest of this thesis ′−′ will be used as the gap character and thenotation A = [[a1, a2]] will be used to denote an alignment. We will usethe notation A[r][c] to describe the c’th. character of the r’th. row in thealignment. Furthermore, we denote the sequences a1, a2 the induced se-quences for s1, s2 in the alignment A. 2

To find out whether two sequences share a common ancestor the se-quences are usually aligned. The problem is to determine the alignmentthat maximises the probability that the sequences are related. The eval-uation of an alignment is done using a scoring function.

94

8.2. Pairwise Sequence Alignment

Definition 17 (Scoring Function for an Alignment)Let A be the set of possible alignments for two strings s1, s2 ∈ Σ∗ over thefinite alphabet Σ. A function f : A → R is called a scoring function for thealignment. 2

The value of the scoring function is typically defined as the sum of acolumn-wise scoring of the alignment characters using a scoring matrixM of size |Σ′| × |Σ′|

Example 12Given two DNA-strings AGCCTA and ATTC and an alignment of these:

A =A G C T C AA T − T C −

The score S, of the alignment using the score matrix

A C G T −A 5 1 3 1 −1C 1 5 1 3 −1G 3 1 5 1 −1T 1 3 1 5 −1− −1 −1 −1 −1 0

is given by:

A G C T C AA T − T C −

S = 5 + 1 − 1 + 5 + 5 − 1 = 142

In Example 12 the scores for gaps are independent of the length of thecurrent gap. An often used scoring strategy for gaps is the so-called“affine gap cost” which scores a gap g of length |g|, according to the for-mula: cost(g) = o + |g| · e, where o is the cost of starting a gap (openingcost) and e the cost of extending a gap.2 Further refinements can be set-ting the cost of terminal gaps to zero or letting gapcost be dependent onthe distance to neighbouring gaps.

Determining the OPA for two strings is done using dynamic program-ming3 and has running-time and space complexity of O(|s1| · |s2|). Asmentioned in [Gusfield, 1997, chapter 12.1] the use of a recursive tech-nique due to Hirshberg can reduce space complexity to O(min(|s1|, |s2|)while only increasing the running-time by a factor of two.

2We deliberately refrain from participating in the controversy of whether the first gapshould have the cost ’o’ or ’o + e’.

3In bioinformatics the algorithms are named “Needleman-Wunsch” and “Smith-Waterman” for global and local alignment respectively [Gusfield, 1997, chapter 11]

95


8.3 Multiple Sequence Alignment

If comparison of two Ribonucleic Acid (RNA), DNA, or amino acid sequences(proteins) can lead to biological insights; then this is even more so in thecase of comparing multiple sequences. Being able to determine “families”of proteins has very important applications in at least three ways:

History of Life on Earth Sequence families are often used as the basisfor building evolutionary or phylogenetic trees. The sequences areput as leaves in the tree (denoted the “taxa”) and vertices in the treethen reflects unknown common ancestors.

Identifying conserved regions When comparing multiple viral or bac-terial protein strains the most conserved regions are typically alsothe critical functional regions for the proteins. Since vira mutate ata very rapid rate, identifying these regions is of paramount impor-tance when designing drugs. A drug attacking a conserved regionwill be far longer effective than a drug attacking a more unstableregion.

Structural and functional prediction A protein folds into so called“secondary structures”. These structures are divided into three cate-gories: α-helices, β-sheets and random coils or loops (see Figure 8.1).Indel events seldom occur in the first two categories meaning that apossible way of identifying coils and loops are as regions with gapsin the alignment.

Figure 8.1: Secondary structure of Avian SarcomaVirus. A retrovirus with many similarities to HIV(from http://mcl1.ncifcrf.gov/integrase/asv_

intro.html )

96

http://mcl1.ncifcrf.gov/integrase/asv_intro.html


8.3. Multiple Sequence Alignment

A straight-forward extension of the pairwise alignment definition leadsto:

Definition 18 (Global Multiple Alignment)A global multiple alignment of k > 2 strings, S = (s1, s2, . . . , sk) is a nat-ural generalisation of an alignment for two strings. Chosen spaces areinserted into (or at either end of) each of the k strings so that the result-ing strings have the same length, defined to be `. Then the strings arearrayed in k rows of l columns each, so that each character and space ofeach string is in a unique column. 2

A more mathematical formulation of the above definition would be

Definition 19 (Multiple Sequence Alignment)Let S = (s1, s2, . . . , sk) be a collection of sequences with si ∈ Σ∗ ∀ i in 1 . . . kand Σ a finite alphabet. An MSA consists of a collection of sequences A =[[a1, a2, . . . , ak]] with ai ∈ (Σ′)` ∀ i in 1 . . . k where Σ′ = Σ ∪ {′−′} and ′−′ /∈Σ. 2

Remark 17We call k the size of the MSA and ` the length of the MSA. In connectionwith MSAs the literature often denotes the number of strings k and their(original) length n. We will use this notation, even though n in connectionwith the SSP was used to denote the number of strings. Hopefully this willnot lead to confusion. 2

Constructing an MSA immediately leads to three distinct problems:

1) The sequences to be aligned have to be determined.

2) A scoring function for the MSA has to be defined.

3) The optimal MSA with respect to the scoring function has to be de-termined.

Re 1) Algorithms for constructing an MSA usually assume that thesequences are “homologous”.4 This does not mean that the algorithms willrefrain from constructing an MSA, should this not be the case, only thatthe resulting MSA will at best be useless and in the worst case directlymisleading.

Re 2) Defining a meaningful scoring function for an MSA is very dif-ficult. This has to do with the notion of correctness of an MSA. Whatdoes it mean that an MSA is biologically correct and is it even possible torecognise a correct MSA if we see one? The challenge is to define a math-ematical function capturing all biological knowledge about the structure,function and evolutionary history of the sequences. This challenge is still

4The sequences share a common ancestor

97


to be met, as of today there is no consensus as to how to score an MSA[Elias, 2006; Gusfield, 1997; Notredame, 2002, 2007].

Re 3) As if the problem of defining a scoring function was not enough;apart from instances with very short or very few sequences, it is incon-ceivable to calculate an optimal MSA. The general MSA problem has beenproven NP hard for the most used scoring functions [Elias, 2006; Wangand Jiang, 1994].

Algorithms for constructing MSAs can be divided into three groups:

Exact Algorithms As a rule of thumb the optimal MSA cannot be prac-tically calculated for more than three sequences of realistic lengths.There are algorithms that extend this limit by extending a dynamicprogramming approach into higher dimensions. They depend onidentifying the portion of the hyperspace that will contribute to thesolution, thereby minimising the space constraints. However eventhe best implementations seem to reach a limit around ten sequences[Notredame, 2002]. This group contains all implementations of theNeedleman-Wunsch algorithm. One of the best known extensionsto higher dimensions is the application msa [Lipman, Altschul, andKececioglu, 1989].5

Progressive Alignment Algorithms These algorithms follow the sim-ple idea of constructing the final MSA by adding one sequence (or asubalignment) at a time to a continuously growing alignment. Theydepend on a proximity measure in order to choose the next sequence(or subalignment) to add. Additionally, the order of succession hasvery high influence on the terminal result, in particular the inser-tion of gaps. This feature is captured in the common formulation:“once a gap, always a gap”. An illustration of this problem is seenin Example 13 on the facing page. The best known program usinga progressive alignment algorithm is Clustalw [Thompson, Hig-gins, and Gibson, 1994]. It is also one of the first published MSAconstruction applications. Despite having been surpassed concern-ing both running-time and quality [Blackshields, Wallace, Larkin,and Higgins, 2006], Clustalw is still one of the most popular MSAconstruction applications.

Iterative Alignment Algorithms The essence of these algorithms is theidea, that it is possible to arrive at an optimal MSA by refining agiven sub-optimal MSA. The ways to refine the original MSA rangefrom simply choosing random sequences (or subalignments) to re-align to a variety of techniques including Hidden Markov Model

5although contrary to popular belief msa does not guarantee the mathematical opti-mal solution [Notredame, 2002]

98

8.4. Notes

(HMM), “Simulated Annealing, Genetic Algorithms, Divide-and-Conquer”and “Consistency Based” approaches. The latter algorithms tries todetermine the MSA most consistent with libraries containing struc-tural, functional, local and global alignment similarity information.Known applications of this type are Probcons[Do, Mahabhashyam,Brudno, and Batzoglou, 2005] and TCoffee[Notredame, Higgins,and Heringa, 2000]

Many MSA application are a mix of the last two groups.

Example 13The four sequences: THE LAST FAT CAT, THE FAST CAT, THE VERYFAST CAT and THE FAT CAT, are progressively aligned in this order toconstruct an MSA:

THE LAST FAT CATTHE FAST CAT

→THE LAS T FAT C AT

THE F AS T CAT −−−↓

THE V ERY FAST CAT →THE LAS T FA− T C AT

THE F AS T CA− T −−−THE V E RY F A S T C AT

↓

THE FAT CAT →

THE LAS T FA− T C AT

THE F AS T CA− T −−−THE V E RY F A S T C AT

THE −−−− F A− T C AT

If the algorithm in the first alignment chooses to mismatch L, F and F, Cinstead of aligning CAT with CAT, the resulting end gap will stay in thefinal MSA. This can happen, if the cost of an end gap is below the costof an interior gap, which is the case for several MSA applications. In thefinal MSA a far better score would be achieved by realigning the secondsequence. 2

8.4 NotesThis chapter is mainly based on [Notredame, 2002] and [Gusfield, 1997],in particular most definitions are taken from the latter.

99

Oh he love, he love, he loveHe does love his numbersAnd they run, they run, they run himIn a great big circleIn a circle of infinityBut he must, he must, he mustPut a number to it

Kate Bush

CHAPTER

9Circular Sum Score.

This chapter introduces the Circular Sum (CS) score along with the con-siderations leading to its definition and a description of its main charac-teristics. The CS score is at the basis of the MSA algorithms described in alater chapter.

9.1 Sum-of-Pairs Score

As mentioned in Chapter 8 on page 93 one of the problems when dealingwith MSA is how to evaluate them. One of the first introduced scoringschemes [Bacon and Anderson, 1986; Murata, Richardson, and Sussman,1985] can be seen as a natural extension of the scoring function for pair-wise alignments.

Definition 20 (Sum of Pairs Score)Let A = [[a1, a2, . . . , ak]] be an MSA of the sequences s1, s2, . . . , sk. LetS(ai, aj) be the score for the induced alignment of the the strings si, sjin A. We then define the Sum-of-Pairs (SOP) score of A as

SOP (A) =

k∑i=1

k∑j>i

S(ai, aj) 2

Example 14Using the scoring scheme of +1 for char matches and 0 for everythingelse,1 the SOP score of the final MSA from Example 13 on page 99 is calcu-

1This scoring scheme is also known as “unit metric” or “edit distance”

101

9. CIRCULAR SUM SCORE

lated as:

a1 : THE LAS T FA− T C AT

a2 : THE F AS T CA− T −−−a3 : THE V E RY F A S T C AT

a4 : THE −−−− F A− T C AT

←−8

+

←−−−−

9

+

←−−−−−−

9

+

a2 : THE F AS T CA− T −−−a3 : THE V E RY F A S T C AT

a4 : THE −−−− F A− T C AT

←−5

+

←−−−−

5

+

a3 : THE V E RY F A S T C AT

a4 : THE −−−− F A− T C AT ←−9

+

giving the result 26 + 10 + 9 = 45. 2

The SOP score is easy to calculate but as most scoring functions it lacksa clear biological reasoning. In [Notredame, 2002] it is even stated that,the SOP score is wrong from an evolutionary point of view. The followingsection will elaborate on this claim.

9.2 MSAs and Evolutionary TreesSubstitution matrices used in alignments are often constructed using apredefined evolutionary model in which the score of a substitution, a gapor a conservation is determined according to the biological likeliness ofthis event happening. This means, that unlikely mutations will have neg-ative (or very small) scores while mutations occurring more often thanwould be expected by chance, will have positive scores.2 The most com-monly used scoring matrices are the Point Accepted Mutation (PAM) andBlock Substitution Matrix (BLOSUM) series.3 Typically the scoring of gapsis done using the “affine” gap cost (see Section 8.2 on page 94).

We can connect an MSA to an evolutionary tree in the following way:

Definition 21 (Evolutionary Tree)Let S = (s1, s2, . . . , sk) be a collection of biomolecular, homologous se-quences with k ≥ 2. An Evolutionary or Phylogenetic Tree for S is arooted, binary tree, where the leafs are labelled with the k sequences

2If a minimisation problem is preferred for the MSA the same values can be used aftereither multiplying each value by −1 or subtracting each value from the largest value.

3http://en.wikipedia.org/wiki/Substitution_matrix

102

http://en.wikipedia.org/wiki/Substitution_matrix

9.3. Circular Orders, Tours and CS Score

(extant species) and each interior vertex represents a common (usuallyunknown) ancestor of its children (extinct species). 2

Consider an evolutionary tree of five sequences S = (s1, s2, s3, s4, s5)(see Figure 9.1). Assume this is the correct evolutionary tree for the se-quences. If we look upon the construction of an MSA for the sequences asa minimisation problem, we can interpret the score of the pairwise align-ment [[s1, s2]] as a measure of the probability of evolutionary events (mu-tations and indels) on the edges (s1, v1) and (v1, s2). Likewise the score ofthe alignment [[s3, s4]] evaluates the likelihood of evolutionary events onthe edges (s3, v2), (v2, v3), (v3, s4).

In other words: the shorter the distance between two sequences in thetree is, the less likely it is that evolutionary events took place i.e. the moreidentical and closely related the sequences will be.

r

v1

v2

v3

s1

s2

s3 s

4s

5

Figure 9.1: Evolutionary tree with five sequences(species)

With this interpretation in mind consider an MSA of the sequencesin the same evolutionary tree using the SOP score. If there is added a“tick” to an edge, each time it is included in the calculation of a pairwisealignment, we get a result as shown in Figure 9.2. The effect is, thatsome edges are considered more often than others. However, there isno biological justification to suggest, that some evolutionary events aremore important than others. Another way to formulate the evolutionaryproblem with the SOP score is that in a sense the SOP score considers eachsequence to be an ancestor of every other sequence.

9.3 Circular Orders, Tours and CS ScoreThe considerations in the last section thus led to the idea of defininga scoring function that will consider each edge in a evolutionary treeequally. Or quoting from [Gonnet et al., 2000]:

103


r

v1

v2

v3

s1

s2

s3 s

4s

5

6

6

444

6

4 4

Figure 9.2: SOP score “ticks” added for each edge.(from [Gonnet et al., 2000])

How can a tree be traversed, going from leaf to leaf, such thatall edges (each representing its own episode of evolutionary di-vergence) are counted the same number of times?

Definition 22 (Circular Order and Circular Tour)Let T be an evolutionary tree for the sequences S = (s1, s2, . . . , sk). Apermutation of the sequences which, when interpreted as a traversal ofT , leads to a tour visiting each leaf once and each edge exactly twice, iscalled a Circular Order (CO). A tour in a tree following a CO is denoted aCircular Tour (CT).4 2

Consider our example evolutionary tree for five sequences ( 9.3 on thenext page). If the tree is traversed in the order s1, s2, s3, s4, s5 (see sub-figure 9.3a) the result will be a CT. Another CO would be s1, s2, s5, s4, s3leading to the CT depicted in subfigure 9.3b. The last tour can also beimagined as being “equal to” the first tour in an isomorphic tree result-ing from rotating the subtree, Tv2 around v2. If we consider cycle orders,which are cyclic shifts of each other as giving the same CT, then the num-ber of distinct CTs is given as the number of rotations around all innervertices apart from the root. For a tree with n leafs the number of distinctCTs is thus 2n−2.

As an evolutionary tree has at least two leafs, we now have:

Lemma 23The CT is the shortest possible tour through an evolutionary tree that visitseach leaf once. 2

4Strictly speaking this is only a tour as far as the distinct leaf nodes are concerned,as each inner node is visited twice.

104


r

v1

v2

v3

s1

s2

s3

s4

s5

(a) circular order 1-2-3-4-5

r

v1

v2

v3

s1

s2

s3

s4

s5

(b) circular order 1-2-5-4-3

Figure 9.3: Circular tours

PROOF A tour visiting all leafs and returning to the “start” leaf will haveto traverse each edge at least twice, as every edge in both ends will beincident to subtrees containing at least one leaf.

A CT per definition visits each edge exactly twice. Using induction overthe number of leafs, we can prove the existence of a CT in any tree.

Base Case: In a tree with two leafs the only tour is a CT (see Figure 9.4).

Induction Case: Assume that every evolutionary tree with n − 1 leafshas a CT. Consider a tree, Tn with n leafs. If we remove a subtree ofTn consisting of a leaf, sn and its non-root parent, pn together withthe edge between them (such a subtree will exist for n ≥ 3), theresulting tree Tn−1 will have n − 1 leafs. Per induction hypothesisthere exists a CT, ctn−1 in Tn. Now consider a tour tn, in Tn which isidentical to ctn−1. The tour will be passing pn twice as pn “divides”an edge in Tn−1. If we extend the tour with a visit of sn at one ofthese passings, the resulting tour is a CT for T . (see Figure 9.5 onthe next page) �

r

s 1 s 2

Figure 9.4: Base case

We can now finally close in on the new scoring function for MSAs, byconsidering the following question: Given the homologous sequences S =

105


pn

sn

Tn

(a)

pn

sn

Tn-1

(b)

pn

sn

Tn

(c)

pn

sn

Tn

(d)

Figure 9.5: Induction case

(s1, s2, . . . , sk), how can we find a CT through an evolutionary tree for Swithout actually knowing or having to construct the tree?

If we interpret the score of the OPA [[si, sj ]] as a distance measure forthe strings (leafs) in the evolutionary tree, and use Lemma 23 , the prob-lem of finding a circular order and a CT reduces to solving a TSP instancewith the

(n2

)pairwise alignment scores as distance matrix and the k se-

quences as cities.The determination of a CO and CT can be done very efficiently com-

pared to the calculation of the(n2

)pairwise alignment scores,5 that often

is a prerequisite for construction of an MSA.

Definition 23 (Circular Sum Score)Let A = [[a1, a2, . . . , ak]] be an MSA of the sequences s1, s2, . . . , sk. LetS(ai, aj) be the score for the induced alignment of the the strings si, sjin A. We then define the Circular Sum score of A as

CS(A) = 12

k∑i=1

S(aci , aci+1) where

ck+1 = c1 and C = (c1, c2, . . . , ck) is a CO. 2

5By now we also know which tools to use.

106


The CS score is similar to the SOP score, except that not all pairwisealignments are added, only the pairwise alignments in CO are added anddivided by two. The CS score can be regarded as a SOP score in CO.

Example 15Using the edit distance scoring for the four sequences from Example 14on page 101, we get the following OPA distance matrix.

s1 : THE LAST FAT CATs2 : THE FAST CATs3 : THE V ERY FAST CATs4 : THE FAT CAT

s1 s2 s3 s4s1 13 9 9 9s2 9 10 10 9s3 9 10 14 9s4 9 9 9 9

A CO will be C = (s1, s2, s3, s4). The CS score of the alignment from Exam-ple 14 will be:

a1 : THE LAST FA− T CAT

a2 : THE FAST CA− T −−−a3 : THE V ERY F A S T CAT

a4 : THE −−−− F A− T CAT

←−8

+

←−−−−5

+

←−−−−−−9

+

←−−−−−

9

+

giving the result 12 · (8 + 5 + 9 + 9) = 15.5. 2

9.3.1 Some CS score properties

The CS score of an alignment of a set of sequences has a clear upper bound:

Definition 24 (Maximal CS Score)Let S = (s1, s2, . . . , sk) be a collection of sequences. Let OPA(si, sj) denotethe score of the OPA of the sequences si, sj The upper bound for the CSscore of S is defined as

CSmax(S) = 12

k∑i=1

OPA(aci , aci+1) where

ck+1 = c1 and C = (c1, c2, . . . , ck) is a CO. 2

Example 16Using the matrix and the CO from Example 15 we get for the sequences S

CSmax(S) = 12 · (OPA(s1, s2) +OPA(s2, s3) +OPA(s3, s4) +OPA(s4, s1))

= 12 · (9 + 10 + 9 + 9) = 18.5 2

107


Likewise the optimal score of an alignment of a set of sequences canbe defined:

Definition 25 (Optimal CS Score)LetA be the set of all possible alignments of the sequences S = (s1, s2, . . . , sk).The optimal score for the CS score of S is defined as

CSopt(S) = maxA∈A

CS(A). 2

Example 17An optimal alignment of the string collection, S from Example 15 and theresulting CS score is

a1 : THE LAS T F A− T CAT

a2 : THE −−−− F AS T CAT

a3 : THE V E RY F A S T CAT

a4 : THE −−−− F A− T CAT

←−9

+

←−−−−10

+

←−−−−−−−9

+

←−−−−−

9

+

giving the result CSopt(S) = 12 · (9 + 10 + 9 + 9) = 18.5. 2

The relation between the scores is given by the following lemma:

Lemma 24LetA be the set of all possible alignments of the sequences S = (s1, s2, . . . , sk).We then have:

CSmax(S) ≥ CSopt(S) 2

PROOF This follows immediately as each score in the sum for CSmax(S)will be larger than or equal to the equivalent induced pairwise alignmentscore in CSopt(S). �

Although the scores in the previous two examples are equal, it is not ingeneral case that CSopt(S) equals CSmax(S). If, on the other hand, we dohave an alignment, A of the sequences S = (s1, s2, . . . , sk) with CS(A) =CSmax(S) then we know that the MSA is optimal.

9.4 NotesThis chapter is mainly based on [Gonnet et al., 2000] and [Roth-Korostensky,2000].

The origin of the SOP score is somewhat unclear; Gusfield gives [Car-rillo and Lipman, 1988] as the first introduction but [Altschul and Lip-man, 1989] gives the two references mentioned.

108

9.4. Notes

The definition and treatment of evolutionary trees is intentionallykept simple in order to avoid getting too involved in the plethora of re-lated topics like bifurcation contra multifurcating, parsimony, molecularclock hypothesis, ultrametric distances and evolutionary tree construc-tion.

109

CHAPTER

10Algorithms

This chapter presents the two implemented MSA algorithms. The firstis an MSA construction algorithm called tspMsa and the second an MSAimproving algorithm called enhanceMsa. Both algorithms are based di-rectly on the CS score and the actual implementations both consist of im-provements of the original algorithms due to Roth-Korostensky.

10.1 Path Sum Scores

For notational convenience we will start this chapter with a couple ofdefinitions

Definition 26 (Path Sum Score)Let A = [[a1, a2, . . . , ak]] be an MSA of the sequences s1, s2, . . . , sk. LetS(ai, aj) be the score for the induced alignment of the the strings si, sjin A. We then define the Path Sum score of A as

PS(A) = 12

k−1∑i=1

S(aci , aci+1) with (c1, c2, . . . , ck) a CO. 2

Remark 18The PS(A) is the CS(A) where the score value for the induced stringsak, a1 is left out of the summation. 2

Analogously we have

111

10. ALGORITHMS

Definition 27 (Maximal Path Sum (MPS) Score)Let S = (s1, s2, . . . , sk) be a collection of sequences. Let OPA(si, sj) denotethe score of the OPA of the sequences si, sj . The MPS score is defined as

MPS(S) = 12

k−1∑i=1

OPA(aci , aci+1) with (c1, c2, . . . , ck) a CO. 2

Remark 19The MPS(S) is the CSmax(S) where the OPA score value for the inducedstrings ak, a1 is left out of the summation. If an MSA, A is having theproperty that PS(A) = MPS(S) we say the alignment has MPS score 2

10.2 The MSA Construction Algorithm

The tspMsa algorithm is a progressive alignment algorithm that uses aCT to determine the alignment order. The pseudo-code for the algorithmis shown in III.1 on the next page.

Given a collection of sequences S = (s1, s2, . . . , sk) tspMsa() executesthe following four steps.

1) For all possible sequence pairs construct an optimal pairwise align-ment and use the score of these optimal alignments to construct adistance matrix for the TSP solver.

2) Use the distance matrix to determine a CO and a CT.

3) Find the sequence pair si, si+1 having the worst OPA score in the CTand rearrange the cyclic order using cyclic shifts until the CT startswith sequence si+1. Rename this order to s1, s2, . . . , sk.

4) Construct an MSA A, progressively beginning with sequence s1, s2ensuring that A will have MPS score. This step is the algorithmcalled CIRCUS in [Roth-Korostensky, 2000, chapter 6]. It progres-sively constructs an MSA of a set of strings S = (s1, s2, . . . , sk) usinga CO. (see Algorithm III.2 on page 115 and Example 18 on page 114)

112

10.2. The MSA Construction Algorithm

Algorithm III.1 tspMsaRequire: List of sequences, sl containing S = (s1, s2, . . . , sk)Ensure: An MSA A = [[a1, a2, . . . , ak]]. In case of maximisation and non-

negative induced alignment scores CS(A) ≥ k−1k · CSmax(S)

procedure TSPMSA(sl)distances← matrix of size |sl| × |sl| . Step 1)

3: for i = 1 to |sl| − 1 dofor j = i+ 1 to |sl| do

distances[i][j]← OPA(si, sj)6: distances[j][i]← OPA(si, sj)

end forend for

9: cycleOrder← FINDCYCLEORDER(distances) . Step 2)worstValueIndex← 1 . Step 3)worstValue← distances[cycleOrder[1]][cycleOrder[2]]

12: for i = 2 to |cycleOrder| − 1 docurrentValue ← distances[cycleOrder[i]][cycleOrder[i+ 1]]if currentValue worse than worstValue then

15: worstValueIndex← iworstValue← currentValue

end if18: end for

for i = 1 to worstValueIndex doAPPEND(cycleOrder, cycleOrder[1])

21: REMOVE(cycleOrder, 1)end formsa← LIST(sl[cycleOrder[1]]) . Step 4)

24: for i = 2 to |cycleOrder| domsa← EXTENDALIGNMENT(msa, sl[cycleOrder[i]])

end for27: return msa

end procedure

Termination: The algorithm only iterates over final non-increasing datastructures.

Correctness: Assume maximisation and S(ai, aj) ≥ 0 ∀ i, j ∈ 1 . . . k andthat Procedure EXTENDALIGNMENT() returns an MSA having MPSscore. The rotation of the cycle order in Step 3) ensures that thescore for OPA(sk, s1) is minimal for all OPA scores. This means that

113

10. ALGORITHMS

the maximal value for OPA(sk, s1) is 1k · CSmax(S) giving

CS(A) = MPS(S)

= CSmax(S)−OPA(sk, s1)

≥ CSmax(S)− 1

k· CSmax(S)

≥ k − 1

k· CSmax(S)

The running-time for the algorithm results from a sequential execu-tion of the algorithm steps. For a collection of sequences S = (s1, s2, . . . , sk)with |si| = n ∀ i ∈ 1 . . . k this gives a running-time complexity of O(k2n2 +k2.2 + k + kn2) = O(k2n2). Here the complexity of Step 2) (k2.2) assumesuse of the LKH TSP solver and the complexity of Step 4) (kn2) that the OPAsfrom Step 1) have to be recalculated.

Example 18Let S = (s1, s2, . . . , sk) be a collection of sequences ordered with respectto the CO in Step 3) on page 112. Let S(aj , aj+1) denote the score of theinduced alignments of the sequences sj , sj+1. Let OPA(sj , sj+1) be thescore of the OPA of sj , sj+1 and let tj , tj+1 denote the strings in this OPA.

Assume that the first i sequences have been aligned in an MSA A′ =[[a1, a2, . . . , ai]] so that S(aj , aj+1) = OPA(sj , sj+1) ∀ j ∈ 1 . . . i − 1 (seeFigure 10.1).

a1

ai

si+1

A’

+

Figure 10.1: Before executing extendMsa(). (from[Roth-Korostensky, 2000])

Now execute the following steps.

i) Construct the OPA for the sequences si, si+1 giving the sequencesti, ti+1 (see Figure 10.2 on page 116 (a)).

114


Algorithm III.2 extendMsaRequire: An MSA A = [[a1, a2, . . . , ai]] having MPS score, a sequences si+1

to incorporate in the MSAEnsure: An MSA A = [[a1, a2, . . . , ai+1]] having MPS score.

procedure EXTENDALIGNMENT(msa, sequence)ti, ti+1 ← CALCULATEOPA(si, si+1)

3: for all gaps g in ai doif g not present in ti then

INSERTGAP(ti, g)6: INSERTGAP(ti+1, g)

end ifend for

9: for all gaps g in ti doif g not present in ai then

for all aj in msa do12: INSERTGAP(aj , g)

end forend if

15: end forAPPEND(msa, ti+1)return msa

18: end procedureTermination: The algorithm only iterates over final non-increasing

data structures.

Correctness: That the resulting MSA has MPS score can be seen byinspection of the gap insertion steps in the algorithm.In line 3 through 8 only gaps (or parts of gaps) that are not al-ready in ti are inserted into both ti and ti+1. This means that thescore of S(ti, ti+1) will not be affected, since opposing gaps do notcontribute to the score. In other words: ti, ti+1 will still be an OPAof si, si+1 only with superfluous gaps (see Figure 10.2 (b)).Analogously in line 9 through 15 only gaps (or part of gaps) thatare not already in ai are inserted into every aj . This does notaffect the score of the MSA which will still have MPS score (seeFigure 10.2 (c)).Since all gaps (or part of gaps) not present in ai and ti are respec-tively inserted into the other sequence, ai and ti are now identi-cal. Extending the MSA with ti+1 at this point will thus result inan MSA with MPS score (see Figure 10.2 (d)).

115

10. ALGORITHMS

ii) Insert all gaps which are present in ai and not already present in tiinto both ti and ti+1 (see Figure 10.2 (b)).

iii) Insert all gaps present in ti which are not present in ai into all se-quences in the MSA (see Figure 10.2 (c)).

iv) At this point sequences ai and ti will be identical. Now append ti+1

to the MSA as ai+1 creating the extended alignment A′′ (see Figure10.2 (d)). 2

sis i+1

t it i+1

(a)

a1

ai

A’

t it i+1

t it i+1

(b)

a1

ai

A’

A’

t it i+1

t it i+1

a1

ai

(c)

A”

a1

ai

A’

t it i+1

ai

ai+1

a1

(d)

Figure 10.2: Example 18 steps i) – iv). (from [Roth-Korostensky, 2000])

116


10.2.1 Result improvements

In order to improve the MSA constructed by the tspMsa algorithm Ko-rostensky and Gonnet mentions a couple of possible post-optimisations.These post-optimisations were not implemented for this thesis, but sincethey are based on some interesting considerations, we will give a shortsketch of the ideas.

The MSA returned by the tspMsa algorithm has the property, that theCS score is optimal for all sequence pairs except s1, sk. The optimisationsaim at improving the score for the induced non-optimal [[a1, ak]] align-ment. Changing this alignment will, however, affect the induced align-ments of other pairs. Since these alignments are all optimal the challengeis to improve the score of [[a1, ak]] more than the change will decrease thescore of all other induced alignments.

It can be argued that inserting a gap into either of the strings a1, akis very unlikely to improve the total CS score. The argument is, that thegap-opening cost is typically large compared to the values from the scor-ing matrix. Thus inserting a gap, g into e.g. a1 will decrease the totalCS score with at least two times the gap-opening cost (see Figure 10.3).Furthermore, a gap g′ with the same length will have to be inserted intoall other induced strings ai, i 6= 1 (at another position) to ensure that allinduced strings have equal length. Again this causes an additional scoredecrease of at least two times the gap-opening cost.

ak

a1

ak-1

a2

a3

bbb

bbb

gg’

g’

Figure 10.3: inserting gaps g and g′ affects the CSscore at the four locations marked with dark clouds(from [Roth-Korostensky, 2000])

All together this makes it very improbable that the CS score for thealignment can be improved by insertion of gaps into a1 or ak. If the align-ment score of [[a1, ak]] is less than four times the gap-opening cost fromthe OPA score, it will be impossible to improve the CS score this way.

117

10. ALGORITHMS

Refraining from gap insertions, a more promising score improvingstrategy is to shift gap blocks around in a1, ak. The possibilities are ei-ther to use a heuristic, or to use an exhaustive search algorithm to shiftgap blocks around. In [Roth-Korostensky, 2000] a heuristic that alignsequal-sized opposing gaps in a1, ak is implemented.

For this thesis a direct optimisation of the tspMsa algorithm was im-plemented. The idea being, that the algorithm Step 3) on page 112 onlymakes sense in order to achieve the guarantee for the final CS score. It isnot certain, that the index found is the optimal index to break the cycle.To determine this optimal index, say i, it is necessary to determine thesequences where the difference between OPA(si−1, si) and the alignmentscore of [[ai−1, ai]], when breaking the cycle at index i is minimal.


The implementation of tspMsa was done using a Python implementationof the algorithm steps on page 112. To determine the pairwise OPAs it usesexternal calls to the stretcher application from the EMBOSS suite. Thisis a C++ implementation of the Hirschberg algorithm. The determinationof cycle orders and tours was done using external calls to LKH

The actual implementation replaces Step 3) of the original algorithmby trying out all the k index possibilities for breaking the cycle. This didincrease the running-time, but the effect was partly compensated for bycaching the calculated OPA values from Step 1) and reusing these whenprogressively building the MSA.

10.3 The MSA Enhance AlgorithmThe enhanceMsa algorithm can be characterised as a kind of iterativealignment algorithm, since it tries to improve a given MSA. The algorithm,introduced in [Roth-Korostensky, 2000, chapter 8] is an application of theDivide-and-Conquer schema.

The following definition will ease describing enhanceMsa:

Definition 28 (MSA Block)Let A = [[a1, a2, . . . , ak]] be an MSA with length `. A block of A is definedby two indices s, e in 1 . . . ` + 1 with s ≤ e and is denoted A[[s : e]]. Itconsists of the sequence slices ai[s : e] ∀ i ∈ 1 . . . k. The notation A[[s :]] isshorthand for A[[s : `+ 1]] and A[[: e]] is shorthand for A[[1 : e]]. 2

Given a collection of sequences enhanceMsa can be described as fol-lows.

1) Calculate an MSA A, using a fast MSA construction algorithm.

118

10.3. The MSA Enhance Algorithm

2) If the length of A is less than a given minimum length return A.

3) Determine the block of A having the best score B and the (possibleempty) blocks on either side L and R.

4) Realign the best scoring block to get B′ and keep the better scoringof B and B′, denoting this block Bfinal

5) Return an MSA: LfinalBfinalRfinal where Lfinal and Rfinal are theresults from repeating from Step 1) on L and R respectively.

RBL

B final

A

Figure 10.4: The MSA enhance algorithm. (from[Roth-Korostensky, 2000])

In the case of enhancing MSAs of proteins, the biological reasoning forthe algorithm is as follows: the algorithm looks for the best place to dividean MSA into a good block and two further blocks by determining the blockhaving the best CS score. This block will typically be a conserved blockof the proteins. Proteins exhibit conserved as well as highly divergentregions. The conserved regions may be active sites of the protein,1 orstable secondary structures like α-helices and β-sheets (see Figure 8.1).

1locations responsible for the chemical reactions of the proteins

119

10. ALGORITHMS

Since conserved regions have fewer gaps than divergent regions, mostMSA algorithms produce more reliable alignments within these.

Algorithm III.3 enhanceMsaRequire: Sequence (or alignment) list sl of size k, a minimum block

length mblEnsure: An MSA A = [[a1, a2, . . . , ak]].

procedure ENHANCEMSA(sl, mbl)A← CONSTRUCTMSA(sl)

3: s, e← FINDBESTBLOCK(A)Aleft ← A[[1 : s]], Abest ← A[[s : e]], Aright ← A[[e :]]if CS(Abest) < CS(CONSTRUCTMSA(Abest)) then

6: Abest ← CONSTRUCTMSA(Abest)end ifif LENGTH(Aleft) > mbl then

9: if CS(Aleft) < CS(CONSTRUCTMSA(Aleft)) thenAleft ← ENHANCEMSA(Aleft, mbl)

end if12: end if

if LENGTH(Aright) > mbl thenif CS(Aright) < CS(CONSTRUCTMSA(Aright)) then

15: Aright ← ENHANCEMSA(Aright, mbl)end if

end if18: return JOIN(Aleft, Abest, Aright)

end procedureTermination: Assume FINDBESTBLOCK() returns a best block of

length `best ≥ 1 and that CONSTRUCTMSA() does not return align-ments with columns consisting of gaps only. The length of thelongest possible alignment returned from CONSTRUCTMSA() inline 2 will be ||S||. Since `best ≥ 1 the recursive calls in line 10and line 15 will be on alignments of strictly smaller length. Thisensures termination of the recursions as soon as the alignmentshave length mbl or below.

Correctness: Assuming CONSTRUCTMSA() returns a proper align-ment, the final MSA will be a proper MSA of S as it consists of ajoin of alignments of disjoint blocks of A.

The algorithm is depicted in Figure 10.4 and pseudo-code for the algo-rithm is shown in Algorithm III.3. The running-time for enhanceMsa isdetermined by the number of recursive calls and the time for each invo-cation of ENHANCEMSA().

The first depends on the lengths of the MSAs produced by CONSTRUCTMSA()

120


and the best blocks found by FINDBESTBLOCK(). If the lengths of the lat-ter is constant, say `best the number of recursive calls will be in the worstcase be of order O

(||S||`best

).

The running-time for each invocation is determined by the running-times for CONSTRUCTMSA(), FINDBESTBLOCK() and the time for cal-culating the CS score. Denoting the first two rt(CONSTRUCTMSA) andrt(FINDBESTBLOCK) respectively, the running-time for enhanceMsa willhave a worst case order of

O

(||S||`best

· (rt(CONSTRUCTMSA) + rt(FINDBESTBLOCK) + kn)

).

10.3.1 Finding the best block within an MSA

It remains to specify how to construct MSAs and how to find the bestblock in an MSA, that is, how to implement the CONSTRUCTMSA() andFINDBESTBLOCK() functions used in Algorithm III.3. The first will bepostponed (see Section Implementation on page 124) and this section willelaborate on finding the best scoring block within an MSA.

In [Roth-Korostensky, 2000] this is done by deciding upon a fixed MSAblock length and then use a sliding-window method to determine the bestscoring block, i.e. calculate the score of the fixed block size for all possiblepositions along the MSA. For this thesis it was decided to use an imple-mentation of the algorithm described in [Bentley, 2000, chapter 8]. Thealgorithm, called MAXSUM(), finds the maximum positive sum in a con-tiguous sub-array of an array containing real numbers (see Example 19).The reasons for choosing this algorithm were, that

i) it will find the optimal scoring block of an MSA with no constraintson the length of this block, and

ii) it will do so with a running-time linear in the length of the MSA.

Pseudo-code for FINDBESTBLOCK() and the sub-procedure MAXSUM()is shown in Algorithm III.4 on page 123.

Example 19Let

Arr =1 2 3 4 5 6 7 8 9 10

31 −41 59 26 −53 58 97 −93 −23 84

be an array of integers. The MAXSUM() algorithm will return the indices3, 8 marking the slice Arr[3 : 8] with sum 187. 2

121

10. ALGORITHMS

In order to use MAXSUM() for an MSA the CS scores for each columnis calculated and inserted into an array Arr. The average value of Arr isthen subtracted from each value in Arr. This ensures that there will beboth positive and negative values in Arr so that the result of MAXSUM()will be different from Arr itself (see Example 20).

Example 20Using the MSA and edit distance scoring scheme from Example 13 Algo-rithm III.4 will give the following result.

A =

T H E L A S T F A − T C A TT H E F A S T C A − T − − −T H E V E R Y F A S T C A TT H E − − − − F A − T C A T

Arr = 4 4 4 0 2 2 2 3 4 0 4 3 3 3

average value =12 + 6 + 11 + 9

14=

38

14≈ 2.7 ⇒

Arr = [1.3, 1.3, 1.3,−2.7,−0.7,−0.7,−0.7, 0.3, 1.3,−2.7, 1.3, 0.3, 0.3, 0.3]

FINDBESTBLOCK(A) = 1, 4⇒ A[[1 : 4]] is the maximal block 2

122


Algorithm III.4 findBestBlockRequire: List of sequences, sl containing non-empty alignment A =

[[a1, a2, . . . , ak]]Ensure: Indices s, e such that CS(A[[s : e]]) is maximal for all blocks of

A.procedure FINDBESTBLOCK(sl)

columnScores← GETCOLUMNSCORES(sl)3: avg← SUM(columnScores)

|columnScores|for i = 1 to |columnScores| do

columnScores[i]← columnScores[i] - avg6: end for

return MAXSUM(columnScores)end procedure

9:procedure MAXSUM(numbers)

best← LIST(0, 0, 0) . sum, start index, end index12: current← LIST(0, 0, 0)

for i = 1 to |numbers| doif current[1] + numbers[i] < 0 then

15: current← LIST(0, 0, 0) . reset if sum negativeelse

current[1]← current[1] + numbers[i]18: if current was reset then . e.g. is current[3] = 0 ?

current[2]← icurrent[3]← i

21: end ifcurrent[3]← current[3] +1 . increase end index

end if24: if best[1] < current[1] then

best← current[:] . a copy not a referenceend if

27: end forreturn (best[2], best[3])

end procedure

123

10. ALGORITHMS

Termination: The algorithm only iterates over finite non-increasing datastructures.

Correctness: The algorithm uses the observation that, summed from itsstart index, the best slice cannot have negative part-sums. If it had,it would contain a sub-slice having a better score. This observationis captured in the use of the current variable, since it contains andsums from the possible start indices of a best slice in numbers. Thebest variable keeps track of the best slice seen so far.

For every round of the for-loop in lines 13 to 27 the following invari-ant holds: The values of best and current are correct.

Before the loop the invariant obviously holds.

Assume the invariant holds for round i− 1. For round i we have twopossibilities for current:

• its sum plus the value of numbers[i] is negative in which casecurrent should be reset. This happens in line 15.

• its sum plus the value of numbers[i] is positive or zero in whichcase the sum should be increased by the value of numbers[i]and the end index should be increased by one. This happens inline 17 and line 22 respectively. Should numbers[i] be the firstpositive value that current “meets” after being reset, the startindex and end index have to be set. This special case is treatedin line 19 and 20.

For round i we get for best

• if current should have obtained a larger sum, best should beupdated. This happens in line 25.

All the above gives that the invariant will hold after round i


The enhanceMsa was implemented using Python and external calls tothe MSA construction applications Muscle and Mafft in order to implementthe CONSTRUCTMSA() procedure.

10.4 NotesThis chapter is mainly based on [Korostensky and Gonnet, 1999], [Roth-Korostensky, 2000, chapter 6, 8] and [Bentley, 2000, chapter 8].

124

There is something fascinating aboutscience. One gets such wholesale returnsof conjecture out of such a trifleinvestment of fact.

Mark Twain

CHAPTER

11Experiments

This chapter contains descriptions of the experiments performed on theimplemented MSA algorithms: tspMsa and enhanceMsa together withtheir results.

In order to evaluate the alignments constructed by tspMsa the align-ments were compared with MSAs constructed by two other applications.All alignments were scored using both the CS score and the traditionalSOP score. Further assessment was done by using the supplied scoringmethods from a protein reference database.

In the experiments with enhanceMsa the scores of the MSA before andafter refinement were compared. The alignments were scored using theCS score and the the supplied scoring methods from the protein referencedatabase.

11.1 Reference MSA Applications

As mentioned in [Blackshields et al., 2006] the amount of published MSAconstruction methods and/or packages is quite overwhelming. Accord-ing to Blackshields et al. more than 50 applications were described from1996–2006 and alone in 2005 there were at least 20 new publications.

The criteria for choosing reference MSA applications can be phrasedas trying to select “the fastest of the best”. A small running-time wouldbe important for the enhanceMsa algorithm and high quality alignmentmethods would be preferable as benchmark for the tspMsa. The applica-tions chosen were Muscle and Mafft, as they both fulfil the above criteria[Blackshields et al., 2006; Notredame, 2007; Wallace, Blackshields, andHiggins, 2005].

125

11. EXPERIMENTS

Muscle is a progressive alignment program that construct a very crudeguide tree used for constructing the first MSA. The guide tree is thenrefined upon and a second MSA is constructed using the improvedguide tree. For increased speed only sequences affected by the sub-trees improvements are realigned.

Mafft is a both a progressive and iterative MSA program. It uses a fastFourier transform1 to generate a guide tree for its progressive MSA.As with Muscle the guide tree is then refined upon both by optimis-ing a weighted SOP score and by using different pairwise alignmentsimilarity information.

11.2 Benchmark Database

In order to further estimate the biological quality of the alignments aprotein benchmark database was used.

As the first MSA construction programs were developed it was difficultto benchmark them. First lack of processing power made the performanceof the programs on large sequences inferior. Second there was not muchactual information concerning correct sequence families. The first genera-tion of programs were thus mainly tested using a small number of “goldenstandards”. As these standards were both limited in number and publicavailable it had the unfortunate effect, that new programs easily could betuned against the standards, achieving artificially high scores.

In order to rectify this effect a number of benchmark databases con-sisting of large and continually growing data-sets have been published.For this thesis BAliBASE (Benchmark Alignments dataBASE)2 has beenchosen as reference database for the MSA experiments.

BAliBASE , which was one of the first benchmark databases, consistsof nine protein reference data-sets chosen in order to represent real MSAproblems faced by biologists. The alignments are based upon protein 3Dstructure super-positioning and are “manually” refined in order to ensurehigher alignment quality than a pure automated method. This propertycan be seen as one of the main quality characteristics of BAliBASE ,but it can also be considered as a source of subjectivity due to the “ex-pert” refinement [Blackshields et al., 2006]. If nothing else it does makeBAliBASE slower to maintain and expand compared to more automatedbenchmark databases.3

In this thesis only the reference sets 1–5 and 9 were used (see Table11.1), the mundane reason being, that the reference sets 6–8 were lacking

1http://en.wikipedia.org/wiki/Fast_Fourier_transform2http://www-bio3d-igbmc.u-strasbg.fr/balibase/ version 3.03like PREFAB, HOMSTRAD, SABmark and OXBench

126

http://en.wikipedia.org/wiki/Fast_Fourier_transform

http://www-bio3d-igbmc.u-strasbg.fr/balibase/

11.2. Benchmark Database

Table 11.1: The used BAliBASE reference sets

BAliBASE Reference Categories

RV1 Equidistant sequences, divided by length and variabilityRV11 – very divergent sequences (< 20 % residue identity)RV12 – medium to divergent sequences (20-40 % residue identity)

RV20 Families with one or more highly divergent sequences

RV30 Divergent subfamilies

RV40 Sequences with large N/C terminal extensions

RV50 Sequences with large internal insertions

RV9 Short/Linear Motifs, experimentally verifiedRV911 – very divergent sequences (< 20 % residue identity)RV912 – medium to divergent sequences (20-40 % residue identity)RV913 – similar to medium divergent sequences (40-80 % residue identity)RV921 – true positive motifs onlyRV922 – true positive motifs and errorsRV931 – true positive motifs onlyRV932 – true positive motifs and false positive motifsRV941 – true positive motifs onlyRV942 – true positive motifs and true negative motifs

files containing the sequences in fasta format. The used reference setscontains 610 alignments with all together 17 427 sequences.

11.2.1 BAliBASE scoring

In order to assess the performance of MSA programs, BAliBASE includesscripts to calculate two scores. Let A = [[a1, a2, . . . , ak]] with |ai| = ` ∀ i ∈1 . . . k be a test alignment and Ar = [[a1, a2, . . . , akr ]] with length `r be thecorresponding reference alignment

Sum-of-pairs This score, which is a variant of the SOP, increases withthe number of sequences correctly aligned with respect to the refer-ence alignment. It is supposed to indicate the extent to which some,if not all, sequences in an alignment are correctly aligned. The scoreSc for the c’th. column is defined as

Sc =k∑

i = 1i 6= j

k∑j = 1

pcij , pcij =

{1 if A[i][c] and A[j][c] are aligned in Ar

0 else

127

11. EXPERIMENTS

denoting the sum-of-pairs score for A by SP we have

SP =

∑`i = 1 Si∑`rj = 1 Srj

with Srj =kr · (kr − 1)

2

Column score This is a column-wise binary score indicating whether allsequences are aligned correctly with respect to the reference align-ment. Let TC denote the Column score for A we then have

TC =

∑`i = 1Ci`

, Ci =

{1 if the i’th. column in A and Ar are identical0 else

11.3 Experiments Set-up

For both the MSA construction and MSA enhancing experiments the fol-lowing set-up was used:

• Muscle and Mafft running with default parameters,

• the BLOSUM62 scoring matrix,

• a gap-opening cost of −12 and a gap-extension cost of −2, and

• all gaps scored equally (end/internal gaps)

11.4 MSA Construction Tests

For each of the BAliBASE reference sets the tspMsa alignment was scoredby each of the scoring schemes: CS Score, SOP Score, BAliBASE Sum-of-Pairs Score and BAliBASE Column Score. These scores were then com-pared to the same scores for Muscle and Mafft alignments.

In order to illustrate the results with block diagrams, the number oftimes an algorithm achieved the best score was counted for each align-ment file in each BAliBASE reference set directory. If more than one al-gorithm achieved the best score, every one of them was counted as “beingbest”. The results are shown in the figures 11.1 to 11.8

11.4.1 CS score results

The results when scoring the alignments by CS score (figures 11.1 and11.2) does not surprisingly favour the tspMsa algorithm. It achieves thehighest number of best scores in every reference set. The mutual directoryscore between Muscle and Mafft is 2 – 8 with 5 draws.

128

11.4. MSA Construction Tests

RV11 RV12 RV20 RV30 RV40 RV50Balibase directory

0

10

20

30

40

50

60

70

Num

ber o

f bes

t sco

res

69 6865

60

49

29

2 2

15

0 0 05

20

1 0 0 2

Alignments scored using Cycle Sum Score

tspMsaMuscleMafft

Figure 11.1: MSA construction on RV11–RV50

RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

5

10

15

20

25

30

Num

ber o

f bes

t sco

res

28

22

17

2

14

2224

27

22

0 02

0 02

02

01

6

10

20 0 0

2

7

Alignments scored using Cycle Sum Score

tspMsaMuscleMafft


129

11. EXPERIMENTS


0

10

20

30

40

50

60

70

80Nu

mbe

r of b

est s

core

s

3 30 0 0 0

26

7

21 19

10

4

47

80

60

41 39

27

Alignments scored using Sum-of-Pairs Score

tspMsaMuscleMafft


RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

5

10

15

20

25

Num

ber o

f bes

t sco

res

0 01

0 0 0 0

2

77

23

0

2

78

76

22

2625

4

12

1716

22

16

Alignments scored using Sum-of-Pairs Score

tspMsaMuscleMafft


130


11.4.2 SOP score results

The results when scoring the alignments by SOP score (figures 11.3 and11.4) shows a clear superiority by Mafft achieving the highest number ofbest scores in every references set. The mutual directory score betweentspMsa and Muscle is 1 – 13 with 1 draw. This last result is also lesssurprising as Muscle is optimising using the SOP score.


0

10

20

30

40

50

60

70

Num

ber o

f bes

t sco

res

39

12

31

3 3

10 9

1713

83

31

70

63

50

39

26

Alignments scored using BAliBASE Sum-of-Pairs Score

tspMsaMuscleMafft


11.4.3 BAliBASE SOP score results

Scoring the alignments by the BAliBASE SOP score (figures 11.5 and 11.6)shows some interesting results. First Mafft is clearly superior being thebest algorithm in 14 out of 15 BAliBASE reference set directories. Secondthe mutual directory score between tspMsa and Muscle is 5 – 7 with 3draws and third the tspMsa algorithm achieves the best result for theBAliBASE reference set RV11 which consists of very divergent sequenceswith less than 20 % residue identity.4

4also known as the “twilight zone”

131

11. EXPERIMENTS

RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

5

10

15

20

25

Num

ber o

f bes

t sco

res

5

23

10

1 13 3

4

13

02

6

9

6

3

20

28

25

3

13

17

14

23 23

Alignments scored using BAliBASE Sum-of-Pairs Score

tspMsaMuscleMafft



10

20

30

40

50

60

70

Num

ber o

f bes

t sco

res

39

11

1618

24

10

28

9

28

21 21

10

50

73

5652

47

28

Alignments scored using BAliBASE Column Score

tspMsaMuscleMafft


132


RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

5

10

15

20

25

Num

ber o

f bes

t sco

res

15

23

23 3

16

5

19

13

1

3

0

34

16

5

20

25

27

23

2

13

20

22

26 26

Alignments scored using BAliBASE Column Score

tspMsaMuscleMafft


11.4.4 BAliBASE column score results

Finally scoring by the BAliBASE column score (figures 11.7 and 11.8)again shows superiority by Mafft “winning” in every reference set direc-tory. Here the mutual directory score between tspMsa and Muscle is 6– 4 with 5 draws. Interesting enough using the BAliBASE column scoreMafft scores better than tspMsa in reference set RV11.

11.4.5 MSA construction conclusions

The following conclusions can be drawn from the above results:

1) The Mafft algorithm is clearly superior to both the tspMsa and Mus-cle algorithm.

2) The tspMsa and Muscle algorithms clearly scores better than theother, when scoring with the algorithm’s optimising score.

3) Using the BAliBASE SOP score Muscle achieves better results thantspMsa. A Wilcoxon rank-sum test5 gives a p-value equal to 0.085for the null hypothesis that the results are being comparable, and az-statistic value of -1.721 in favour of Muscle.

5calculated using the stats package from scipy

133

11. EXPERIMENTS

4) Using the BAliBASE column score tspMsa and Muscle achievescomparable results with a slight advantage for Muscle. A Wilcoxonrank-sum test gives a p-value equal to 0.885 for the null hypothe-sis that the results are being comparable, and a z-statistic value of-0.145 in favour of Muscle.

5) Using the BAliBASE scores the tspMsa algorithm is scoring rela-tively well in reference sets from the twilight zone (reference setRV11 and RV911)

11.5 MSA Enhancing Tests

RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

10

20

30

40

50

Perc

enta

ge im

prov

emen

t in

Cycl

e Su

m S

core

MSA enhance results using Muscle

best8142026

Figure 11.9: CS score improvement on RV911–RV942

The enhanceMsa algorithm was tested using both Muscle and Mafftas a fast external alignment construction application.

To compare the effect of different block sizes the algorithm was runusing the FINDBESTBLOCK() function (see Section 10.3.1 on page 121)and also with constant block sizes of 8, 14, 20 and 26.

For each of the BAliBASE reference sets the enhanceMsa algorithmwas scored by each of the scoring schemes: CS Score, BAliBASE Sum-of-Pairs Score and BAliBASE Column Score. The original first alignment

134

11.5. MSA Enhancing Tests

RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

0

5

10

15

Perc

enta

ge im

prov

emen

t in

BAliB

ASE

Sum

-of-P

airs

Sco

re


best8142026

Figure 11.10: BAliBASE SOP score improvement onRV911–RV942

was scored at the start of algorithm and the final alignment was scoredwhen the algorithm finished.

For each file in a reference set directory the improvement in scoresbetween the first and the final alignment was calculated as a percentageof the original score. These differences were then summed (by score type)and averaged by dividing with the number of files in the reference setdirectory.

The result of the test runs, when using Muscle as external MSA con-struction program, for the reference sets RV911– RV942 and all threescore types are shown in Figure 11.9 – Figure 11.11. The results, whenusing Mafft, and for the data sets RV11 – RV50 were similar in the sadsense of being just as unclear.

Using a Wilcoxon rank-sum test comparing the variable block algo-rithm with the constant block algorithm for each block size gives the val-ues in Table 11.2.

Neither the figures nor the table conveys any clear message.

135

11. EXPERIMENTS

RV911RV912

RV913RV921

RV922RV931

RV932RV941

RV942

Balibase directory

2

0

2

4

6

8

10

12

Perc

enta

ge im

prov

emen

t in

BAliB

ASE

Colu

mn

Scor

e


best8142026

Figure 11.11: BAliBASE column score improve-ment on RV911–RV942

11.5.1 MSA enhancing conclusions

All though the test results are rather unclear, a couple of conclusions canbe drawn:

• Comparing the variable block length results to the constant blocklength results, the variable block results are not better than any ofthe constant block results. On the positive side, they are not clearlyworse either.

• The enhanceMsa algorithm uses the CS score to determine whetherit should keep on refining the MSA. Using this scoring does in mostcases also result in improvements in the BAliBASE scores.

11.6 NotesSection 11.2 is based on [Blackshields et al., 2006] and [Perrodou, Chica,Poch, Gibson, and Thompson, 2008; Thompson, Koehl, Ripp, and Poch,2005; Thompson, Plewniak, and Poch, 1999a,b].

Muscle is described in details in [Edgar, 2004] and Mafft in [Katoh,Kuma, Toh, and Miyata, 2005; Katoh, Misawa, Kuma, and Miyata, 2002].

136

11.6. Notes

Table 11.2: comparing variable block size to constant block sizes

CS score SOP score column score

constant block size 8

RV11–50 -0.1601, 0.8728 -0.1601, 0.8728 -0.1601, 0.8728

RV911-942 -0.2208, 0.8253 0.6181, 0.5365 0.3091, 0.7573


RV11–50 0.1601, 0.8728 -0.1601, 0.8728 -0.3203, 0.7488

RV911-942 -0.3974, 0.6911 -0.4857, 0.6272 -0.6623, 0.5078


RV11–50 -0.1601, 0.8728 -0.4804, 0.6310 -0.6405, 0.5218

RV911-942 -0.3091, 0.7573 -0.0442, 0.9648 -0.2208, 0.8253


RV11–50 -0.1601, 0.8728 0.0000, 1.0000 -0.3203, 0.7488

RV911-942 -0.1325, 0.8946 0.4415, 0.6588 0.2208, 0.8253

Each table entry consists of z-statistics and two-sided p-value for thenull hypothesis that two sets of measurements are drawn from the samedistribution. A negative z-statistic value indicate worse variable blockvalues, a positive z-statistic indicate better variable block values.

137

Part IV

The end of the tour

DNA tour with 16384 cities

The Road goes ever on and onOut from the door where it began.Now far ahead the Road has gone,Let others follow it who can!Let them a journey new begin,But I at last with weary feetWill turn towards the lighted inn,My evening-rest and sleep to meet.

Bilbo Baggins (J.R.R. Tolkien)CHAPTER

12Conclusion

To recapitulate, the goals set for this thesis were:

1) an examination of the feasibility of applying efficient TSP solver im-plementations to solve NP complete problems within bioinformaticapplications,

2) a self-contained survey of approximation algorithms for the SSP in-cluding descriptions of implementations and testing of solution qual-ity, and

3) a description and an experimental testing of the improvements onthe TSP based algorithms for MSA construction and MSA refinement.

Based on the work done and the results obtained in this thesis thefollowing conclusions can be drawn:

Re 1) It can definitely be concluded that it is a promising idea to ap-ply efficient TSP solver implementations to solve NP complete problemswithin bioinformatic applications. The results obtained by the TSP basedSSP algorithm were of a quality that could hardly be more convincing.Compared to all other algorithms it achieved the best solutions in everytest (see Table 7.2 on page 82). This affirms the findings in [Laporte,2010] and [Applegate et al., 2007, Chapter 15.5, 15.6] that LKH is capableof solving TSP instances up to at least 1000 cities to optimality in a veryshort time.

In the conducted experiments the TSP solver based algorithm used lessthan two minutes for solving a 1000 string SSP problem (see Figure 7.7 onpage 84). this was achieved with an implementation that in no way has

141

12. CONCLUSION

been optimised with concern to running-time. In other words: the timecost in order to get results of this quality seems very well spent.

Re 2) This thesis has in Chapter 6 presented a survey of approximationalgorithms for the SSP. Furthermore Chapter 7 has presented test resultsof their solution quality. One important insight resulting from the latterchapter is, that the Greedy algorithm is the best choice if optimal solu-tion quality is of less importance. This very simple algorithm obtains thebest solutions among the fast algorithms (see Table 7.2 and Figure 7.9 onpage 86).

Another relevant point is, that approximation factors only reflect theperformance in the most pathological instances. In none of the 355 280test cases used for this thesis did any of the algorithms exceed a solutionthat was more than 2 % above the lower bound.

Re 3) The results of the implemented MSA construction algorithm couldnot compete with the results of Muscle and especially Mafft, when theBAliBASE reference set and scores were used. This might not come as abig surprise; these programs have been elaborated on for quite some time,and surely more than was possible to put into algorithm implementationin the scope of this thesis. Nevertheless, the algorithm did not produceevidently bad results and it seems to produce quite good results on se-quences from the “twilight zone” (see Figure 11.5 – 11.8 on page 133).

On the other hand, the results obtained by the implemented MSA re-finement algorithm can hardly be regarded as positive. The test resultswere very unclear and the most interesting conclusion was, that an op-timisation based on the CS score, also resulted in improvements in theBAliBASE scores. This does, however, give reason for optimism concern-ing the applicability of the CS score in bioinformatics.

12.1 Future WorkThe results from this thesis point to the following interesting possibilitiesfor future work:

Attacking other NP hard problems: The very positive results obtainedby using an efficient TSP solver to solve another NP hard problemopens up a large amount of future applications. Actually every prob-lem involving any of the other known NP hard problems is a possiblecandidate.

Examination of integral solution majority: It came as an unexpectedresult, that the solutions to the LP used by the Best Factor ap-proximation algorithm almost solely consisted of integral solutions.

142

12.1. Future Work

There might be a simple reason for this, but there could also be someinteresting insights behind this observation.

Further assessment of the CS score: Optimising the CS score lead toimprovements of the BAliBASE scores. To evaluate its biologicalapplicability, a study examining the effect of optimising the CS scoreusing other MSA benchmark databases would be interesting. Butfirst of all, it would be relevant to include evaluations from biologistsconcerning the MSAs constructed by optimising the CS score.

A competitive version of the tspMsa algorithm: The tspMsa algorithmdid not excel in comparison to Mafft and Muscle but its results arenot unpromising either. For example, there is still room for post-optimisations using gap block shifts.

Both the CS score and the tspMsa algorithm origins in work done atSwiss Federal Institute of Technology around 2000, and when queriedupon reasons for the lack of publications related to these ideas in the pastten years, Professor Gaston Gonnet kindly commented:1

I think the circular tour idea will always have a fundamen-tal role in evaluating MSAs. Regarding MSA algorithms, thereare several new remarkable algorithms which are probably verydifficult to beat. In particular MAFFT and Prank. This makesthe task of beating them more difficult and more challenging.

Several years ago we improved our MSA algorithm basedon CT but we did not put enough effort to make it competitivewith current algorithms.

I personally think it is an excellent topic to pursue.

My pen is at the bottom of a page,Which, being finished, here the story ends;’Tis to be wished it had been sooner done,But stories somehow lengthen when begun.

Byron

1personal communication

143

List of Algorithms

II.1 Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45II.2 MakeMaximalAssignment . . . . . . . . . . . . . . . . . . . 52II.3 RecurseCycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54II.4 Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58II.5 leastCommonDenominator . . . . . . . . . . . . . . . . . . . 67III.1 tspMsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113III.2 extendMsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115III.3 enhanceMsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120III.4 findBestBlock . . . . . . . . . . . . . . . . . . . . . . . . . . 123

145

List of Figures

1.1 paths and tours in a graph . . . . . . . . . . . . . . . . . . . . . . 61.2 complete graph and cut set . . . . . . . . . . . . . . . . . . . . . 61.3 Different tree types . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 The Commis-Voyageur tour for 47 German cities (from [Apple-gate et al., 2007]). . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 The 33 city contest from 1962. . . . . . . . . . . . . . . . . . . . . 152.3 Different k-opt moves . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 The optimal TSP tour in Sweden. (from http://www.tsp.gatech.

edu/sweden/tours/swtour_small.htm ) . . . . . . . . . . . . . . 18

3.1 A LKH tour of 11 455 danish cities. . . . . . . . . . . . . . . . . . 233.2 Results on dk11455.tsp. Note the addition of 20 990 000 on the

y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 sequence assembly problem with repeats . . . . . . . . . . . . . 31

5.1 Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’(from [Blum et al., 1994]) . . . . . . . . . . . . . . . . . . . . . . . 36

6.1 Running-time measures of all-pairs overlap calculations . . . . 496.2 Illustration of Lemma 9 : wuv + wu′v′ ≥ wu′v + wuv′ . . . . . . . . 516.3 Shortest Hamiltonian path is not necessarily included in the

optimal TSP tour. . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.4 Transforming an ATSP instance to a TSP instance . . . . . . . . . 64

7.1 SSP solution dependency on variable coverage . . . . . . . . . . . 787.2 SSP solution dependency on length . . . . . . . . . . . . . . . . . 79

146



List of Figures

7.3 solution dependency on length, compare with Figure 7.10a . . . 807.4 solution dependency on count, compare with Figure 7.10b . . . 817.5 solution dependency on count . . . . . . . . . . . . . . . . . . . . 827.6 Figure 7.5 correlated using TSP algorithm values as 100 % . . . 837.7 SSP algorithms running-time using normal scale . . . . . . . . . 847.8 Best Factor and TSP algorithm running-times . . . . . . . . . . . 857.9 Cycle, RecurseCycle and Greedy algorithm running-times . . . 867.10 SSP algorithm alignment similarity results . . . . . . . . . . . . 88

8.1 Secondary structure of Avian Sarcoma Virus. A retrovirus withmany similarities to HIV (from http://mcl1.ncifcrf.gov/integrase/

asv_intro.html ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.1 Evolutionary tree with five sequences (species) . . . . . . . . . . 1039.2 SOP score “ticks” added for each edge. (from [Gonnet et al., 2000])1049.3 Circular tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.4 Base case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.5 Induction case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.1 Before executing extendMsa(). (from [Roth-Korostensky, 2000])11410.2 Example 18 steps i) – iv). (from [Roth-Korostensky, 2000]) . . . 11610.3 inserting gaps g and g′ affects the CS score at the four locations

marked with dark clouds (from [Roth-Korostensky, 2000]) . . . 11710.4 The MSA enhance algorithm. (from [Roth-Korostensky, 2000]) . 119

11.1 MSA construction on RV11–RV50 . . . . . . . . . . . . . . . . . . 12911.2 MSA construction on RV911–RV942 . . . . . . . . . . . . . . . . . 12911.3 MSA construction on RV11–RV50 . . . . . . . . . . . . . . . . . . 13011.4 MSA construction on RV911–RV942 . . . . . . . . . . . . . . . . . 13011.5 MSA construction on RV11–RV50 . . . . . . . . . . . . . . . . . . 13111.6 MSA construction on RV911–RV942 . . . . . . . . . . . . . . . . . 13211.7 MSA construction on RV11–RV50 . . . . . . . . . . . . . . . . . . 13211.8 MSA construction on RV911–RV942 . . . . . . . . . . . . . . . . . 13311.9 CS score improvement on RV911–RV942 . . . . . . . . . . . . . . 13411.10BAliBASE SOP score improvement on RV911–RV942 . . . . . . 13511.11BAliBASE column score improvement on RV911–RV942 . . . . 136

147



List of Tables

2.1 Some TSP history milestones (mainly from [Okano, 2009]) . . . 142.2 Milestones for solving TSP (from [Applegate et al., 2007]) . . . . 19

3.1 LKH and Concorde results for dk11455.tsp . . . . . . . . . . . . . 24

6.1 All-pairs suffix-prefix running-times . . . . . . . . . . . . . . . . 496.2 SSP algorithm characteristics . . . . . . . . . . . . . . . . . . . . 74

7.1 SSP experiments overview . . . . . . . . . . . . . . . . . . . . . . 777.2 minimum scores for SSP algorithms in 355 280 tests . . . . . . . 827.3 Running time linear regression values . . . . . . . . . . . . . . . 877.4 superstring lengths compared to original string length . . . . . 89

11.1 The used BAliBASE reference sets . . . . . . . . . . . . . . . . . 12711.2 comparing variable block size to constant block sizes . . . . . . 137

C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 157C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 158C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 159

D.1 coverage 6 test data . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.2 coverage 8 test data . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.2 coverage 8 testdata (continued) . . . . . . . . . . . . . . . . . . . 164D.3 coverage 10 test data . . . . . . . . . . . . . . . . . . . . . . . . . 164D.3 coverage 10 test data (continued) . . . . . . . . . . . . . . . . . . 165D.4 coverage 20 test data . . . . . . . . . . . . . . . . . . . . . . . . . 165D.5 coverage 30 test data . . . . . . . . . . . . . . . . . . . . . . . . . 165D.6 coverage 40 test data . . . . . . . . . . . . . . . . . . . . . . . . . 165

148

Part V

Appendices

Evolutionary tree tour with 8178 cities

APPENDIX

AAcronyms

AP Assignment Problem

ATSP Asymmetric Traveling Salesman Problem

BLOSUM Block Substitution Matrix

CPU Central Processing Unit

CC Cycle Cover

CO Circular Order

CS Circular Sum

CT Circular Tour

DNA Deoxyribonucleic Acid

GB Giga Byte

GCD Greatest Common Divisor

HCP Hamiltonian Cycle Problem

HMM Hidden Markov Model

HPP Hamiltonian Path Problem

KMP Knuth-Morris-Pratt

LKH Lin-Kernighan-Helsgaun

LP Linear Program

151

A. ACRONYMS

MB Mega Byte

MHz Mega Hertz

MMP Minimum Matching Problem

MPS Maximal Path Sum

MSA Multiple Sequence Alignment

MST Minimum Spanning Tree

MTSP Metric Traveling Salesman Problem

NP Nondeterministic Polynomial-time

OPA Optimal Pairwise Alignment

PAM Point Accepted Mutation

PTAS Polynomial Time Approximation Scheme

RAM Random-access Memory

RNA Ribonucleic Acid

SCSP Shortest Common Superstring Problem

SOP Sum-of-Pairs

SSP Shortest Superstring Problem

TSP Traveling Salesman Problem

VLSI Very-large-scale Integration

WLOG Without loss of generality

152

APPENDIX

BLKH Output

The following is a sample output from a run of LKH on the TSPLIB instancepr2392 using the shell command: ingolf% ./LKH pr2392.par

File pr2392.par:PROBLEM_FILE= pr2392 . tspOPTIMUM 378032MOVE_TYPE = 5PATCHING_C = 3PATCHING_A = 2RUNS = 10

Output:PARAMETER_FILE = pr2392 . parReading PROBLEM_FILE: " pr2392 . tsp " . . . doneASCENT_CANDIDATES = 50BACKBONE_TRIALS = 0BACKTRACKING = NO# CANDIDATE_FILE =CANDIDATE_SET_TYPE = ALPHAEXCESS = 0.00041806EXTRA_CANDIDATES = 0EXTRA_CANDIDATE_SET_TYPE = QUADRANTGAIN23 = YESGAIN_CRITERION = YESINITIAL_PERIOD = 1196INITIAL_STEP_SIZE = 1INITIAL_TOUR_ALGORITHM = WALK# INITIAL_TOUR_FILE =INITIAL_TOUR_FRACTION = 1.000# INPUT_TOUR_FILE =

153

B. LKH OUTPUT

KICK_TYPE = 0KICKS = 1# MAX_BREADTH =MAX_CANDIDATES = 5MAX_SWAPS = 2392MAX_TRIALS = 2392# MERGE_TOUR_FILE =MOVE_TYPE = 5NONSEQUENTIAL_MOVE_TYPE = 9OPTIMUM = 378032# OUTPUT_TOUR_FILE =PATCHING_A = 2PATCHING_C = 3# PI_FILE =# POPULATION_SIZE = 0PRECISION = 100PROBLEM_FILE = pr2392 . tspRESTRICTED_SEARCH = YESRUNS = 10SEED = 1STOP_AT_OPTIMUM = YESSUBGRADIENT = YES# SUBPROBLEM_SIZE =# SUBPROBLEM_TOUR_FILE =SUBSEQUENT_MOVE_TYPE = 5SUBSEQUENT_PATCHING = YES# TIME_LIMIT =# TOUR_FILE =TRACE_LEVEL = 1

Lower bound = 373488.5 , Gap = 1.20%, Ascent time = 30.5 sec .Cand . min = 2 , Cand . avg = 5.0 , Cand .max = 5Preprocessing time = 33.0 sec .∗ 1: Cost = 378224 , Gap = 0.051%, Time = 0.8 sec .∗ 2: Cost = 378032 , Gap = 0.000%, Time = 1.0 sec . =Run 1: Cost = 378032 , Gap = 0.000%, Time = 1.0 sec . =

∗ 1: Cost = 378032 , Gap = 0.000%, Time = 0.7 sec . =Run 2: Cost = 378032 , Gap = 0.000%, Time = 0.7 sec . =


∗ 1: Cost = 378811 , Gap = 0.206%, Time = 1.1 sec .∗ 3: Cost = 378798 , Gap = 0.203%, Time = 1.5 sec .∗ 6: Cost = 378653 , Gap = 0.164%, Time = 1.9 sec .∗ 10: Cost = 378033 , Gap = 0.000%, Time = 2.4 sec .∗ 11: Cost = 378032 , Gap = 0.000%, Time = 2.4 sec . =Run 4: Cost = 378032 , Gap = 0.000%, Time = 2.4 sec . =

154


∗ 1: Cost = 378136 , Gap = 0.028%, Time = 0.7 sec .∗ 2: Cost = 378032 , Gap = 0.000%, Time = 0.9 sec . =Run 6: Cost = 378032 , Gap = 0.000%, Time = 0.9 sec . =

∗ 1: Cost = 378370 , Gap = 0.089%, Time = 0.7 sec .∗ 4: Cost = 378075 , Gap = 0.011%, Time = 2.0 sec .∗ 6: Cost = 378053 , Gap = 0.006%, Time = 2.3 sec .∗ 29: Cost = 378032 , Gap = 0.000%, Time = 3.6 sec . =Run 7: Cost = 378032 , Gap = 0.000%, Time = 3.6 sec . =

∗ 1: Cost = 378683 , Gap = 0.172%, Time = 1.2 sec .∗ 2: Cost = 378032 , Gap = 0.000%, Time = 1.7 sec . =Run 8: Cost = 378032 , Gap = 0.000%, Time = 1.7 sec . =

∗ 1: Cost = 378787 , Gap = 0.200%, Time = 1.2 sec .∗ 3: Cost = 378634 , Gap = 0.159%, Time = 1.5 sec .∗ 4: Cost = 378615 , Gap = 0.154%, Time = 1.7 sec .∗ 8: Cost = 378032 , Gap = 0.000%, Time = 2.7 sec . =Run 9: Cost = 378032 , Gap = 0.000%, Time = 2.7 sec . =


Successes / Runs = 10/10Cost . min = 378032 , Cost . avg = 378032.00 , Cost .max = 378032Gap . min = 0.000%, Gap . avg = 0.000%, Gap .max = 0.000%Trials . min = 1 , Tria ls . avg = 5.8 , Tr ia ls .max = 29Time . min = 0.7 sec . , Time . avg = 1.6 sec . , Time .max = 3.6 sec .

155

APPENDIX

CGenBank NucleotidSequences

Depicted in Table C.1 are the accession numbers and fasta file descrip-tions for the DNA-sequences used as basis for the SSP-experiments.

Table C.1: SSP-experiments nucleotid sequences

Acc. Number Description

D00723.1 HUMHYDCP Homo sapiens mRNA for hydrogen carrierprotein, a component of an enzyme complex, glycine syn-thase (EC 2.1.2.10)

D11428.1 HUMPMP22A Homo sapiens mRNA for PMP-22(PAS-II/SR13/Gas-3), complete cds

NT_004321.15 Hs1_4478 Homo sapiens chromosome 1 genomic contigNT_004350.16 Hs1_4507 Homo sapiens chromosome 1 genomic contigNT_004433.16 Hs1_4590 Homo sapiens chromosome 1 genomic contigNT_004434.16 Hs1_4591 Homo sapiens chromosome 1 genomic contigNT_004487.16 Hs1_4644 Homo sapiens chromosome 1 genomic contigNT_004511.16 Hs1_4668 Homo sapiens chromosome 1 genomic contigNT_004538.15 Hs1_4695 Homo sapiens chromosome 1 genomic contigNT_004547.16 Hs1_4704 Homo sapiens chromosome 1 genomic contigNT_004559.11 Hs1_4716 Homo sapiens chromosome 1 genomic contigNT_004610.16 Hs1_4767 Homo sapiens chromosome 1 genomic contigNT_004668.16 Hs1_4825 Homo sapiens chromosome 1 genomic contigNT_004671.15 Hs1_4828 Homo sapiens chromosome 1 genomic contigNT_004686.16 Hs1_4843 Homo sapiens chromosome 1 genomic contigNT_004754.15 Hs1_4911 Homo sapiens chromosome 1 genomic contig

the table is continued

157

C. GENBANK NUCLEOTID SEQUENCES



NT_004836.15 Hs1_4993 Homo sapiens chromosome 1 genomic contigNT_004873.15 Hs1_5030 Homo sapiens chromosome 1 genomic contigNT_019273.16 Hs1_19429 Homo sapiens chromosome 1 genomic contigNT_021877.16 Hs1_22033 Homo sapiens chromosome 1 genomic contigNT_021937.16 Hs1_22093 Homo sapiens chromosome 1 genomic contigNT_021973.16 Hs1_22129 Homo sapiens chromosome 1 genomic contigNT_022071.12 Hs1_22227 Homo sapiens chromosome 1 genomic contigNT_026943.13 Hs1_27103 Homo sapiens chromosome 1 genomic contigNT_028050.13 Hs1_28209 Homo sapiens chromosome 1 genomic contigNT_029860.11 Hs1_30115 Homo sapiens chromosome 1 genomic contigNT_030584.10 Hs1_30840 Homo sapiens chromosome 1 genomic contigX00351.1 Human mRNA for beta-actinX01098.1 Human mRNA for aldolase B (EC 4.1.2.13)X02160.1 Human mRNA for insulin receptor precursorX02874.1 Human mRNA for (2’-5’) oligo A synthetase E (1,6 kb RNA)X02994.1 Human mRNA for adenosine deaminase (adenosine amino-

hydrolase, EC 3.5.4.4)X03350.1 Human mRNA for alcohol dehydrogenase beta-1-subunit

(ADH1-2 allele)X03444.1 Human mRNA for nuclear envelope protein lamin A pre-

cursorX03445.1 Human mRNA for nuclear envelope protein lamin C pre-

cursorX03663.1 Human mRNA for c-fms proto-oncogeneX03795.1 Human mRNA for platelet derived growth factor A-chain

(PDGF-A)X04217.1 Human mRNA for porphobilinogen deaminase (PBG-D, EC

4.3.1.8)X04350.1 Human mRNA for liver alchol dehydrogenase (EC 1.1.1.1)

gamma 1 subunit (partial) from ADH3 locusX04412.1 Human mRNA for plasma gelsolinX04772.1 Human mRNA for lymphocyte IgE receptor (low affinity

receptor Fc epsilon R)X04808.1 Human mRNA for non-erythropoietic porphobilinogen

deaminase (hydroxymethylbilane synthase; EC4.3.1.8)X06374.1 Human mRNA for platelet-derived growth factor PDGF-AX06985.1 Human mRNA for heme oxygenaseX07173.1 Human mRNA for second protein of inter-alpha-trypsin in-

hibitor complexthe table is continued

158



X07577.1 Human mRNA for C1 inhibitorX07696.1 Human mRNA for cytokeratin 15X07743.1 Human mRNA for pleckstrin (P47)X13097.1 Human mRNA for tissue type plasminogen activatorX13403.1 Human mRNA for octamer-binding protein Oct-1X13440.1 Human mRNA for 17-beta-hydroxysteroid dehydrogenase

(17-HSD) (EC 1.1.1.62)X14034.1 Human mRNA for phospholipase CX14758.1 Human mRNA for adenocarcinoma-associated antigen

(KSA)X15610.1 Human mRNA for proliferating cell nucleolar antigen P40

C-terminal fragmentX51841.1 Human mRNA for integrin beta(4)subunitX52104.1 Human mRNA for p68 proteinX53279.1 Human mRNA for placental-like alkaline phosphatase (EC

3.1.3.1)X56088.1 Human mRNA for cholesterol 7-alpha-hydroxylaseX57398.1 Human mRNA for pM5 proteinX58377.1 Human mRNa for adipogenesis inhibitory factorY00093.1 HSP15095 H.sapiens mRNA for leukocyte adhesion glyco-

protein p150,95Y00264.1 Human mRNA for amyloid A4 precursor of Alzheimer’s dis-

easeY00503.1 Human mRNA for keratin 19Y00649.1 Human mRNA for B-lymphocyte CR2-receptor (for comple-

ment factor C3d and Epstein-Barr virus)

159

APPENDIX

DSSP Test Data

To generate the test data according to the method description in Sec-tion 7.1 on page 76 the following calculations were made to determinedthe needed length of nucleotid sequence N . Let the wanted values forcoverage, collection size and string length be denoted by cov, Ssize and lrespectively, then

N =Ssize · lcov

(D.1)

The number of strings gen(N) of length l generated by the methodusing a nucleotid sequence of length Ncan be calculated as

gen(N) = N − l + 1

and the number of strings to be randomly selected for removal r isgiven by

r = gen(N)− Ssize

giving a reduction ratio rratio in percentage of size

rratio =r · 100

gen(N)

161

D. SSP TEST DATA

Example 21

Let cov = 8, Ssize = 600, l = 16

then N =600 · 16

8= 1200

gen(N) = 1200− 16 + 1 = 1185

r = 1185− 600 = 585

rratio =585 · 100

1185≈ 42.194%

2

In praxis an integer value for the reduction ration was preferred lead-ing to the following modifications of the Equation D.1 on the previouspage

N ′ =

(Ssize · lcov

− 1

)+ l = N − l + 1

Example 22

Let cov = 8, Ssize = 600, l = 16

then N ′ =

(600 · 16

8− 1

)+ 16 = 1215

gen(N ′) = 1215− 16 + 1 = 1200

r = 1200− 600 = 600

rratio =600 · 100

1200= 50%

2

Note that effect of this was that the value for cov is then actually onlyapproximated giving

cov′ =Ssize · lN ′

in Example 22 we get

cov′ =600 · 16

1215≈ 7.901

The values for the generated testdata is listed in following tables

162

Table D.1: coverage 6 test data

Ssize l rratio Ssize l rratio Ssize l rratio

100 10 40 100 20 70 100 25 76200 10 40 200 20 70 200 25 76300 10 40 300 20 70 300 25 76400 10 40 400 20 70 400 25 76500 10 40 500 20 70 500 25 76600 10 40 600 20 70 600 25 76700 10 40 700 20 70 700 25 76800 10 40 800 20 70 800 25 76900 10 40 900 20 70 900 25 76

1000 10 40 1000 20 70 1000 25 76

100 40 85 100 50 88 100 100 94200 40 85 200 50 88 200 100 94300 40 85 300 50 88 300 100 94400 40 85 400 50 88 400 100 94500 40 85 500 50 88 500 100 94600 40 85 600 50 88 600 100 94700 40 85 700 50 88 700 100 94800 40 85 800 50 88 800 100 94900 40 85 900 50 88 900 100 94

1000 40 85 1000 50 88 1000 100 94

3000 8 0



100 8 00 100 10 20 100 16 50200 8 00 200 10 20 200 16 50300 8 00 300 10 20 300 16 50400 8 00 400 10 20 400 16 50500 8 00 500 10 20 500 16 50600 8 00 600 10 20 600 16 50700 8 00 700 10 20 700 16 50800 8 00 800 10 20 800 16 50900 8 00 900 10 20 900 16 50

1000 8 00 1000 10 20 1000 16 502000 8 00 2000 10 20

3000 10 20the table is continued

163

D. SSP TEST DATA

Table D.2: coverage 8 testdata (continued)


100 20 60 100 25 68 100 32 75200 20 60 200 25 68 200 32 75300 20 60 300 25 68 300 32 75400 20 60 400 25 68 400 32 75500 20 60 500 25 68 500 32 75600 20 60 600 25 68 600 32 75700 20 60 700 25 68 700 32 75800 20 60 800 25 68 800 32 75900 20 60 900 25 68 900 32 75

1000 20 60 1000 25 68 1000 32 75

100 40 80 100 50 84 100 100 92200 40 80 200 50 84 200 100 92300 40 80 300 50 84 300 100 92400 40 80 400 50 84 400 100 92500 40 80 500 50 84 500 100 92600 40 80 600 50 84 600 100 92700 40 80 700 50 84 700 100 92800 40 80 800 50 84 800 100 92900 40 80 900 50 84 900 100 92

1000 40 80 1000 50 84 1000 100 92



100 10 00 100 20 50 100 25 60200 10 00 200 20 50 200 25 60300 10 00 300 20 50 300 25 60400 10 00 400 20 50 400 25 60500 10 00 500 20 50 500 25 60600 10 00 600 20 50 600 25 60700 10 00 700 20 50 700 25 60800 10 00 800 20 50 800 25 60900 10 00 900 20 50 900 25 60

1000 10 00 1000 20 50 1000 25 600

100 40 75 100 50 80 100 100 90200 40 75 200 50 80 200 100 90300 40 75 300 50 80 300 100 90

the table is continued

164

Table D.3: coverage 10 test data (continued)


400 40 75 400 50 80 400 100 90500 40 75 500 50 80 500 100 90600 40 75 600 50 80 600 100 90700 40 75 700 50 80 900 100 90800 40 75 800 50 80900 40 75 900 50 80

1000 40 75 1000 50 80


Ssize l rratio Ssize l rratio

100 50 60 100 100 80200 50 60 200 100 80



100 50 40 100 100 70200 50 40 200 100 70300 50 40400 50 40500 50 40600 50 40



100 50 20 100 100 60200 50 20 200 100 60

165

Bibliography

Altschul, S. and D. Lipman (1989). Trees, stars, and multiple biologi-cal sequence alignment. SIAM Journal on Applied Mathematics 49(1),197–209.

Applegate, D., R. Bixby, V. Chvátal, W. Cook, D. Espinoza, M. Goycoolea,and K. Helsgaun (2009). Certification of an optimal tsp tour through85,900 cities. Operations Research Letters 37(1), 11–15.

Applegate, D. L., R. E. Bixby, V. Chvátal, and W. J. Cook (2007). TheTraveling Salesman Problem: A Computational Study (Princeton Se-ries in Applied Mathematics)., Chapter 1–5,12–17. Princeton, NJ, USA:Princeton University Press.

Bacon, D. and W. Anderson (1986). Multiple sequence alignment* 1. Jour-nal of Molecular Biology 191(2), 153–161.

Bentley, J. L. (1992). Fast algorithms for geometric traveling salesmanproblems. ORSA Journal on Computing 4(4), 387–411.

Bentley, J. L. (2000). Programming Pearls (second edition ed.)., Chapter 8.Addison-Wesley, Inc.

Blackshields, G., I. Wallace, M. Larkin, and D. Higgins (2006). Analysisand comparison of benchmarks for multiple sequence alignment. Insilico biology 6(4), 321–339.

Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis (1994, May).Linear approximation of shortest superstrings. In Proceedings of the23rd Annual ACM Symposium on Theory of Computing, pp. 328–336.

167

BIBLIOGRAPHY

Breslauer, D., T. Jiang, and Z. Jiang (1996). Rotations of periodic stringsand short superstrings. Journal of Algorithms 24, 340–353.

Carrillo, H. and D. Lipman (1988). The multiple sequence alignmentproblem in biology. SIAM Journal on Applied Mathematics 48(5), 1073–1082.

Christofides, N. (1976). Worst-case analysis of a new heuristic for thetravelling salesman problem. Technical Report Report 388, GraduateSchool of Industrial Administration, Carnegie-Mellon University, Pitts-burg.

Cormen, T., C. Leiserson, R. Rivest, and C. Stein (2001). Introduction toalgorithms., Chapter 22,31.2. The MIT press.

Dantzig, G., R. Fulkerson, and S. Johnson (1954). Solution of a large scaletraveling salesman problem. Technical Report P-510, RAND Corpora-tion, Santa Monica, California, USA.

Do, C., M. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). Prob-cons: probabilistic consistency-based multiple sequence alignment.Genome Research 15(2), 330.

Edgar, R. (2004). Muscle: multiple sequence alignment with high accu-racy and high throughput. Nucleic acids research 32(5), 1792.

Elias, I. (2006). Settling the intractability of multiple alignment. Journalof Computational Biology 13(7), 1323–1339.

Gallant, J., D. Maier, and J. A. Storer (1980). On finding minimal lengthsuperstrings. Journal of Computer and System Sciences 20(1), 50 – 58.

Gonnet, G., C. Korostensky, and S. Benner (2000). Evaluation measures ofmultiple sequence alignments. Journal of Computational Biology 7(1-2), 261–276.

Gusfield, D. (1997). Algorithms on Streams, Trees, and Sequences., Chap-ter 5–6, 7.10, 10–17. Cambridge University Press, New York, NY.

Held, M. and R. Karp (1970). The traveling-salesman problem and mini-mum spanning trees. Operations Research 18(6), 1138–1162.

Helsgaun, K. (1998). An effective implementation of the lin-kernighantraveling salesman heuristic. datalogiske skrifter (writings on com-puter science). Technical Report 81, Roskilde University.

Helsgaun, K. (2000). An effective implementation of the Lin-Kernighantraveling salesman heuristic. European Journal of Operational Re-search 126(1), 106–130.

168

Bibliography

Helsgaun, K. (2006). An effective implementation of k-opt moves for thelin-kernighan tsp heuristic. datalogiske skrifter (writings on computerscience). Technical Report 109, Roskilde University.

Helsgaun, K. (2009). General k-opt submoves for the Lin-Kernighan TSPheuristic. Mathematical Programming Computation 1(2), 119–163.

Hopfield, J. J. and D. W. Tank (1985). “Neural” computation of deci-sions in optimization problems. Biological Cybernetics 52, 141–152.10.1007/BF00339943.

Ilie, L. and C. Popescu (2006). The shortest common superstring problemand viral genome compression. Fundamenta Informaticae 73(1,2), 153–164.

Kaplan, H., M. Lewenstein, N. Shafrir, and M. Sviridenko (2005). Approx-imation algorithms for asymmetric tsp by decomposing directed regularmultigraphs. section 1–3, 5. J. ACM 52(4), 602–626.

Kaplan, H. and N. Shafrir (2005). The greedy algorithm for shortest su-perstrings. Inf. Process. Lett. 93(1), 13–17.

Karp, R. M. (1972). Reducibility among combinatorial problems. InR. Miller and J. Thatcher (Eds.), Complexity of Computer Computations,New York, USA., pp. 85–103. Plenum Press.

Karp, R. M. (2003). The role of algorithmic research in computationalgenomics. Computational Systems Bioinformatics Conference, Interna-tional IEEE Computer Society 0, 10.

Katoh, K., K. Kuma, H. Toh, and T. Miyata (2005). MAFFT version 5:improvement in accuracy of multiple sequence alignment. Nucleic acidsresearch 33(2), 511.

Katoh, K., K. Misawa, K. Kuma, and T. Miyata (2002). Mafft: a novelmethod for rapid multiple sequence alignment based on fast fouriertransform. Nucleic acids research 30(14), 3059.

Korostensky, C. and G. Gonnet (1999). Near optimal multiple sequencealignments using a traveling salesman problem approach. In Proceed-ings of the String Processing and Information Retrieval Symposium &International Workshop on Groupware, pp. 105–114. IEEE ComputerSociety.

Laporte, G. (2010). A concise guide to the traveling salesman problem.Journal of the Operational Research Society 61(1), 35–40.

169

BIBLIOGRAPHY

Lin, S. and B. Kernighan (1973). An effective heuristic algorithm for thetraveling-salesman problem. Operations research 21(2), 498–516.

Lipman, D., S. Altschul, and J. Kececioglu (1989). A tool for multiplesequence alignment. Proceedings of the National Academy of Sciencesof the United States of America 86(12), 4412.

Metzker, M. (2009). Sequencing technologies—the next generation. Na-ture Reviews Genetics 11(1), 31–46.

Mucha, M. (2007). A tutorial on shortest superstring approximation.http://www.mimuw.edu.pl/~mucha/teaching/aa2008/ss.pdf .

Murata, M., J. Richardson, and J. Sussman (1985). Simultaneous compar-ison of three protein sequences. Proceedings of the National Academyof Sciences of the United States of America 82(10), 3073.

Notredame, C. (2002). Recent progress in multiple sequence alignment: asurvey. pgs 3(1), 131–144.

Notredame, C. (2007). Recent evolutions of multiple sequence alignmentalgorithms. PLoS Comput. Biol 3(8), e123.

Notredame, C., D. Higgins, and J. Heringa (2000). T-coffee: a novelmethod for fast and accurate multiple sequence alignment1. Journalof molecular biology 302(1), 205–217.

Okano, H. (2009). Study of Practical Solutions for Combinatorial Opti-mization Problems. Ph. D. thesis, School of Information Sciences, To-hoku University.

Perrodou, E., C. Chica, O. Poch, T. Gibson, and J. Thompson (2008). Anew protein linear motif benchmark for multiple sequence alignmentsoftware. BMC bioinformatics 9(1), 213–228.

Reinelt, G. (1991, Fall). Tsplib - a traveling salesman problem library.ORSA, Journal On Computing 3(4), 376–384.

Rice, P., I. Longden, A. Bleasby, et al. (2000). Emboss: the europeanmolecular biology open software suite. Trends in genetics 16(6), 276–277.

Romero, H. J., C. A. Brizuela, and A. Tchernykh (2004). An experimentalcomparison of approximation algorithms for the shortest common su-perstring problem. In Fifth Mexican International Conference on Com-puter Science, (ENC), pp. 27 – 34.

170

http://www.mimuw.edu.pl/~mucha/teaching/aa2008/ss.pdf

Bibliography

Roth-Korostensky, C. (2000). Algorithms for Building Multiple SequenceAlignments and Evolutionary Trees. Doctor of technical sciences, SwissFederal Institute of Technology, Zürich.

S. Kirkpatrick, C. D. G. J. and M. P. Vecchi (1983, May). Optimization bysimulated annealing. Science 220(4598), 671–680.

Tarhio, J. and E. Ukkonen (1988). A greedy approximation algorithm forconstructing shortest common superstrings. Theor. Comput. Sci. 57(1),131–145.

Thompson, J., D. Higgins, and T. Gibson (1994). Clustal w: improv-ing the sensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weight matrixchoice. Nucleic acids research 22(22), 4673.

Thompson, J., P. Koehl, R. Ripp, and O. Poch (2005). Balibase 3.0: latestdevelopments of the multiple sequence alignment benchmark. Proteins:Structure, Function, and Bioinformatics 61(1), 127–136.

Thompson, J. D., F. Plewniak, and O. Poch (1999a). Balibase: a bench-mark alignment database for the evaluation of multiple alignment pro-grams. Bioinformatics 15, 87–88.

Thompson, J. D., F. Plewniak, and O. Poch (1999b). A comprehensivecomparison of multiple sequence alignment programs. Nucleic AcidsResearch 27(13), 2682–2690.

Venter, J., M. Adams, E. Myers, P. Li, R. Mural, G. Sutton, H. Smith,M. Yandell, C. Evans, R. Holt, et al. (2001). The sequence of the humangenome. science 291(5507), 1304.

Wallace, I., G. Blackshields, and D. Higgins (2005). Multiple sequencealignments. Current opinion in structural biology 15(3), 261–266.

Wang, L. and T. Jiang (1994). On the complexity of multiple sequencealignment. Journal of computational biology 1(4), 337–348.

171

TRAVELING SALESMAN PROBLEM BIOINFORMATIC ALGORITHMS

Documents

Transcript of TRAVELING SALESMAN PROBLEM BIOINFORMATIC ALGORITHMS