Global Optimization for Scaffolding and Completing Genome ... · Observatons concerning the Genome...
Transcript of Global Optimization for Scaffolding and Completing Genome ... · Observatons concerning the Genome...
Global Optimization for Scaffolding and Completing Genome Assemblies
Rumen Andonov1
joint with Sebastien Francois1, Dominique Lavenier1, Hristo Djidjev2
IRISA/INRIA, Rennes, France
Los Alamos National Laboratory, Los Alamos, NM 87545, USA
(adapted from a talk at Workshop on Constraint-Based Methods for Bioinformatics WCB’16)
Outline
Scaffolding in the context of genome assembly problem
Graph formulation
Search for a convenient optimization problem
Mixed Integer Linear Programming formulation
Numerical results
2/25 2/25
Illustration of the Genome Sequencing-Assembly process
Figure 1 : Illustration of the Genome Sequencing-Assembly process.3/25 3/25
Illustration of the Genome Sequencing-Assembly process
4/25 4/25
Observatons concerning the Genome Assembly process
The process is decomposed in several subtasks (contig assembly,scaffolding, gap filling, scaffold extension) that are solved independentlyand heuristically
"The Contig Scaffolding Problem is to order and orientate
the given contigs in a manner that is consistent with as many
mate-pairs as possible". Hudson et al. 2002
Scaffolding has been proven to be NP-hard, and lot of work has been donetowards its solution, but essentially heuristics have been published.
Here we model and tackle not only this problem, but also the two tasksfollowing the scaffolding (i.e. Gap Filling and Extension)
5/25 5/25
State of the art of post-”contig assembly” stages
Popular scaffolders : Bambus (2004), SSPACE (2011), SSAKE (2007),Bambus2 (2011), Opera (2011)
Mixed Integer Programming approaches : SCARPA (2013), GRASS(2012), MIP Scaffolder (2011), ILP (2014) [1]
Post-scaffolders : GapFiller (2011), SOAPdenovo2 (2012), Sealer (2015)
[1] Nicolas Briot, Annie Chateau, Remi Coletta, Simon De Givry, Philippe Leleux, and Thomas Schiex. An Integer Linear ProgrammingApproach for Genome Scaffolding. In Workshop Constraints in Bioinformatics, 2014.
6/25 6/25
Our approach for modelling the scaffolding graph
7/25 7/25
Input data for Agrostis genome
Contigs and coverage ====================
5__len__56145 11__len__24521 14__len__19352 20__len__2277 23__len__12880 12__len__160 3
Overlap=======
0__len__2277_F 1__len__24521_R 701__len__24521_R 2__len__160_R 700__len__2277_R 2__len__160_R 701__len__24521_F 0__len__2277_R 704__len__19352_F 3__len__12880_F 702__len__160_R 4__len__19352_F 704__len__19352_F 3__len__12880_R 703__len__12880_R 4__len__19352_R 700__len__2277_F 5__len__56145_R 702__len__160_F 1__len__24521_F 702__len__160_R 5__len__56145_F 705__len__56145_R 2__len__160_F 704__len__19352_R 2__len__160_F 705__len__56145_F 0__len__2277_R 702__len__160_F 0__len__2277_F 703__len__12880_F 4__len__19352_R 70
Links=====
1__len__24521_F 4__len__19352_F 2279 24052__len__160_F 1__len__24521_R 2127 22970__len__2277_R 4__len__19352_F 190 6834__len__19352_R 0__len__2277_F 299 6955__len__56145_R 1__len__24521_F 222 6511__len__24521_F 2__len__160_R 2118 23311__len__24521_R 5__len__56145_F 286 6834__len__19352_R 1__len__24521_R 2287 2419
8/25 8/25
From input to the scaffolding graph
Rules :
use unitigs instead of contigs.
the unitig i is represented by at least two nodes vi and v ′i (forward/inverse)
nodes/unitigs are multiplied according to their coverage to considerrepetitions. .
Nodes : big (red) and small (blue). A vertex v is big if wv > le ∀e ∈ L.
Any edge is given in its forward/inverse orientation (i.e. if eij is in theinput, we add its inverse ej ′ i ′ in the graph).
Edges : mate-pairs (L–orange dashed) while overlaps (O–black normal).
9/25 9/25
Input data for Agrostis genome
Contigs and coverage ====================
5__len__56145 11__len__24521 14__len__19352 20__len__2277 23__len__12880 12__len__160 3
Overlap=======
0__len__2277_F 1__len__24521_R 701__len__24521_R 2__len__160_R 700__len__2277_R 2__len__160_R 701__len__24521_F 0__len__2277_R 704__len__19352_F 3__len__12880_F 702__len__160_R 4__len__19352_F 704__len__19352_F 3__len__12880_R 703__len__12880_R 4__len__19352_R 700__len__2277_F 5__len__56145_R 702__len__160_F 1__len__24521_F 702__len__160_R 5__len__56145_F 705__len__56145_R 2__len__160_F 704__len__19352_R 2__len__160_F 705__len__56145_F 0__len__2277_R 702__len__160_F 0__len__2277_F 703__len__12880_F 4__len__19352_R 70
Links=====
1__len__24521_F 4__len__19352_F 2279 24052__len__160_F 1__len__24521_R 2127 22970__len__2277_R 4__len__19352_F 190 6834__len__19352_R 0__len__2277_F 299 6955__len__56145_R 1__len__24521_F 222 6511__len__24521_F 2__len__160_R 2118 23311__len__24521_R 5__len__56145_F 286 6834__len__19352_R 1__len__24521_R 2287 2419
10/25 10/25
Initial Scaffolding Graph for Saccharomyces cerevisiae Chromosome III
What shall we look for in this graph in order to
assemble the genome?
11/25 11/25
Scaffolding, Flows and Integer Linear Programming
12/25 12/25
Our approach for Scaffolding
Scaffolding is seen an elementary (simple) longest path in a directed graphG = (V ,E ,w , l), where both the vertices V and the edges E are weighted.
Mate-pairs distances have to be satisfied (as much as possible).
Classical definition of flow : A feasible flow is a vector φ = [φ1,φ2, . . .φm]such that the flow conservation holds , i.e.
∀v ∈ V \{s,t} ∑e∈A+(v)
φe = ∑e∈A−(v)
φe (1)
where A+(v) (resp. A−(v) ) : set of edges leaving (entering) node v .
Here (1) is modified in order to avoid subtours (with polynomial numberof constraints).
Our model reduces to Mixed Integer Linear programming (ILP) with binaryvariables (maximizing a linear objective function under a set of linearconstraints).
13/25 13/25
Modelling the simple longest path
A binary variable for any edge :
∀e ∈ E : xe ∈ {0,1}. (2)
Any vertex can be an intermediate (iv = 1), source (sv = 1) or target vertex(tv = 1)
∀v ∈ V : 0≤ iv ≤ 1, 0≤ sv ≤ 1, 0≤ tv ≤ 1. (3)
Any vertex v (or its inverse v ′ ) can be visited at most once, i.e.
∀(v ,v ′) : iv + iv ′ + sv + sv ′ + tv + tv ′ ≤ 1 (4)
Four possibles states for any vertex v : to be a source, a target, or anintermediate vertex in some path, or otherwise, to belong to none of the paths
sv + iv = ∑e∈A+(v)
xe (5)
tv + iv = ∑e∈A−(v)
xe (6)
We search for a single-path solution
∑v∈V
sv = 1 and ∑v∈V
tv = 1. (7)
14/25 14/25
Modelling the simple longest path (cont. II)
∀e ∈ E : xe ∈ {0,1}. (8)
∀v ∈ V : 0≤ iv ≤ 1, 0≤ sv ≤ 1, 0≤ tv ≤ 1. (9)
∀(v ,v ′) : iv + iv ′ + sv + sv ′ + tv + tv ′ ≤ 1 (10)
sv + iv = ∑e∈A+(v)
xe (11)
tv + iv = ∑e∈A−(v)
xe (12)
∑v∈V
sv = 1 and ∑v∈V
tv = 1. (13)
TheoremThe real variables iv ,sv ,tv ,∀v ∈ V take binary values.
15/25 15/25
Modelling the simple longest path : subtour elimination
Let fe be the flow circulating on the edge e .
fe ≤Wxe ∀e ∈O. (14)
∀v ∈ V : ∑e∈A−(v)
fe − ∑e∈A+(v)
fe ≥ iv (wv + ∑e∈A−(v)
lexe)−Wsv (15)
Wsv ≤ ∑e∈A+(v)
fe . (16)
When sv = 1, (15) and (16) output from v an initial flow of value W .
When iv = 1, (15) forces the flow to decrease by at least l(u,v) + wv unitswhen it moves form vertex u to its adjacent vertex v . This forbids cycles.
When sv + iv = 0 (i.e.tv = 1), (15) is simply a valid inequality.
.We search for the longest path
max( ∑e∈E
xe le + ∑v∈V
wv (iv + sv + tv )) (17)
16/25 16/25
Modelling the mate-pair distances
A binary variable g(s,t) is associated with any mate-pair (s,t). It is set to 1 onlyif both vertices s and t belong to the selected path and the length of theconsidered path between them is in the given interval [L(s,t),L(s,t)].
g(s,t) ≤ ss + is + ts and g(s,t) ≤ st + it + tt (18)
as well as
∀(s,t) ∈ L : ∑e∈A+(s)
fe − ∑e∈A−(t)
fe ≥ L(s,t)g(s,t)−M(1−g(s,t)) (19)
∀(s,t) ∈ L : ∑e∈A+(s)
fe − ∑e∈A−(t)
fe ≤ L(s,t)g(s,t)) + M(1−g(s,t)), (20)
where M is some big constant.We search for a long path in the graph and such that as many as possiblemate-paired distances are satisfied.
max( ∑e∈O
xe le + ∑v∈V
wv (iv + sv + tv ) + p ∑e∈L
ge) (21)
where p is a parameter to be chosen as appropriate (currently p = 1).
17/25 17/25
Computational results
Datasets Size(bp) #unitigs #nodes #edges #mate-pairs
Acinetobacter 3 598 621 165 676 8344 4430Wolbachia 1 080 084 100 452 7552 2972
Aethionema Cordif. 154 167 83 166 898 600Atropa belladonna 156 687 18 36 114 46Angiopteris Evecta 153 901 16 32 144 74
Acorus Calamus 153 821 15 30 134 26
Table 1 : Our dataset of chloroplast and bacteria genomes.
The protocol that we applied to the above data :
Synthetic sequencing reads have been generated applying ART simulator [1].
We used minia [2] to produce unitigs.
The scaffolding graph was generated as explained above.
[1] Huang, W., Li, L., Myers, J.R., Marth, G.T. : Art : a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012),[2] Chikhi,R.,Rizk,G. :Space-efficientandexactdeBruijngraphrepresentationbased on a Bloom filter. In : WABI. Lecture Notes in ComputerScience, vol. 7534, pp. 236–248. Springer (2012)
18/25 18/25
Computational results (cont. I)
Figure 2 : The contig graph generated for the Atropa belladonna genome. Red/bluevertices correspond respectively to big/small contigs.
19/25 19/25
Computational results (cont. II)
Figure 3 : The solution obtained for Atropa belladonna’s genome
Datasets Models Obj 1st term 2nd term Time(length) (# satisfied
mate-pairs)
Atropa belladonna BR 156 501 156 488 13 0m0.780s
20/25 20/25
Computational results : comparison with recent scaffolders
Datasets Scaffolder Genome #scaffolds # mis- N’s perfraction assemblies 100 kbp
Acinetobacter GST 98.536% 1 0 0
SSPACE 98.563% 20 0 155.01
BESST 98.539% 37 0 266.65
Scaffmatch 98.675% 9 5 1579.12
Wolbachia GST 98.943% 1 0 0
SSPACE 97.700% 9 0 2036.75
BESST 97.699% 49 0 642.90
Scaffmatch 97.994% 2 2 3162.81
Aethionema Cordifolium GST 100% 1 0 0
SSPACE 95.550% 20 0 13603.00
BESST 81.318% 30 0 1553.22
Scaffmatch 82.608% 7 7 36892
Atropa belladonna GST 99.987% 1 0 0
SSPACE 83.389% 2 0 155.01
BESST 83.353% 1 0 14.52
Scaffmatch 83.516% 1 0 318.93
Angiopteris Evecta GST 99.968% 1 0 0
SSPACE 85.100% 4 0 0
BESST 85.164% 2 0 1438.54
Scaffmatch 85.684% 1 0 454.23
Acorus Calamus GST 100% 1 0 0
SSPACE 83.091% 4 0 126.39
BESST 83.091% 4 0 127.95
Scaffmatch 83.271% 1 1 3757.13
Table 2 : Performance of different solvers on the scaffolding datasets from Table 1.Our tool GST is the only one that consistently assembles the complete genome withzero misassembles. The quality was evaluated by QUAST tool [?]
21/25 21/25
Difficultes concerning the comparisons with other scaffolders/solvers
Fair comparison is hard to achieve. Some reasons :
No common conventions for comparison.
Existing tools solve the above mentionned Genome Assembly stages asseparate/independent tasks (local optimization)
For example the Briot et al. ILP approach solves instances that are anorder of magnitude larger than ours. But this approach considers the”classical scaffolding” problem.
On the other hand the gap-closer Sealer reports important running timefor gap filling (30h, and more for larger instances )
How to evaluate the multiplicity of the solutions ?
.
22/25 22/25
Computational results : comparing various formulations of our approach
BB (Basic Binary ) : the main model with binary variables for vertices.
∀v ∈ V : iv ∈ {0,1} and sv ∈ {0,1} and tv ∈ {0,1}. (22)
BR (Basic Real) : 22 are relaxed. This is the main model.
BRLP : the linear programming relaxation of BR where the binary variablesfor edges are relaxed (gives upper bound for the objectif function).
∀e ∈O : 0≤ xe ≤ 1 and ∀e ∈ L : 0≤ ge ≤ 1. (23)
LP (Longest Path) : all constraints related to mate-pairs distances areomitted. Its optimal value yields an upper bound for the first term inobjectif of the main model BR.
Computational results confirm that :BB and BR give the same value, but BR is much faster.Relaxations BRLP and LP give very tight bounds being faster than modelBR.
23/25 23/25
Computational results : comparing various formulations of our approach
Datasets Models Obj 1st term 2nd term Time(length) (# satisfied
links)
Acinetobacter BB N/A N/A N/A 15m00.000s*
BR 3 598 689 3 598 499 190 3m13.878s
BRLP 3 598 977 3 597 826 1151 0m44.508s
LP 3 598 518 3 598 518 N/A 1m16.318s
Wolbachia BB N/A N/A N/A 15m00.000s*
BR 1 075 949 1 075 856 93 3m13.144s
BRLP 1 076 109 1 075 857 252 0m25.428s
LP 1 075 857 1 075 857 N/A 0m18.694s
Atropa belladonna BB 156 501 156 488 13 0m1.151s
BR 156 501 156 488 13 0m0.780s
BRLP 156 507 156 468 39 0m0.720s
LP 156 488 156 488 N/A 0m0.296s
Angiopteris Evecta BB 145 542 145 534 8 0m28.728s
BR 145 542 145 534 8 0m1.084s
BRLP 145 556 145 501 55 0m0.752s
LP 145 535 145 535 N/A 0m0.308s
Acorus Calamus BB 153 976 153 970 6 0m1.343s
BR 153 976 153 970 6 0m1.060s
BRLP 153 981 153 959 22 0m0.768s
LP 153 970 153 970 N/A 0m0.261s
Aethionema Cordifolium BB 151 563 151 445 118 15m00.000s*
BR 151 570 151 445 125 3m15.534s
BRLP 151 610 151 200 410 0m1.992s
LP 151 445 151 445 N/A 0m1.548s
Table 3 : Performance of the basic MILP model and some of its relaxations andrelated formulations on the scaffolding datasets from Table 1. The symbol ∗ indicatesthat the corresponding execution has been stopped by time limit.
24/25 24/25
Conclusion
We model the three stages of Genome Assembly (scaffolding, gap fillingand contig extension) as one (global) optimization problem.
The problem is expressed as path finding in a directed graph. Theapproach assumes that the longest path should correspond to the correctgenome sequence, and the experimental results confirm this hypothesis.One single pass on this path performs several tasks—contig orientation,contig ordering, contig extension and gap filling.
For the above problem we give a Mixed Integer Linear Programmingformulation (MIP). Various such formulations have been proposed andcompared.
The obtained results are better that those obtained with alternativeheuristic approaches.
More work should be done so as to study the validity of the approach andto increase its scalability.
.
25/25 25/25