Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi...
-
Upload
mavis-bruce -
Category
Documents
-
view
216 -
download
0
Transcript of Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi...
Flipping letters to Flipping letters to minimize minimize
the support of a stringthe support of a stringGiuseppe Lancia, Franca Rinaldi, Romeo Rizzi
University of Udine
Outline of talk:
1. Problem definition2. Parametrized complexity3. Polynomial cases4. NP-hardness5. ILP formulations
1. Problem definition1. Problem definition
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010 }
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100 }
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001, 011}
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001, 011} | K(s) | = 4
We are given a string s and a parameter k (e.g., k = 3)
The string has a set of k-mers, its support, K(s)
010010011
K(s) = { 010, 100, 001, 011} | K(s) | = 4
By flipping some bits, we could reduce the number of k-mers
010010011
We are given a string s and a parameter k (e.g., k = 3)
K(s) = { 010, 100, 001, 011} | K(s) | = 4
By flipping some bits, we could reduce the number of k-mers
010010011010010010
S’=
We are given a string s and a parameter k (e.g., k = 3)
K(s) = { 010, 100, 001, 011} | K(s) | = 4
By flipping some bits, we could reduce the number of k-mers
010010011010010010
S’=
K(s’) = { 010, 100, 001} | K(s’) | = 3
We are given a string s and a parameter k (e.g., k = 3)
The Problem :The Problem :
- A string s over an alphabet
IngredienIngredients:ts:
The Problem :The Problem :
- A string s over an alphabet
- A parameter k (k-mer size)
IngredienIngredients:ts:
The Problem :The Problem :
- A string s over an alphabet
- A parameter k (k-mer size)
- A budget B
IngredienIngredients:ts:
The Problem :The Problem :
- A string s over an alphabet
- A parameter k (k-mer size)
- A budget B
IngredienIngredients:ts:
Change at most B letters in s so as resulting s’ has as few distinct k-mers as possible
ObjectiveObjective::
The Problem :The Problem :
- A string s over an alphabet
- A parameter k (k-mer size)
- A budget B
IngredienIngredients:ts:
Find a string s’ with d(s,s’) <= B with the smallest number of kmers
ObjectiveObjective::
s
s’
Motivation :Motivation :
Curiosity-driven (it’s a cute combinatorial problem)Real:Real:
Motivation :Motivation :
Curiosity-driven (it’s a cute combinatorial problem)Real:Real:
Analysis of DNA sequencesFictious:Fictious:
atcgattgatccttta
atc, tcg, cga, gat, ….
3-mers are aminoacid codons. Protein complexity relates to # of codons. Mutations may reduce complexity….
Our results:Our results:
The problem has many parameters (|s|, ||, k, B), we study all versions (when possibly some of the parameters are bounded)
- Polynomial special cases (e.g. for B fixed or both k,|| fixed)
- NP-hard special cases (even k=2 or ||=2)
2. Parametrized 2. Parametrized complexitycomplexity
s NO B NO
s YES B NO
s NO B YES
s YES B YES
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
We can assume : k <= |s|
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
We can assume : k <= |s|
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
We can assume : B <= |s|
NO k NO
YES k NO
NO k YES
YES k YES
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
We can assume : || <= |s| (we don’t need any symbol not already in s)
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
Polynomial cases
)1(O
)|(| )1(2 BsO
|)(| sO
)||( )1(2 BskO
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES
NP-hard cases
)1(O
)|(| )1(2 BsO
|)(| sO
)||( )1(2 BskO
NP-hard NP-hardfor ||=2
NP-hardfor k=2
3. Polynomial cases3. Polynomial cases
The case |The case || and k fixed:| and k fixed:
The case |The case || and k fixed:| and k fixed:
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES )1(O
)|(| )1(2 BsO
|)(| sO
)||( )1(2 BskO
NP-hard NP-hardfor ||=2
NP-hardfor k=2
The case |The case || and k fixed:| and k fixed:
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
0
0
0
0
0
3
2
2
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
0
0
0
0
0
3
2
2
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
Each path corresponds to a string s’ with all its kmers in A
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
0
0
0
0
0
3
2
2
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
The length of the path is the Hamming distance d(s’, s)
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |The case || and k fixed:| and k fixed:
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
s = 01000100101110A = 0100, 1001, 0010 , 0001 B = 3
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
……
……
……
0100
1001
0010
0001
0100
1001
0010
0001
0100
1001
0010
0001
0100 0 1 ….. 1 1 0
0
0
0
0
0
3
2
2
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
SUB(A) has a solution iff the shortest path is <= B
The case |The case || and k fixed:| and k fixed:
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
- we can solve SUB(A) in polytime (O|A||||s|) = O(|s|) sincekA 2||
The case |The case || and k fixed:| and k fixed:
We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
- we can solve SUB(A) in polytime (O|A||||s|) = O(|s|) since
- There are “only” possible subsets A to try…
)1(2 || Ok
problem is solved in polytime O(|s|)
kA 2||
The case of B fixed:The case of B fixed:
The case of B fixed:The case of B fixed:
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES )1(O
)|(| )1(2 BsO
|)(| sO
)||( )1(2 BskO
NP-hard NP-hardfor ||=2
NP-hardfor k=2
The case of B fixed:The case of B fixed:
For B fixed, we can try all possible solutions. There are
)|(||| BsO
B
s
possible choices of bits to flip. We can try them all, and count the # of k-mers.
Since ||<=|s|, the way to flip them is bounded by )|(| ksO
4. NP-hardness4. NP-hardness
- Theorem: the problem is NP-hard even for k=2.
- Theorem: the problem is NP-hard even for k=2.
NO k NO
YES k NO
NO k YES
YES k YES
s NO B NO
s YES B NO
s NO B YES
s YES B YES )1(O
)|(| )1(2 BsO
|)(| sO
)||( )1(2 BskO
NP-hard NP-hardfor ||=2
NP-hardfor k=2
- Theorem: the problem is NP-hard even for k=2.
- Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS)
INSTANCE: a bipartite graph G=(U,V;E). Integers n, m
PROBLEM: does there exist a set VUX such that
(i)
(ii)
nX ||
mXGE |])[(|
- Theorem: the problem is NP-hard even for k=2.
- Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS)
INSTANCE: a bipartite graph G=(U,V;E). Integers n, m
PROBLEM: does there exist a set such that
(i)
(ii)
nX ||
mXGE |])[(|
n = 6, m = 5
VUX
- Theorem: the problem is NP-hard even for k=2.
- Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS)
INSTANCE: a bipartite graph G=(U,V;E). Integers n, m
PROBLEM: does there exist a set such that
(i)
(ii)
nX ||
mXGE |])[(|
n = 6, m = 5
VUX
- Theorem: the problem is NP-hard even for k=2.
- Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS)
INSTANCE: a bipartite graph G=(U,V;E). Integers n, m
PROBLEM: does there exist a set such that
(i)
(ii)
nX ||
mXGE |])[(|
n = 6, m = 5
( note that CBS is NP-hard because itincludes MAX BALANCED BIP GRAPH,for n = 2t, m = t^2 )
VUX
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
Let = {a,b,c,d,e,f,g,h,i,l} U { , } B = |E| - m
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
Let = {a,b,c,d,e,f,g,h,i,l} U { , } B = |E| - m
IDEA: Encode an edge (i,j) as …ij…
and make all k-mers x, x and unavoidable(i.e., insert a LOT of each of them in s)
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
Let = {a,b,c,d,e,f,g,h,i,l} U { , } B = |E| - m
The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing theedge.
ij.. ...ij
IDEA: Encode an edge (i,j) as …ij…
and make all k-mers x, x and unavoidable(i.e., insert a LOT of each of them in s)
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
Let = {a,b,c,d,e,f,g,h,i,l} U { , } B = |E| - m
The set of kmers of type x or xwhich remain define the set X whichcovers at least m edges
IDEA: Encode an edge (i,j) as …ij…
and make all k-mers x, x and unavoidable(i.e., insert a LOT of each of them in s)
The only kmers that can be destroyed are of the form x or x, and this is achieved by “flippling” the into a . This corresponds to removing theedge.
ij.. ...ij
CBS: does there exist VUX such that nX || mXGE |])[(|
The reduction:The reduction:and
?
a
b
c
d
e
f
g
h
i
l
Let = {a,b,c,d,e,f,g,h,i,l} U { , } B = |E| - m
IDEA: Encode an edge (i,j) as …ij…
and make all k-mers x, x and unavoidable(i.e., insert a LOT of each of them in s)
S =aa. . .aabb. . .bb. . .lll. . .lagbgei
-With similar reductions, we can also prove the
Theorem: the problem is NP-hard even for || = 2
5. Integer Linear 5. Integer Linear ProgrammingProgrammingformulationsformulations
Let K be the set of all possible kmers
hx
iz
Define a 0/1 variable in K
and a 0/1 variable for each position
x
Exponential-sizeExponential-size formulation (formulation (={0,1})={0,1})
K={ 000, 001, 010, 011, 100, 101, 110, 111 }
||,...,1 si
x min
Exponential-sizeExponential-size formulation (formulation (={0,1})={0,1})
x min
||
1
s
ii Bz
Exponential-sizeExponential-size formulation (formulation (={0,1})={0,1})
)1()1(),(),(
kzzxrEQLi
irDIFFi
i
x min
||
1
s
ii Bz
1||,...,1 ksr
Exponential-sizeExponential-size formulation (formulation (={0,1})={0,1})
Exponential n. of variables/constraints
pricing & separation problems
Our P&S strategy is exponential in the general case (polynomial for k fixed)
}0:{ * xA
Exponential n. of variables/constraints
pricing & separation problems
Our P&S strategy is exponential in the general case (polynomial for k fixed)
Fractional solutions may lead to effective heuristics (e.g., solving SUB(A) with
Polynomial-sizePolynomial-size formulation (formulation (={0,1})={0,1})
ijw
Polynomial-sizePolynomial-size formulation (formulation (={0,1})={0,1})
In addition to variables ,
there are variables for positions i and j saying
“is the kmer starting at i identical to the kmer starting at j ? “
iz
The variables w depend on s and z via linear constraints
hx
ijw
1iu
To count only kmers that have no identical kmer following them:
if all kmers starting after i are different from kmer at i
Polynomial-sizePolynomial-size formulation (formulation (={0,1})={0,1})
In addition to variables ,
there are variables for positions i and j saying
“is the kmer starting at i identical to the kmer starting at j ? “
iz
The variables w depend on s and z via linear constraints
Polynomial-sizePolynomial-size formulation (formulation (={0,1})={0,1})
-The formulation (not show) is basically a big boolean formula
-It yields a poor bound compared to the exponential formulation
-It can be improved in many ways (expecially via valid cuts)
-At this stage it’s not clear which method is best for not too small instances
-We’ll run experiments with variants of both formulations
<EOT><EOT>