A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU...
Transcript of A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU...
A totally unimodular view ofstructured sparsity
Volkan [email protected]
Laboratory for Information and Inference Systems (LIONS)École Polytechnique Fédérale de Lausanne (EPFL)
Switzerland
DISCML (NIPS)[December 13, 2014]
Joint work withMarwa El Halabi, Luca Baldassarre and Baran Gözcü @ LIONS
Anastasios Kyrillidis and Bubacarr Bah @ UT AustinNirav Bhan @ MIT
Outline
Total unimodularity in discrete optimization
From sparsity to structured sparsity
Convex relaxations for structured sparse recovery
Enter nonconvexity
Conclusions
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 2/ 45
Integer linear programming
Discrete optimizationSearch for an optimum object within a finite collection of objects.
Integer linear programMany important discrete optimization problemscan be formulated as an integer linear program
β\ ∈ arg maxβ∈Zm
{θTβ : Mβ ≤ c,β ≥ 0} (ILP)
NP-Hard (in general)I vertex cover, set packing, maximum flow,traveling salesman, boolean satisfiability.
A general approachAttempt the following convex relaxation
β? ∈ arg maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0} (LP)
Obtains an upperbound
�1
�2
�\
Polyhedra & Polytopes
P = {β|Mβ ≤ c,β ≥ 0}
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 3/ 45
Integer linear programming
Discrete optimizationSearch for an optimum object within a finite collection of objects.
Integer linear programMany important discrete optimization problemscan be formulated as an integer linear program
β\ ∈ arg maxβ∈Zm
{θTβ : Mβ ≤ c,β ≥ 0} (ILP)
NP-Hard (in general)I vertex cover, set packing, maximum flow,traveling salesman, boolean satisfiability.
A general approachAttempt the following convex relaxation
β? ∈ arg maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0} (LP)
Obtains an upperbound
�1
�2
M� c �\ ✓T�\
✓
Polyhedra & Polytopes
P = {β|Mβ ≤ c,β ≥ 0}
(β ∈ Rm , c ∈ Rm)
Polytope: A bounded polyhedron
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 3/ 45
Integer linear programming
Discrete optimizationSearch for an optimum object within a finite collection of objects.
Integer linear programMany important discrete optimization problemscan be formulated as an integer linear program
β\ ∈ arg maxβ∈Zm
{θTβ : Mβ ≤ c,β ≥ 0} (ILP)
NP-Hard (in general)I vertex cover, set packing, maximum flow,traveling salesman, boolean satisfiability.
A general approachAttempt the following convex relaxation
β? ∈ arg maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0} (LP)
Obtains an upperbound
�1
�2
M� c ✓T�\�\
✓
�?
✓T�? � ✓T�\
Polyhedra & Polytopes
P = {β|Mβ ≤ c,β ≥ 0}
(β ∈ Rm , c ∈ Rm)
Polytope: A bounded polyhedron
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 3/ 45
Integer linear programming
Discrete optimizationSearch for an optimum object within a finite collection of objects.
Integer linear programMany important discrete optimization problemscan be formulated as an integer linear program
β\ ∈ arg maxβ∈Zm
{θTβ : Mβ ≤ c,β ≥ 0} (ILP)
NP-Hard (in general)I vertex cover, set packing, maximum flow,traveling salesman, boolean satisfiability.
A general approachAttempt the following convex relaxation
β? ∈ arg maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0} (LP)
Obtains an upperbound
�1
�2
M� c ✓T�\
�?=�\
✓
Polyhedra & Polytopes
P = {β|Mβ ≤ c,β ≥ 0}
Observation:When every vertex of P is integer,LP is a “correct” relaxation.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 3/ 45
A sufficient condition
Polyhedra P = {Mβ ≤ c,β ≥ 0} has integer vertices when M is TU and c is integer
Definition (Total unimodularity)A matrix M ∈ Rl×m is totally unimodular(TU) iff the determinant of every squaresubmatrix of M is 0, or ±1.
Correctness of LP [23]When M is TU and c is integer, then the LP
maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0}
has integer optimal solutions (i.e., ILP ⊆ LP).
�1
�2
M� c
Verifying if a matrix is TU is in P [31]
TU matrices are not rare!I Regular matroids have TU representations [29]I Network flow problems & interval constraints involve TU matrices [23]I Incidence matrices of undirected bipartite graphs are TU [23]
Computational complexity of LP
I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 4/ 45
A sufficient condition
Polyhedra P = {Mβ ≤ c,β ≥ 0} has integer vertices when M is TU and c is integer
Definition (Total unimodularity)A matrix M ∈ Rl×m is totally unimodular(TU) iff the determinant of every squaresubmatrix of M is 0, or ±1.
Correctness of LP [23]When M is TU and c is integer, then the LP
maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0}
has integer optimal solutions (i.e., ILP ⊆ LP).
�1
�2
M� c
Verifying if a matrix is TU is in P [31]
Computational complexity of LP
I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)I IPM performs O
(√l log l
ε
)iterations (l > m) with up to O(m2l) operations,
where ε is the absolute solution accuracy
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 4/ 45
A sufficient condition
Polyhedra P = {Mβ ≤ c,β ≥ 0} has integer vertices when M is TU and c is integer
Definition (Total unimodularity)A matrix M ∈ Rl×m is totally unimodular(TU) iff the determinant of every squaresubmatrix of M is 0, or ±1.
Correctness of LP [23]When M is TU and c is integer, then the LP
maxβ∈Rm
{θTβ : Mβ ≤ c,β ≥ 0}
has integer optimal solutions (i.e., ILP ⊆ LP).
�1
�2
M� c
Verifying if a matrix is TU is in P [31]
Computational complexity of LP
I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)I What if l is exponentially large?
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 4/ 45
A weaker sufficient conditionSubmodularity & submodular polyhedron [15]F : 2V → R is submodular iff it has the following diminishing returns property:
F(S ∪ {e})− F(S) ≥ F(T ∪ {e})− F(T ),
∀S ⊆ T ⊆ V, ∀e ∈ V \ T . The submodular polyhedron is defined as
P(F) := {β ∈ Rm | ∀S ⊆ V,βT1S ≤ F(S)}
where 1S is the support indicator vector, i.e., (1S)i = 1 if i ∈ S, 0 otherwise.I We cannot verify submodularity in polynomial time [28].I Submodular polyhedron is TDI: LP is a “correct” relaxation of ILP.
Total dual integrality (TDI) [17]A system Mβ ≤ c is called TDI when primal objective is finite and the dual problem
minα∈Rl
{αT c : α ≥ 0,αT M = θT
}has integer optimum solutions for all rational M and c, and for each integer θ.I A polynomial time (in l and m) algorithm can verify if Mβ ≤ c is TDI [12].
Structure matters! LP is efficiently solvable on the submodular polyhedra P(F).
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 5/ 45
A weaker sufficient conditionSubmodularity & submodular polyhedron [15]F : 2V → R is submodular iff it has the following diminishing returns property:
F(S ∪ {e})− F(S) ≥ F(T ∪ {e})− F(T ),
∀S ⊆ T ⊆ V, ∀e ∈ V \ T . The submodular polyhedron is defined as
P(F) := {β ∈ Rm | ∀S ⊆ V,βT1S ≤ F(S)}
where 1S is the support indicator vector, i.e., (1S)i = 1 if i ∈ S, 0 otherwise.I We cannot verify submodularity in polynomial time [28].I Submodular polyhedron is TDI: LP is a “correct” relaxation of ILP.
Total dual integrality (TDI) [17]A system Mβ ≤ c is called TDI when primal objective is finite and the dual problem
minα∈Rl
{αT c : α ≥ 0,αT M = θT
}has integer optimum solutions for all rational M and c, and for each integer θ.I A polynomial time (in l and m) algorithm can verify if Mβ ≤ c is TDI [12].
Structure matters! LP is efficiently solvable on the submodular polyhedra P(F).
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 5/ 45
In the rest of the talk...
We can use these concepts in obtainingI tight convex relaxationsI efficient nonconvex projections
for supervised learning and inverse problems
b = Axx1
x2
x3
Thursday, June 19, 14
Running example:
b A x\ w
n × p
A difficult estimation challenge when n < p:
Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)
I Needle in a haystack: We need additional information on x\!
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 6/ 45
In the rest of the talk...
We can use these concepts in obtainingI tight convex relaxationsI efficient nonconvex projections
for supervised learning and inverse problems
b = Axx1
x2
x3
Thursday, June 19, 14
Running example:
b A x\ w
n × p
Applications: Machine learning, signal processing, theoretical computer science...
A difficult estimation challenge when n < p:
Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)
I Needle in a haystack: We need additional information on x\!
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 6/ 45
In the rest of the talk...
We can use these concepts in obtainingI tight convex relaxationsI efficient nonconvex projections
for supervised learning and inverse problemsb = Ax
x1
x2
x3
Thursday, June 19, 14
Running example:
b A x\ w
n × p
A difficult estimation challenge when n < p:
Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)I Needle in a haystack: We need additional information on x\!
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 6/ 45
Outline
Total unimodularity in discrete optimization
From sparsity to structured sparsity
Convex relaxations for structured sparse recovery
Enter nonconvexity
Conclusions
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 7/ 45
Three key insights
1. Sparse or compressible x\
not sufficient alone
2. Recovery
tractable & stable
3. Projection A
information preserving
b A x\ w
n × p
Typical goals:
1. Find x? to minimize ‖x? − x\‖
2. Find x? to minimize L(
x?(a),x\(a) + w)
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 8/ 45
Swiss army knife of signal models
Definition (s-sparse vector)A vector x ∈ Rp is s-sparse, i.e., x ∈ Σs,if it has at most s non-zero entries.
Rp
x\
Sparse representations:y\ has sparse transform coefficients x\
I Basis representations Ψ ∈ Rp×p
I Wavelets, DCT, ...I Frame representations Ψ ∈ Rm×p,
m > pI Gabor, curvelets, shearlets, ...
I Other dictionary representations...
∥∥x\∥∥
0:= |{i : x\i , 0}| = s
=y\ x\
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 9/ 45
Sparse representations strike back!
b A y\
I b ∈ Rn , A ∈ Rn×p, and n < p
Impact: Support restricted columns of A leads to an overcomplete system.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 10/ 45
Sparse representations strike back!
b A x\
I b ∈ Rn , A ∈ Rn×p, and n < pI Ψ ∈ Rp×p, x\ ∈ Σs, and s < n < p
Impact: Support restricted columns of A leads to an overcomplete system.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 10/ 45
Sparse representations strike back!
b A x\
I b ∈ Rn , A ∈ Rn×p, x\ ∈ Σs, and s < n < p
Impact: Support restricted columns of A leads to an overcomplete system.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 10/ 45
Sparse representations strike back!
b A x\
n × 1 n × s s × 1
I b ∈ Rn , A ∈ Rn×p, x\ ∈ Σs, and s < n < p
Impact: Support restricted columns of A leads to an overcomplete system.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 10/ 45
Enter sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the estimator with the least number of non-zero entries. That is,
x? ∈ arg minx∈Rp
{‖x‖0 : ‖b−Ax‖2 ≤ κ
}(P0)
with some κ ≥ 0. If κ = ‖w‖2, then x\ is a feasible solution.
P0 has the following characteristics:I sample complexity: O(s)I computational effort: NP-HardI stability: No
Tightest convex relaxation:‖x‖∗∗0 is the biconjugate (Fenchelconjugate of Fenchel conjugate)
Fenchel conjugate:f ∗(y) := supx:dom(f ) xT y− f (x).
A technicality: Restrict x\ ∈ [−1, 1]p.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 11/ 45
Enter sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the estimator with the least number of non-zero entries. That is,
x? ∈ arg minx∈Rp
{‖x‖0 : ‖b−Ax‖2 ≤ κ
}(P0)
with some κ ≥ 0. If κ = ‖w‖2, then x\ is a feasible solution.
P0 has the following characteristics:I sample complexity: O(s)I computational effort: NP-HardI stability: No
Tightest convex relaxation:‖x‖∗∗0 is the biconjugate (Fenchelconjugate of Fenchel conjugate)
Fenchel conjugate:f ∗(y) := supx:dom(f ) xT y− f (x).
‖x‖0 over the unit `∞-ball
A technicality: Restrict x\ ∈ [−1, 1]p.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 11/ 45
Enter sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the estimator with the least number of non-zero entries. That is,
x? ∈ arg minx∈Rp
{‖x‖0 : ‖b−Ax‖2 ≤ κ
}(P0)
with some κ ≥ 0. If κ = ‖w‖2, then x\ is a feasible solution.
P0 has the following characteristics:I sample complexity: O(s)I computational effort: NP-HardI stability: No
Tightest convex relaxation:‖x‖∗∗0 is the biconjugate (Fenchelconjugate of Fenchel conjugate)
Fenchel conjugate:f ∗(y) := supx:dom(f ) xT y− f (x).
‖x‖0 over the unit `∞-ball
A technicality: Restrict x\ ∈ [−1, 1]p.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 11/ 45
Enter sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the estimator with the least number of non-zero entries. That is,
x? ∈ arg minx∈Rp
{‖x‖0 : ‖b−Ax‖2 ≤ κ
}(P0)
with some κ ≥ 0. If κ = ‖w‖2, then x\ is a feasible solution.
P0 has the following characteristics:I sample complexity: O(s)I computational effort: NP-HardI stability: No
Tightest convex relaxation:‖x‖∗∗0 is the biconjugate (Fenchelconjugate of Fenchel conjugate)
Fenchel conjugate:f ∗(y) := supx:dom(f ) xT y− f (x).
‖x‖1 is the convex envelope of ‖x‖0
A technicality: Restrict x\ ∈ [−1, 1]p.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 11/ 45
The role of convexity: Tractable & stable recovery
A convex candidate solution for b = Ax\ + w
x? ∈ arg minx∈Rp
{‖x‖1 : ‖b−Ax‖2 ≤ ‖w‖2, ‖x‖∞ ≤ 1
}. (SOCP)
Theorem (A model recovery guarantee [27])Let A ∈ Rn×p be a matrix of i.i.d. Gaussian random variables with zero mean andvariances 1/n. For any t > 0 with probability at least 1− 6 exp
(−t2/26
), we have
∥∥x? − x\∥∥
2≤
[2√
2s log( ps ) + 5
4 s√
n −√
2s log( ps ) + 5
4 s − t
]‖w‖2 B ε, when ‖x\‖0 ≤ s.
Observations:I perfect recovery (i.e., ε = 0) with n ≥ 2s log( p
s ) + 54 s whp when w = 0.
I ε-accurate solution in k = O(√
2p + 1 log( 1ε))iterations via IPM1
with each iteration requiring the solution of a structured n× 2p linear system.2
I robust to noise.1For a rigorous primal-dual algorithm for this class of problems, see my NIPS 2014 paper [30].2When w = 0, the IPM complexity (# of iterations × cost per iteration) amounts to O(n2p1.5 log( 1
ε)).
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 12/ 45
The role of the matrix A: Preserving information
Proposition (Condition for exact recovery in the noiseless case)We have successful recovery with x? ∈ arg minx∈Rp {f (x) : b = Ax, ‖x‖∞ ≤ 1},i.e., δ := x? − x\ = 0, if and only if null(A) ∩ Df (x\) = {0}.
null (A)
x
x\
�x : f(x) f(x\)
Df (x\)
Assume that the constraint ‖x‖∞ ≤ 1 is inactive.
Descent cone: Df (x\) := cone({
x : f (x\ + x) ≤ f (x\)})
.
x\
null (A)
b = Ax\
xx � x\
�x : f(x) f(x\)
Df (x\)
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 13/ 45
The role of the matrix A: Preserving information
Proposition (Condition for exact recovery in the noiseless case)We have successful recovery with x? ∈ arg minx∈Rp {f (x) : b = Ax, ‖x‖∞ ≤ 1},i.e., δ := x? − x\ = 0, if and only if null(A) ∩ Df (x\) = {0}.
x\
null (A)
b = Ax\
xx � x\
�x : f(x) f(x\)
Df (x\)
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 13/ 45
The role of the matrix A: Preserving information
P{
x? = x\}
= P{
null(A) ∩ Df (x\) = {0}}
Definition (Statistical dimension [2]3)Let C ⊆ Rp be a closed convex cone. The statistical dimension of C is defined as
d(C) := E[‖projC(g)‖2
2].
Theorem (Approximate kinematic formula [2])Let A ∈ Rn×p, n < p, be a matrix of i.i.d. standard Gaussian random variables, andlet C ⊆ Rp be a closed convex cone. Let η ∈ (0, 1), then we have
n ≥ d(C) + cη√p ⇒ P {null(A) ∩ C = {0}} ≥ 1− η;
n ≤ d(C)− cη√p ⇒ P {null(A) ∩ C = {0}} ≤ η,
where cη :=√
8 log(4/η).
We can compute d(C) . 2s log( ps ) + 5
4 s for C = D‖·‖1 (x\) when x\ ∈ Σs.
3The statistical dimension is closely related to the Gaussian complexity [7], Gaussian width [10], mean width[32], and Gaussian squared complexity [9].
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 14/ 45
Beyond sparsity towards model-based or structured sparsity
I The following signals can look the same from a sparsity perspective!
Sparse image Wavelet coefficients Spike train Background substractedof a natural image image
sorted index s p
|x\|(i)
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 15/ 45
Beyond sparsity towards model-based or structured sparsity
I The following signals can look the same from a sparsity perspective!
Sparse image Wavelet coefficients Spike train Background substractedof a natural image image
I In reality, these signals have additional structures beyond the simple sparsity
�
Sparse image Wavelet coefficients Spike train Background substractedof a natural image image
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 15/ 45
Beyond sparsity towards model-based or structured sparsity
Sparsity model: Union of all s-dimensionalcanonical subspaces.
Structured sparsity model: A particularunion of ms s-dimensional canonicalsubspaces.
Rp✓p
s
◆
Rp
ms
Model-based or structured sparsityStructured sparsity models are discrete structures describing the interdependencybetween the non-zero coefficients of a vector.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 16/ 45
Three upshots of structured sparsity
Key properties of the statistical dimension [2]I The statistical dimension is invariant under unitary transformations (rotations).I Let C1 and C2 be closed convex cones. If C1 ⊆ C2, then d(C1) ≤ d(C2).
1. The smaller the statistical dimension is, the less we need to sample
Example (If Df1(x\) ⊆ Df2(x\) ⊆ Rn, then d(Df1(x\)) ≤ d(Df2(x\)). )f1(x) := max{|x1|, |x2|}+ |x3|f2(x) := ‖x‖1
x\ = [1,−1, 0]T
Observations:1. n1 < n2 for x\
2. n1 > n2 for z\ = [0, 0, 1]T
I Reduced sample complexity: phase transition at the statistical dimension
I Better noise robustness: denoising capabilities depend on the statistical dimension
maxσ>0
E[‖proxf (x\ + σw, σλ)− x\‖2
]σ2 ≤ d(λDf (x\))
Minimize a bound to the minimax risk via the regularization parameter λ [27]I Better interpretability: geometry can enhance interpretability
`1-ball: TV-ball:Influence the recovered support via customized convex geometry
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 17/ 45
Three upshots of structured sparsity
Key properties of the statistical dimension [2]I The statistical dimension is invariant under unitary transformations (rotations).I Let C1 and C2 be closed convex cones. If C1 ⊆ C2, then d(C1) ≤ d(C2).
2. The smaller the statistical dimension is, the better we can denoiseI Reduced sample complexity: phase transition at the statistical dimensionI Better noise robustness: denoising capabilities depend on the statistical dimension
maxσ>0
E[‖proxf (x\ + σw, σλ)− x\‖2
]σ2 ≤ d(λDf (x\))
Minimize a bound to the minimax risk via the regularization parameter λ [27]
I Better interpretability: geometry can enhance interpretability
`1-ball: TV-ball:Influence the recovered support via customized convex geometry
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 17/ 45
Three upshots of structured sparsity
Key properties of the statistical dimension [2]I The statistical dimension is invariant under unitary transformations (rotations).I Let C1 and C2 be closed convex cones. If C1 ⊆ C2, then d(C1) ≤ d(C2).
3. The smaller the statistical dimension is, the better we can enforce structureI Reduced sample complexity: phase transition at the statistical dimensionI Better noise robustness: denoising capabilities depend on the statistical dimension
maxσ>0
E[‖proxf (x\ + σw, σλ)− x\‖2
]σ2 ≤ d(λDf (x\))
Minimize a bound to the minimax risk via the regularization parameter λ [27]I Better interpretability: geometry can enhance interpretability
`1-ball: TV-ball:Influence the recovered support via customized convex geometry
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 17/ 45
Outline
Total unimodularity in discrete optimization
From sparsity to structured sparsity
Convex relaxations for structured sparse recovery
Enter nonconvexity
Conclusions
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 18/ 45
A simple template for linear inverse problems
Find the “sparsest” x subject to structure and data.
I SparsityWe can generalize this desideratum to other notions of simplicity
I StructureWe only allow certain sparsity patterns
I Data fidelityWe have many choices of convex constraints & losses to represent data; e.g.,
‖b−Ax‖2 ≤ κ
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 19/ 45
A convex proto-problem for structured sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the sparsest estimator or its surrogate with a valid sparsity pattern:
x? ∈ arg minx∈Rp
{‖x‖s : ‖b−Ax‖2 ≤ κ, ‖x‖∞ ≤ 1
}(Ps)
with some κ ≥ 0. If κ = ‖w‖2, then the structured sparse x\ is a feasible solution.
Sparsity and structure together [14]Given some weights d ∈ Rd , e ∈ Rp and an integer input c ∈ Zl , we define
‖x‖s := minω{dTω + eT s : M
[ωs
]≤ c,1supp(x) = s,ω ∈ {0, 1}d}
for all feasible x, ∞ otherwise. The parameter ω is useful for latent modeling.
A convex candidate solution for b = Ax\ + wWe use the convex estimator based on the tightest convex relaxation of ‖x‖s:
x? ∈ arg minx∈dom(‖·‖s)
{‖x‖∗∗s : ‖b−Ax‖2 ≤ κ
}with some κ ≥ 0, dom(‖·‖s) := {x : ‖x‖s <∞}.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 20/ 45
A convex proto-problem for structured sparsity
A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the sparsest estimator or its surrogate with a valid sparsity pattern:
x? ∈ arg minx∈Rp
{‖x‖s : ‖b−Ax‖2 ≤ κ, ‖x‖∞ ≤ 1
}(Ps)
with some κ ≥ 0. If κ = ‖w‖2, then the structured sparse x\ is a feasible solution.
Sparsity and structure together [14]Given some weights d ∈ Rd , e ∈ Rp and an integer input c ∈ Zl , we define
‖x‖s := minω{dTω + eT s : M
[ωs
]≤ c,1supp(x) = s,ω ∈ {0, 1}d}
for all feasible x, ∞ otherwise. The parameter ω is useful for latent modeling.
A convex candidate solution for b = Ax\ + wWe use the convex estimator based on the tightest convex relaxation of ‖x‖s:
x? ∈ arg minx∈dom(‖·‖s)
{‖x‖∗∗s : ‖b−Ax‖2 ≤ κ
}with some κ ≥ 0, dom(‖·‖s) := {x : ‖x‖s <∞}.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 20/ 45
Tractability & tightness of biconjugation
Proposition (Hardness of conjugation)Let F(s) : 2P → R ∪ {+∞} be a set function defined on the support s = supp(x).Conjugate of F over the unit infinity ball ‖x‖∞ ≤ 1 is given by
g∗(y) = sups∈{0,1}p
|y|T s − F(s).
Observations:
I F(s) is general set functionComputation: NP-Hard
I F(s) = ‖x‖sComputation: ILP in general. However, ifI M is TUI (M , c) is TDI
then tight convex relaxations with an LP (“usually” tractable)
Otherwise, relax to LP anyway!
I F(s) is submodularComputation: Polynomial-time
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 21/ 45
Tree sparsity [21, 13, 6, 33]
Wavelet coefficients Wavelet tree Valid selection of nodes Invalid selection of nodes
Structure: We seek the sparsest signal with a rooted connected subtree support.
Linear description: A valid support satisfy sparent ≥ schild over tree T
T1supp(x) := Ts ≥ 0
where T is the directed edge-node incidence matrix, which is TU.
Biconjugate: ‖x‖∗∗s = mins∈[0,1]p{1T s : Ts ≥ 0, |x| ≤ s}
?=∑G∈GH
‖xG‖∞
for x ∈ [−1, 1]p, ∞ otherwise.
The set G ∈ GH are defined as each node and all its descendants.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 22/ 45
Tree sparsity [21, 13, 6, 33]
Wavelet coefficients Wavelet tree Valid selection of nodes Invalid selection of nodes
Structure: We seek the sparsest signal with a rooted connected subtree support.
Linear description: A valid support satisfy sparent ≥ schild over tree T
T1supp(x) := Ts ≥ 0
where T is the directed edge-node incidence matrix, which is TU.
Biconjugate: ‖x‖∗∗s = mins∈[0,1]p{1T s : Ts ≥ 0, |x| ≤ s}
?=∑G∈GH
‖xG‖∞
for x ∈ [−1, 1]p, ∞ otherwise.
The set G ∈ GH are defined as each node and all its descendants.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 22/ 45
Tree sparsity [21, 13, 6, 33]
GH = {{1, 2, 3}, {2}, {3}} valid selection of nodes
Structure: We seek the sparsest signal with a rooted connected subtree support.
Linear description: A valid support satisfy sparent ≥ schild over tree T
T1supp(x) := Ts ≥ 0
where T is the directed edge-node incidence matrix, which is TU.
Biconjugate: ‖x‖∗∗s = mins∈[0,1]p{1T s : Ts ≥ 0, |x| ≤ s} ?=∑G∈GH
‖xG‖∞for x ∈ [−1, 1]p, ∞ otherwise.
The set G ∈ GH are defined as each node and all its descendants.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 22/ 45
Tree sparsity example: 1:100-compressive sensing [30, 1]
World [1Gpix] Lac Léman World [10Mpix]
sparse tree-sparse
PNSR = 31.83db PNSR = 32.48db
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 23/ 45
Tree sparsity example: TV & TU-relax 1:15-compression [30, 1]
Original Original BP
TU-relax TV
TV with BP TV with TU-relax
Regularization:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.054
0.055
0.056
0.057
0.058
0.059
0.06
0.061
α
l2 e
rror
tv−BPmin errortv−TU relaxmin error
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 24/ 45
Group knapsack sparsity [35, 18, 16]
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
1supp(x)
supportindicator vector
sparse
1
2
2
3
1
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
G5 = {6, 8}
G1 = {1}
knapsackconstraints vector
cu
Structure: We seek the sparsest signal with group allocation constraints.
Linear description: A valid support obeys budget constraints over G
BT s ≤ cu
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.Remark: We can also budget a lowerbound c` ≤ BT s ≤ cu .
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 25/ 45
Group knapsack sparsity [35, 18, 16]
�
BT =
1 1 · · · 1 1 0 0 · · · 0
0 1 1 · · · 1 1 0 · · · 0
...
0 · · · 0 0 1 1 · · · 1 1
(p−∆+1)×p
Structure: We seek the sparsest signal with group allocation constraints.
Linear description: A valid support obeys budget constraints over G
BT s ≤ cu
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.Remark: We can also budget a lowerbound c` ≤ BT s ≤ cu .
Biconjugate: ‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise
For the neuronal spike example, we have cu = 1.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 25/ 45
Group knapsack sparsity [35, 18, 16]
(left) ‖x‖∗∗s ≤ 1 (middle) ‖x‖∗∗s ≤ 1.5 (right) ‖x‖∗∗s ≤ 2 for G = {{1, 2}, {2, 3}}
Structure: We seek the sparsest signal with group allocation constraints.
Linear description: A valid support obeys budget constraints over G
BT s ≤ cu
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.Remark: We can also budget a lowerbound c` ≤ BT s ≤ cu .
Biconjugate: ‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise
For the neuronal spike example, we have cu = 1.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 25/ 45
Group knapsack sparsity example: A stylized spike train
I Basis pursuit (BP): ‖x‖1I TU-relax (TU):
‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise
0.1 0.2 0.3 0.4
0.2
0.4
0.6
0.8
1
n/p
Error:
‖x−x?‖2
‖x?‖2
BPDBP
Figure: Recovery for n = 0.18p.
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
x\ xBP solution xTU solution
relative errors:‖x\−xBP‖2‖x\‖2
= .200‖x\−xTU‖2‖x\‖2
= .067
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 26/ 45
Group knapsack sparsity: A simple variation
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
1supp(x)
supportindicator vector
sparse
1
2
2
3
1
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
G5 = {6, 8}
G1 = {1}
knapsackconstraints vector
cu
Structure: We seek the signal with the minimal overall group allocation.
Objective: 1T s → ‖x‖ω ={
minω∈Z++ ω if x ∈ [−1, 1]p,BT |x| ≤ ω1,∞ otherwise
Linear description: A valid support obeys budget constraints over G
BT s ≤ ω1
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.
Biconjugate: ‖x‖∗∗s ={
maxG∈G ‖xG‖1 if x ∈ [−1, 1]p,∞ otherwise
Remark: The regularizer is known as exclusive Lasso [35, 26].A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 27/ 45
Group cover sparsity: Minimal group cover [5, 25, 19]
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
1supp(x)
supportindicator vector
sparse
0
0
0
1
0
group “support”indicator vector
Ê
group sparse
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
G5 = {6, 8}
G1 = {1}
Structure: We seek the signal covered by a minimal number of groups.
Objective: 1T s → dTω
Linear description: At least one group containing a sparse coefficient is selected
Bω ≥ s
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix, or G has a loopless group intersection graph it is TU.
Remark: Weights d can depend on the sparsity within each groups (not TU) [14].
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 28/ 45
Group cover sparsity: Minimal group cover [5, 25, 19]
G = {{1, 2}, {2, 3}}, unit group weights d = 1.
Structure: We seek the signal covered by a minimal number of groups.
Objective: 1T s → dTω
Linear description: At least one group containing a sparse coefficient is selected
Bω ≥ s
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix, or G has a loopless group intersection graph it is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : Bω ≥ |x|} for x ∈ [−1, 1]p,∞ otherwise
?= minvi∈Rp{∑M
i=1 di‖vi‖∞ : x =∑M
i=1 vi , ∀supp(vi) ⊆ Gi},Remark: Weights d can depend on the sparsity within each groups (not TU) [14].
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 28/ 45
Group cover sparsity: Minimal group cover [5, 25, 19]
G = {{1, 2}, {2, 3}}, unit group weights d = 1.
Structure: We seek the signal covered by a minimal number of groups.
Objective: 1T s → dTω
Linear description: At least one group containing a sparse coefficient is selected
Bω ≥ s
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix, or G has a loopless group intersection graph it is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : Bω ≥ |x|} for x ∈ [−1, 1]p,∞ otherwise?= minvi∈Rp{
∑Mi=1 di‖vi‖∞ : x =
∑Mi=1 vi , ∀supp(vi) ⊆ Gi},
Remark: Weights d can depend on the sparsity within each groups (not TU) [14].
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 28/ 45
Group cover sparsity: Minimal group cover [5, 25, 19]
G = {{1, 2}, {2, 3}}, unit group weights d = 1.
Structure: We seek the signal covered by a minimal number of groups.
Objective: 1T s → dTω
Linear description: At least one group containing a sparse coefficient is selected
Bω ≥ s
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix, or G has a loopless group intersection graph it is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : Bω ≥ |x|} for x ∈ [−1, 1]p,∞ otherwise?= minvi∈Rp{
∑Mi=1 di‖vi‖∞ : x =
∑Mi=1 vi , ∀supp(vi) ⊆ Gi},
Remark: Weights d can depend on the sparsity within each groups (not TU) [14].
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 28/ 45
Budgeted group cover sparsity
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
1supp(x)
supportindicator vector
sparse
0
0
0
1
0
group “support”indicator vector
Ê
group sparse
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
G5 = {6, 8}
G1 = {1}
Structure: We seek the sparsest signal covered by G groups.
Objective: dTω → 1T s
Linear description: At least one of the G selected groups cover each sparse coefficient.
Bω ≥ s,1Tω ≤ G
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .
When[
B1
]is an interval matrix, it is TU.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 29/ 45
Budgeted group cover sparsity
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
1supp(x)
supportindicator vector
sparse
0
0
0
1
0
group “support”indicator vector
Ê
group sparse
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
G5 = {6, 8}
G1 = {1}
Structure: We seek the sparsest signal covered by G groups.
Objective: dTω → 1T s
Linear description: At least one of the G selected groups cover each sparse coefficient.
Bω ≥ s,1Tω ≤ G
where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .
When[
B1
]is an interval matrix, it is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {‖x‖1 : Bω ≥ |x|,1Tω ≤ G}for x ∈ [−1, 1]p,∞ otherwise.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 29/ 45
Budgeted group cover example: Interval overlapping groups
I Basis pursuit (BP): ‖x‖1I Sparse group Lasso (SGLq):
(1− α)∑G∈G
√|G|‖xG‖q + α‖xG‖1
I TU-relax (TU):
‖x‖∗∗ω = minω∈[0,1]M
{‖x‖1 : Bω ≥ |x|, 1Tω ≤ G}
for x ∈ [−1, 1]p,∞ otherwise.0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1
n/p
Error:
‖x−x?‖2
‖x?‖2
BPSGLSGL∞
BLGC
Figure: Recovery for n = 0.25p, s = 15, p = 200,G = 5 out of M = 29 groups.
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
x\ xBP solution xSGL solution xSGL∞ solution xTU solution
relative errors:‖x\−xBP‖2‖x\‖2
= .128‖x\−xSGL‖2‖x\‖2
= .181‖x\−xSGL∞‖2‖x\‖2
= .085‖x\−xTU‖2‖x\‖2
= .058
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 30/ 45
Group intersection sparsity [20, 34, 3]
G2 = {1, 2, 3, 4, 5}
x1
x2
x3
x4
x5
x6
x7
x8
1supp(x)
supportindicator vector
sparse
0
1
1
1
0
group “support”indicator vector
Ê
group sparse
G3 = {5, 6, 7, 8}
G4 = {2, 5, 7}
x1
x2
x3
x4
x5
x6
x7
x8
0
1
0
0
1
0
1
0
G1 = {1}
G5 = {6, 8}
Structure: We seek the signal intersecting with minimal number of groups.
Objective: 1T s → dTω
Linear description: All groups containing a sparse coefficient are selected
H ks ≤ ω,∀k ∈ P
where H k(i, j) ={
1 if j = k, j ∈ Gi
0 otherwise, which is TU.
Remark: For hierarchical GH , group intersection and tree sparsity models coincide.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 31/ 45
Group intersection sparsity [20, 34, 3]
G = {{1, 2}, {2, 3}}, unit group weights d = 1
(left) intersection (right) cover.
Structure: We seek the signal intersecting with minimal number of groups.Objective: 1T s → dTω
Linear description: All groups containing a sparse coefficient are selected
H ks ≤ ω,∀k ∈ P
where H k(i, j) ={
1 if j = k, j ∈ Gi
0 otherwise, which is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}
?=∑G∈G ‖xG‖∞
for x ∈ [−1, 1]p,∞ otherwise.
Remark: For hierarchical GH , group intersection and tree sparsity models coincide.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 31/ 45
Group intersection sparsity [20, 34, 3]
G = {{1, 2}, {2, 3}}, unit group weights d = 1
(left) intersection (right) cover.
Structure: We seek the signal intersecting with minimal number of groups.Objective: 1T s → dTω (submodular)
Linear description: All groups containing a sparse coefficient are selected
H ks ≤ ω,∀k ∈ P
where H k(i, j) ={
1 if j = k, j ∈ Gi
0 otherwise, which is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}?=∑G∈G ‖xG‖∞
for x ∈ [−1, 1]p,∞ otherwise.
Remark: For hierarchical GH , group intersection and tree sparsity models coincide.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 31/ 45
Group intersection sparsity [20, 34, 3]
G = {{1, 2, 3}, {2}, {3}}, unit group weights d = 1.
Structure: We seek the signal intersecting with minimal number of groups.
Objective: 1T s → dTω (submodular)
Linear description: All groups containing a sparse coefficient are selected
H ks ≤ ω,∀k ∈ P
where H k(i, j) ={
1 if j = k, j ∈ Gi
0 otherwise, which is TU.
Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}?=∑G∈G ‖xG‖∞
for x ∈ [−1, 1]p,∞ otherwise.
Remark: For hierarchical GH , group intersection and tree sparsity models coincide.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 31/ 45
Beyond linear costs: Graph dispersiveness
(left) ‖x‖∗∗s = 0 (right) ‖x‖∗∗s ≤ 1 for E = {{1, 2}, {2, 3}} (chain graph)
Structure: We seek a signal dispersive over a given graph G(P, E)
Objective: 1T s →∑
(i,j)∈E sisj (non-linear, supermodular function)
Linearization:
‖x‖s = minz∈{0,1}|E|{∑
(i,j)∈E zij : zij ≥ si + sj − 1}
When edge-node incidence matrix of G(P, E) is TU (e.g., bipartite graphs), it is TU.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 32/ 45
Beyond linear costs: Graph dispersiveness
(left) ‖x‖∗∗s = 0 (right) ‖x‖∗∗s ≤ 1 for E = {{1, 2}, {2, 3}} (chain graph)
Structure: We seek a signal dispersive over a given graph G(P, E)
Objective: 1T s →∑
(i,j)∈E sisj (non-linear, supermodular function)
Linearization:
‖x‖s = minz∈{0,1}|E|{∑
(i,j)∈E zij : zij ≥ si + sj − 1}
When edge-node incidence matrix of G(P, E) is TU (e.g., bipartite graphs), it is TU.Biconjugate: ‖x‖∗∗s =
∑(i,j)∈E(|xi |+ |xj | − 1)+ for x ∈ [−1, 1]p,∞ otherwise.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 32/ 45
Outline
Total unimodularity in discrete optimization
From sparsity to structured sparsity
Convex relaxations for structured sparse recovery
Enter nonconvexity
Conclusions
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 33/ 45
An important alternative
Problem (Projection)
DefineMs,G := {x : eT s ≤ s,dTω ≤ G, M[ωs
]≤ c, s = 1supp(x)}.
The projection of x ontoMs,G in `q-norm is defined as Pq,Ms,G (x) : Rp → Rp,
Pq,Ms,G (x) ∈ arg minu∈Rp
{‖x− u‖qq : u ∈Ms,G}
I x = PM k,G (x) is the best model-based approximation of x.I The interesting cases are q = 1, 2.
Observation: Model-based approximation corresponds to an ILPI NP-Hard in general (weighted max cover formulation [5])I TU structures play a major roleI Pseudo-polynomial time solutions via dynamic programming
Model-based CS [6, 22]: n = O(log |Ms,G |) with iid Gaussian (dense)I n = O(s) for tree structureI iterative projected gradient descent xk+1 ∈ P2,Ms,G
(xk + AT (b−Axk)
)Model-based sketching [4]: n = o (s log(p/s)) with expanders (sparse)I n = O(s log(p/s)/ log log(p)) for tree structure (empirical: n = O(s))I iterative projected median descent xk+1 ∈ P1,Ms,G (xk +M(b−Axk))
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 34/ 45
An important alternative
Problem (Projection)
DefineMs,G := {x : eT s ≤ s,dTω ≤ G, M[ωs
]≤ c, s = 1supp(x)}.
The projection of x ontoMs,G in `q-norm is defined as Pq,Ms,G (x) : Rp → Rp,
Pq,Ms,G (x) ∈ arg minu∈Rp
{‖x− u‖qq : u ∈Ms,G}
I x = PM k,G (x) is the best model-based approximation of x.I The interesting cases are q = 1, 2.
Observation: Model-based approximation corresponds to an ILPI NP-Hard in general (weighted max cover formulation [5])I TU structures play a major roleI Pseudo-polynomial time solutions via dynamic programming
Model-based CS [6, 22]: n = O(log |Ms,G |) with iid Gaussian (dense)I n = O(s) for tree structureI iterative projected gradient descent xk+1 ∈ P2,Ms,G
(xk + AT (b−Axk)
)Model-based sketching [4]: n = o (s log(p/s)) with expanders (sparse)I n = O(s log(p/s)/ log log(p)) for tree structure (empirical: n = O(s))I iterative projected median descent xk+1 ∈ P1,Ms,G (xk +M(b−Axk))
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 34/ 45
Tree sparsity example: Pareto frontier [8, 5]
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
Sparsity s0 5 10 15 20 25
Appro
xim
ation
Err
or
0
1
2
3
4
5
6
7
8
9Tree model
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
Sparsity s0 5 10 15 20 25
Appro
xim
ation
Err
or
0
1
2
3
4
5
6
7
8
9Tree modelTU-relax
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
Sparsity s0 5 10 15 20 25
Appro
xim
ation
Err
or
0
1
2
3
4
5
6
7
8
9Tree modelTU-relaxHGL `1
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
Sparsity s0 5 10 15 20 25
Appro
xim
ation
Err
or
0
1
2
3
4
5
6
7
8
9Tree modelTU-relaxHGL `1
Just kidding, they are the same.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
Sparsity s0 5 10 15 20 25
Appro
xim
ation
Err
or
0
1
2
3
4
5
6
7
8
9Tree modelTU-relaxHGL `1HGL `2
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
Tree sparsity example: Pareto frontier [8, 5]
�������� �0 5 10 15 20 25
���
���
���
���
�
0
1
2
3
4
5
6
7
8
9���� ��������������� ����� ��
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 35/ 45
CLASH [22]
Combinatorial Selection+
Least Absolute Shrinkage
Retains the same theoretical guarantees
ILP and matroid structured models...
CLASH set:
Model-CLASH set:
combinatorial origami
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 36/ 45
Outline
Total unimodularity in discrete optimization
From sparsity to structured sparsity
Convex relaxations for structured sparse recovery
Enter nonconvexity
Conclusions
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 37/ 45
Conclusions
Our work: TU modeling framework & convex template & non-convex algorithmsI Many more convex programs (not necessarily norms)I TU models: tight convexifications, non-submodular examplesI Easy to design and “usually” efficient via an LPI London calling...
Alternatives:1. Atomic norms [11, 10]
I Given a set A, use the biconjugation of g(x) = inf0≤t≤c t + ιtA(x), for c > 0I Reverse engineer the set to obtain structured sparsityI “Usually” tractable since the norm is reverse engineered
2. Monotone submodular penalties and extensions [3]I Tight convexification via Lovász extensionI Reverse engineer the submodular set function (not always possible)
3. `q-regularized combinatorial functions [24]I Tight convexification (also explains latent group lasso like norms)I Not always efficiently computableI Reverse engineered and may loose structure, e.g., group knapsack model
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 38/ 45
References I
[1] Ben Adcock, Anders C. Hansen, Clarice Poon, and Bogdan Roman.Breaking the coherence barrier: A new theory for compressed sensing.http://arxiv.org/abs/1302.0561, Feb. 2013.
[2] Dennis Amelunxen, Martin Lotz, Michael B. McCoy, and Joel A. Tropp.Living on the edge: Phase transitions in convex programs with random data.Information and Inference, 3:224–294, 2014.arXiv:1303.6672v2 [cs.IT].
[3] Francis Bach.Structured sparsity-inducing norms through submodular functions.Adv. Neur. Inf. Proc. Sys. (NIPS), pages 118–126, 2010.
[4] B. Bah, L. Baldassarre, and V. Cevher.Model-based sketching and recovery with expanders.In Proc. ACM-SIAM Symp. Disc. Alg., number EPFL-CONF-187484, 2014.
[5] L. Baldassarre, N. Bhan, V. Cevher, and A. Kyrillidis.Group-sparse model selection: Hardness and relaxations.arXiv preprint arXiv:1303.3207, 2013.
[6] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde.Model-based compressive sensing.IEEE Trans. Inf. Theory, 56(4):1982–2001, April 2010.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 39/ 45
References II
[7] Peter L. Barlett and Shahar Mendelson.Rademacher and Gaussian complexities: Risk bounds and structural results.J. Mach. Learn. Res., 3, 2002.
[8] N. Bhan, L. Baldassarre, and V. Cevher.Tractability of interpretability via selection of group-sparse models.In IEEE Int. Symp. Inf. Theory, 2013.
[9] Venkat Chandrasekaran and Michael I. Jordan.Computational and statistical tradeoffs via convex relaxation.Proc. Nat. Acad. Sci., 110(13):E1181–E1190, 2013.
[10] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky.The convex geometry of linear inverse problems.Found. Comp. Math., 12:805–849, 2012.
[11] S. Chen, D. Donoho, and M. Saunders.Atomic decomposition by basis pursuit.SIAM J. Sci. Comp., 20(1):33–61, 1998.
[12] William Cook, László Lovász, and Alexander Schrijver.A polynomial-time test for total dual integrality in fixed dimension.In Mathematical programming at Oberwolfach II, pages 64–69. Springer, 1984.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 40/ 45
References III
[13] Marco F. Duarte, Dharmpal Davenport, Mark A. adn Takhar, Jason N. Laska,Ting Sun, Kevin F. Kelly, and Richard G. Baraniuk.Single-pixel imaging via compressive sampling.IEEE Sig. Proc. Mag., 25(2):83–91, March 2008.
[14] Marwa El Halabi and Volkan Cevher.A totally unimodular view of structured sparsity.preprint, 2014.arXiv:1411.1990v1 [cs.LG].
[15] S. Fujishige.Submodular functions and optimization, volume 58.Elsevier Science, 2005.
[16] W Gerstner and W. Kistler.Spiking neuron models: Single neurons, populations, plasticity.Cambridge university press, 2002.
[17] FR Giles and William R Pulleyblank.Total dual integrality and integer polyhedra.Linear algebra and its applications, 25:191–196, 1979.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 41/ 45
References IV
[18] C. Hegde, M. Duarte, and V. Cevher.Compressive sensing recovery of spike trains using a structured sparsity model.In Sig. Proc. with Adapative Sparse Struct. Rep. (SPARS), 2009.
[19] J. Huang, T. Zhang, and D. Metaxas.Learning with structured sparsity.J. Mach. Learn. Res., 12:3371–3412, 2011.
[20] R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, F. Bach, and B. Thirion.Multi-scale mining of fmri data with hierarchical structured sparsity.In Pattern Recognition in NeuroImaging (PRNI), 2011.
[21] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach.Proximal methods for hierarchical sparse coding.J. Mach. Learn. Res., 12:2297–2334, 2011.
[22] A. Kyrillidis and V. Cevher.Combinatorial selection and least absolute shrinkage via the CLASH algorithm.In IEEE Int. Symp. Inf. Theory, pages 2216–2220. Ieee, 2012.
[23] George L Nemhauser and Laurence A Wolsey.Integer and combinatorial optimization, volume 18.Wiley New York, 1999.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 42/ 45
References V
[24] G. Obozinski and F. Bach.Convex relaxation for combinatorial penalties.arXiv preprint arXiv:1205.1240, 2012.
[25] G. Obozinski, L. Jacob, and J.P. Vert.Group lasso with overlaps: The latent group lasso approach.arXiv preprint arXiv:1110.0413, 2011.
[26] G. Obozinski, B. Taskar, and M.I. Jordan.Joint covariate selection and joint subspace selection for multiple classificationproblems.Statistics and Computing, 20(2):231–252, 2010.
[27] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi.Simple bounds for noisy linear inverse problems with exact side information.2013.arXiv:1312.0641v2 [cs.IT].
[28] C Seshadhri and Jan Vondrák.Is submodularity testable?Algorithmica, 69(1):1–25, 2014.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 43/ 45
References VI
[29] Paul D Seymour.Decomposition of regular matroids.Journal of combinatorial theory, Series B, 28(3):305–359, 1980.
[30] Quoc Tran-Dinh and Volkan Cevher.Constrained convex minimization via model-based excessive gap.In Adv. Neur. Inf. Proc. Sys. (NIPS), 2014.
[31] Klaus Truemper.Alpha-balanced graphs and matrices and GF(3)-representability of matroids.J. Comb. Theory Ser. B, 32(2):112–139, 1982.
[32] Roman Vershynin.Estimation in high dimensions: a geometric perspective.http://arxiv.org/abs/1405.5103, May 2014.
[33] Peng Zhao, Guilherme Rocha, and Bin Yu.Grouped and hierarchical model selection through composite absolute penalties.Department of Statistics, UC Berkeley, Tech. Rep, 703, 2006.
[34] Peng Zhao and Bin Yu.On model selection consistency of Lasso.J. Mach. Learn. Res., 7:2541–2563, 2006.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 44/ 45
References VII
[35] H. Zhou, M.E. Sehl, J.S. Sinsheimer, and K. Lange.Association screening of common and rare genetic variants by penalized regression.
Bioinformatics, 26(19):2375, 2010.
A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 45/ 45