A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU...

$: A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU relativeerrors: kx \ −xBP 2 kx\ 2 =.128 k x\− SGLk 2 =.181 kx −x ∞ k2 kx\ 2 =.085 kx\$
A totally unimodular view ofstructured sparsity

Volkan [email protected]

Laboratory for Information and Inference Systems (LIONS)École Polytechnique Fédérale de Lausanne (EPFL)

Switzerland

DISCML (NIPS)[December 13, 2014]

Joint work withMarwa El Halabi, Luca Baldassarre and Baran Gözcü @ LIONS

Anastasios Kyrillidis and Bubacarr Bah @ UT AustinNirav Bhan @ MIT

Outline

Total unimodularity in discrete optimization

From sparsity to structured sparsity

Convex relaxations for structured sparse recovery

Enter nonconvexity

Conclusions

A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 2/ 45

Integer linear programming

Discrete optimizationSearch for an optimum object within a finite collection of objects.

Integer linear programMany important discrete optimization problemscan be formulated as an integer linear program

β\ ∈ arg maxβ∈Zm

{θTβ : Mβ ≤ c,β ≥ 0} (ILP)

NP-Hard (in general)I vertex cover, set packing, maximum flow,traveling salesman, boolean satisfiability.

A general approachAttempt the following convex relaxation

β? ∈ arg maxβ∈Rm

{θTβ : Mβ ≤ c,β ≥ 0} (LP)

Obtains an upperbound

�1

�2

�\

Polyhedra & Polytopes

P = {β|Mβ ≤ c,β ≥ 0}






{θTβ : Mβ ≤ c,β ≥ 0} (ILP)




{θTβ : Mβ ≤ c,β ≥ 0} (LP)


�1

�2

M� c �\ ✓T�\

✓


P = {β|Mβ ≤ c,β ≥ 0}

(β ∈ Rm , c ∈ Rm)

Polytope: A bounded polyhedron






{θTβ : Mβ ≤ c,β ≥ 0} (ILP)




{θTβ : Mβ ≤ c,β ≥ 0} (LP)


�1

�2

M� c ✓T�\�\

✓

�?

✓T�? � ✓T�\


P = {β|Mβ ≤ c,β ≥ 0}

(β ∈ Rm , c ∈ Rm)

Polytope: A bounded polyhedron






{θTβ : Mβ ≤ c,β ≥ 0} (ILP)




{θTβ : Mβ ≤ c,β ≥ 0} (LP)


�1

�2

M� c ✓T�\

�?=�\

✓


P = {β|Mβ ≤ c,β ≥ 0}

Observation:When every vertex of P is integer,LP is a “correct” relaxation.


A sufficient condition

Polyhedra P = {Mβ ≤ c,β ≥ 0} has integer vertices when M is TU and c is integer

Definition (Total unimodularity)A matrix M ∈ Rl×m is totally unimodular(TU) iff the determinant of every squaresubmatrix of M is 0, or ±1.

Correctness of LP [23]When M is TU and c is integer, then the LP

maxβ∈Rm

{θTβ : Mβ ≤ c,β ≥ 0}

has integer optimal solutions (i.e., ILP ⊆ LP).

�1

�2

M� c

Verifying if a matrix is TU is in P [31]

TU matrices are not rare!I Regular matroids have TU representations [29]I Network flow problems & interval constraints involve TU matrices [23]I Incidence matrices of undirected bipartite graphs are TU [23]

Computational complexity of LP

I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)






maxβ∈Rm

{θTβ : Mβ ≤ c,β ≥ 0}


�1

�2

M� c



I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)I IPM performs O

(√l log l

ε

)iterations (l > m) with up to O(m2l) operations,

where ε is the absolute solution accuracy






maxβ∈Rm

{θTβ : Mβ ≤ c,β ≥ 0}


�1

�2

M� c



I Polynomial time in l (i.e., number of constraints) and m (i.e., ambient dimension)I What if l is exponentially large?


A weaker sufficient conditionSubmodularity & submodular polyhedron [15]F : 2V → R is submodular iff it has the following diminishing returns property:

F(S ∪ {e})− F(S) ≥ F(T ∪ {e})− F(T ),

∀S ⊆ T ⊆ V, ∀e ∈ V \ T . The submodular polyhedron is defined as

P(F) := {β ∈ Rm | ∀S ⊆ V,βT1S ≤ F(S)}

where 1S is the support indicator vector, i.e., (1S)i = 1 if i ∈ S, 0 otherwise.I We cannot verify submodularity in polynomial time [28].I Submodular polyhedron is TDI: LP is a “correct” relaxation of ILP.

Total dual integrality (TDI) [17]A system Mβ ≤ c is called TDI when primal objective is finite and the dual problem

minα∈Rl

{αT c : α ≥ 0,αT M = θT

}has integer optimum solutions for all rational M and c, and for each integer θ.I A polynomial time (in l and m) algorithm can verify if Mβ ≤ c is TDI [12].

Structure matters! LP is efficiently solvable on the submodular polyhedra P(F).


In the rest of the talk...

We can use these concepts in obtainingI tight convex relaxationsI efficient nonconvex projections

for supervised learning and inverse problems

b = Axx1

x2

x3

Thursday, June 19, 14

Running example:

b A x\ w

n × p

A difficult estimation challenge when n < p:

Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)

I Needle in a haystack: We need additional information on x\!




for supervised learning and inverse problems

b = Axx1

x2

x3


Running example:

b A x\ w

n × p

Applications: Machine learning, signal processing, theoretical computer science...


Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)

I Needle in a haystack: We need additional information on x\!




for supervised learning and inverse problemsb = Ax

x1

x2

x3


Running example:

b A x\ w

n × p


Nullspace (null) of A: x\ + δ → b, ∀δ ∈ null(A)I Needle in a haystack: We need additional information on x\!


Outline




Enter nonconvexity

Conclusions


Three key insights

1. Sparse or compressible x\

not sufficient alone

2. Recovery

tractable & stable

3. Projection A

information preserving

b A x\ w

n × p

Typical goals:

1. Find x? to minimize ‖x? − x\‖

2. Find x? to minimize L(

x?(a),x\(a) + w)


Swiss army knife of signal models

Definition (s-sparse vector)A vector x ∈ Rp is s-sparse, i.e., x ∈ Σs,if it has at most s non-zero entries.

Rp

x\

Sparse representations:y\ has sparse transform coefficients x\

I Basis representations Ψ ∈ Rp×p

I Wavelets, DCT, ...I Frame representations Ψ ∈ Rm×p,

m > pI Gabor, curvelets, shearlets, ...

I Other dictionary representations...

∥∥x\∥∥

0:= |{i : x\i , 0}| = s

=y\ x\


Sparse representations strike back!

b A y\

I b ∈ Rn , A ∈ Rn×p, and n < p

Impact: Support restricted columns of A leads to an overcomplete system.



b A x\

I b ∈ Rn , A ∈ Rn×p, and n < pI Ψ ∈ Rp×p, x\ ∈ Σs, and s < n < p




b A x\

I b ∈ Rn , A ∈ Rn×p, x\ ∈ Σs, and s < n < p




b A x\

n × 1 n × s s × 1

I b ∈ Rn , A ∈ Rn×p, x\ ∈ Σs, and s < n < p



Enter sparsity

A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the estimator with the least number of non-zero entries. That is,

x? ∈ arg minx∈Rp

{‖x‖0 : ‖b−Ax‖2 ≤ κ

}(P0)

with some κ ≥ 0. If κ = ‖w‖2, then x\ is a feasible solution.

P0 has the following characteristics:I sample complexity: O(s)I computational effort: NP-HardI stability: No

Tightest convex relaxation:‖x‖∗∗0 is the biconjugate (Fenchelconjugate of Fenchel conjugate)

Fenchel conjugate:f ∗(y) := supx:dom(f ) xT y− f (x).

A technicality: Restrict x\ ∈ [−1, 1]p.


Enter sparsity



{‖x‖0 : ‖b−Ax‖2 ≤ κ

}(P0)





‖x‖0 over the unit `∞-ball



Enter sparsity



{‖x‖0 : ‖b−Ax‖2 ≤ κ

}(P0)





‖x‖1 is the convex envelope of ‖x‖0



The role of convexity: Tractable & stable recovery

A convex candidate solution for b = Ax\ + w


{‖x‖1 : ‖b−Ax‖2 ≤ ‖w‖2, ‖x‖∞ ≤ 1

}. (SOCP)

Theorem (A model recovery guarantee [27])Let A ∈ Rn×p be a matrix of i.i.d. Gaussian random variables with zero mean andvariances 1/n. For any t > 0 with probability at least 1− 6 exp

(−t2/26

), we have

∥∥x? − x\∥∥

2≤

[2√

2s log( ps ) + 5

4 s√

n −√

2s log( ps ) + 5

4 s − t

]‖w‖2 B ε, when ‖x\‖0 ≤ s.

Observations:I perfect recovery (i.e., ε = 0) with n ≥ 2s log( p

s ) + 54 s whp when w = 0.

I ε-accurate solution in k = O(√

2p + 1 log( 1ε))iterations via IPM1

with each iteration requiring the solution of a structured n× 2p linear system.2

I robust to noise.1For a rigorous primal-dual algorithm for this class of problems, see my NIPS 2014 paper [30].2When w = 0, the IPM complexity (# of iterations × cost per iteration) amounts to O(n2p1.5 log( 1

ε)).


The role of the matrix A: Preserving information

Proposition (Condition for exact recovery in the noiseless case)We have successful recovery with x? ∈ arg minx∈Rp {f (x) : b = Ax, ‖x‖∞ ≤ 1},i.e., δ := x? − x\ = 0, if and only if null(A) ∩ Df (x\) = {0}.

null (A)

x

x\

�x : f(x) f(x\)

Df (x\)

Assume that the constraint ‖x‖∞ ≤ 1 is inactive.

Descent cone: Df (x\) := cone({

x : f (x\ + x) ≤ f (x\)})

.

x\

null (A)

b = Ax\

xx � x\

�x : f(x) f(x\)

Df (x\)



Proposition (Condition for exact recovery in the noiseless case)We have successful recovery with x? ∈ arg minx∈Rp {f (x) : b = Ax, ‖x‖∞ ≤ 1},i.e., δ := x? − x\ = 0, if and only if null(A) ∩ Df (x\) = {0}.

x\

null (A)

b = Ax\

xx � x\

�x : f(x) f(x\)

Df (x\)



P{

x? = x\}

= P{

null(A) ∩ Df (x\) = {0}}

Definition (Statistical dimension [2]3)Let C ⊆ Rp be a closed convex cone. The statistical dimension of C is defined as

d(C) := E[‖projC(g)‖2

2].

Theorem (Approximate kinematic formula [2])Let A ∈ Rn×p, n < p, be a matrix of i.i.d. standard Gaussian random variables, andlet C ⊆ Rp be a closed convex cone. Let η ∈ (0, 1), then we have

n ≥ d(C) + cη√p ⇒ P {null(A) ∩ C = {0}} ≥ 1− η;

n ≤ d(C)− cη√p ⇒ P {null(A) ∩ C = {0}} ≤ η,

where cη :=√

8 log(4/η).

We can compute d(C) . 2s log( ps ) + 5

4 s for C = D‖·‖1 (x\) when x\ ∈ Σs.

3The statistical dimension is closely related to the Gaussian complexity [7], Gaussian width [10], mean width[32], and Gaussian squared complexity [9].


Beyond sparsity towards model-based or structured sparsity

I The following signals can look the same from a sparsity perspective!

Sparse image Wavelet coefficients Spike train Background substractedof a natural image image

sorted index s p

|x\|(i)



I The following signals can look the same from a sparsity perspective!


I In reality, these signals have additional structures beyond the simple sparsity

�




Sparsity model: Union of all s-dimensionalcanonical subspaces.

Structured sparsity model: A particularunion of ms s-dimensional canonicalsubspaces.

Rp✓p

s

◆

Rp

ms

Model-based or structured sparsityStructured sparsity models are discrete structures describing the interdependencybetween the non-zero coefficients of a vector.


Three upshots of structured sparsity

Key properties of the statistical dimension [2]I The statistical dimension is invariant under unitary transformations (rotations).I Let C1 and C2 be closed convex cones. If C1 ⊆ C2, then d(C1) ≤ d(C2).

1. The smaller the statistical dimension is, the less we need to sample

Example (If Df1(x\) ⊆ Df2(x\) ⊆ Rn, then d(Df1(x\)) ≤ d(Df2(x\)). )f1(x) := max{|x1|, |x2|}+ |x3|f2(x) := ‖x‖1

x\ = [1,−1, 0]T

Observations:1. n1 < n2 for x\

2. n1 > n2 for z\ = [0, 0, 1]T

I Reduced sample complexity: phase transition at the statistical dimension

I Better noise robustness: denoising capabilities depend on the statistical dimension

maxσ>0

E[‖proxf (x\ + σw, σλ)− x\‖2

]σ2 ≤ d(λDf (x\))

Minimize a bound to the minimax risk via the regularization parameter λ [27]I Better interpretability: geometry can enhance interpretability

`1-ball: TV-ball:Influence the recovered support via customized convex geometry




2. The smaller the statistical dimension is, the better we can denoiseI Reduced sample complexity: phase transition at the statistical dimensionI Better noise robustness: denoising capabilities depend on the statistical dimension

maxσ>0


]σ2 ≤ d(λDf (x\))

Minimize a bound to the minimax risk via the regularization parameter λ [27]

I Better interpretability: geometry can enhance interpretability





3. The smaller the statistical dimension is, the better we can enforce structureI Reduced sample complexity: phase transition at the statistical dimensionI Better noise robustness: denoising capabilities depend on the statistical dimension

maxσ>0


]σ2 ≤ d(λDf (x\))

Minimize a bound to the minimax risk via the regularization parameter λ [27]I Better interpretability: geometry can enhance interpretability



Outline




Enter nonconvexity

Conclusions


A simple template for linear inverse problems

Find the “sparsest” x subject to structure and data.

I SparsityWe can generalize this desideratum to other notions of simplicity

I StructureWe only allow certain sparsity patterns

I Data fidelityWe have many choices of convex constraints & losses to represent data; e.g.,

‖b−Ax‖2 ≤ κ


A convex proto-problem for structured sparsity

A combinatorial approach for estimating x\ from b = Ax\ + wWe may consider the sparsest estimator or its surrogate with a valid sparsity pattern:


{‖x‖s : ‖b−Ax‖2 ≤ κ, ‖x‖∞ ≤ 1

}(Ps)

with some κ ≥ 0. If κ = ‖w‖2, then the structured sparse x\ is a feasible solution.

Sparsity and structure together [14]Given some weights d ∈ Rd , e ∈ Rp and an integer input c ∈ Zl , we define

‖x‖s := minω{dTω + eT s : M

[ωs

]≤ c,1supp(x) = s,ω ∈ {0, 1}d}

for all feasible x, ∞ otherwise. The parameter ω is useful for latent modeling.

A convex candidate solution for b = Ax\ + wWe use the convex estimator based on the tightest convex relaxation of ‖x‖s:

x? ∈ arg minx∈dom(‖·‖s)

{‖x‖∗∗s : ‖b−Ax‖2 ≤ κ

}with some κ ≥ 0, dom(‖·‖s) := {x : ‖x‖s <∞}.


Tractability & tightness of biconjugation

Proposition (Hardness of conjugation)Let F(s) : 2P → R ∪ {+∞} be a set function defined on the support s = supp(x).Conjugate of F over the unit infinity ball ‖x‖∞ ≤ 1 is given by

g∗(y) = sups∈{0,1}p

|y|T s − F(s).

Observations:

I F(s) is general set functionComputation: NP-Hard

I F(s) = ‖x‖sComputation: ILP in general. However, ifI M is TUI (M , c) is TDI

then tight convex relaxations with an LP (“usually” tractable)

Otherwise, relax to LP anyway!

I F(s) is submodularComputation: Polynomial-time


Tree sparsity [21, 13, 6, 33]

Wavelet coefficients Wavelet tree Valid selection of nodes Invalid selection of nodes

Structure: We seek the sparsest signal with a rooted connected subtree support.

Linear description: A valid support satisfy sparent ≥ schild over tree T

T1supp(x) := Ts ≥ 0

where T is the directed edge-node incidence matrix, which is TU.

Biconjugate: ‖x‖∗∗s = mins∈[0,1]p{1T s : Ts ≥ 0, |x| ≤ s}

?=∑G∈GH

‖xG‖∞

for x ∈ [−1, 1]p, ∞ otherwise.

The set G ∈ GH are defined as each node and all its descendants.


Tree sparsity [21, 13, 6, 33]

GH = {{1, 2, 3}, {2}, {3}} valid selection of nodes

Structure: We seek the sparsest signal with a rooted connected subtree support.

Linear description: A valid support satisfy sparent ≥ schild over tree T

T1supp(x) := Ts ≥ 0

where T is the directed edge-node incidence matrix, which is TU.

Biconjugate: ‖x‖∗∗s = mins∈[0,1]p{1T s : Ts ≥ 0, |x| ≤ s} ?=∑G∈GH

‖xG‖∞for x ∈ [−1, 1]p, ∞ otherwise.

The set G ∈ GH are defined as each node and all its descendants.


Tree sparsity example: 1:100-compressive sensing [30, 1]

World [1Gpix] Lac Léman World [10Mpix]

sparse tree-sparse

PNSR = 31.83db PNSR = 32.48db


Tree sparsity example: TV & TU-relax 1:15-compression [30, 1]

Original Original BP

TU-relax TV

TV with BP TV with TU-relax

Regularization:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.054

0.055

0.056

0.057

0.058

0.059

0.06

0.061

α

l2 e

rror

tv−BPmin errortv−TU relaxmin error


Group knapsack sparsity [35, 18, 16]

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

1supp(x)

supportindicator vector

sparse

1

2

2

3

1

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

G5 = {6, 8}

G1 = {1}

knapsackconstraints vector

cu

Structure: We seek the sparsest signal with group allocation constraints.

Linear description: A valid support obeys budget constraints over G

BT s ≤ cu

where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.Remark: We can also budget a lowerbound c` ≤ BT s ≤ cu .



�

BT =

1 1 · · · 1 1 0 0 · · · 0

0 1 1 · · · 1 1 0 · · · 0

...

0 · · · 0 0 1 1 · · · 1 1

(p−∆+1)×p



BT s ≤ cu


Biconjugate: ‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise

For the neuronal spike example, we have cu = 1.



(left) ‖x‖∗∗s ≤ 1 (middle) ‖x‖∗∗s ≤ 1.5 (right) ‖x‖∗∗s ≤ 2 for G = {{1, 2}, {2, 3}}



BT s ≤ cu


Biconjugate: ‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise

For the neuronal spike example, we have cu = 1.


Group knapsack sparsity example: A stylized spike train

I Basis pursuit (BP): ‖x‖1I TU-relax (TU):

‖x‖∗∗s ={‖x‖1 if x ∈ [−1, 1]p,BT |x| ≤ cu ,∞ otherwise

0.1 0.2 0.3 0.4

0.2

0.4

0.6

0.8

1

n/p

Error:

‖x−x?‖2

‖x?‖2

BPDBP

Figure: Recovery for n = 0.18p.

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

x\ xBP solution xTU solution

relative errors:‖x\−xBP‖2‖x\‖2

= .200‖x\−xTU‖2‖x\‖2

= .067


Group knapsack sparsity: A simple variation

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

1supp(x)


sparse

1

2

2

3

1

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

G5 = {6, 8}

G1 = {1}

knapsackconstraints vector

cu

Structure: We seek the signal with the minimal overall group allocation.

Objective: 1T s → ‖x‖ω ={

minω∈Z++ ω if x ∈ [−1, 1]p,BT |x| ≤ ω1,∞ otherwise


BT s ≤ ω1

where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix or G has a loopless group intersection graph, it is TU.

Biconjugate: ‖x‖∗∗s ={

maxG∈G ‖xG‖1 if x ∈ [−1, 1]p,∞ otherwise

Remark: The regularizer is known as exclusive Lasso [35, 26].A TU view of structured sparsity | Volkan Cevher, [email protected] Slide 27/ 45

Group cover sparsity: Minimal group cover [5, 25, 19]

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

1supp(x)


sparse

0

0

0

1

0

group “support”indicator vector

Ê

group sparse

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

G5 = {6, 8}

G1 = {1}

Structure: We seek the signal covered by a minimal number of groups.

Objective: 1T s → dTω

Linear description: At least one group containing a sparse coefficient is selected

Bω ≥ s

where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .When B is an interval matrix, or G has a loopless group intersection graph it is TU.

Remark: Weights d can depend on the sparsity within each groups (not TU) [14].



G = {{1, 2}, {2, 3}}, unit group weights d = 1.




Bω ≥ s


Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : Bω ≥ |x|} for x ∈ [−1, 1]p,∞ otherwise

?= minvi∈Rp{∑M

i=1 di‖vi‖∞ : x =∑M

i=1 vi , ∀supp(vi) ⊆ Gi},Remark: Weights d can depend on the sparsity within each groups (not TU) [14].



G = {{1, 2}, {2, 3}}, unit group weights d = 1.




Bω ≥ s


Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : Bω ≥ |x|} for x ∈ [−1, 1]p,∞ otherwise?= minvi∈Rp{

∑Mi=1 di‖vi‖∞ : x =

∑Mi=1 vi , ∀supp(vi) ⊆ Gi},

Remark: Weights d can depend on the sparsity within each groups (not TU) [14].


Budgeted group cover sparsity

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

1supp(x)


sparse

0

0

0

1

0


Ê

group sparse

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

G5 = {6, 8}

G1 = {1}

Structure: We seek the sparsest signal covered by G groups.

Objective: dTω → 1T s

Linear description: At least one of the G selected groups cover each sparse coefficient.

Bω ≥ s,1Tω ≤ G

where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .

When[

B1

]is an interval matrix, it is TU.


Budgeted group cover sparsity

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

1supp(x)


sparse

0

0

0

1

0


Ê

group sparse

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

G5 = {6, 8}

G1 = {1}

Structure: We seek the sparsest signal covered by G groups.

Objective: dTω → 1T s

Linear description: At least one of the G selected groups cover each sparse coefficient.

Bω ≥ s,1Tω ≤ G

where B is the biadjacency matrix of G, i.e., Bij = 1 iff i-th coefficient is in Gj .

When[

B1

]is an interval matrix, it is TU.

Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {‖x‖1 : Bω ≥ |x|,1Tω ≤ G}for x ∈ [−1, 1]p,∞ otherwise.


Budgeted group cover example: Interval overlapping groups

I Basis pursuit (BP): ‖x‖1I Sparse group Lasso (SGLq):

(1− α)∑G∈G

√|G|‖xG‖q + α‖xG‖1

I TU-relax (TU):

‖x‖∗∗ω = minω∈[0,1]M

{‖x‖1 : Bω ≥ |x|, 1Tω ≤ G}

for x ∈ [−1, 1]p,∞ otherwise.0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1

n/p

Error:

‖x−x?‖2

‖x?‖2

BPSGLSGL∞

BLGC

Figure: Recovery for n = 0.25p, s = 15, p = 200,G = 5 out of M = 29 groups.

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

x\ xBP solution xSGL solution xSGL∞ solution xTU solution

relative errors:‖x\−xBP‖2‖x\‖2

= .128‖x\−xSGL‖2‖x\‖2

= .181‖x\−xSGL∞‖2‖x\‖2

= .085‖x\−xTU‖2‖x\‖2

= .058


Group intersection sparsity [20, 34, 3]

G2 = {1, 2, 3, 4, 5}

x1

x2

x3

x4

x5

x6

x7

x8

1supp(x)


sparse

0

1

1

1

0


Ê

group sparse

G3 = {5, 6, 7, 8}

G4 = {2, 5, 7}

x1

x2

x3

x4

x5

x6

x7

x8

0

1

0

0

1

0

1

0

G1 = {1}

G5 = {6, 8}

Structure: We seek the signal intersecting with minimal number of groups.


Linear description: All groups containing a sparse coefficient are selected

H ks ≤ ω,∀k ∈ P

where H k(i, j) ={

1 if j = k, j ∈ Gi

0 otherwise, which is TU.

Remark: For hierarchical GH , group intersection and tree sparsity models coincide.



G = {{1, 2}, {2, 3}}, unit group weights d = 1

(left) intersection (right) cover.

Structure: We seek the signal intersecting with minimal number of groups.Objective: 1T s → dTω



where H k(i, j) ={

1 if j = k, j ∈ Gi


Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}

?=∑G∈G ‖xG‖∞

for x ∈ [−1, 1]p,∞ otherwise.




G = {{1, 2}, {2, 3}}, unit group weights d = 1

(left) intersection (right) cover.

Structure: We seek the signal intersecting with minimal number of groups.Objective: 1T s → dTω (submodular)



where H k(i, j) ={

1 if j = k, j ∈ Gi


Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}?=∑G∈G ‖xG‖∞





G = {{1, 2, 3}, {2}, {3}}, unit group weights d = 1.

Structure: We seek the signal intersecting with minimal number of groups.

Objective: 1T s → dTω (submodular)



where H k(i, j) ={

1 if j = k, j ∈ Gi


Biconjugate: ‖x‖∗∗ω = minω∈[0,1]M {dTω : H k |x| ≤ ω, ∀k ∈ P}?=∑G∈G ‖xG‖∞




Beyond linear costs: Graph dispersiveness

(left) ‖x‖∗∗s = 0 (right) ‖x‖∗∗s ≤ 1 for E = {{1, 2}, {2, 3}} (chain graph)

Structure: We seek a signal dispersive over a given graph G(P, E)

Objective: 1T s →∑

(i,j)∈E sisj (non-linear, supermodular function)

Linearization:

‖x‖s = minz∈{0,1}|E|{∑

(i,j)∈E zij : zij ≥ si + sj − 1}

When edge-node incidence matrix of G(P, E) is TU (e.g., bipartite graphs), it is TU.


Beyond linear costs: Graph dispersiveness

(left) ‖x‖∗∗s = 0 (right) ‖x‖∗∗s ≤ 1 for E = {{1, 2}, {2, 3}} (chain graph)

Structure: We seek a signal dispersive over a given graph G(P, E)

Objective: 1T s →∑

(i,j)∈E sisj (non-linear, supermodular function)

Linearization:

‖x‖s = minz∈{0,1}|E|{∑

(i,j)∈E zij : zij ≥ si + sj − 1}

When edge-node incidence matrix of G(P, E) is TU (e.g., bipartite graphs), it is TU.Biconjugate: ‖x‖∗∗s =

∑(i,j)∈E(|xi |+ |xj | − 1)+ for x ∈ [−1, 1]p,∞ otherwise.


Outline




Enter nonconvexity

Conclusions


An important alternative

Problem (Projection)

DefineMs,G := {x : eT s ≤ s,dTω ≤ G, M[ωs

]≤ c, s = 1supp(x)}.

The projection of x ontoMs,G in `q-norm is defined as Pq,Ms,G (x) : Rp → Rp,

Pq,Ms,G (x) ∈ arg minu∈Rp

{‖x− u‖qq : u ∈Ms,G}

I x = PM k,G (x) is the best model-based approximation of x.I The interesting cases are q = 1, 2.

Observation: Model-based approximation corresponds to an ILPI NP-Hard in general (weighted max cover formulation [5])I TU structures play a major roleI Pseudo-polynomial time solutions via dynamic programming

Model-based CS [6, 22]: n = O(log |Ms,G |) with iid Gaussian (dense)I n = O(s) for tree structureI iterative projected gradient descent xk+1 ∈ P2,Ms,G

(xk + AT (b−Axk)

)Model-based sketching [4]: n = o (s log(p/s)) with expanders (sparse)I n = O(s log(p/s)/ log log(p)) for tree structure (empirical: n = O(s))I iterative projected median descent xk+1 ∈ P1,Ms,G (xk +M(b−Axk))


Tree sparsity example: Pareto frontier [8, 5]



Sparsity s0 5 10 15 20 25

Appro

xim

ation

Err

or

0

1

2

3

4

5

6

7

8

9Tree model



Sparsity s0 5 10 15 20 25

Appro

xim

ation

Err

or

0

1

2

3

4

5

6

7

8

9Tree modelTU-relax



Sparsity s0 5 10 15 20 25

Appro

xim

ation

Err

or

0

1

2

3

4

5

6

7

8

9Tree modelTU-relaxHGL `1



Sparsity s0 5 10 15 20 25

Appro

xim

ation

Err

or

0

1

2

3

4

5

6

7

8

9Tree modelTU-relaxHGL `1

Just kidding, they are the same.



Sparsity s0 5 10 15 20 25

Appro

xim

ation

Err

or

0

1

2

3

4

5

6

7

8

9Tree modelTU-relaxHGL `1HGL `2



�� 0 5 10 15 20 25

��

��

��

��

�

0

1

2

3

4

5

6

7

8

9��


CLASH [22]

Combinatorial Selection+

Least Absolute Shrinkage

Retains the same theoretical guarantees

ILP and matroid structured models...

CLASH set:

Model-CLASH set:

combinatorial origami


Outline




Enter nonconvexity

Conclusions


Conclusions

Our work: TU modeling framework & convex template & non-convex algorithmsI Many more convex programs (not necessarily norms)I TU models: tight convexifications, non-submodular examplesI Easy to design and “usually” efficient via an LPI London calling...

Alternatives:1. Atomic norms [11, 10]

I Given a set A, use the biconjugation of g(x) = inf0≤t≤c t + ιtA(x), for c > 0I Reverse engineer the set to obtain structured sparsityI “Usually” tractable since the norm is reverse engineered

2. Monotone submodular penalties and extensions [3]I Tight convexification via Lovász extensionI Reverse engineer the submodular set function (not always possible)

3. `q-regularized combinatorial functions [24]I Tight convexification (also explains latent group lasso like norms)I Not always efficiently computableI Reverse engineered and may loose structure, e.g., group knapsack model


References I

[1] Ben Adcock, Anders C. Hansen, Clarice Poon, and Bogdan Roman.Breaking the coherence barrier: A new theory for compressed sensing.http://arxiv.org/abs/1302.0561, Feb. 2013.

[2] Dennis Amelunxen, Martin Lotz, Michael B. McCoy, and Joel A. Tropp.Living on the edge: Phase transitions in convex programs with random data.Information and Inference, 3:224–294, 2014.arXiv:1303.6672v2 [cs.IT].

[3] Francis Bach.Structured sparsity-inducing norms through submodular functions.Adv. Neur. Inf. Proc. Sys. (NIPS), pages 118–126, 2010.

[4] B. Bah, L. Baldassarre, and V. Cevher.Model-based sketching and recovery with expanders.In Proc. ACM-SIAM Symp. Disc. Alg., number EPFL-CONF-187484, 2014.

[5] L. Baldassarre, N. Bhan, V. Cevher, and A. Kyrillidis.Group-sparse model selection: Hardness and relaxations.arXiv preprint arXiv:1303.3207, 2013.

[6] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde.Model-based compressive sensing.IEEE Trans. Inf. Theory, 56(4):1982–2001, April 2010.


References II

[7] Peter L. Barlett and Shahar Mendelson.Rademacher and Gaussian complexities: Risk bounds and structural results.J. Mach. Learn. Res., 3, 2002.

[8] N. Bhan, L. Baldassarre, and V. Cevher.Tractability of interpretability via selection of group-sparse models.In IEEE Int. Symp. Inf. Theory, 2013.

[9] Venkat Chandrasekaran and Michael I. Jordan.Computational and statistical tradeoffs via convex relaxation.Proc. Nat. Acad. Sci., 110(13):E1181–E1190, 2013.

[10] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky.The convex geometry of linear inverse problems.Found. Comp. Math., 12:805–849, 2012.

[11] S. Chen, D. Donoho, and M. Saunders.Atomic decomposition by basis pursuit.SIAM J. Sci. Comp., 20(1):33–61, 1998.

[12] William Cook, László Lovász, and Alexander Schrijver.A polynomial-time test for total dual integrality in fixed dimension.In Mathematical programming at Oberwolfach II, pages 64–69. Springer, 1984.


References III

[13] Marco F. Duarte, Dharmpal Davenport, Mark A. adn Takhar, Jason N. Laska,Ting Sun, Kevin F. Kelly, and Richard G. Baraniuk.Single-pixel imaging via compressive sampling.IEEE Sig. Proc. Mag., 25(2):83–91, March 2008.

[14] Marwa El Halabi and Volkan Cevher.A totally unimodular view of structured sparsity.preprint, 2014.arXiv:1411.1990v1 [cs.LG].

[15] S. Fujishige.Submodular functions and optimization, volume 58.Elsevier Science, 2005.

[16] W Gerstner and W. Kistler.Spiking neuron models: Single neurons, populations, plasticity.Cambridge university press, 2002.

[17] FR Giles and William R Pulleyblank.Total dual integrality and integer polyhedra.Linear algebra and its applications, 25:191–196, 1979.


References IV

[18] C. Hegde, M. Duarte, and V. Cevher.Compressive sensing recovery of spike trains using a structured sparsity model.In Sig. Proc. with Adapative Sparse Struct. Rep. (SPARS), 2009.

[19] J. Huang, T. Zhang, and D. Metaxas.Learning with structured sparsity.J. Mach. Learn. Res., 12:3371–3412, 2011.

[20] R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, F. Bach, and B. Thirion.Multi-scale mining of fmri data with hierarchical structured sparsity.In Pattern Recognition in NeuroImaging (PRNI), 2011.

[21] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach.Proximal methods for hierarchical sparse coding.J. Mach. Learn. Res., 12:2297–2334, 2011.

[22] A. Kyrillidis and V. Cevher.Combinatorial selection and least absolute shrinkage via the CLASH algorithm.In IEEE Int. Symp. Inf. Theory, pages 2216–2220. Ieee, 2012.

[23] George L Nemhauser and Laurence A Wolsey.Integer and combinatorial optimization, volume 18.Wiley New York, 1999.


References V

[24] G. Obozinski and F. Bach.Convex relaxation for combinatorial penalties.arXiv preprint arXiv:1205.1240, 2012.

[25] G. Obozinski, L. Jacob, and J.P. Vert.Group lasso with overlaps: The latent group lasso approach.arXiv preprint arXiv:1110.0413, 2011.

[26] G. Obozinski, B. Taskar, and M.I. Jordan.Joint covariate selection and joint subspace selection for multiple classificationproblems.Statistics and Computing, 20(2):231–252, 2010.

[27] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi.Simple bounds for noisy linear inverse problems with exact side information.2013.arXiv:1312.0641v2 [cs.IT].

[28] C Seshadhri and Jan Vondrák.Is submodularity testable?Algorithmica, 69(1):1–25, 2014.


References VI

[29] Paul D Seymour.Decomposition of regular matroids.Journal of combinatorial theory, Series B, 28(3):305–359, 1980.

[30] Quoc Tran-Dinh and Volkan Cevher.Constrained convex minimization via model-based excessive gap.In Adv. Neur. Inf. Proc. Sys. (NIPS), 2014.

[31] Klaus Truemper.Alpha-balanced graphs and matrices and GF(3)-representability of matroids.J. Comb. Theory Ser. B, 32(2):112–139, 1982.

[32] Roman Vershynin.Estimation in high dimensions: a geometric perspective.http://arxiv.org/abs/1405.5103, May 2014.

[33] Peng Zhao, Guilherme Rocha, and Bin Yu.Grouped and hierarchical model selection through composite absolute penalties.Department of Statistics, UC Berkeley, Tech. Rep, 703, 2006.

[34] Peng Zhao and Bin Yu.On model selection consistency of Lasso.J. Mach. Learn. Res., 7:2541–2563, 2006.


References VII

[35] H. Zhou, M.E. Sehl, J.S. Sinsheimer, and K. Lange.Association screening of common and rare genetic variants by penalized regression.

Bioinformatics, 26(19):2375, 2010.


A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU...

Documents

Transcript of A totally unimodular view of structured sparsity · BP solution xSGL SGL∞ solutionxTU...