Sparsity by worst-case quadratic penalties

72
Sparsity by Worst-case Quadratic Penalties joint work with Yves Grandvalet and Christophe Ambroise Statistique et G´ enome, CNRS & Universit´ e d’ ´ Evry Val d’Essonne SSB group, ´ Evry – October the 16th, 2012 arXiv preprint with Y. Grandvalet and C. Ambroise. http://arxiv.org/abs/1210.2077 R-package quadrupen, on CRAN 1

Transcript of Sparsity by worst-case quadratic penalties

Page 1: Sparsity by worst-case quadratic penalties

Sparsity by Worst-case Quadratic Penalties

joint work withYves Grandvalet and Christophe Ambroise

Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne

SSB group, Evry – October the 16th, 2012

arXiv preprint with Y. Grandvalet and C. Ambroise.

http://arxiv.org/abs/1210.2077

R-package quadrupen, on CRAN

1

Page 2: Sparsity by worst-case quadratic penalties

Variable selection in high-dimensional problems

Questionable solutions

1. Treat univariate problems and select effects via multiple testing Genomic data are often highly correlated. . .

2. Combine multivariate analysis and model selection techniques

arg minβ∈Rp

− `(β; y,X) + λ‖β‖0

Unfeasible for p > 30 in general (NP-hard)

Popular idea (questionable too!)

Use `1 as a convex relaxation of this problem, keeping sparse-inducingeffect:

arg minβ∈Rp

− `(β; y,X) + λ · pen`1(β)

2

Page 3: Sparsity by worst-case quadratic penalties

Another algorithm for the Lasso? (- -)zzzWell, not really. . .

1. We suggest an unifying approach that might be useful sinceI it helps having a global view on the Lasso-zoologyI reading every daily arXiv papers dealing with `1 is out-of-reachI insights are still needed to understand high dimensional problem

2. The associated algorithm is efficient and accurate up to mediumscale (1000s) problemsI Promising tools for pre-discarding irrelevant variables are emerging Thus solving this class of problems may be enough

I Bootstrapping, cross-validation and subsampling are highly needed. . . for which our method is well adapted.

3

Page 4: Sparsity by worst-case quadratic penalties

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

4

Page 5: Sparsity by worst-case quadratic penalties

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

5

Page 6: Sparsity by worst-case quadratic penalties

A geometric view of sparsity`(β

1,β

2)

β2 β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

6

Page 7: Sparsity by worst-case quadratic penalties

A geometric view of sparsityβ

2

β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

6

Page 8: Sparsity by worst-case quadratic penalties

Singularities induce sparsity

Lasso (`1) Ridge (`2)

β1

β2

BNB(β)

β

β1

β2

BNB(β)β

β ∈ B is optimal if and only if ∇`(β; X,y) defines a supportinghyperplane at β. Equivalently,

−∇`(β; X,y) ∈ NB(β).

7

Page 9: Sparsity by worst-case quadratic penalties

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

8

Page 10: Sparsity by worst-case quadratic penalties

Robust optimization framework

Worst-case analysis

We wish to solve a regression problem where β minimize

β = arg minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2,

I Dγ describes an uncertainty set for the parameters,

I γ acts as a spurious adversary over the true β?.

maximizing over Dγ leads to the worst-case formulation.

9

Page 11: Sparsity by worst-case quadratic penalties

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + .

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Page 12: Sparsity by worst-case quadratic penalties

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Page 13: Sparsity by worst-case quadratic penalties

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Page 14: Sparsity by worst-case quadratic penalties

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Page 15: Sparsity by worst-case quadratic penalties

Example: robust formulation of the elastic net

’Lasso’ regularity set for β

The `1-norm must be controlled:

HLassoβ = β ∈ Rp : ||β||1 ≤ ηβ .

Dual assumption on the adversary

The `∞-norm of γ should be controlled, say:

DLassoγ =

γ ∈ Rp : max

β∈HLassoβ

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||∞ ≤ ηγ = conv

−ηγ , ηγp

,

where ηγ = 1/ηβ.

11

Page 16: Sparsity by worst-case quadratic penalties

Example: robust formulation of the elastic net (continued)

Now, the other way round:

minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈−ηγ ,ηγp

γᵀβ + ||γ||2

= minimize

β∈Rp

||Xβ − y||2 + λ ||β||2 + ληγ‖β‖1 + c

⇔ minimize

β∈Rp

1

2||Xβ − y||2 + λ1‖β‖1 +

λ2

2||β||2

,

for known λ1, λ2.

We recognize the ’official’ Elastic-net regularizer.

12

Page 17: Sparsity by worst-case quadratic penalties

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

13

Page 18: Sparsity by worst-case quadratic penalties

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

13

Page 19: Sparsity by worst-case quadratic penalties

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

13

Page 20: Sparsity by worst-case quadratic penalties

Example: robust formulation of `1/`∞ group-Lasso

Argument with one group, which generalizes by decomposability of thegroup-norm.

The regularity set for β with `∞ controlled

Hmaxβ = β ∈ Rp : ||β||∞ ≤ ηβ .

Dual assumption on the adversary

Dmaxγ =

γ ∈ Rp : sup

β∈Hmaxβ∗

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||1 ≤ ηγ= conv

ηγe

p1 , . . . , ηγe

pp ,−ηγep1 , . . . ,−ηγepp

,

where ηγ = 1/ηβ and epj is the j th element of the canonical basis of Rp ,that is ejj ′ = 1 if j = j ′ and ejj ′ = 0 otherwise.

14

Page 21: Sparsity by worst-case quadratic penalties

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

15

Page 22: Sparsity by worst-case quadratic penalties

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈Dmax

γ

||β − γ||2 ≤ s + η2

15

Page 23: Sparsity by worst-case quadratic penalties

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈Dmax

γ

||β − γ||2 ≤ s + η2

15

Page 24: Sparsity by worst-case quadratic penalties

Generalize this principle to your favourite problem

elastic-net (`1 + `2) `∞ + `2

structured e.-n. fused-lasso + `2 OSCAR + `216

Page 25: Sparsity by worst-case quadratic penalties

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Page 26: Sparsity by worst-case quadratic penalties

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Page 27: Sparsity by worst-case quadratic penalties

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Page 28: Sparsity by worst-case quadratic penalties

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Page 29: Sparsity by worst-case quadratic penalties

Algorithm complexitysee, e.g., Bach et al 2011

Suppose that

I the algorithm stops at λmin with k activated variables,

I no downgrade has been observed (we thus have k iterations/steps).

Complexity in favorable cases

1. compute XᵀXA + λI|A|: O(npk),

2. maintaining xᵀj (XAβA − y) along the path: O(pn + pk2)

3. cholesky update of (XᵀAXA + λI|A|)−1: O(k3).

a total of O(npk + pk2 + k3). Hence k , defined by λmin, matters. . .

18

Page 30: Sparsity by worst-case quadratic penalties

A bound to assess distance to optimum during optimization

Proposition

For any ηγ > 0, and for all vectorial norm ||·||∗, when Dγ is defined asDγ = γ ∈ Rp : ||γ||∗ ≤ ηγ, then, ∀γ ∈ Rp : ||γ||∗ ≥ ηγ , we have:

minβ∈Rp

maxγ′∈Dγ

Jλ(β,γ ′) ≥ ηγ||γ||∗

Jλ (β? (γ) ,γ)− ληγ(||γ||∗ − ηγ)

||γ||2∗||γ||2 ,

where

Jλ(β,γ) = ||Xβ − y||2 + λ ||β − γ||2 and β?(γ) = arg minβ∈Rp

Jλ(β,γ) .

This proposition can be used to compute an optimality gap by picking aγ-value such that the current worst-case gradient is null (the currentβ-value then being the optimal β?(γ)).

19

Page 31: Sparsity by worst-case quadratic penalties

Bound: illustration on an Elastic-Net problem

0 50 100 150 200

−8

−6

−4

−2

02

4

# of iterations

Op

tim

alit

yga

pn = 50, p = 200

Figure: Monitoring convergence: true optimality gap (solid black) versus ourpessimistic bound (dashed blue) and Fenchel’s duality gap (dotted red)computed at each iteration of the algorithm.

20

Page 32: Sparsity by worst-case quadratic penalties

Full interpretation in a robust optimization framework

Proposition

The robust regression problem

minβ∈Rp

max(∆X,ε,γ)∈DX×Dε×Dγ

||(X−∆X)β + Xγ + ε− y||2 ,

for a given form of the global uncertainty set DX ×Dε ×Dγ on(∆X, ε,γ), is equivalent to the robust regularized regression problem:

minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + ηX ||β − γ||2 .

These assumptions entail the following relationship between X and y:

y = (X−∆X)β? + Xγ + ε .

The observed responses are formed by summing the contributions of theunobserved clean inputs, the adversarial noise that maps the observedinputs to the responses, and the neutral noise.

21

Page 33: Sparsity by worst-case quadratic penalties

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

22

Page 34: Sparsity by worst-case quadratic penalties

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

23

Page 35: Sparsity by worst-case quadratic penalties

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

23

Page 36: Sparsity by worst-case quadratic penalties

Data generation

Exploring those characteristics with linear regression

We generate samples of size n from the model

y = Xβ? + ε, ε ∼ N (0, σ2I),

I σ chosen so as to reach R2 ≈ 0.8,

I X ∼ N (0,Σ), with Σij = 1i=j + ρ1i 6=j,

I sgn (β?) =(

1, . . . , 1︸ ︷︷ ︸s/2

,−1, . . . ,−1︸ ︷︷ ︸s/2

, 0, . . . , 0︸ ︷︷ ︸p−s

).

Controlling the difficulty

I ρ ∈ 0.1, 0.4, 0.8 rules the conditioning,

I s ∈ 10%, 30%, 60% controls the sparsity,

I the ratio n/p ∈ 2, 1, 0.5 quantifies the well/ill-posedness.

24

Page 37: Sparsity by worst-case quadratic penalties

Comparing optimization strategies

We used our own code to avoid implementation biases for

1. accelerated proximal methods – proximal,

2. coordinate descent – coordinate,

3. our quadratic solver – quadratic,

wrapped in the same active-set + warm-start routine.Timings averaged over 100 runs to minimize

J enetλ1,λ2(β) =

1

2||Xβ − y||2 + λ1 ||β||1 +

λ2

2||β||2 .

with halting condition

maxj∈1...p

∣∣∣xᵀj (y −Xβ)

+ λ2β∣∣∣ < λ1 + τ, (1)

where the threshold τ = 10−2 on a 50× 50 grid of λ1 × λ2.

25

Page 38: Sparsity by worst-case quadratic penalties

log-ratio between timing of competitor and quadratic, p = 2n = 100, s = 30log10λ2

large corr. (0.8) medium corr. (0.4) small corr. (0.1)

−1.0

−0.5

0.0

0.5

1.0

1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

coordinate descentproxim

al (fista)

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

# times faster

1

3

10

30

100

300

log10(λ1)26

Page 39: Sparsity by worst-case quadratic penalties

Comparing stand-alone implementations

We compare our method on a Lasso problem to popular R-packages

1. accelerated proximal methods – SPAMs-FISTA (Mairal, Bach et al.),

2. coordinate descent – glmnet (Friedman, Hastie, Tibshirani),

3. homotopy/LARS algorithm– lars (Efron, Hastie) and SPAMs-LARS,

4. quadratic solver – quadrupen.

The distance D to the optimum is evaluated on J lassoλ (β) = J enet

λ,0 (β) by

D(method) =

(1

|Λ|∑λ∈Λ

(J lassoλ

(βlars

λ

)− J lasso

λ

(βmethod

λ

))2)1/2

,

where Λ is given by the first min(n, p) steps of lars.

Vary ρ, (p,n), fix s = 0.25 min(n, p) and average over 50 runs.

27

Page 40: Sparsity by worst-case quadratic penalties

−2.5 −2.0 −1.5 −1.0

−12

−10

−8

−6

−4

−2

0

CPU time (in seconds, log10)

D(method)

(log10)

n = 100, p = 40

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

Page 41: Sparsity by worst-case quadratic penalties

−1.0 −0.5 0.0 0.5 1.0 1.5

−12

−10

−8

−6

−4

−2

02

CPU time (in seconds, log10)

D(method)

(log10)

n = 200, p = 1000

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

Page 42: Sparsity by worst-case quadratic penalties

0.5 1.0 1.5 2.0 2.5 3.0

−10

−8

−6

−4

−2

02

CPU time (in seconds, log10)

D(method)

(log10)

n = 400, p = 10000

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

Page 43: Sparsity by worst-case quadratic penalties

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

Page 44: Sparsity by worst-case quadratic penalties

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

Page 45: Sparsity by worst-case quadratic penalties

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

Page 46: Sparsity by worst-case quadratic penalties

MS

E

n/p = 0.5 n/p = 1 n/p = 2

1.5

2.0

2.5

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

10

20

30

40

50

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet highSig

ner

ror

n/p = 0.5 n/p = 1 n/p = 2

1.5

2.0

2.5

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

10

20

30

40

50

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet high

log10(λ1) 30

Page 47: Sparsity by worst-case quadratic penalties

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

31

Page 48: Sparsity by worst-case quadratic penalties

Learning features

Problem solved

β = arg minβ

1

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

where

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Page 49: Sparsity by worst-case quadratic penalties

Learning features

Problem solved

β = arg minβ

1

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

where

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Page 50: Sparsity by worst-case quadratic penalties

Technical features

Written in R (with S4 classes) and C++

Dependencies

I armadillo + RcppArmadillo, its accompanying interface to R

Armadillo is a C++ linear algebra library (matrix maths)aiming towards a good balance between speed and ease ofuse. The syntax is deliberately similar to Matlab.

I Matrix, to handle with sparse matrices

I ggplot2, a plotting system based on the grammar of graphics.

I parallel – not available for Windows :-)

a built in R package since 2.14 which allows parallelcomputation over available CPU cores or clusters (veryuseful for cross-validation for instance).

Suited to solve small to medium scale problems.

33

Page 51: Sparsity by worst-case quadratic penalties

Load the package and show dependencies

R> l i b r a r y ( ”quadrupen ”)

[ 1 ] ”quadrupen ” ”Matr i x ” ” l a t t i c e ” ”ggp l o t2 ” ”s t a t s ” ”g r a p h i c s ”

[ 7 ] ”g rDev i c e s ” ” u t i l s ” ”d a t a s e t s ” ”methods ” ”base ”

Generate a size-100 vector of parameters with labels

R> ## VECTOR OF TRUE PARAMETERSR> ## spa r s e , b l o c kw i s e shapedR> beta <− rep ( c (0 ,−2 ,2) , c ( 80 , 10 , 10 ) )R> ## l a b e l s f o r t r u e nonze ro sR> l a b e l s <− rep ( ” i r r e l e v a n t ” , l ength ( beta ) )R> l a b e l s [ beta != 0 ] <− c ( ” r e l e v a n t ”)R> l a b e l s <− f a c to r ( l a b e l s , o r d e r ed=TRUE,+ l e v e l s=c ( ”r e l e v a n t ” , ” i r r e l e v a n t ”) )

34

Page 52: Sparsity by worst-case quadratic penalties

R> ## COVARIANCE STRUCTURE OF THE PREDICTORSR> ## Toep l i t z c o r r e l a t i o n between i r r e l e v a n t v a r i a b l e sR> co r <− 0 . 8R> S11 <− t o e p l i t z ( co r ˆ(0 : (80 −1)) )R> ## b lo c c o r r e l a t i o n between r e l e v a n t v a r i a b l e sR> S22 <− matr ix ( cor , 1 0 , 10 )R> d i ag ( S22 ) <− 1R> ## c o r r e l a t i o n between r e l e v a n t and i r r e l e v a n t v a r i a b l e sR> eps <− 0 .25R> Sigma <− bdiag ( S11 , S22 , S22 ) + eps

Dimensions: 100 x 100Column

Row

20

40

60

80

20 40 60 80 35

Page 53: Sparsity by worst-case quadratic penalties

Generate n = 100 observation with a high level of noise σ = 10

R> mu <− 3R> s igma <− 10R> n <− 100R> x <− as .mat r i x (matr ix ( rnorm (100 *n ) , n , 100 ) %*% cho l ( Sigma ) )R> y <− mu + x %*% beta + rnorm (n , 0 , s igma )

Give a try to raw Lasso and Elastic-net fits. . .

R> s t a r t <− proc . t ime ( )R> l a s s o <− e l a s t i c . n e t ( x , y , lambda2=0)R> e . n e t <− e l a s t i c . n e t ( x , y , lambda2=1)R> p r i n t ( proc . t ime ()− s t a r t )

u s e r system e l a p s e d0 .080 0 .000 0 .078

36

Page 54: Sparsity by worst-case quadratic penalties

A print/show method is defined:

R> p r i n t ( e . n e t )

L i n e a r r e g r e s s i o n wi th e l a s t i c net p e n a l i z e r , c o e f f i c i e n t sr e s c a l e d by (1+lambda2 ) .

− number o f c o e f f i c i e n t s : 100 + i n t e r c e p t− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 1

Also consider residuals, deviance, predict, and fitted methods. . .

R> head ( deviance ( e . n e t ) )

171.217 162.295 153.837 145.821 138.222 131.01982366 76185 68799 60415 52456 45139

37

Page 55: Sparsity by worst-case quadratic penalties

R> p lo t ( l a s s o , main=”Lasso ” , xva r=” f r a c t i o n ” , l a b e l s=l a b e l s )

−80

−40

0

40

0.00 0.25 0.50 0.75 1.00|βλ1

|1 maxλ1|βλ1

|1

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

38

Page 56: Sparsity by worst-case quadratic penalties

R> p lo t ( l a s s o , main=”Lasso ” , xva r=”lambda ” ,+ l o g . s c a l e=FALSE , r e v e r s e=TRUE, l a b e l s=l a b e l s )

−80

−40

0

40

050100150200λ1

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

39

Page 57: Sparsity by worst-case quadratic penalties

R> p lo t ( l a s s o , main=”Lasso ” , l a b e l s=l a b e l s )

−80

−40

0

40

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

40

Page 58: Sparsity by worst-case quadratic penalties

R> p lo t ( e . n e t , main=”E l a s t i c−net ” , l a b e l s=l a b e l s )

−50

−25

0

25

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Elastic−net

41

Page 59: Sparsity by worst-case quadratic penalties

R> system.t ime (+ c v . d oub l e <− c r o s s v a l ( x , y , lambda2=10ˆ seq (1 ,−1 .5 , l e n =50))+ )

DOUBLE CROSS−VALIDATION

10− f o l d CV on the lambda1 g r i d f o r each lambda210 8 .892 7 .906 7 .03 6 .2515 .558 4 .942 4 .394 3 .907 3 .4743 .089 2 .746 2 .442 2 .171 1 .9311 .717 1 .526 1 .357 1 .207 1 .0730 .954 0 .848 0 .754 0 .671 0 .5960 .53 0 .471 0 .419 0 .373 0 .3310 .295 0 .262 0 .233 0 .207 0 .1840 .164 0 .146 0 .129 0 .115 0 .1020 .091 0 .081 0 .072 0 .064 0 .0570 .051 0 .045 0 .04 0 .036 0 .032

u s e r system e l a p s e d10 .888 1 .492 6 .636

R> p lo t ( c v . d o ub l e )

42

Page 60: Sparsity by worst-case quadratic penalties

0.1

10.0

1 10 100log10(λ1)

log 1

0(λ2)

10000

20000

30000

40000

mean

Cross−validation error

43

Page 61: Sparsity by worst-case quadratic penalties

R> lambda2 <− s l o t ( c v . doub l e , ”lambda2.min ”)

[ 1 ] 0 .8483

R> system.t ime (+ c v . s im p l e <− c r o s s v a l ( x , y , lambda2=lambda2 )+ )

SIMPLE CROSS−VALIDATION10− f o l d CV on the lambda1 g r i d , lambda2 i s f i x e d .

u s e r system e l a p s e d0 .312 0 .052 0 .266

R> sum ( s i gn ( s l o t ( c v . s imp l e , ”be ta .m in ”) ) != s i gn ( beta ) )

[ 1 ] 0

R> p lo t ( c v . c imp l e )

44

Page 62: Sparsity by worst-case quadratic penalties

500

1000

0.0 0.5 1.0 1.5 2.0log10(λ1)

Mea

n sq

uare

err

or

lambda.choice

1−se rule

min. MSE

Cross−validation error

45

Page 63: Sparsity by worst-case quadratic penalties

R> marks <− log10 ( c ( s l o t ( c v . s imp l e , ”lambda1.min ”) ,+ s l o t ( c v . s imp l e , ”l ambda1 .1se ”) ) )R> graph <− p lo t ( e l a s t i c . n e t ( x , y , lambda2=lambda2 ) ,+ l a b e l s=l a b e l s , p lo t=FALSE)R> graph + geom v l i n e ( x i n t e r c e p t=marks )

−50

−25

0

25

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

elastic net path

46

Page 64: Sparsity by worst-case quadratic penalties

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

p47

Page 65: Sparsity by worst-case quadratic penalties

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

p47

Page 66: Sparsity by worst-case quadratic penalties

R> system.t ime (+ s tab <− s t a b i l i t y ( x , y , subsamp le s =400 ,+ randomize=TRUE, weakness=0. 5 )+ )

STABILITY SELECTION with r andom i za t i on ( weakness = 0 . 5 )F i t t i n g p rocedu r e : e l a s t i c . ne t w i th lambda2 = 0.01 and an 100−

d imen s i o n a l g r i d o f lambda1 .Running 2 j o b s p a r a l l e l y (1 pe r co r e )Approx . 200 sub samp l i ng s f o r each job f o r a t o t a l o f 400 u s e r

system e l a p s e d5 .273 0 .144 5 .440

R> p r i n t ( s t ab )

S t a b i l i t y path f o r e l a s t i c . ne t p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by(1+lambda2 ) .

− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 0 .01

48

Page 67: Sparsity by worst-case quadratic penalties

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=1)

πthr

PFER ≤ 1

q = 7.290.00

0.25

0.50

0.75

1.00

0 10 20 30average number of selected variables

sele

ctio

n pr

obab

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

49

Page 68: Sparsity by worst-case quadratic penalties

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=2)

πthr

PFER ≤ 2

q = 10.030.00

0.25

0.50

0.75

1.00

0 10 20 30average number of selected variables

sele

ctio

n pr

obab

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

50

Page 69: Sparsity by worst-case quadratic penalties

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

51

Page 70: Sparsity by worst-case quadratic penalties

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

51

Page 71: Sparsity by worst-case quadratic penalties

More perspectives

Hopefully (help needed)

1. Screening/early discarding of irrelevant features.

2. Efficient implementation of Iterative Reweighted Least-Squares forlogistic regression, Cox model, etc. (Sarah? Marius?)

3. Consider implementation of the OSCAR/group-OSCAR (forsegmentation purpose). (Alia? Morgane? Pierre?)

4. Control the precision when solving the sub problems (tortured M2student)I a dirty (yet controlled) resolution could even speed up the procedure.I can be done via (preconditioned) conjugate gradient/iterative

methods.I Some promising results during Aurore’s M2.

5. More tools for robust statistics. . .

6. Integration with SIMoNe (If I don’t know what to do/at loose ends).

52

Page 72: Sparsity by worst-case quadratic penalties

References

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski.Optimization with sparsity-inducing penalties.Foundations and Trends in Machine Learning, 4(1):1–106, 2012.

H. Xu, C. Caramanis, and S. Mannor.Robust regression and lasso.IEEE Transactions on Information Theory, 56(7):3561–3574, 2010.

A. Beck and M. Teboulle.Fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2:183–202, 2009.

H. D. Bondell and B. J. Reich.Simultaneous regression shrinkage, variable selection, and supervised clustering ofpredictors with oscar.Biometrics, 64(1):115–123, 2008.

W. J. Fu.Penalized regressions: The bridge versus the lasso.Journal of Computational and Graphical Statistics, 7(3):397–416, 1998.

L. El Ghaoui and H. Lebret.Robust solutions to least-squares problems with uncertain data.SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.

53