Sparsity by worst-case quadratic penalties

Post on 14-Jul-2015

320 views 9 download

Transcript of Sparsity by worst-case quadratic penalties

Sparsity by Worst-case Quadratic Penalties

joint work withYves Grandvalet and Christophe Ambroise

Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne

SSB group, Evry – October the 16th, 2012

arXiv preprint with Y. Grandvalet and C. Ambroise.

http://arxiv.org/abs/1210.2077

R-package quadrupen, on CRAN

1

Variable selection in high-dimensional problems

Questionable solutions

1. Treat univariate problems and select effects via multiple testing Genomic data are often highly correlated. . .

2. Combine multivariate analysis and model selection techniques

arg minβ∈Rp

− `(β; y,X) + λ‖β‖0

Unfeasible for p > 30 in general (NP-hard)

Popular idea (questionable too!)

Use `1 as a convex relaxation of this problem, keeping sparse-inducingeffect:

arg minβ∈Rp

− `(β; y,X) + λ · pen`1(β)

2

Another algorithm for the Lasso? (- -)zzzWell, not really. . .

1. We suggest an unifying approach that might be useful sinceI it helps having a global view on the Lasso-zoologyI reading every daily arXiv papers dealing with `1 is out-of-reachI insights are still needed to understand high dimensional problem

2. The associated algorithm is efficient and accurate up to mediumscale (1000s) problemsI Promising tools for pre-discarding irrelevant variables are emerging Thus solving this class of problems may be enough

I Bootstrapping, cross-validation and subsampling are highly needed. . . for which our method is well adapted.

3

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

4

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

5

A geometric view of sparsity`(β

1,β

2)

β2 β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

6

A geometric view of sparsityβ

2

β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

6

Singularities induce sparsity

Lasso (`1) Ridge (`2)

β1

β2

BNB(β)

β

β1

β2

BNB(β)β

β ∈ B is optimal if and only if ∇`(β; X,y) defines a supportinghyperplane at β. Equivalently,

−∇`(β; X,y) ∈ NB(β).

7

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

8

Robust optimization framework

Worst-case analysis

We wish to solve a regression problem where β minimize

β = arg minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2,

I Dγ describes an uncertainty set for the parameters,

I γ acts as a spurious adversary over the true β?.

maximizing over Dγ leads to the worst-case formulation.

9

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + .

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

.

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

10

Example: robust formulation of the elastic net

’Lasso’ regularity set for β

The `1-norm must be controlled:

HLassoβ = β ∈ Rp : ||β||1 ≤ ηβ .

Dual assumption on the adversary

The `∞-norm of γ should be controlled, say:

DLassoγ =

γ ∈ Rp : max

β∈HLassoβ

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||∞ ≤ ηγ = conv

−ηγ , ηγp

,

where ηγ = 1/ηβ.

11

Example: robust formulation of the elastic net (continued)

Now, the other way round:

minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈−ηγ ,ηγp

γᵀβ + ||γ||2

= minimize

β∈Rp

||Xβ − y||2 + λ ||β||2 + ληγ‖β‖1 + c

⇔ minimize

β∈Rp

1

2||Xβ − y||2 + λ1‖β‖1 +

λ2

2||β||2

,

for known λ1, λ2.

We recognize the ’official’ Elastic-net regularizer.

12

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

13

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

13

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖1 ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

13

Example: robust formulation of `1/`∞ group-Lasso

Argument with one group, which generalizes by decomposability of thegroup-norm.

The regularity set for β with `∞ controlled

Hmaxβ = β ∈ Rp : ||β||∞ ≤ ηβ .

Dual assumption on the adversary

Dmaxγ =

γ ∈ Rp : sup

β∈Hmaxβ∗

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||1 ≤ ηγ= conv

ηγe

p1 , . . . , ηγe

pp ,−ηγep1 , . . . ,−ηγepp

,

where ηγ = 1/ηβ and epj is the j th element of the canonical basis of Rp ,that is ejj ′ = 1 if j = j ′ and ejj ′ = 0 otherwise.

14

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

15

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈Dmax

γ

||β − γ||2 ≤ s + η2

15

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

s. t.

‖β‖22 + η‖β‖∞ ≤ s

minimize

β∈Rp||Xβ − y||2

s. t.

maxγ∈Dmax

γ

||β − γ||2 ≤ s + η2

15

Generalize this principle to your favourite problem

elastic-net (`1 + `2) `∞ + `2

structured e.-n. fused-lasso + `2 OSCAR + `216

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

17

Algorithm complexitysee, e.g., Bach et al 2011

Suppose that

I the algorithm stops at λmin with k activated variables,

I no downgrade has been observed (we thus have k iterations/steps).

Complexity in favorable cases

1. compute XᵀXA + λI|A|: O(npk),

2. maintaining xᵀj (XAβA − y) along the path: O(pn + pk2)

3. cholesky update of (XᵀAXA + λI|A|)−1: O(k3).

a total of O(npk + pk2 + k3). Hence k , defined by λmin, matters. . .

18

A bound to assess distance to optimum during optimization

Proposition

For any ηγ > 0, and for all vectorial norm ||·||∗, when Dγ is defined asDγ = γ ∈ Rp : ||γ||∗ ≤ ηγ, then, ∀γ ∈ Rp : ||γ||∗ ≥ ηγ , we have:

minβ∈Rp

maxγ′∈Dγ

Jλ(β,γ ′) ≥ ηγ||γ||∗

Jλ (β? (γ) ,γ)− ληγ(||γ||∗ − ηγ)

||γ||2∗||γ||2 ,

where

Jλ(β,γ) = ||Xβ − y||2 + λ ||β − γ||2 and β?(γ) = arg minβ∈Rp

Jλ(β,γ) .

This proposition can be used to compute an optimality gap by picking aγ-value such that the current worst-case gradient is null (the currentβ-value then being the optimal β?(γ)).

19

Bound: illustration on an Elastic-Net problem

0 50 100 150 200

−8

−6

−4

−2

02

4

# of iterations

Op

tim

alit

yga

pn = 50, p = 200

Figure: Monitoring convergence: true optimality gap (solid black) versus ourpessimistic bound (dashed blue) and Fenchel’s duality gap (dotted red)computed at each iteration of the algorithm.

20

Full interpretation in a robust optimization framework

Proposition

The robust regression problem

minβ∈Rp

max(∆X,ε,γ)∈DX×Dε×Dγ

||(X−∆X)β + Xγ + ε− y||2 ,

for a given form of the global uncertainty set DX ×Dε ×Dγ on(∆X, ε,γ), is equivalent to the robust regularized regression problem:

minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + ηX ||β − γ||2 .

These assumptions entail the following relationship between X and y:

y = (X−∆X)β? + Xγ + ε .

The observed responses are formed by summing the contributions of theunobserved clean inputs, the adversarial noise that maps the observedinputs to the responses, and the neutral noise.

21

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

22

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

23

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

23

Data generation

Exploring those characteristics with linear regression

We generate samples of size n from the model

y = Xβ? + ε, ε ∼ N (0, σ2I),

I σ chosen so as to reach R2 ≈ 0.8,

I X ∼ N (0,Σ), with Σij = 1i=j + ρ1i 6=j,

I sgn (β?) =(

1, . . . , 1︸ ︷︷ ︸s/2

,−1, . . . ,−1︸ ︷︷ ︸s/2

, 0, . . . , 0︸ ︷︷ ︸p−s

).

Controlling the difficulty

I ρ ∈ 0.1, 0.4, 0.8 rules the conditioning,

I s ∈ 10%, 30%, 60% controls the sparsity,

I the ratio n/p ∈ 2, 1, 0.5 quantifies the well/ill-posedness.

24

Comparing optimization strategies

We used our own code to avoid implementation biases for

1. accelerated proximal methods – proximal,

2. coordinate descent – coordinate,

3. our quadratic solver – quadratic,

wrapped in the same active-set + warm-start routine.Timings averaged over 100 runs to minimize

J enetλ1,λ2(β) =

1

2||Xβ − y||2 + λ1 ||β||1 +

λ2

2||β||2 .

with halting condition

maxj∈1...p

∣∣∣xᵀj (y −Xβ)

+ λ2β∣∣∣ < λ1 + τ, (1)

where the threshold τ = 10−2 on a 50× 50 grid of λ1 × λ2.

25

log-ratio between timing of competitor and quadratic, p = 2n = 100, s = 30log10λ2

large corr. (0.8) medium corr. (0.4) small corr. (0.1)

−1.0

−0.5

0.0

0.5

1.0

1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

coordinate descentproxim

al (fista)

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

# times faster

1

3

10

30

100

300

log10(λ1)26

Comparing stand-alone implementations

We compare our method on a Lasso problem to popular R-packages

1. accelerated proximal methods – SPAMs-FISTA (Mairal, Bach et al.),

2. coordinate descent – glmnet (Friedman, Hastie, Tibshirani),

3. homotopy/LARS algorithm– lars (Efron, Hastie) and SPAMs-LARS,

4. quadratic solver – quadrupen.

The distance D to the optimum is evaluated on J lassoλ (β) = J enet

λ,0 (β) by

D(method) =

(1

|Λ|∑λ∈Λ

(J lassoλ

(βlars

λ

)− J lasso

λ

(βmethod

λ

))2)1/2

,

where Λ is given by the first min(n, p) steps of lars.

Vary ρ, (p,n), fix s = 0.25 min(n, p) and average over 50 runs.

27

−2.5 −2.0 −1.5 −1.0

−12

−10

−8

−6

−4

−2

0

CPU time (in seconds, log10)

D(method)

(log10)

n = 100, p = 40

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

−1.0 −0.5 0.0 0.5 1.0 1.5

−12

−10

−8

−6

−4

−2

02

CPU time (in seconds, log10)

D(method)

(log10)

n = 200, p = 1000

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

0.5 1.0 1.5 2.0 2.5 3.0

−10

−8

−6

−4

−2

02

CPU time (in seconds, log10)

D(method)

(log10)

n = 400, p = 10000

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

28

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

29

MS

E

n/p = 0.5 n/p = 1 n/p = 2

1.5

2.0

2.5

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

10

20

30

40

50

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet highSig

ner

ror

n/p = 0.5 n/p = 1 n/p = 2

1.5

2.0

2.5

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

10

20

30

40

50

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet high

log10(λ1) 30

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

31

Learning features

Problem solved

β = arg minβ

1

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

where

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Learning features

Problem solved

β = arg minβ

1

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

where

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Technical features

Written in R (with S4 classes) and C++

Dependencies

I armadillo + RcppArmadillo, its accompanying interface to R

Armadillo is a C++ linear algebra library (matrix maths)aiming towards a good balance between speed and ease ofuse. The syntax is deliberately similar to Matlab.

I Matrix, to handle with sparse matrices

I ggplot2, a plotting system based on the grammar of graphics.

I parallel – not available for Windows :-)

a built in R package since 2.14 which allows parallelcomputation over available CPU cores or clusters (veryuseful for cross-validation for instance).

Suited to solve small to medium scale problems.

33

Load the package and show dependencies

R> l i b r a r y ( ”quadrupen ”)

[ 1 ] ”quadrupen ” ”Matr i x ” ” l a t t i c e ” ”ggp l o t2 ” ”s t a t s ” ”g r a p h i c s ”

[ 7 ] ”g rDev i c e s ” ” u t i l s ” ”d a t a s e t s ” ”methods ” ”base ”

Generate a size-100 vector of parameters with labels

R> ## VECTOR OF TRUE PARAMETERSR> ## spa r s e , b l o c kw i s e shapedR> beta <− rep ( c (0 ,−2 ,2) , c ( 80 , 10 , 10 ) )R> ## l a b e l s f o r t r u e nonze ro sR> l a b e l s <− rep ( ” i r r e l e v a n t ” , l ength ( beta ) )R> l a b e l s [ beta != 0 ] <− c ( ” r e l e v a n t ”)R> l a b e l s <− f a c to r ( l a b e l s , o r d e r ed=TRUE,+ l e v e l s=c ( ”r e l e v a n t ” , ” i r r e l e v a n t ”) )

34

R> ## COVARIANCE STRUCTURE OF THE PREDICTORSR> ## Toep l i t z c o r r e l a t i o n between i r r e l e v a n t v a r i a b l e sR> co r <− 0 . 8R> S11 <− t o e p l i t z ( co r ˆ(0 : (80 −1)) )R> ## b lo c c o r r e l a t i o n between r e l e v a n t v a r i a b l e sR> S22 <− matr ix ( cor , 1 0 , 10 )R> d i ag ( S22 ) <− 1R> ## c o r r e l a t i o n between r e l e v a n t and i r r e l e v a n t v a r i a b l e sR> eps <− 0 .25R> Sigma <− bdiag ( S11 , S22 , S22 ) + eps

Dimensions: 100 x 100Column

Row

20

40

60

80

20 40 60 80 35

Generate n = 100 observation with a high level of noise σ = 10

R> mu <− 3R> s igma <− 10R> n <− 100R> x <− as .mat r i x (matr ix ( rnorm (100 *n ) , n , 100 ) %*% cho l ( Sigma ) )R> y <− mu + x %*% beta + rnorm (n , 0 , s igma )

Give a try to raw Lasso and Elastic-net fits. . .

R> s t a r t <− proc . t ime ( )R> l a s s o <− e l a s t i c . n e t ( x , y , lambda2=0)R> e . n e t <− e l a s t i c . n e t ( x , y , lambda2=1)R> p r i n t ( proc . t ime ()− s t a r t )

u s e r system e l a p s e d0 .080 0 .000 0 .078

36

A print/show method is defined:

R> p r i n t ( e . n e t )

L i n e a r r e g r e s s i o n wi th e l a s t i c net p e n a l i z e r , c o e f f i c i e n t sr e s c a l e d by (1+lambda2 ) .

− number o f c o e f f i c i e n t s : 100 + i n t e r c e p t− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 1

Also consider residuals, deviance, predict, and fitted methods. . .

R> head ( deviance ( e . n e t ) )

171.217 162.295 153.837 145.821 138.222 131.01982366 76185 68799 60415 52456 45139

37

R> p lo t ( l a s s o , main=”Lasso ” , xva r=” f r a c t i o n ” , l a b e l s=l a b e l s )

−80

−40

0

40

0.00 0.25 0.50 0.75 1.00|βλ1

|1 maxλ1|βλ1

|1

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

38

R> p lo t ( l a s s o , main=”Lasso ” , xva r=”lambda ” ,+ l o g . s c a l e=FALSE , r e v e r s e=TRUE, l a b e l s=l a b e l s )

−80

−40

0

40

050100150200λ1

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

39

R> p lo t ( l a s s o , main=”Lasso ” , l a b e l s=l a b e l s )

−80

−40

0

40

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Lasso

40

R> p lo t ( e . n e t , main=”E l a s t i c−net ” , l a b e l s=l a b e l s )

−50

−25

0

25

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

Elastic−net

41

R> system.t ime (+ c v . d oub l e <− c r o s s v a l ( x , y , lambda2=10ˆ seq (1 ,−1 .5 , l e n =50))+ )

DOUBLE CROSS−VALIDATION

10− f o l d CV on the lambda1 g r i d f o r each lambda210 8 .892 7 .906 7 .03 6 .2515 .558 4 .942 4 .394 3 .907 3 .4743 .089 2 .746 2 .442 2 .171 1 .9311 .717 1 .526 1 .357 1 .207 1 .0730 .954 0 .848 0 .754 0 .671 0 .5960 .53 0 .471 0 .419 0 .373 0 .3310 .295 0 .262 0 .233 0 .207 0 .1840 .164 0 .146 0 .129 0 .115 0 .1020 .091 0 .081 0 .072 0 .064 0 .0570 .051 0 .045 0 .04 0 .036 0 .032

u s e r system e l a p s e d10 .888 1 .492 6 .636

R> p lo t ( c v . d o ub l e )

42

0.1

10.0

1 10 100log10(λ1)

log 1

0(λ2)

10000

20000

30000

40000

mean

Cross−validation error

43

R> lambda2 <− s l o t ( c v . doub l e , ”lambda2.min ”)

[ 1 ] 0 .8483

R> system.t ime (+ c v . s im p l e <− c r o s s v a l ( x , y , lambda2=lambda2 )+ )

SIMPLE CROSS−VALIDATION10− f o l d CV on the lambda1 g r i d , lambda2 i s f i x e d .

u s e r system e l a p s e d0 .312 0 .052 0 .266

R> sum ( s i gn ( s l o t ( c v . s imp l e , ”be ta .m in ”) ) != s i gn ( beta ) )

[ 1 ] 0

R> p lo t ( c v . c imp l e )

44

500

1000

0.0 0.5 1.0 1.5 2.0log10(λ1)

Mea

n sq

uare

err

or

lambda.choice

1−se rule

min. MSE

Cross−validation error

45

R> marks <− log10 ( c ( s l o t ( c v . s imp l e , ”lambda1.min ”) ,+ s l o t ( c v . s imp l e , ”l ambda1 .1se ”) ) )R> graph <− p lo t ( e l a s t i c . n e t ( x , y , lambda2=lambda2 ) ,+ l a b e l s=l a b e l s , p lo t=FALSE)R> graph + geom v l i n e ( x i n t e r c e p t=marks )

−50

−25

0

25

0.0 0.5 1.0 1.5 2.0log10(λ1)

stan

dard

ized

coe

ffici

ents

variables

relevant

irrelevant

elastic net path

46

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

p47

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

p47

R> system.t ime (+ s tab <− s t a b i l i t y ( x , y , subsamp le s =400 ,+ randomize=TRUE, weakness=0. 5 )+ )

STABILITY SELECTION with r andom i za t i on ( weakness = 0 . 5 )F i t t i n g p rocedu r e : e l a s t i c . ne t w i th lambda2 = 0.01 and an 100−

d imen s i o n a l g r i d o f lambda1 .Running 2 j o b s p a r a l l e l y (1 pe r co r e )Approx . 200 sub samp l i ng s f o r each job f o r a t o t a l o f 400 u s e r

system e l a p s e d5 .273 0 .144 5 .440

R> p r i n t ( s t ab )

S t a b i l i t y path f o r e l a s t i c . ne t p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by(1+lambda2 ) .

− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 0 .01

48

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=1)

πthr

PFER ≤ 1

q = 7.290.00

0.25

0.50

0.75

1.00

0 10 20 30average number of selected variables

sele

ctio

n pr

obab

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

49

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=2)

πthr

PFER ≤ 2

q = 10.030.00

0.25

0.50

0.75

1.00

0 10 20 30average number of selected variables

sele

ctio

n pr

obab

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

50

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

51

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

51

More perspectives

Hopefully (help needed)

1. Screening/early discarding of irrelevant features.

2. Efficient implementation of Iterative Reweighted Least-Squares forlogistic regression, Cox model, etc. (Sarah? Marius?)

3. Consider implementation of the OSCAR/group-OSCAR (forsegmentation purpose). (Alia? Morgane? Pierre?)

4. Control the precision when solving the sub problems (tortured M2student)I a dirty (yet controlled) resolution could even speed up the procedure.I can be done via (preconditioned) conjugate gradient/iterative

methods.I Some promising results during Aurore’s M2.

5. More tools for robust statistics. . .

6. Integration with SIMoNe (If I don’t know what to do/at loose ends).

52

References

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski.Optimization with sparsity-inducing penalties.Foundations and Trends in Machine Learning, 4(1):1–106, 2012.

H. Xu, C. Caramanis, and S. Mannor.Robust regression and lasso.IEEE Transactions on Information Theory, 56(7):3561–3574, 2010.

A. Beck and M. Teboulle.Fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2:183–202, 2009.

H. D. Bondell and B. J. Reich.Simultaneous regression shrinkage, variable selection, and supervised clustering ofpredictors with oscar.Biometrics, 64(1):115–123, 2008.

W. J. Fu.Penalized regressions: The bridge versus the lasso.Journal of Computational and Graphical Statistics, 7(3):397–416, 1998.

L. El Ghaoui and H. Lebret.Robust solutions to least-squares problems with uncertain data.SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.

53