Sparsity by worst-case quadratic penalties

Sparsity by Worst-case Quadratic Penalties

joint work withYves Grandvalet and Christophe Ambroise

Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne

SSB group, Evry – October the 16th, 2012

arXiv preprint with Y. Grandvalet and C. Ambroise.

http://arxiv.org/abs/1210.2077

R-package quadrupen, on CRAN

Variable selection in high-dimensional problems

Questionable solutions

1. Treat univariate problems and select effects via multiple testing Genomic data are often highly correlated. . .

2. Combine multivariate analysis and model selection techniques

arg minβ∈Rp

− `(β; y,X) + λ‖β‖0

Unfeasible for p > 30 in general (NP-hard)

Popular idea (questionable too!)

Use `1 as a convex relaxation of this problem, keeping sparse-inducingeffect:

arg minβ∈Rp

− `(β; y,X) + λ · pen`1(β)

Another algorithm for the Lasso? (- -)zzzWell, not really. . .

1. We suggest an unifying approach that might be useful sinceI it helps having a global view on the Lasso-zoologyI reading every daily arXiv papers dealing with `1 is out-of-reachI insights are still needed to understand high dimensional problem

2. The associated algorithm is efficient and accurate up to mediumscale (1000s) problemsI Promising tools for pre-discarding irrelevant variables are emerging Thus solving this class of problems may be enough

I Bootstrapping, cross-validation and subsampling are highly needed. . . for which our method is well adapted.

Outline

Geometrical insights on Sparsity

Robustness Viewpoint

Numerical experiments

R package features

Outline

R package features

A geometric view of sparsity`(β

β2 β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

A geometric view of sparsityβ

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c

Singularities induce sparsity

Lasso (`1) Ridge (`2)

BNB(β)

BNB(β)β

β ∈ B is optimal if and only if ∇`(β; X,y) defines a supportinghyperplane at β. Equivalently,

−∇`(β; X,y) ∈ NB(β).

Outline

R package features

Robust optimization framework

Worst-case analysis

We wish to solve a regression problem where β minimize

β = arg minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2,

I Dγ describes an uncertainty set for the parameters,

I γ acts as a spurious adversary over the true β?.

maximizing over Dγ leads to the worst-case formulation.

Sparsity by work-case formulation

First mathematical breakthrough

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + .

Look for sparsity-inducing norm

I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,

I and forget about ||γ||2, which does not change the minimization

I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

Great, (a − b)2 = a2 + b2 − 2ab, so

minimizeβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + λ ||β − γ||2

= minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈Dγ

γᵀβ + ||γ||2

Example: robust formulation of the elastic net

’Lasso’ regularity set for β

The `1-norm must be controlled:

HLassoβ = β ∈ Rp : ||β||1 ≤ ηβ .

Dual assumption on the adversary

The `∞-norm of γ should be controlled, say:

DLassoγ =

γ ∈ Rp : max

β∈HLassoβ

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||∞ ≤ ηγ = conv

−ηγ , ηγp

where ηγ = 1/ηβ.

Example: robust formulation of the elastic net (continued)

Now, the other way round:

minimizeβ∈Rp

||Xβ − y||2 + λ ||β||2 + λ max

γ∈−ηγ ,ηγp

γᵀβ + ||γ||2

= minimize

β∈Rp

||Xβ − y||2 + λ ||β||2 + ληγ‖β‖1 + c

⇔ minimize

β∈Rp

2||Xβ − y||2 + λ1‖β‖1 +

2||β||2

for known λ1, λ2.

We recognize the ’official’ Elastic-net regularizer.

Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation

minimize

β∈Rp||Xβ − y||2

‖β‖22 + η‖β‖1 ≤ s

minimize

‖β‖22 + η‖β‖1 ≤ s

minimize

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

minimize

‖β‖22 + η‖β‖1 ≤ s

minimize

maxγ∈−η,ηp

||β − γ||2 ≤ s + η2

Example: robust formulation of `1/`∞ group-Lasso

Argument with one group, which generalizes by decomposability of thegroup-norm.

The regularity set for β with `∞ controlled

Hmaxβ = β ∈ Rp : ||β||∞ ≤ ηβ .

Dual assumption on the adversary

Dmaxγ =

γ ∈ Rp : sup

β∈Hmaxβ∗

γᵀβ ≤ 1

= γ ∈ Rp : ||γ||1 ≤ ηγ= conv

p1 , . . . , ηγe

pp ,−ηγep1 , . . . ,−ηγepp

where ηγ = 1/ηβ and epj is the j th element of the canonical basis of Rp ,that is ejj ′ = 1 if j = j ′ and ejj ′ = 0 otherwise.

Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation

minimize

‖β‖22 + η‖β‖∞ ≤ s

minimize

‖β‖22 + η‖β‖∞ ≤ s

minimize

maxγ∈Dmax

||β − γ||2 ≤ s + η2

minimize

‖β‖22 + η‖β‖∞ ≤ s

minimize

maxγ∈Dmax

||β − γ||2 ≤ s + η2

Generalize this principle to your favourite problem

elastic-net (`1 + `2) `∞ + `2

structured e.-n. fused-lasso + `2 OSCAR + `216

Worst-Case Quadratic Penalty Active Set Algorithm

S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set

γ = arg maxg∈Dγ

||β − g||2 ; // Pick a worst admissible γ

S1 Update active variables βAβA ←

(XᵀAXA + λI|A|

)−1 (XᵀAy + λγA

); // Subproblem resolution

S2 Verify coherence of γA with the updated βAif ||βA − γA||

2 < maxg∈Dγ

||βA − gA||2 then // if γA is not worst-case

βA ← βoldA + ρ(βA − βold

A ) ; // Last γA-coherent solution

S3 Update active set Agj ← min

γ∈Dγ

∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient

if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j

elseif maxj∈Ac gj 6= 0 then

j ? ← arg maxj∈Ac

gj , A ← A∪ j ? ; // Upgrade j ?

elseStop and return β, which is optimal

γ = arg maxg∈Dγ

(XᵀAXA + λI|A|

2 < maxg∈Dγ

γ∈Dγ

γ = arg maxg∈Dγ

(XᵀAXA + λI|A|

2 < maxg∈Dγ

γ∈Dγ

γ = arg maxg∈Dγ

(XᵀAXA + λI|A|

2 < maxg∈Dγ

γ∈Dγ

Algorithm complexitysee, e.g., Bach et al 2011

Suppose that

I the algorithm stops at λmin with k activated variables,

I no downgrade has been observed (we thus have k iterations/steps).

Complexity in favorable cases

1. compute XᵀXA + λI|A|: O(npk),

2. maintaining xᵀj (XAβA − y) along the path: O(pn + pk2)

3. cholesky update of (XᵀAXA + λI|A|)−1: O(k3).

a total of O(npk + pk2 + k3). Hence k , defined by λmin, matters. . .

A bound to assess distance to optimum during optimization

Proposition

For any ηγ > 0, and for all vectorial norm ||·||∗, when Dγ is defined asDγ = γ ∈ Rp : ||γ||∗ ≤ ηγ, then, ∀γ ∈ Rp : ||γ||∗ ≥ ηγ , we have:

minβ∈Rp

maxγ′∈Dγ

Jλ(β,γ ′) ≥ ηγ||γ||∗

Jλ (β? (γ) ,γ)− ληγ(||γ||∗ − ηγ)

||γ||2∗||γ||2 ,

Jλ(β,γ) = ||Xβ − y||2 + λ ||β − γ||2 and β?(γ) = arg minβ∈Rp

Jλ(β,γ) .

This proposition can be used to compute an optimality gap by picking aγ-value such that the current worst-case gradient is null (the currentβ-value then being the optimal β?(γ)).

Bound: illustration on an Elastic-Net problem

0 50 100 150 200

# of iterations

pn = 50, p = 200

Figure: Monitoring convergence: true optimality gap (solid black) versus ourpessimistic bound (dashed blue) and Fenchel’s duality gap (dotted red)computed at each iteration of the algorithm.

Full interpretation in a robust optimization framework

Proposition

The robust regression problem

minβ∈Rp

max(∆X,ε,γ)∈DX×Dε×Dγ

||(X−∆X)β + Xγ + ε− y||2 ,

for a given form of the global uncertainty set DX ×Dε ×Dγ on(∆X, ε,γ), is equivalent to the robust regularized regression problem:

minβ∈Rp

maxγ∈Dγ

||Xβ − y||2 + ηX ||β − γ||2 .

These assumptions entail the following relationship between X and y:

y = (X−∆X)β? + Xγ + ε .

The observed responses are formed by summing the contributions of theunobserved clean inputs, the adversarial noise that maps the observedinputs to the responses, and the neutral noise.

Outline

R package features

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

General objectives

Assessing efficiency of an algorithm

I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;

I speed is the computing time required for returning this solution.

timing has to be compared at similar precision requirements

Mimicking post-genomic data attributes

Optimization difficulties results from ill-conditioning, due to

I either high correlation between predictors

I or underdetermination (high-dimensional or “large p small n”setup)

Remarks

1. With active set strategies, bad conditioning is somehow alleviated.

2. Sparsity of the true parameter heavily impacts the running times.

Data generation

Exploring those characteristics with linear regression

We generate samples of size n from the model

y = Xβ? + ε, ε ∼ N (0, σ2I),

I σ chosen so as to reach R2 ≈ 0.8,

I X ∼ N (0,Σ), with Σij = 1i=j + ρ1i 6=j,

I sgn (β?) =(

1, . . . , 1︸︷︷︸s/2

,−1, . . . ,−1︸︷︷︸s/2

, 0, . . . , 0︸︷︷︸p−s

Controlling the difficulty

I ρ ∈ 0.1, 0.4, 0.8 rules the conditioning,

I s ∈ 10%, 30%, 60% controls the sparsity,

I the ratio n/p ∈ 2, 1, 0.5 quantifies the well/ill-posedness.

Comparing optimization strategies

We used our own code to avoid implementation biases for

1. accelerated proximal methods – proximal,

2. coordinate descent – coordinate,

3. our quadratic solver – quadratic,

wrapped in the same active-set + warm-start routine.Timings averaged over 100 runs to minimize

J enetλ1,λ2(β) =

2||Xβ − y||2 + λ1 ||β||1 +

2||β||2 .

with halting condition

maxj∈1...p

∣∣∣xᵀj (y −Xβ)

+ λ2β∣∣∣ < λ1 + τ, (1)

where the threshold τ = 10−2 on a 50× 50 grid of λ1 × λ2.

log-ratio between timing of competitor and quadratic, p = 2n = 100, s = 30log10λ2

large corr. (0.8) medium corr. (0.4) small corr. (0.1)

−1.0

−0.5

−1.0

−0.5

coordinate descentproxim

al (fista)

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

# times faster

log10(λ1)26

Comparing stand-alone implementations

We compare our method on a Lasso problem to popular R-packages

1. accelerated proximal methods – SPAMs-FISTA (Mairal, Bach et al.),

2. coordinate descent – glmnet (Friedman, Hastie, Tibshirani),

3. homotopy/LARS algorithm– lars (Efron, Hastie) and SPAMs-LARS,

4. quadratic solver – quadrupen.

The distance D to the optimum is evaluated on J lassoλ (β) = J enet

λ,0 (β) by

D(method) =

|Λ|∑λ∈Λ

(J lassoλ

(βlars

)− J lasso

(βmethod

))2)1/2

where Λ is given by the first min(n, p) steps of lars.

Vary ρ, (p,n), fix s = 0.25 min(n, p) and average over 50 runs.

−2.5 −2.0 −1.5 −1.0

CPU time (in seconds, log10)

D(method)

(log10)

n = 100, p = 40

low correlation (0.1)med correlation (0.4)high correlation (0.8)

glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)

−1.0 −0.5 0.0 0.5 1.0 1.5

D(method)

(log10)

n = 200, p = 1000

0.5 1.0 1.5 2.0 2.5 3.0

D(method)

(log10)

n = 400, p = 10000

Link between accuracy and prediction performances

Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .

Any unforeseen consequences of lack of accuracy?

Early stops of the algorithm is likely to prevent

I either the removal of all irrelevant coefficients,

I or the insertion of relevant ones.

Illustration on mean square error and support recovery

I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.

I Generate for each a large test set (say, 10n) for evaluation.

methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2

n/p = 0.5 n/p = 1 n/p = 2

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet highSig

n/p = 0.5 n/p = 1 n/p = 2

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

n/p = 0.5 n/p = 1 n/p = 2

0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0

method

quadrupen

glmnet low

glmnet med

glmnet high

log10(λ1) 30

Outline

R package features

Learning features

Problem solved

β = arg minβ

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Learning features

Problem solved

β = arg minβ

2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β

ᵀSβ,

I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,

I ω such as (ωj )pj=1 > 0 `1- penalty weights,

I S a p × p positive definite matrix ’structuring’ the `2-penalty.

Some corresponding estimators

I (Adaptive)-Lasso,

I (Structured) Elastic-net,

I Fused-Lasso signal approximator (inefficient),

I . . .32

Technical features

Written in R (with S4 classes) and C++

Dependencies

I armadillo + RcppArmadillo, its accompanying interface to R

Armadillo is a C++ linear algebra library (matrix maths)aiming towards a good balance between speed and ease ofuse. The syntax is deliberately similar to Matlab.

I Matrix, to handle with sparse matrices

I ggplot2, a plotting system based on the grammar of graphics.

I parallel – not available for Windows :-)

a built in R package since 2.14 which allows parallelcomputation over available CPU cores or clusters (veryuseful for cross-validation for instance).

Suited to solve small to medium scale problems.

Load the package and show dependencies

R> l i b r a r y ( ”quadrupen ”)

[ 1 ] ”quadrupen ” ”Matr i x ” ” l a t t i c e ” ”ggp l o t2 ” ”s t a t s ” ”g r a p h i c s ”

[ 7 ] ”g rDev i c e s ” ” u t i l s ” ”d a t a s e t s ” ”methods ” ”base ”

Generate a size-100 vector of parameters with labels

R> ## VECTOR OF TRUE PARAMETERSR> ## spa r s e , b l o c kw i s e shapedR> beta <− rep ( c (0 ,−2 ,2) , c ( 80 , 10 , 10 ) )R> ## l a b e l s f o r t r u e nonze ro sR> l a b e l s <− rep ( ” i r r e l e v a n t ” , l ength ( beta ) )R> l a b e l s [ beta != 0 ] <− c ( ” r e l e v a n t ”)R> l a b e l s <− f a c to r ( l a b e l s , o r d e r ed=TRUE,+ l e v e l s=c ( ”r e l e v a n t ” , ” i r r e l e v a n t ”) )

R> ## COVARIANCE STRUCTURE OF THE PREDICTORSR> ## Toep l i t z c o r r e l a t i o n between i r r e l e v a n t v a r i a b l e sR> co r <− 0 . 8R> S11 <− t o e p l i t z ( co r ˆ(0 : (80 −1)) )R> ## b lo c c o r r e l a t i o n between r e l e v a n t v a r i a b l e sR> S22 <− matr ix ( cor , 1 0 , 10 )R> d i ag ( S22 ) <− 1R> ## c o r r e l a t i o n between r e l e v a n t and i r r e l e v a n t v a r i a b l e sR> eps <− 0 .25R> Sigma <− bdiag ( S11 , S22 , S22 ) + eps

Dimensions: 100 x 100Column

20 40 60 80 35

Generate n = 100 observation with a high level of noise σ = 10

R> mu <− 3R> s igma <− 10R> n <− 100R> x <− as .mat r i x (matr ix ( rnorm (100 *n ) , n , 100 ) %*% cho l ( Sigma ) )R> y <− mu + x %*% beta + rnorm (n , 0 , s igma )

Give a try to raw Lasso and Elastic-net fits. . .

R> s t a r t <− proc . t ime ( )R> l a s s o <− e l a s t i c . n e t ( x , y , lambda2=0)R> e . n e t <− e l a s t i c . n e t ( x , y , lambda2=1)R> p r i n t ( proc . t ime ()− s t a r t )

u s e r system e l a p s e d0 .080 0 .000 0 .078

A print/show method is defined:

R> p r i n t ( e . n e t )

L i n e a r r e g r e s s i o n wi th e l a s t i c net p e n a l i z e r , c o e f f i c i e n t sr e s c a l e d by (1+lambda2 ) .

− number o f c o e f f i c i e n t s : 100 + i n t e r c e p t− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 1

Also consider residuals, deviance, predict, and fitted methods. . .

R> head ( deviance ( e . n e t ) )

171.217 162.295 153.837 145.821 138.222 131.01982366 76185 68799 60415 52456 45139

R> p lo t ( l a s s o , main=”Lasso ” , xva r=” f r a c t i o n ” , l a b e l s=l a b e l s )

0.00 0.25 0.50 0.75 1.00|βλ1

|1 maxλ1|βλ1

variables

relevant

irrelevant

R> p lo t ( l a s s o , main=”Lasso ” , xva r=”lambda ” ,+ l o g . s c a l e=FALSE , r e v e r s e=TRUE, l a b e l s=l a b e l s )

050100150200λ1

variables

relevant

irrelevant

R> p lo t ( l a s s o , main=”Lasso ” , l a b e l s=l a b e l s )

0.0 0.5 1.0 1.5 2.0log10(λ1)

variables

relevant

irrelevant

R> p lo t ( e . n e t , main=”E l a s t i c−net ” , l a b e l s=l a b e l s )

0.0 0.5 1.0 1.5 2.0log10(λ1)

variables

relevant

irrelevant

Elastic−net

R> system.t ime (+ c v . d oub l e <− c r o s s v a l ( x , y , lambda2=10ˆ seq (1 ,−1 .5 , l e n =50))+ )

DOUBLE CROSS−VALIDATION

10− f o l d CV on the lambda1 g r i d f o r each lambda210 8 .892 7 .906 7 .03 6 .2515 .558 4 .942 4 .394 3 .907 3 .4743 .089 2 .746 2 .442 2 .171 1 .9311 .717 1 .526 1 .357 1 .207 1 .0730 .954 0 .848 0 .754 0 .671 0 .5960 .53 0 .471 0 .419 0 .373 0 .3310 .295 0 .262 0 .233 0 .207 0 .1840 .164 0 .146 0 .129 0 .115 0 .1020 .091 0 .081 0 .072 0 .064 0 .0570 .051 0 .045 0 .04 0 .036 0 .032

R> p lo t ( c v . d o ub l e )

1 10 100log10(λ1)

0(λ2)

Cross−validation error

R> lambda2 <− s l o t ( c v . doub l e , ”lambda2.min ”)

[ 1 ] 0 .8483

R> system.t ime (+ c v . s im p l e <− c r o s s v a l ( x , y , lambda2=lambda2 )+ )

SIMPLE CROSS−VALIDATION10− f o l d CV on the lambda1 g r i d , lambda2 i s f i x e d .

R> sum ( s i gn ( s l o t ( c v . s imp l e , ”be ta .m in ”) ) != s i gn ( beta ) )

[ 1 ] 0

R> p lo t ( c v . c imp l e )

0.0 0.5 1.0 1.5 2.0log10(λ1)

lambda.choice

1−se rule

min. MSE

Cross−validation error

R> marks <− log10 ( c ( s l o t ( c v . s imp l e , ”lambda1.min ”) ,+ s l o t ( c v . s imp l e , ”l ambda1 .1se ”) ) )R> graph <− p lo t ( e l a s t i c . n e t ( x , y , lambda2=lambda2 ) ,+ l a b e l s=l a b e l s , p lo t=FALSE)R> graph + geom v l i n e ( x i n t e r c e p t=marks )

0.0 0.5 1.0 1.5 2.0log10(λ1)

variables

relevant

irrelevant

elastic net path

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

Stability Selection

LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ

j = P(j ⊆ Sλ(I )) the estimated selection probabilities,

I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].

DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is

S stable = j : maxλ∈Λ

Πλj ≥ πthr

Proposition

If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,

FWER ≤ PFER = E (V ) ≤ 1

2πthr − 1· q

R> system.t ime (+ s tab <− s t a b i l i t y ( x , y , subsamp le s =400 ,+ randomize=TRUE, weakness=0. 5 )+ )

STABILITY SELECTION with r andom i za t i on ( weakness = 0 . 5 )F i t t i n g p rocedu r e : e l a s t i c . ne t w i th lambda2 = 0.01 and an 100−

d imen s i o n a l g r i d o f lambda1 .Running 2 j o b s p a r a l l e l y (1 pe r co r e )Approx . 200 sub samp l i ng s f o r each job f o r a t o t a l o f 400 u s e r

system e l a p s e d5 .273 0 .144 5 .440

R> p r i n t ( s t ab )

S t a b i l i t y path f o r e l a s t i c . ne t p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by(1+lambda2 ) .

− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 0 .01

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=1)

PFER ≤ 1

q = 7.290.00

0 10 20 30average number of selected variables

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=2)

PFER ≤ 2

q = 10.030.00

0 10 20 30average number of selected variables

ilitie

s selection

selected

unselected

variables

relevant

irrelevant

Stability path of an elastic.net regularizer

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

Concluding remarks

What has been done

I Unifying view of sparsity through quadratic formulation,

I Robust regression interpretation,

I Competitive algorithm for small to medium scale problems,

I Accompanying R-package,

I Insights for links between accuracy and prediction performances.

What will be done (almost surely, and soon)

1. Multivariate problems (Stephane, Guillem/Pierre)

tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).

2. A real sparse handling of the design matrix (Stephane).

3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).

More perspectives

Hopefully (help needed)

1. Screening/early discarding of irrelevant features.

2. Efficient implementation of Iterative Reweighted Least-Squares forlogistic regression, Cox model, etc. (Sarah? Marius?)

3. Consider implementation of the OSCAR/group-OSCAR (forsegmentation purpose). (Alia? Morgane? Pierre?)

4. Control the precision when solving the sub problems (tortured M2student)I a dirty (yet controlled) resolution could even speed up the procedure.I can be done via (preconditioned) conjugate gradient/iterative

methods.I Some promising results during Aurore’s M2.

5. More tools for robust statistics. . .

6. Integration with SIMoNe (If I don’t know what to do/at loose ends).

References

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski.Optimization with sparsity-inducing penalties.Foundations and Trends in Machine Learning, 4(1):1–106, 2012.

H. Xu, C. Caramanis, and S. Mannor.Robust regression and lasso.IEEE Transactions on Information Theory, 56(7):3561–3574, 2010.

A. Beck and M. Teboulle.Fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2:183–202, 2009.

H. D. Bondell and B. J. Reich.Simultaneous regression shrinkage, variable selection, and supervised clustering ofpredictors with oscar.Biometrics, 64(1):115–123, 2008.

W. J. Fu.Penalized regressions: The bridge versus the lasso.Journal of Computational and Graphical Statistics, 7(3):397–416, 1998.

L. El Ghaoui and H. Lebret.Robust solutions to least-squares problems with uncertain data.SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.

Sparsity by worst-case quadratic penalties

Documents

Transcript of Sparsity by worst-case quadratic penalties

Statistical Learning with Sparsity

Quadratic Inequalities Solving Quadratic Inequalities.

Learning with Structured Sparsity

Hierarchical preconditioners for high-order FEMHierarchical preconditioners for high-order FEM 3 Fig. 1 Sparsity structures using linear, quadratic, fourth and eighth order polynomials,

Quadratic Equations - InyaTrust · adfected quadratic equations. solving pure quadratic equations. solving quadratic equations by factorisation method. solving quadratic equations

Learning with sparsity-inducing norms

Penalized likelihood regression for generalized linear models with non-quadratic penalties

23 Trout Fishing with Natural Bait Individual World ... · Ranking Angler Nation Leg penalties (Total) Penalties (Each leg) Period penalties (Total) Penalties (Each period) Fish nr.

ON SPARSITY OF THE SOLUTION TO A RANDOM QUADRATIC ... · with random data can be traced back to early 1980s, e.g. Goldberg and Marchetti-Spaccamela [20] (knapsack problem). See Beier

Sparsity-Cognizant Overlapping Co-clustering

Index Introduction to algebraic equations Quadratic equation – Complete quadratic equation – Incomplete quadratic equation – Solving Incomplete quadratic.

Relocation Penalties

Chapter01 Quadratic Equations and Quadratic · PDF fileChapter 1 Quadratic Equations and Quadratic Functions ... 1.1.2 Quadratic Formula The two roots of the quadratic equation 0ax2

Branch Detection and Sparsity Estimation in Matlab EuroAd Workshop - Marina... · Branch Detection and Sparsity Estimation in ... Sparsity estimation for Jacobian function y = y(x(1),x

1 Quadratic functions A. Quadratic functions B. Quadratic equations C. Quadratic inequalities.

Deep Learning (CNNs) Jumpstart 2018 · Tips for training CNN Regularization: L1 : for sparsity L2 : penalties peaky weight vectors, and prefers diffuse weight vectors. Dropout: Dropout

Sparsity, Randomness and Compressed Sensing

ReFACTor Column-Sparsity - arXiv

Learning With Dynamic Group Sparsity

Conditional Cash Transfer penalties Vs. no penalties