Sparsity by worst-case quadratic penalties
-
Upload
laboratoire-statistique-et-genome -
Category
Documents
-
view
319 -
download
9
Transcript of Sparsity by worst-case quadratic penalties
Sparsity by Worst-case Quadratic Penalties
joint work withYves Grandvalet and Christophe Ambroise
Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne
SSB group, Evry – October the 16th, 2012
arXiv preprint with Y. Grandvalet and C. Ambroise.
http://arxiv.org/abs/1210.2077
R-package quadrupen, on CRAN
1
Variable selection in high-dimensional problems
Questionable solutions
1. Treat univariate problems and select effects via multiple testing Genomic data are often highly correlated. . .
2. Combine multivariate analysis and model selection techniques
arg minβ∈Rp
− `(β; y,X) + λ‖β‖0
Unfeasible for p > 30 in general (NP-hard)
Popular idea (questionable too!)
Use `1 as a convex relaxation of this problem, keeping sparse-inducingeffect:
arg minβ∈Rp
− `(β; y,X) + λ · pen`1(β)
2
Another algorithm for the Lasso? (- -)zzzWell, not really. . .
1. We suggest an unifying approach that might be useful sinceI it helps having a global view on the Lasso-zoologyI reading every daily arXiv papers dealing with `1 is out-of-reachI insights are still needed to understand high dimensional problem
2. The associated algorithm is efficient and accurate up to mediumscale (1000s) problemsI Promising tools for pre-discarding irrelevant variables are emerging Thus solving this class of problems may be enough
I Bootstrapping, cross-validation and subsampling are highly needed. . . for which our method is well adapted.
3
Outline
Geometrical insights on Sparsity
Robustness Viewpoint
Numerical experiments
R package features
4
Outline
Geometrical insights on Sparsity
Robustness Viewpoint
Numerical experiments
R package features
5
A geometric view of sparsity`(β
1,β
2)
β2 β1
minimizeβ1,β2
−`(β1, β2) + λΩ(β1, β2)
mmaximize
β1,β2`(β1, β2)
s.t. Ω(β1, β2) ≤ c
6
A geometric view of sparsityβ
2
β1
minimizeβ1,β2
−`(β1, β2) + λΩ(β1, β2)
mmaximize
β1,β2`(β1, β2)
s.t. Ω(β1, β2) ≤ c
6
Singularities induce sparsity
Lasso (`1) Ridge (`2)
β1
β2
BNB(β)
β
β1
β2
BNB(β)β
β ∈ B is optimal if and only if ∇`(β; X,y) defines a supportinghyperplane at β. Equivalently,
−∇`(β; X,y) ∈ NB(β).
7
Outline
Geometrical insights on Sparsity
Robustness Viewpoint
Numerical experiments
R package features
8
Robust optimization framework
Worst-case analysis
We wish to solve a regression problem where β minimize
β = arg minβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + λ ||β − γ||2,
I Dγ describes an uncertainty set for the parameters,
I γ acts as a spurious adversary over the true β?.
maximizing over Dγ leads to the worst-case formulation.
9
Sparsity by work-case formulation
First mathematical breakthrough
Great, (a − b)2 = a2 + b2 − 2ab, so
minimizeβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + λ ||β − γ||2
= minimizeβ∈Rp
||Xβ − y||2 + λ ||β||2 + λ max
γ∈Dγ
γᵀβ + .
Look for sparsity-inducing norm
I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,
I and forget about ||γ||2, which does not change the minimization
I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.
10
Sparsity by work-case formulation
First mathematical breakthrough
Great, (a − b)2 = a2 + b2 − 2ab, so
minimizeβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + λ ||β − γ||2
= minimizeβ∈Rp
||Xβ − y||2 + λ ||β||2 + λ max
γ∈Dγ
γᵀβ + ||γ||2
.
Look for sparsity-inducing norm
I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,
I and forget about ||γ||2, which does not change the minimization
I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.
10
Sparsity by work-case formulation
First mathematical breakthrough
Great, (a − b)2 = a2 + b2 − 2ab, so
minimizeβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + λ ||β − γ||2
= minimizeβ∈Rp
||Xβ − y||2 + λ ||β||2 + λ max
γ∈Dγ
γᵀβ + ||γ||2
.
Look for sparsity-inducing norm
I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,
I and forget about ||γ||2, which does not change the minimization
I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.
10
Sparsity by work-case formulation
First mathematical breakthrough
Great, (a − b)2 = a2 + b2 − 2ab, so
minimizeβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + λ ||β − γ||2
= minimizeβ∈Rp
||Xβ − y||2 + λ ||β||2 + λ max
γ∈Dγ
γᵀβ + ||γ||2
.
Look for sparsity-inducing norm
I Chose Dγ so as to recover your favourite `1 penalizer via γᵀβ,
I and forget about ||γ||2, which does not change the minimization
I can be done ’systematically’ by imposing regularity on β andconsidering the dual adversarial assumption on γ.
10
Example: robust formulation of the elastic net
’Lasso’ regularity set for β
The `1-norm must be controlled:
HLassoβ = β ∈ Rp : ||β||1 ≤ ηβ .
Dual assumption on the adversary
The `∞-norm of γ should be controlled, say:
DLassoγ =
γ ∈ Rp : max
β∈HLassoβ
γᵀβ ≤ 1
= γ ∈ Rp : ||γ||∞ ≤ ηγ = conv
−ηγ , ηγp
,
where ηγ = 1/ηβ.
11
Example: robust formulation of the elastic net (continued)
Now, the other way round:
minimizeβ∈Rp
||Xβ − y||2 + λ ||β||2 + λ max
γ∈−ηγ ,ηγp
γᵀβ + ||γ||2
= minimize
β∈Rp
||Xβ − y||2 + λ ||β||2 + ληγ‖β‖1 + c
⇔ minimize
β∈Rp
1
2||Xβ − y||2 + λ1‖β‖1 +
λ2
2||β||2
,
for known λ1, λ2.
We recognize the ’official’ Elastic-net regularizer.
12
Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖1 ≤ s
13
Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖1 ≤ s
⇔
minimize
β∈Rp||Xβ − y||2
s. t.
maxγ∈−η,ηp
||β − γ||2 ≤ s + η2
13
Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖1 ≤ s
⇔
minimize
β∈Rp||Xβ − y||2
s. t.
maxγ∈−η,ηp
||β − γ||2 ≤ s + η2
13
Example: robust formulation of `1/`∞ group-Lasso
Argument with one group, which generalizes by decomposability of thegroup-norm.
The regularity set for β with `∞ controlled
Hmaxβ = β ∈ Rp : ||β||∞ ≤ ηβ .
Dual assumption on the adversary
Dmaxγ =
γ ∈ Rp : sup
β∈Hmaxβ∗
γᵀβ ≤ 1
= γ ∈ Rp : ||γ||1 ≤ ηγ= conv
ηγe
p1 , . . . , ηγe
pp ,−ηγep1 , . . . ,−ηγepp
,
where ηγ = 1/ηβ and epj is the j th element of the canonical basis of Rp ,that is ejj ′ = 1 if j = j ′ and ejj ′ = 0 otherwise.
14
Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖∞ ≤ s
15
Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖∞ ≤ s
⇔
minimize
β∈Rp||Xβ − y||2
s. t.
maxγ∈Dmax
γ
||β − γ||2 ≤ s + η2
15
Example: robust formulation of `1/`∞ group-LassoGeometrical argument with constrained formulation
minimize
β∈Rp||Xβ − y||2
s. t.
‖β‖22 + η‖β‖∞ ≤ s
⇔
minimize
β∈Rp||Xβ − y||2
s. t.
maxγ∈Dmax
γ
||β − γ||2 ≤ s + η2
15
Generalize this principle to your favourite problem
elastic-net (`1 + `2) `∞ + `2
structured e.-n. fused-lasso + `2 OSCAR + `216
Worst-Case Quadratic Penalty Active Set Algorithm
S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set
γ = arg maxg∈Dγ
||β − g||2 ; // Pick a worst admissible γ
S1 Update active variables βAβA ←
(XᵀAXA + λI|A|
)−1 (XᵀAy + λγA
); // Subproblem resolution
S2 Verify coherence of γA with the updated βAif ||βA − γA||
2 < maxg∈Dγ
||βA − gA||2 then // if γA is not worst-case
βA ← βoldA + ρ(βA − βold
A ) ; // Last γA-coherent solution
S3 Update active set Agj ← min
γ∈Dγ
∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient
if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j
elseif maxj∈Ac gj 6= 0 then
j ? ← arg maxj∈Ac
gj , A ← A∪ j ? ; // Upgrade j ?
elseStop and return β, which is optimal
17
Worst-Case Quadratic Penalty Active Set Algorithm
S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set
γ = arg maxg∈Dγ
||β − g||2 ; // Pick a worst admissible γ
S1 Update active variables βAβA ←
(XᵀAXA + λI|A|
)−1 (XᵀAy + λγA
); // Subproblem resolution
S2 Verify coherence of γA with the updated βAif ||βA − γA||
2 < maxg∈Dγ
||βA − gA||2 then // if γA is not worst-case
βA ← βoldA + ρ(βA − βold
A ) ; // Last γA-coherent solution
S3 Update active set Agj ← min
γ∈Dγ
∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient
if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j
elseif maxj∈Ac gj 6= 0 then
j ? ← arg maxj∈Ac
gj , A ← A∪ j ? ; // Upgrade j ?
elseStop and return β, which is optimal
17
Worst-Case Quadratic Penalty Active Set Algorithm
S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set
γ = arg maxg∈Dγ
||β − g||2 ; // Pick a worst admissible γ
S1 Update active variables βAβA ←
(XᵀAXA + λI|A|
)−1 (XᵀAy + λγA
); // Subproblem resolution
S2 Verify coherence of γA with the updated βAif ||βA − γA||
2 < maxg∈Dγ
||βA − gA||2 then // if γA is not worst-case
βA ← βoldA + ρ(βA − βold
A ) ; // Last γA-coherent solution
S3 Update active set Agj ← min
γ∈Dγ
∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient
if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j
elseif maxj∈Ac gj 6= 0 then
j ? ← arg maxj∈Ac
gj , A ← A∪ j ? ; // Upgrade j ?
elseStop and return β, which is optimal
17
Worst-Case Quadratic Penalty Active Set Algorithm
S0 Initializationβ ← β0 ; // Start with a feasible βA ← j : βj 6= 0 ; // Determine the active set
γ = arg maxg∈Dγ
||β − g||2 ; // Pick a worst admissible γ
S1 Update active variables βAβA ←
(XᵀAXA + λI|A|
)−1 (XᵀAy + λγA
); // Subproblem resolution
S2 Verify coherence of γA with the updated βAif ||βA − γA||
2 < maxg∈Dγ
||βA − gA||2 then // if γA is not worst-case
βA ← βoldA + ρ(βA − βold
A ) ; // Last γA-coherent solution
S3 Update active set Agj ← min
γ∈Dγ
∣∣∣xᵀj (XAβA − y) + λ(βj − γj )∣∣∣ j = 1, . . . , p // worst-case gradient
if ∃ j ∈ A : βj = 0 and gj = 0 thenA ← A\j ; // Downgrade j
elseif maxj∈Ac gj 6= 0 then
j ? ← arg maxj∈Ac
gj , A ← A∪ j ? ; // Upgrade j ?
elseStop and return β, which is optimal
17
Algorithm complexitysee, e.g., Bach et al 2011
Suppose that
I the algorithm stops at λmin with k activated variables,
I no downgrade has been observed (we thus have k iterations/steps).
Complexity in favorable cases
1. compute XᵀXA + λI|A|: O(npk),
2. maintaining xᵀj (XAβA − y) along the path: O(pn + pk2)
3. cholesky update of (XᵀAXA + λI|A|)−1: O(k3).
a total of O(npk + pk2 + k3). Hence k , defined by λmin, matters. . .
18
A bound to assess distance to optimum during optimization
Proposition
For any ηγ > 0, and for all vectorial norm ||·||∗, when Dγ is defined asDγ = γ ∈ Rp : ||γ||∗ ≤ ηγ, then, ∀γ ∈ Rp : ||γ||∗ ≥ ηγ , we have:
minβ∈Rp
maxγ′∈Dγ
Jλ(β,γ ′) ≥ ηγ||γ||∗
Jλ (β? (γ) ,γ)− ληγ(||γ||∗ − ηγ)
||γ||2∗||γ||2 ,
where
Jλ(β,γ) = ||Xβ − y||2 + λ ||β − γ||2 and β?(γ) = arg minβ∈Rp
Jλ(β,γ) .
This proposition can be used to compute an optimality gap by picking aγ-value such that the current worst-case gradient is null (the currentβ-value then being the optimal β?(γ)).
19
Bound: illustration on an Elastic-Net problem
0 50 100 150 200
−8
−6
−4
−2
02
4
# of iterations
Op
tim
alit
yga
pn = 50, p = 200
Figure: Monitoring convergence: true optimality gap (solid black) versus ourpessimistic bound (dashed blue) and Fenchel’s duality gap (dotted red)computed at each iteration of the algorithm.
20
Full interpretation in a robust optimization framework
Proposition
The robust regression problem
minβ∈Rp
max(∆X,ε,γ)∈DX×Dε×Dγ
||(X−∆X)β + Xγ + ε− y||2 ,
for a given form of the global uncertainty set DX ×Dε ×Dγ on(∆X, ε,γ), is equivalent to the robust regularized regression problem:
minβ∈Rp
maxγ∈Dγ
||Xβ − y||2 + ηX ||β − γ||2 .
These assumptions entail the following relationship between X and y:
y = (X−∆X)β? + Xγ + ε .
The observed responses are formed by summing the contributions of theunobserved clean inputs, the adversarial noise that maps the observedinputs to the responses, and the neutral noise.
21
Outline
Geometrical insights on Sparsity
Robustness Viewpoint
Numerical experiments
R package features
22
General objectives
Assessing efficiency of an algorithm
I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;
I speed is the computing time required for returning this solution.
timing has to be compared at similar precision requirements
Mimicking post-genomic data attributes
Optimization difficulties results from ill-conditioning, due to
I either high correlation between predictors
I or underdetermination (high-dimensional or “large p small n”setup)
Remarks
1. With active set strategies, bad conditioning is somehow alleviated.
2. Sparsity of the true parameter heavily impacts the running times.
23
General objectives
Assessing efficiency of an algorithm
I accuracy is the difference between the optimum of the objectivefunction and its value at the solution returned by the algorithm;
I speed is the computing time required for returning this solution.
timing has to be compared at similar precision requirements
Mimicking post-genomic data attributes
Optimization difficulties results from ill-conditioning, due to
I either high correlation between predictors
I or underdetermination (high-dimensional or “large p small n”setup)
Remarks
1. With active set strategies, bad conditioning is somehow alleviated.
2. Sparsity of the true parameter heavily impacts the running times.
23
Data generation
Exploring those characteristics with linear regression
We generate samples of size n from the model
y = Xβ? + ε, ε ∼ N (0, σ2I),
I σ chosen so as to reach R2 ≈ 0.8,
I X ∼ N (0,Σ), with Σij = 1i=j + ρ1i 6=j,
I sgn (β?) =(
1, . . . , 1︸ ︷︷ ︸s/2
,−1, . . . ,−1︸ ︷︷ ︸s/2
, 0, . . . , 0︸ ︷︷ ︸p−s
).
Controlling the difficulty
I ρ ∈ 0.1, 0.4, 0.8 rules the conditioning,
I s ∈ 10%, 30%, 60% controls the sparsity,
I the ratio n/p ∈ 2, 1, 0.5 quantifies the well/ill-posedness.
24
Comparing optimization strategies
We used our own code to avoid implementation biases for
1. accelerated proximal methods – proximal,
2. coordinate descent – coordinate,
3. our quadratic solver – quadratic,
wrapped in the same active-set + warm-start routine.Timings averaged over 100 runs to minimize
J enetλ1,λ2(β) =
1
2||Xβ − y||2 + λ1 ||β||1 +
λ2
2||β||2 .
with halting condition
maxj∈1...p
∣∣∣xᵀj (y −Xβ)
+ λ2β∣∣∣ < λ1 + τ, (1)
where the threshold τ = 10−2 on a 50× 50 grid of λ1 × λ2.
25
log-ratio between timing of competitor and quadratic, p = 2n = 100, s = 30log10λ2
large corr. (0.8) medium corr. (0.4) small corr. (0.1)
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
coordinate descentproxim
al (fista)
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
# times faster
1
3
10
30
100
300
log10(λ1)26
Comparing stand-alone implementations
We compare our method on a Lasso problem to popular R-packages
1. accelerated proximal methods – SPAMs-FISTA (Mairal, Bach et al.),
2. coordinate descent – glmnet (Friedman, Hastie, Tibshirani),
3. homotopy/LARS algorithm– lars (Efron, Hastie) and SPAMs-LARS,
4. quadratic solver – quadrupen.
The distance D to the optimum is evaluated on J lassoλ (β) = J enet
λ,0 (β) by
D(method) =
(1
|Λ|∑λ∈Λ
(J lassoλ
(βlars
λ
)− J lasso
λ
(βmethod
λ
))2)1/2
,
where Λ is given by the first min(n, p) steps of lars.
Vary ρ, (p,n), fix s = 0.25 min(n, p) and average over 50 runs.
27
−2.5 −2.0 −1.5 −1.0
−12
−10
−8
−6
−4
−2
0
CPU time (in seconds, log10)
D(method)
(log10)
n = 100, p = 40
low correlation (0.1)med correlation (0.4)high correlation (0.8)
glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)
28
−1.0 −0.5 0.0 0.5 1.0 1.5
−12
−10
−8
−6
−4
−2
02
CPU time (in seconds, log10)
D(method)
(log10)
n = 200, p = 1000
low correlation (0.1)med correlation (0.4)high correlation (0.8)
glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)
28
0.5 1.0 1.5 2.0 2.5 3.0
−10
−8
−6
−4
−2
02
CPU time (in seconds, log10)
D(method)
(log10)
n = 400, p = 10000
low correlation (0.1)med correlation (0.4)high correlation (0.8)
glmnet (CD, active set)SPAMs (FISTA, no active set)SPAMs (homotopy/LARS)quadrupen (this paper)lars (homotopy/LARS)
28
Link between accuracy and prediction performances
Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .
Any unforeseen consequences of lack of accuracy?
Early stops of the algorithm is likely to prevent
I either the removal of all irrelevant coefficients,
I or the insertion of relevant ones.
Illustration on mean square error and support recovery
I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.
I Generate for each a large test set (say, 10n) for evaluation.
methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2
29
Link between accuracy and prediction performances
Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .
Any unforeseen consequences of lack of accuracy?
Early stops of the algorithm is likely to prevent
I either the removal of all irrelevant coefficients,
I or the insertion of relevant ones.
Illustration on mean square error and support recovery
I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.
I Generate for each a large test set (say, 10n) for evaluation.
methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2
29
Link between accuracy and prediction performances
Lasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .
Any unforeseen consequences of lack of accuracy?
Early stops of the algorithm is likely to prevent
I either the removal of all irrelevant coefficients,
I or the insertion of relevant ones.
Illustration on mean square error and support recovery
I Generate 100 training data sets for linear regression withρ = 0.8,R2 = 0.8, p = 100, s = 30%, varying n.
I Generate for each a large test set (say, 10n) for evaluation.
methods quadrupen glmnet (low) glmnet (med) glmnet (high)timing (ms) 8 7 8 64accuracy (dist. to opt.) 5.9× 10−14 7.2× 100 6.04× 100 1.47× 10−2
29
MS
E
n/p = 0.5 n/p = 1 n/p = 2
1.5
2.0
2.5
0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0
n/p = 0.5 n/p = 1 n/p = 2
10
20
30
40
50
0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0
method
quadrupen
glmnet low
glmnet med
glmnet highSig
ner
ror
n/p = 0.5 n/p = 1 n/p = 2
1.5
2.0
2.5
0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0
n/p = 0.5 n/p = 1 n/p = 2
10
20
30
40
50
0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0
method
quadrupen
glmnet low
glmnet med
glmnet high
log10(λ1) 30
Outline
Geometrical insights on Sparsity
Robustness Viewpoint
Numerical experiments
R package features
31
Learning features
Problem solved
β = arg minβ
1
2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β
ᵀSβ,
where
I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,
I ω such as (ωj )pj=1 > 0 `1- penalty weights,
I S a p × p positive definite matrix ’structuring’ the `2-penalty.
Some corresponding estimators
I (Adaptive)-Lasso,
I (Structured) Elastic-net,
I Fused-Lasso signal approximator (inefficient),
I . . .32
Learning features
Problem solved
β = arg minβ
1
2(y −Xβ)ᵀW(y −Xβ) + λ1 ‖ω β‖1 + λ2β
ᵀSβ,
where
I W = diag(w), such as (wi)ni=1 ≥ 0 some observation weights,
I ω such as (ωj )pj=1 > 0 `1- penalty weights,
I S a p × p positive definite matrix ’structuring’ the `2-penalty.
Some corresponding estimators
I (Adaptive)-Lasso,
I (Structured) Elastic-net,
I Fused-Lasso signal approximator (inefficient),
I . . .32
Technical features
Written in R (with S4 classes) and C++
Dependencies
I armadillo + RcppArmadillo, its accompanying interface to R
Armadillo is a C++ linear algebra library (matrix maths)aiming towards a good balance between speed and ease ofuse. The syntax is deliberately similar to Matlab.
I Matrix, to handle with sparse matrices
I ggplot2, a plotting system based on the grammar of graphics.
I parallel – not available for Windows :-)
a built in R package since 2.14 which allows parallelcomputation over available CPU cores or clusters (veryuseful for cross-validation for instance).
Suited to solve small to medium scale problems.
33
Load the package and show dependencies
R> l i b r a r y ( ”quadrupen ”)
[ 1 ] ”quadrupen ” ”Matr i x ” ” l a t t i c e ” ”ggp l o t2 ” ”s t a t s ” ”g r a p h i c s ”
[ 7 ] ”g rDev i c e s ” ” u t i l s ” ”d a t a s e t s ” ”methods ” ”base ”
Generate a size-100 vector of parameters with labels
R> ## VECTOR OF TRUE PARAMETERSR> ## spa r s e , b l o c kw i s e shapedR> beta <− rep ( c (0 ,−2 ,2) , c ( 80 , 10 , 10 ) )R> ## l a b e l s f o r t r u e nonze ro sR> l a b e l s <− rep ( ” i r r e l e v a n t ” , l ength ( beta ) )R> l a b e l s [ beta != 0 ] <− c ( ” r e l e v a n t ”)R> l a b e l s <− f a c to r ( l a b e l s , o r d e r ed=TRUE,+ l e v e l s=c ( ”r e l e v a n t ” , ” i r r e l e v a n t ”) )
34
R> ## COVARIANCE STRUCTURE OF THE PREDICTORSR> ## Toep l i t z c o r r e l a t i o n between i r r e l e v a n t v a r i a b l e sR> co r <− 0 . 8R> S11 <− t o e p l i t z ( co r ˆ(0 : (80 −1)) )R> ## b lo c c o r r e l a t i o n between r e l e v a n t v a r i a b l e sR> S22 <− matr ix ( cor , 1 0 , 10 )R> d i ag ( S22 ) <− 1R> ## c o r r e l a t i o n between r e l e v a n t and i r r e l e v a n t v a r i a b l e sR> eps <− 0 .25R> Sigma <− bdiag ( S11 , S22 , S22 ) + eps
Dimensions: 100 x 100Column
Row
20
40
60
80
20 40 60 80 35
Generate n = 100 observation with a high level of noise σ = 10
R> mu <− 3R> s igma <− 10R> n <− 100R> x <− as .mat r i x (matr ix ( rnorm (100 *n ) , n , 100 ) %*% cho l ( Sigma ) )R> y <− mu + x %*% beta + rnorm (n , 0 , s igma )
Give a try to raw Lasso and Elastic-net fits. . .
R> s t a r t <− proc . t ime ( )R> l a s s o <− e l a s t i c . n e t ( x , y , lambda2=0)R> e . n e t <− e l a s t i c . n e t ( x , y , lambda2=1)R> p r i n t ( proc . t ime ()− s t a r t )
u s e r system e l a p s e d0 .080 0 .000 0 .078
36
A print/show method is defined:
R> p r i n t ( e . n e t )
L i n e a r r e g r e s s i o n wi th e l a s t i c net p e n a l i z e r , c o e f f i c i e n t sr e s c a l e d by (1+lambda2 ) .
− number o f c o e f f i c i e n t s : 100 + i n t e r c e p t− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 1
Also consider residuals, deviance, predict, and fitted methods. . .
R> head ( deviance ( e . n e t ) )
171.217 162.295 153.837 145.821 138.222 131.01982366 76185 68799 60415 52456 45139
37
R> p lo t ( l a s s o , main=”Lasso ” , xva r=” f r a c t i o n ” , l a b e l s=l a b e l s )
−80
−40
0
40
0.00 0.25 0.50 0.75 1.00|βλ1
|1 maxλ1|βλ1
|1
stan
dard
ized
coe
ffici
ents
variables
relevant
irrelevant
Lasso
38
R> p lo t ( l a s s o , main=”Lasso ” , xva r=”lambda ” ,+ l o g . s c a l e=FALSE , r e v e r s e=TRUE, l a b e l s=l a b e l s )
−80
−40
0
40
050100150200λ1
stan
dard
ized
coe
ffici
ents
variables
relevant
irrelevant
Lasso
39
R> p lo t ( l a s s o , main=”Lasso ” , l a b e l s=l a b e l s )
−80
−40
0
40
0.0 0.5 1.0 1.5 2.0log10(λ1)
stan
dard
ized
coe
ffici
ents
variables
relevant
irrelevant
Lasso
40
R> p lo t ( e . n e t , main=”E l a s t i c−net ” , l a b e l s=l a b e l s )
−50
−25
0
25
0.0 0.5 1.0 1.5 2.0log10(λ1)
stan
dard
ized
coe
ffici
ents
variables
relevant
irrelevant
Elastic−net
41
R> system.t ime (+ c v . d oub l e <− c r o s s v a l ( x , y , lambda2=10ˆ seq (1 ,−1 .5 , l e n =50))+ )
DOUBLE CROSS−VALIDATION
10− f o l d CV on the lambda1 g r i d f o r each lambda210 8 .892 7 .906 7 .03 6 .2515 .558 4 .942 4 .394 3 .907 3 .4743 .089 2 .746 2 .442 2 .171 1 .9311 .717 1 .526 1 .357 1 .207 1 .0730 .954 0 .848 0 .754 0 .671 0 .5960 .53 0 .471 0 .419 0 .373 0 .3310 .295 0 .262 0 .233 0 .207 0 .1840 .164 0 .146 0 .129 0 .115 0 .1020 .091 0 .081 0 .072 0 .064 0 .0570 .051 0 .045 0 .04 0 .036 0 .032
u s e r system e l a p s e d10 .888 1 .492 6 .636
R> p lo t ( c v . d o ub l e )
42
0.1
10.0
1 10 100log10(λ1)
log 1
0(λ2)
10000
20000
30000
40000
mean
Cross−validation error
43
R> lambda2 <− s l o t ( c v . doub l e , ”lambda2.min ”)
[ 1 ] 0 .8483
R> system.t ime (+ c v . s im p l e <− c r o s s v a l ( x , y , lambda2=lambda2 )+ )
SIMPLE CROSS−VALIDATION10− f o l d CV on the lambda1 g r i d , lambda2 i s f i x e d .
u s e r system e l a p s e d0 .312 0 .052 0 .266
R> sum ( s i gn ( s l o t ( c v . s imp l e , ”be ta .m in ”) ) != s i gn ( beta ) )
[ 1 ] 0
R> p lo t ( c v . c imp l e )
44
500
1000
0.0 0.5 1.0 1.5 2.0log10(λ1)
Mea
n sq
uare
err
or
lambda.choice
1−se rule
min. MSE
Cross−validation error
45
R> marks <− log10 ( c ( s l o t ( c v . s imp l e , ”lambda1.min ”) ,+ s l o t ( c v . s imp l e , ”l ambda1 .1se ”) ) )R> graph <− p lo t ( e l a s t i c . n e t ( x , y , lambda2=lambda2 ) ,+ l a b e l s=l a b e l s , p lo t=FALSE)R> graph + geom v l i n e ( x i n t e r c e p t=marks )
−50
−25
0
25
0.0 0.5 1.0 1.5 2.0log10(λ1)
stan
dard
ized
coe
ffici
ents
variables
relevant
irrelevant
elastic net path
46
Stability Selection
LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ
j = P(j ⊆ Sλ(I )) the estimated selection probabilities,
I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].
DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is
S stable = j : maxλ∈Λ
Πλj ≥ πthr
Proposition
If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,
FWER ≤ PFER = E (V ) ≤ 1
2πthr − 1· q
2Λ
p47
Stability Selection
LetI I be a random subsample of size bn/2cI Sλ(I ) = j : βj (I )λ 6= 0 the estimated support at λ,I Πλ
j = P(j ⊆ Sλ(I )) the estimated selection probabilities,
I qΛ = E(|SΛ(I )|), the average number of selected variables whereΛ = [λmax, λmin].
DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is
S stable = j : maxλ∈Λ
Πλj ≥ πthr
Proposition
If the distribution of 1k∈Sλ is exchangeable for any λ ∈ Λ then,
FWER ≤ PFER = E (V ) ≤ 1
2πthr − 1· q
2Λ
p47
R> system.t ime (+ s tab <− s t a b i l i t y ( x , y , subsamp le s =400 ,+ randomize=TRUE, weakness=0. 5 )+ )
STABILITY SELECTION with r andom i za t i on ( weakness = 0 . 5 )F i t t i n g p rocedu r e : e l a s t i c . ne t w i th lambda2 = 0.01 and an 100−
d imen s i o n a l g r i d o f lambda1 .Running 2 j o b s p a r a l l e l y (1 pe r co r e )Approx . 200 sub samp l i ng s f o r each job f o r a t o t a l o f 400 u s e r
system e l a p s e d5 .273 0 .144 5 .440
R> p r i n t ( s t ab )
S t a b i l i t y path f o r e l a s t i c . ne t p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by(1+lambda2 ) .
− p en a l t y paramete r lambda1 : 100 p o i n t s from 171 to 0 .856− p en a l t y paramete r lambda2 : 0 .01
48
R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=1)
πthr
PFER ≤ 1
q = 7.290.00
0.25
0.50
0.75
1.00
0 10 20 30average number of selected variables
sele
ctio
n pr
obab
ilitie
s selection
selected
unselected
variables
relevant
irrelevant
Stability path of an elastic.net regularizer
49
R> p lo t ( stab , l a b e l s=l a b e l s , c u t o f f=0.75 , PFER=2)
πthr
PFER ≤ 2
q = 10.030.00
0.25
0.50
0.75
1.00
0 10 20 30average number of selected variables
sele
ctio
n pr
obab
ilitie
s selection
selected
unselected
variables
relevant
irrelevant
Stability path of an elastic.net regularizer
50
Concluding remarks
What has been done
I Unifying view of sparsity through quadratic formulation,
I Robust regression interpretation,
I Competitive algorithm for small to medium scale problems,
I Accompanying R-package,
I Insights for links between accuracy and prediction performances.
What will be done (almost surely, and soon)
1. Multivariate problems (Stephane, Guillem/Pierre)
tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).
2. A real sparse handling of the design matrix (Stephane).
3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).
51
Concluding remarks
What has been done
I Unifying view of sparsity through quadratic formulation,
I Robust regression interpretation,
I Competitive algorithm for small to medium scale problems,
I Accompanying R-package,
I Insights for links between accuracy and prediction performances.
What will be done (almost surely, and soon)
1. Multivariate problems (Stephane, Guillem/Pierre)
tr(Y −XB)ᵀΩ(Y −XB) + penλ1,λ2(B).
2. A real sparse handling of the design matrix (Stephane).
3. Group `1/`∞ penalty (prototyped in R) (Stephane, Camille, Eric).
51
More perspectives
Hopefully (help needed)
1. Screening/early discarding of irrelevant features.
2. Efficient implementation of Iterative Reweighted Least-Squares forlogistic regression, Cox model, etc. (Sarah? Marius?)
3. Consider implementation of the OSCAR/group-OSCAR (forsegmentation purpose). (Alia? Morgane? Pierre?)
4. Control the precision when solving the sub problems (tortured M2student)I a dirty (yet controlled) resolution could even speed up the procedure.I can be done via (preconditioned) conjugate gradient/iterative
methods.I Some promising results during Aurore’s M2.
5. More tools for robust statistics. . .
6. Integration with SIMoNe (If I don’t know what to do/at loose ends).
52
References
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski.Optimization with sparsity-inducing penalties.Foundations and Trends in Machine Learning, 4(1):1–106, 2012.
H. Xu, C. Caramanis, and S. Mannor.Robust regression and lasso.IEEE Transactions on Information Theory, 56(7):3561–3574, 2010.
A. Beck and M. Teboulle.Fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2:183–202, 2009.
H. D. Bondell and B. J. Reich.Simultaneous regression shrinkage, variable selection, and supervised clustering ofpredictors with oscar.Biometrics, 64(1):115–123, 2008.
W. J. Fu.Penalized regressions: The bridge versus the lasso.Journal of Computational and Graphical Statistics, 7(3):397–416, 1998.
L. El Ghaoui and H. Lebret.Robust solutions to least-squares problems with uncertain data.SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.
53