Cross validation in sparse linear regression with ...
Transcript of Cross validation in sparse linear regression with ...
![Page 1: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/1.jpg)
Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration
Tomoyuki Obuchi1 and Ayaka Sakata2
Dept. of Math. and Comp. Sci., Tokyo Tech.1
The Institute of Statistical Mathematics2
TO, AS: arXiv:1902.103751/28
![Page 2: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/2.jpg)
• Penalized linear regression
Linear Regressionx(⌘) = argmin
x
⇢1
2||y �Ax||22 + J(x; ⌘)
�
Penalty• Representative Penalty
J(x; ⌘ = �) = �||x||pp`p• norm
• : convex • : sparsity-inducing
3/28
→ p=1 is nice for variable selection (LASSO)
p ≥ 1p ≤ 1
![Page 3: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/3.jpg)
• LASSO for 1-dimensional estimation
Statistical bias in LASSO
-4
-2
0
2
4
-4 -2 0 2 4
θ^
w
✓ = argmin✓
⇢1
2�2(✓ � w)2 + �|✓|
�
Bias →
4/28
![Page 4: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/4.jpg)
• p<1 can reduce bias but… • Nonconvex → possible local minima • Noncontinuity → algorithmic instability
• Two representatives of PCNP • Smoothly Clipped Absolute Deviation (SCAD) penalty • Minimax Concave Penalty (MCP)
• Nonconvex, but estimator is continuous
Piecewise continuous nonconvex penalty (PCNP)
We hereafter focus only on SCAD
5/28
![Page 5: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/5.jpg)
SCAD estimator
J(✓; ⌘) =
8>>><
>>>:
�|✓| (|✓| �)
�✓2 � 2a�|✓|+ �2
2(a� 1)(� < |✓| a�)
(a+ 1)�2
2(|✓| > a�)
.
• SCAD penalty ⌘ = {a,�}a=5,λ=1
a=3,λ=1
a=2,λ=1
0
0.5
1
1.5
2
2.5
3
-6 -4 -2 0 2 4 6
J(x)
x
-4
-2
0
2
4
-4 -2 0 2 4
θ^
w
✓ = argmin✓
⇢1
2�2(✓ � w)2 + J(✓; ⌘)
�
• SCAD estimator
Continuous
No bias
• E.g. 1D estimator
( )
6/28
![Page 6: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/6.jpg)
Our Contributions• Clarifying the emergence region of local minima
• Phase transition (w. replica symmetry breaking) • Quantitative analysis of reconstruction performance
• SCAD outperforms LASSO in weak noise region
• Developing an approximate CV formula • Fast CV becomes possible • A method to avoid unstable parameter region
7/28
![Page 7: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/7.jpg)
Contents1. Analytical performance analysis in simulated dataset
2. Approximate CV formula
3. Numerical experiments
8/28
![Page 8: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/8.jpg)
Contents1. Analytical performance analysis in simulated dataset
2. Approximate CV formula
3. Numerical experiments
8/28
![Page 9: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/9.jpg)
Problem Setting• Generative process
x0i ⇠ (1� ⇢0)�(x0i) + ⇢0N (0,�2x)
Aµi ⇠ N (0, N�1)
• Quantities of interest
✏y =1
2M||y � y||22
✏x =1
2N||x0 � x||22
y = Ax: Output MSE
: Input MSE
TP, FP : True and False positive rates of support S = {i|x0i 6= 0}
Investigate typical values of these in high-dimensional limitN ! 1, (↵ = M/N = O(1))
y = Ax0 +�
�i ⇠ N (0,�2�)
all i.i.d.
crucial assumptions for analysis
9/28
![Page 10: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/10.jpg)
Stat. Mech. Formulation• Hamiltonian, Boltzmann distribution, Partition function
• Computing “free energy” or moment-generating function f(β)
H(x) =1
2||y �Ax||22 + J(x; ⌘)
P (x) =1
Ze��H(x)
Z =
Zdxe��H(x)
��f(�) =1
N[logZ]y,A
• Any quantity of interest can be computed from f(β)
However, the average w.r.t. y and A is unperformable…
Average w.r.t. y and A
←Replica Method (with replica symmetric assumption)
! �(x� x(⌘|y, A)), (� ! 1)
Solution of the original problem
10/28
![Page 11: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/11.jpg)
Equations to be solved
f(� ! 1) = Extr⌦,⌦
(Q� 2m+ ⇢0�2
x + ↵�2�
2(1 + �/↵)+mm� QQ� ��
2+
⇠(�; Q)
2
)
L(h; Q) ⌘ minx
(Q
2x2 � hx+ J(x; ⌘)
).
ZDz(· · · ) ⌘
Z 1
�1
dzp2⇡
exp
✓�1
2z2◆(· · · ),
• Replica symmetric (RS) free energy
⌦ = {Q,�,m} ⌦ = {Q, �, m} ⇠(�; Q) ⌘ 2
ZDz L(�z; Q),
x⇤(h; Q�1) = argminx
(Q
2x2 � hx+ J(x; ⌘)
).
✏x =1
2
�⇢0�
2x � 2m+Q
�,
✏y =1
2�.
TP =
ZDz
���x⇤(�+z; Q�1)
���0
(· · · ) =X
�
(· · · )P (�)
�� =p�, �+ =
p�+ m2�2
x.FP =
ZDz
���x⇤(��z; Q�1)
���0
P (�) = (1� ⇢)�(� � ��) + ⇢�(� � �+)
11/28
![Page 12: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/12.jpg)
Stability and Multiple solutions
RSRSB
(Replica symmetry breaking)✏y✏y
SG transition
• RS solution is sometimes unstable • The instability can be signaled by a formula (not shown here)
• Spin-glass transition or Almeida-Thouless (AT) instability
Configuration Configuration
12/28
Exponentially many (w.r.t. N) local minima exist.
![Page 13: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/13.jpg)
10-1
100
101
2
3
4
5
6
7
a
=0.5, 0=0.2,
2=0.01
Phase diagrams
100
101
2
6.5
11
15.5
20
a
=0.5, 0=0.2,
2=1
• that gives minimum input MSE is in stable region. • For large noise, LASSO is sufficient.
(λ, a)
13/28
Green line: Minimum of input MSE for each
Green dot: Minimum of input MSE along the green line
Blue line: AT line (Above the line, our analysis is stable)
λ
![Page 14: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/14.jpg)
ROC curve
• Receiver operating characteristic (ROC) curve • Plot of TP against FP
• A criterion: “Optimal point” on ROC curve is the minimum of
R(η) = (TP(η) − 1)2 + (FP(η) − 0)2, η = {λ, a}
R(η)
• Here, we identify the optimal value of at a fixed value of , and compare the value with that gives minimum of input MSE.
λ a
At the optimal point, the support recovery error is expected to be minimized.
14/28
![Page 15: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/15.jpg)
ROC curve
0 0.5 1
FP
0
0.2
0.4
0.6
0.8
1
TP
=0.5, 0=0.2,
2=0.1
R min
x min
0 0.5 1
FP
0
0.2
0.4
0.6
0.8
1
TP
=0.5, 0=0.2,
2=0.0001
R min
x min
• Minimum locations of input MSE and R are close. This property is absent in LASSO [Obuchi and Kabashima, JSTAT (2016)]
• Input MSE is unknown in general settings, but relates to Cross-validation (CV) error, hence we may minimize CV error to determine optimal support.
15/28
![Page 16: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/16.jpg)
Verification of Theoretical Result
(N=1000, 10samples)
Analytically derived lines match to numerical simulation
in RS phase.
RS phase
Input MSE ROC curve
16/28
![Page 17: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/17.jpg)
Contents1. Analytical performance analysis in simulated dataset
2. Approximate CV formula
3. Numerical experiments
17/28
![Page 18: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/18.jpg)
LOOCV and Linear Approx.
• Leave-one-out CV (LOOCV)
← Large cost!
d = x� x\µ• Approximation: Expand w.r.t.
ℋ(x |D) ≡ {12 ∑
μ(yμ − ∑
i
Aμixi)2
+ J(x; η)}
x\μ = arg minx
ℋ(x |D\μ)
ϵLOO(η) =1
2M ∑μ
(yμ − ∑i
Aμi x\μi (η))
2
Define:
ℋ
ℋ(x |D) − ℋ(x\μ |D\μ) ∼ ∑μ
dThμ(x)
x\μ ∼ x − χ\μhμ(x), χ\μ =∂x\μ
∂h
18/28
![Page 19: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/19.jpg)
Approximate CV formula• Approximate CV formula:
✏LOO ⇡ 1
2M
MX
µ=1
⇥µ
�yµ � a>
µ x�2
⇥µ =
✓1� (aµ)
>SA
⇣(A⇤SA)
> A⇤SA +�@2J(xSA ; ⌘)
�SASA
⌘�1(aµ)SA
◆�2
.
Computable only from x
cost function’s Hessian on support
SA: support
• Delicate points • Invariance of support between full and LOO solutions is
assumed (approximately (exactly in N→∞) correct) • Regularity of cost function Hessian
• Actually violated in RSB phase • Computational cost is O(|SA|3)
19/28
![Page 20: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/20.jpg)
Contents1. Analytical performance analysis in simulated dataset
2. Approximate CV formula
3. Numerical experiments
20/28
![Page 21: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/21.jpg)
Experimental Setting
x0i ⇠ (1� ⇢0)�(x0i) + ⇢0N (0,�2x) all i.i.d.
Aµi ⇠ N (0, N�1)
y = Ax0 +�
�i ⇠ N (0,�2�)
• Generative process: Identical to theoretical setting
• Optimization algorithm: Cyclic Coordinate Descent (CCD) • Coordinate-wise update optimizing the cost function
• A technique: λ annealing • Pathwise optimization with gradually changing λ
• Faster convergence • Robust solution even in RSB region
21/28
![Page 22: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/22.jpg)
Approx. CV: Sample dependenceSCAD parameter a = 3
α = 0.5, ρ0 = 0.2, σ2Δ = 0.1, N = 100
Sample No.1 Sample No.4
• CV error fluctuates depending on sample. • Approximated CV error is valid in RS phase for both samples.
22/28
(Error bar is for components of data.)
![Page 23: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/23.jpg)
Approx. CV: Sample dependenceSCAD parameter a = 4
α = 0.5, ρ0 = 0.2, σ2Δ = 0.1, N = 100
Sample No.1 Sample No.4
Sample dependence becomes moderate
as increase SCAD parameter a.
22/28
(Error bar is for components of data.)
![Page 24: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/24.jpg)
“Phase diagram” for given data• “Phase” is defined for the infinite set of samples that are
distributing according to a probability distribution.
• In practical problems, • Appropriate parameter region for a given data is required. • In particular for finite size system, sample-dependency is large.
• We propose a method to get “phase diagram” for given data. • In other words, we identify the parameter region where we should
rule out as candidates.
23/28
![Page 25: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/25.jpg)
Approx. CV: Instability detection for “phase diagram”
24/28
• Detect “irregular” datapoints along the path.
• Find the maximum value of irregular datapoints.
• smaller than the maximum value is inappropriate in the sense that instability appears.
λ
λ
λ
SCAD parameter a = 3α = 0.5, ρ0 = 0.2, σ2
Δ = 0.1, N = 100Sample No.4
We use our approximate CV formula to detect “RSB” region.
![Page 26: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/26.jpg)
25/28
10-1
100
101
2.1
5
10
15
20
a
=0.5, 0=0.2,
2=0.1, N=100, ID=4
a AT
a CVE
Blue line: AT line (RS-RSB transition)
Black region: “RSB region” for sample No.4
Green line: Minimum of CV error
Corresponding “phase diagram” for sample No.4
What happens in “RSB” region (black)?
Starting from 10 different initial condition without annealing, literal CV’s value fluctuates in the “RSB” region.
![Page 27: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/27.jpg)
Application to SuperNovae data analysis
10-1
100
0
0.05
0.1
0.15
0.2
CV
err
ors
TIa supernovae dataset, a=4
ApproximateLiteralInstabilityCVE min
26/28
http://heracles.astro.berkeley.edu/sndb/
Minimum of literal CV
CV error for a=4
Minimum of CV in stable region
“Phase” boundary
![Page 28: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/28.jpg)
Application to SuperNovae data analysis
2.1 3 4 6 8 10 20 40 80
a
3
5
8
10
12
K
CVE min
2.1 3 4 6 8 10 20 40 80
a
0.01
0.015
0.02
0.025
0.03
0.035
0.04
CV
err
or
CVE min
Sparsest within one-sigma rule: K=3
←Identical solution to a Monte-Carlo method solving L0 problem
(TO et al, 2016,2018)Our method is consistent with L0 result even though SCAD is more computationally reasonable
27/28
Number of parameters in modelCV error
![Page 29: Cross validation in sparse linear regression with ...](https://reader031.fdocuments.net/reader031/viewer/2022013018/61d15e1e7638763c9c38566b/html5/thumbnails/29.jpg)
Summary• Theoretical analysis of SCAD estimator in linear regression
• Emergence of local minima = Phase transition w. RSB • Analytical evidence of outperformance of SCAD to LASSO
• Invention of an approximate CV formula • The scaling is O(N3) but still practical in a wide range of N • Approximate CV instability <-> Local minima or RSB
• Instability detection in CV formula also signals RSB • Numerical results fully support the theoretical result • A MATLAB Package of approx. CV formula + CCD algorithm:
https://github.com/T-Obuchi/SLRpackage_AcceleratedCV_matlab
• Future work • Characterization of the λ annealed solution path • Applications, different models (non-L2 cost function)
TO, AS, arXiv:1902.1037528/28