Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan...
Transcript of Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan...
![Page 1: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/1.jpg)
Statistical Foundations of Data Science
Jianqing Fan
Princeton University
http://www.princeton.edu/∼jqfan
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 1 / 63
![Page 2: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/2.jpg)
2. Penalized Least-Squares
2.1. Classical model selection 2.2. Penalized least-squares
2.3. Lasso and L1-regularization 2.4. Folded concave regularization
2.5. Concentration inequialities 2.6. Estimation of residual variance
2.7. Variable expansion & g-penalty 2.8. Prediction of HPI
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 2 / 63
![Page 3: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/3.jpg)
2.1 Classical Model Selection
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 3 / 63
![Page 4: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/4.jpg)
Best subset selection
Data: (X1,Y1), · · · ,(Xn,Yn) i.i.d. from Y = XT β + ε, E(ε|X) = 0.
Matrix notation: Y = Xβ + ε.
Arts of Modeling: Fcategorical (SES, age group)FpolynomialsFsplines
Fadditive modelFinteraction modelsFbivariate tensorsFmultivariate tensors
Fkernel tricksFtime-series
�Model size p = dim(X) can easily get very large
Objective: To select active components S and estimate βS .
Best subset Sm: Minimizes RSS among(p
m
)possible subsets.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 4 / 63
![Page 5: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/5.jpg)
Relation with L0-penalty
Traditional: L0-penalty
n−1‖y−Xβ‖2 + λ
p
∑j=1‖βj‖0.
Computation: Best subset
� Given ‖β‖0 = m, the solution is the best subset Sm.
� Find m to minimize n−1RSSm + λm. NP hard, all subets.
Greedy algorithms: stepwise addition/deletion, matching pursuit.
Theory: The method works well even for high-dimensional problems. (Birge,
Massart, Baron, 99) see also chap 4
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 5 / 63
![Page 6: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/6.jpg)
Estimating prediction error
Model: Yi = µ(Xi ) + εi . Let µ be estimator of µ
From ‖µ− µ‖2 = ‖Y− µ‖2−‖Y−µ‖2 + 2(µ−µ)T (Y−µ), we have
Stein identity: E‖µ− µ‖2 = E{‖Y− µ‖2−nσ2
}+ 2∑
ni=1 cov(µi ,Yi ).
PE: E‖Ynew− µ‖2 = nσ2 + E‖µ− µ‖2 = E{‖Y− µ‖2
}+ 2dfµσ2.
Unbiased estimate of PE. Let dfµ = σ−2∑
ni=1 cov(µi ,Yi ) model fixed
Cp(µ) = ‖Y− µ‖2 + 2σ2 dfµ ≡ RSS + 2σ
2 dfµ.
�For linear predictor µ = SY, dfµ = tr(S) since cov(µi ,Yi) = siiσ2.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 6 / 63
![Page 7: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/7.jpg)
Efforts in model selection: choosing λ
Approx same: AIC(µ) = log(‖Y− µ‖2/n) + 2 dfµ(λ)/n. Fw/o σ2
Best subset: Find m to minimize n−1RSSm + λm
Cp (Mallows, 73) and AIC (Akaike, 74): λ = 2σ2, model M
BIC (Schwarz, 1978): λ = log(n)σ2.
RIC (Foster & George, 1994): λ = 2 log(p)σ2
Adjusted-R2: Radj,m = 1− n−1n−m
RSSm
SDy.
Generalized cross-validation (Craven and Wahba, 1979):
GCV(m) = RSSmn(1−m/n)2 , equivalent to PLS with λ = 2σ2.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 7 / 63
![Page 8: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/8.jpg)
Cross-validation
�Modelfree or nonparametric approach to PE (Allen, 74; Stone, 74)
Multiple fold CV. Random partition data
into equal size subsamples {Sj}kj=1.
Testing and training set:
data in Sk and {Sj}j 6=k .
Estimating of PE: CVk (m) = n−1∑j
{∑i∈Sj
(Yi − βTm,−Sj
Xi,Mm)2}
�βm,−Sj= fitted coef of model Mm w/o using data in Sj .
Choice of k : k = n (best, but expensive; leave-one out), 10 or 5 (5-fold).
Why CV: E(Yi −XTi β
λ
−i)2 = Eε2
i + E‖XTi (β
λ
−i −β∗)︸ ︷︷ ︸
indep
‖2.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 8 / 63
![Page 9: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/9.jpg)
2.2 Penalized Least-Squares
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 9 / 63
![Page 10: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/10.jpg)
Convex and folded concave relaxations
Penalized least-squares: Q(β) = 12n‖y−Xβ‖2 + ∑
pj=1 pλ(|βj |)
−6 −4 −2 0 2 4 6
01
23
45
6
L_0, L_1 and SCAD penalties
(a)
L_0−penaltyL_1−penaltySCAD
●
●
−10 −5 0 5 10
−10
−5
05
10
Solutions to PLS
(b)
Hard−thresholdSoft−thresholdSCAD
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 10 / 63
![Page 11: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/11.jpg)
Convex penalties
F L2 penalty pλ(|θ|) = λ|θ|2 =⇒ ridge regression
F Lq (0 < q < 2) penalty pλ(|θ|) = λ|θ|q =⇒ Bridge (Frank and Friedman, 93)
F L1 penalty pλ(|θ|) = λ|θ|=⇒ LASSO (Tibshirani 1996).
F Elastic net pλ(θ) = λ1|θ|+ λ2θ2 (Zou & Hastie, 05)
F Bayesian variable selection: posterior mode. (Mitchell, 88; George & McCulloch
93, Berger and Pericchi, 96 JASA).
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 11 / 63
![Page 12: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/12.jpg)
Folded Concave Penalty
Smoothly Clipped Absolute Deviation(SCAD)
p′λ(θ) = λ
{I(θ≤ λ) +
(aλ−θ)+(a−1)λ
I(θ > λ)}, a > 2
(Fan, 1997). Set a = 3.7 from Bayesian viewpoint (Fan & Li, 01)
Minimum Concavity P. (MCP): p′λ(θ) = (aλ−θ)+/a (Zhang,10).
a = 1 =⇒ Hard-Thresholding pλ(θ) = λ2− (λ−|θ|)2+.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 12 / 63
![Page 13: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/13.jpg)
Folded Concave Penalties and Their Derivaves
−10 −5 0 5 100
2
4
6
8
10
θ
p λ(|θ|
)
SCADMCPHardSoft
0 2 4 6 80
0.5
1
1.5
2
2.5
3
3.5
4
θp λ’(|
θ|)
SCADMCPHardSoft
�Associated risks computed in the next figure
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 13 / 63
![Page 14: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/14.jpg)
Orthonormal design
LSE: β = (XT X)−1XT y = XT y, when XT X = Ip.
Q(β) =12
n
∑i=1
(yi − yi)2 +
12
p
∑j=1
(βj −βj)2 +‖pλ(|β|)‖1
comp min. (Antoniadis and Fan, 01; Fan and Li, 01)
12
(z−θ)2 + pλ(|θ|)
1 pλ(|θ|) = λ
2 θ2 =⇒ shrinkage: θλ = (1 + λ)−1z.
2 pλ(|θ|) = λ|θ|=⇒ soft-threshold: θS = sgn(z)(|z|−λ)+.
3 pλ(θ) = λ2− (λ−|θ|)2+ =⇒ hard-threshold θH = zI(|z|> λ).
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 14 / 63
![Page 15: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/15.jpg)
Solutions of SCAD and MCP
θSCAD(z) =
sgn(z)(|z|−λ)+, when |z| ≤ 2λ;
sgn(z)[(a−1)|z|−aλ]/(a−2), when 2λ < |z| ≤ aλ;
z, when |z| ≥ aλ.
.
�a = ∞ =⇒ soft-thresholding
θMCP(z) =
{sgn(z)(|z|−λ)+/(1−1/a), when |z|< aλ;
z, when |z| ≥ aλ.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 15 / 63
![Page 16: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/16.jpg)
Desired Properties
for penalty functions (Antoniadis and Fan, 2001; Fan and Li, 2001)
Continuity: to avoid instability in model prediction.
Sparsity: to reduce model complexity (set small coeff. to 0).
Unbiasedness: to avoid unnecessary modeling bias
(unbiased when true coefficients are large).
Method continuity sparsity unbiasedness
Best subset√ √
Ridge√
LASSO√ √
SCAD√ √ √
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 16 / 63
![Page 17: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/17.jpg)
Performance Comparisons
Z ∼ N(θ,1) �R(θ,θ) = Eθ(θ−θ)2 �same risk at θ = 3
−10 −5 0 5 10
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
θ
Ris
k
λhard
= 1
SCADMCPHardSoft
−10 −5 0 5 100
0.5
1
1.5
2
2.5
3
θ
Ris
k
λhard
= 2
SCADMCPHardSoft
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 17 / 63
![Page 18: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/18.jpg)
Characterization of folded-concave PLS
Concavity: κ(pλ;v) = limε→0+ maxj supt1<t2∈(|vj |−ε,|vj |+ε)−p′
λ(t2)−p′
λ(t1)
t2−t1.
F For L1 penalty, κ(pλ;v) = 0 for any v.
F For SCAD, κ(pλ;v) = 0 if |v| 6∈ [λ,aλ], else κ(pλ;v) = (a−1)−1λ−1
Theorem 2.1. Necessary and sufficient conditions
Let S = supp(β), β1 = βS , X1 = XS , X2 = XSc . If β is a local min, then
n−1XT1 (Y−Xβ)−p′
λ(|β1|)sgn(β1) = 0,
‖n−1XT2 (Y−Xβ)‖∞ ≤ p′
λ(0+),
λmin(n−1XT1 X1)≥ κ(pλ; β1).
They are sufficient if inequalities are strict.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 18 / 63
![Page 19: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/19.jpg)
Solution Paths
Target: Q(β) = 12n‖y−Xβ‖2 + ∑
pj=1 pλ(|βj |)
Solution path: β(λ) as a function of λ.
0 10 20 30 40 50 60 70 80 90 100−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Penalty parameter index
Coe
ffici
ents
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 19 / 63
![Page 20: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/20.jpg)
2.3 LASSO and L1 Regularization
<=1970sRidge Reg.Subset sel; AIC, BIC
1995Nonneg. garrote
Bridge Reg.1993
LASSO1996
1998Basis Pursuit
SCAD2001
2006Adaptive Lasso
Elastic Net2005
2004LARS
Danzig2007
Debiased2014
2014Distributed
2008SIS
SLOPE2015
…
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 20 / 63
![Page 21: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/21.jpg)
LASSO
Necessary condition for Lasso:
n−1XT1 (Y−X1β1)−λsgn(β1) = 0, ‖(nλ)−1XT
2 (Y−X1β1)‖∞ ≤ 1,
‖n−1XT (Y−Xβ)‖∞ ≤ λ β = 0 when λ > ‖n−1XT Y‖∞.
Solving β1 and substituting in, we have
‖(nλ)−1XT2 (In−PX1)Y−XT
2 X1(XT1 X1)−1 sgn(β1)‖∞ ≤ 1.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 21 / 63
![Page 22: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/22.jpg)
Model Selection Consistency for LASSO
True Model: Y = XS0β0 + ε
Selection Consistency: supp(β) = S0 XS0 = X1
Necessary Condition: Using true model and (In−PX1)XS0 = 0
‖(nλ)−1XT2 (In−PX1)ε−XT
2 X1(X1X1)−1 sgn(β1)‖∞ ≤ 1.
First term negligible: ‖XT2 X1(XT
1 X1)−1 sgn(β1)‖∞ ≤ 1.
Irrepresentable Cond: If sign consistency sgn(β) = sgn(β0),
‖XT2 X1(XT
1 X1)−1 sgn(βS0)‖∞ ≤ 1, (Zhao and Yu, 06),
necessary for sel. consistency; sufficient if 1 replaced by 1−η.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 22 / 63
![Page 23: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/23.jpg)
Remarks on Irrepresentabe Condition
F (X1X1)−1XT1 X2 reg coef of ‘unimportant’ Xj (j 6∈ S0) on XS0 .
F Irrepresentable cond requires sum of the signed reg coefs of each Xj on
XS0 ≤ 1. The bigger S0, the harder the cond.
F Example: Xj = ρs−1/2∑k∈S0
sgn(βk )Xk +√
1−ρ2εj , where s = |S0|.Sum of reg coef is
√s|ρ|, can exceed 1!
F Irrespresentable condition is restrictive
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 23 / 63
![Page 24: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/24.jpg)
Risks of Lasso (I)
Risk: R(β) = E(Yi −XTi β)2 = γTΣ∗γ,
γ =
(−1
β
), Σ∗ =Var
(Y
X
), S∗n = cov
[(Yi
Xi
)n
i=1
].
Empirical risk: Rn(β) = n−1∑
ni=1(Yi −XT
i β)2 = γT S∗nγ cov-learning
Dual problem: min‖β‖1≤c ‖Y−Xβ‖2.
|R(β)−Rn(β)|= |γT (Σ∗−S∗n)γ|
≤ ‖Σ∗−S∗n‖max‖γ‖21 = (1 +‖β‖1)2‖Σ∗−S∗n‖max.
≤ (1 + c)2‖Σ∗−S∗n‖max.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 24 / 63
![Page 25: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/25.jpg)
Risks of Lasso (II)
If ‖β0‖1 ≤ c, then Rn(β)−Rn(β0)≤ 0 and
0≤ R(β)−R(β0) ≤ {R(β)−Rn(β)}+{Rn(β0)−R(β0)}
≤ 2(1 + c)2‖Σ∗−S∗n‖max
Persistency: R(β)−R(β0)→ 0 (Greenshtein and Ritov, 04)
F Sample cov has rate O(√
(log p)/n) for subGaussian data.
F Persistency requires ‖β0‖1 ≤ c = o((n/ log p)1/4)
F Need uniform convergence.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 25 / 63
![Page 26: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/26.jpg)
Accuracy of Sample Covariance matrix
Result: If maxi,j P{√
n|σij − σij |> x}< exp(−Cx1/a) for big x ,
‖Σ− Σ‖max = OP
((log p)a√
n
).
�Impact of dim is limited. �Req exponential tail.
Proof:
P{√
n‖Σ−Σ‖max > bn} ≤ ∑i,j
P{√
n|σij −σij |> bn}
≤ p2 exp(−Cb1/an ) = 1/p
by taking bn = (3C−1 log p)a. Conclusion follows.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 26 / 63
![Page 27: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/27.jpg)
Remarks
F It requires only marginal behavior and uses the union bounds.
F The inequality is referred to as concentration inequality.
F Sample covariance matrix is basically the sample mean of EXiXj ; needed
concentration inequality for sample mean.
F Other robust estimators can also be in the regression.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 27 / 63
![Page 28: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/28.jpg)
Sparsity of Lasso Solution
�Let ∆ = β−β0, β∗ = β0 and S = S0. Then, F(∆)≤ 0.
�F(∆) = Rn(∆+β0)/2−Rn(β0)/2︸ ︷︷ ︸(1)
+λ(‖∆+β0‖1−‖β0‖1︸ ︷︷ ︸(2)
)
By convexity, if λ≥ n−1‖XT ε‖∞, then
(1) ≥ −|12
R′n(β0)T∆| ≥ − 12n‖XT
ε‖∞‖∆‖1 ≥−λ‖∆‖1/2
(2) = ‖∆S + β∗S‖1 +‖∆Sc‖1−‖β∗S‖1 ≥−‖∆S‖1 +‖∆Sc‖1
Combining these and F(∆)≤ 0, we have ‖∆Sc‖1 ≤ 3‖∆S‖1.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 28 / 63
![Page 29: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/29.jpg)
Weighted and adaptive lasso and SLOPE
Lasso: Creating biases for large coefficients.
Weighted lasso: 12n‖Y−Xβ‖2 + λ∑
pj=1 wj |βj |.
Fsolved by Lasso via rescaling
Adaptive lasso: wj = |βj |−γ (e.g, γ = 0.5,1,2) (Zou, 06)
FOne-step implementation of folded-concave penalty pλ(·):
wj = p′λ(|βlasso
j |)/λ.
SLOPE: arg minβ
{1n‖y−Xβ‖2 + ∑
pj=1 λj |β|(j)
},
Fλj = Φ−1(1− jq/2p)σ/√
n, related to FDR q.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 29 / 63
![Page 30: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/30.jpg)
Elastic Net
arg minβ
{1n‖y−Xβ‖2 + λ[(1−α)‖β‖2 + α‖β‖1]
},
FAmeliorate collinearity
Fα = 1 =⇒ Lasso; α = 0 =⇒ Ridge
β1
β2RidgeLassoElastic Net
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 30 / 63
![Page 31: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/31.jpg)
Danzig Selector
Danzig selector: minβ∈Rp ‖β‖1, s.t. ‖n−1XT (Y−Xβ)‖∞ ≤ λ.
high confident set
�βDZ = solution; �require λ≥ ‖n−1XT (Y−Xβ0︸ ︷︷ ︸ε
)‖∞
Linear program: minu ∑pi=1 ui , u≥ 0, −u≤ β≤ u, −λ1≤ n−1XT (Y−Xβ)≤ λ1
(Candes & Tao, 07)
F Let ∆ = βDZ −β0. Then, ‖β0‖1 ≥ ‖βDZ‖1 = ‖β0 + ∆‖1.
F As in (2),‖β0 + ∆‖1 ≥ ‖β0‖1−‖∆S0‖1 +‖∆S c0‖1.
F They imply ‖∆S0‖1 ≥ ‖∆S c0‖1 —sparsity
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 31 / 63
![Page 32: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/32.jpg)
Rate of Convergence
F ‖∆‖2 ≥ ‖∆S0‖2 ≥ ‖∆S0‖1/√
s ≥ ‖∆‖1/(2√
s)
F ‖X∆‖22 ≤ ‖(XT X∆)‖∞‖∆‖1 ≤ 2nλ‖∆‖1.
Combine last two, we have
‖X∆‖22 ≤ 4nλ
√s‖∆‖2.
Restricted eigenvalue condition: (Bickel, et al, 09)
min‖∆S0‖1≥‖∆Sc
0‖1
n−1‖X∆‖22/‖∆‖2
2 ≥ a
L2 consistency: ‖∆‖22 ≤ 16a−2λ2s
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 32 / 63
![Page 33: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/33.jpg)
Housing Prediction: Zillow data: House price + house attributes (bedrooms,
bathrooms, sqft.living sqft.lot, ...) in 70 zip codes, sold in 2014 & 15.
(ntrain = 15,129, ntest = 6,484).Out-of-sample R2: 0.80
library(’glmnet’)
X <- model.matrix(˜.,data = train_data[,5:22])
Y <- train_data$price
fit.glm1 <- glmnet(X, Y, alpha=1) #Lasso fit, solution path
plot(fit.glm1, main="Lasso")
fit.cvglm1 <- cv.glmnet(X,Y,nfolds = 5, alpha = 1); #what is alpha = 0.5?
fit.cvglm1$lambda.min
beta.cvglm1 <- coef(fit.cvglm1, s=fit.cvglm1$lambda.1se) ###coef at 1se
pdf("Zillow1.pdf", width=4.6, height=2.6, pointsize=8)
par(mfrow = c(1,2), mar=c(5,5,3,1)+0.1, mex=0.5)
plot(fit.cvglm1$glmnet.fit); title(’LASSO’)
abline(v=sum(abs(beta.cvglm1[-1]))) #Lasso solution path
plot(fit.cvglm1) #Estimated MSE
X_test <- model.matrix(˜., data = test_data[,5:22])
pred.glm1<- predict(fit.cvglm1, newx = X_test, s = "lambda.min")
mse.pred.glm1 <- sum((test_data$price - pred.glm1)ˆ2) #MSE
1 - mse.pred.glm1/ sum((test_data$price - mean(Y))ˆ2) #out-sample-Rˆ2
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 33 / 63
![Page 34: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/34.jpg)
Housing Prediction: Zillow data
0e+00 4e+06 8e+06
0e+
004e
+05
8e+
05
L1 Norm
Coe
ffici
ents
0 10 36 55 75LASSO
6 7 8 9 10 11 12
4.0e
+10
8.0e
+10
1.2e
+11
log(Lambda)
Mea
n−S
quar
ed E
rror
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
83 80 76 65 39 10 7 3 2
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 33 / 63
![Page 35: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/35.jpg)
3.4 Folded-concave Regularization
1960Stepwise
1993Matching Pursuit
LEAPS1974
Shooting1998
LARS2004
2007CCD
LLA2008
2001LQA
PLUS2010
2009FISTA
2011ADMM
2004ISTA
2008ISIS
Distributed2011
…..
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 34 / 63
![Page 36: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/36.jpg)
Local quadratic approximations
Target: Qapprox(β) = 12n‖y−Xβ‖2 + 1
2 ∑pj=1
p′λ(|β∗j |)|β∗j |
β2j .
−4 −2 0 2 4
01
23
45
LQA and LLA
(a)
LQALLA
−4 −2 0 2 4
01
23
45
LQA and LLA
(a)
LQALLA
Iterative formulas for LQA: (ridge reg, Fan & Li, 01)
β(k+1)
= {XT X + nΣλ(β(k)
)}−1X′y, Σλ(β(k)) = diag{p′
λ1(|β(k)|)/|β(k)|}
�Delete Xj in the iteration, if |β(k+1)j | ≤ η.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 35 / 63
![Page 37: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/37.jpg)
Local linear approximations and one-step estimator
Target: Qapprox(β) = ‖y−Xβ‖2 + ∑pj=1 wj |βj |, wj = p′
λ(|β∗j |)
�Use LASSO via the transform α = β ·w, X∗ = X/w.
One-step estimator: β(0) = 0 =⇒ Lasso =⇒ Adaptive Lasso
F Fan et al (14) shows that if initial estimator is good, one-step procedure
obtains the oracle estimator, and is a fixed point.
F Later, algorithmic construction of oracle estimator.
F Optimality demonstrated in Chap 4 of book.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 36 / 63
![Page 38: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/38.jpg)
Remarks
F LLQ computes explicitly the updates whereas LLA requires PLS.
F LLA gives better approximation, particularly around the origin.
F With LLA, PLS is iter. reweighted LASSO, w/ weights given by p′λ(·).
F Adaptive lasso: p′λ(|β∗j |) = λ|β∗j |−γ (γ > 0). drawback: Zero is an absorbing state.
F Both LQA and LLA are a specific member of MM algorithm:
Q(β)≤ Qapprox(β), Q(β∗) = Qapprox(β
∗)
F Convergence of MM (Majorization Minimization):
Q(β(k)) =︸︷︷︸
cond
Qapprox(β(k)) ≥︸︷︷︸
min
Qapprox(β(k+1)) ≥︸︷︷︸
major
Q(β(k+1)).
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 37 / 63
![Page 39: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/39.jpg)
Coordinate descent algorithms
F Optimize one variable at a time: Given current value β0, update
βj = argminβjQ(β1,0, · · · ,βj−1,0,βj ,βj+1,0, · · · ,βp,0),
F For PLS, Rj = Y−X−j β−j,0 without j th variable. Then
Qj(βj) ≡ Q(β10, · · · ,βj−1,0,βj ,βj+1,0, · · · ,βp,0)
=1
2n‖Rj −Xjβj‖2 + pλ(|βj |) + c,
F For Lasso, SCAD and MCP, βj admits analytic solution.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 38 / 63
![Page 40: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/40.jpg)
Housing Prediction by SCAD: Zillow data: R2 = 0.80
library(’ncvreg’)
X <- model.matrix(˜.,data = train_data[,5:22])
Y <- train_data$price
cvfit.SCAD <- cv.ncvreg(X, Y, penalty="SCAD")
### prediction of the test data and out-sample Rˆ2
predict.SCAD = predict(cvfit.SCAD, X=X_test)
mse.pred.scad <- sum((test_data$price - predict.SCAD)ˆ2) #MSE
1 - mse.pred.scad/ sum((test_data$price - mean(Y))ˆ2) #out-sample-Rˆ2
### change smoothing parameter
fit.SCAD <- ncvreg(X, Y, penalty="SCAD")
predict.SCAD = predict(fit.SCAD, X=X_test); R.sq = NULL
for(j in 1:100){
mse.pred.scad <- sum((test_data$price - predict.SCAD[,j])ˆ2) #MSE at jth
R.sq = c(R.sq, 1 - mse.pred.scad/ sum((test_data$price - mean(Y))ˆ2))}
pdf("Zillow2.pdf", width=4.6, height=2.6, pointsize=8)
par(mfrow = c(1,3), mar=c(5,5,3,1)+0.1, mex=0.5)
plot(cvfit.SCAD)
abline(v=log(cvfit.SCAD$lambda.min),lwd=2,col=4)
fit.SCAD <- ncvreg(X, Y, penalty="SCAD")
plot(fit.SCAD, main="SCAD")
abline(v=cvfit.SCAD$lambda.min,lwd=2,col=4)
plot(1:100,R.sq, main="Out-sample Rˆ2", type="l", col=4)
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 39 / 63
![Page 41: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/41.jpg)
Housing Prediction by SCAD: Zillow data
12 11 10 9 8 7 62.0e+10
4.0e+10
6.0e+10
8.0e+10
1.0e+11
1.2e+11
1.4e+11
log(λ)
Cro
ss−
valid
atio
n er
ror
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 2 3 10 18 43 61 69 80Variables selected
250000 150000 50000 0−200000
0
200000
400000
600000
800000
1000000
1200000
SCAD
λ
β
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
Out−sample R^2
1:100
R.s
q
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 39 / 63
![Page 42: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/42.jpg)
2.5 Concentration Inequalities
u-10 -5 0 5 10
loss
function:� τ(u)
0
5
10
15
20
25
30
35
40
45
50τ = 0.5τ = 1τ = 2τ = 3τ = 4τ = 5least squares
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 40 / 63
![Page 43: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/43.jpg)
A motivating example
�fundamental tools for controlling max estimation errors: e.g.
P{‖n−1XT
ε‖∞ > t}≤
p
∑j=1
P{|n−1
n
∑i=1
Xijεi |> t}.
If n−1‖Xj‖2 = 1, then Zj ≡ n−1∑
ni=1 Xijεi ∼ N(0,σ2/n) and
P{|Zj | ≥ t
σ√n
}≤ 2√
2π
∫∞
t
xt
exp(−x2/2)dx =2√2π
exp(−t2/2)/t
By taking t =√
2(1 + δ) log p, with prob ≥ 1−o(p−δ),
‖n−1XTε‖∞ ≤
√2(1 + δ)σ
√log p
n.
�Tail probability of sum of independent random variables
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 41 / 63
![Page 44: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/44.jpg)
Concentration inequalities
Theorem 2.2 Y1, · · · ,Yn are independent r.v. w/ mean 0.
Let Sn = ∑ni=1 Yi be the sum of the random variables.
a) Hoeffding inequality: If Yi ∈ [ai ,bi ], then
P(|Sn| ≥ t)≤ 2exp(− 2t2
∑ni=1(bi −ai)2
).
b) Berstein’s inequality. If E |Yi |m ≤m!Mm−2vi/2, then
P(|Sn| ≥ t)≤ 2exp(− t2
2(v1 + · · ·+ vn + Mt)
).
c) Sub-Gaussian: If E exp(aYi)≤ exp(via2/2), then
P(|Sn| ≥ t)≤ 2exp(− t2
2(v1 + · · ·+ vn)
).
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 42 / 63
![Page 45: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/45.jpg)
Concentration inequalities for robust mean
Robust means. Yi are i.i.d. with mean µ and variance σ2.
d) Truncated loss (Adaptive Huber estimator): Let
µτ = argminn
∑i=1
ρτ(Yi −µ), ρτ(x) =
{x2, if |x | ≤ τ
τ(2|x |− τ), if |x |> τ
Then, for τ =√
nc/t with c ≥ SD(Y ), we have (Fan, Li, Wang, 17)
P(|µτ−µ| ≥ tc√n
) ≤ 2exp(−t2/16), ∀t ≤√
n/8.
e) Truncated mean: Set Yi = sgn(Yi )min(|Yi |,τ).
P(∣∣∣1
n
n
∑i=1
Yi −µ∣∣∣≥ t
σ√n
)≤ 2exp(− ct2
)for some universal constant c. (Fan, Wang, Zhu, 16).
�Sample mean has Cauchy tail: P(|Y −µ|> t σ√n
)≤ 1/t2.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 43 / 63
![Page 46: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/46.jpg)
Remarks
There are concentration inequalities for
F sum of sub-exponential distributions
F empirical processes
F random matrices
F MLE
F Self-normalized average
F Markovian chains and mixing processes
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 44 / 63
![Page 47: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/47.jpg)
2.6 Estimation of Residual Variance
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 45 / 63
![Page 48: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/48.jpg)
Residual Variance
F Important for stat inference and model selection. It is benchmark for
optimal prediction.
F σ2 = R(β0), consistently estimated by LASSO residual variance
Rn(β) = n−1‖Y−Xβ‖2, when persistency.
F Slow rate and require ‖β0‖1 ≤ c = o((n/ log p)1/4).
F See Sec 1.3.3. for spurious corr and underestimation of σ2.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 46 / 63
![Page 49: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/49.jpg)
Refitted Cross-validation
F Randomly split the data;
F Select variables using the first half data;
F Refit the selected model using the second half to get σ2;
F and vice verse; take the average of the two estimators.
Key difference: Refitting eliminates spurious correlation!
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 47 / 63
![Page 50: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/50.jpg)
RCV
Random Division of Data: (y(1),X(1)) and (y(2),X(2))
Refitted residual variances: With selected model M1, compute
σ21 =
(y(2))T (In/2−P(2)
M1)y(2)
n/2−|M1|,
where P(2)
M1= X(2)
M1(X(2)T
M1X(2)
M1)−1X(2)T
M1and σ2
2
Final estimate: σ2RCV = (σ2
1 + σ22)/2 or its weighted average.
Advantages: �Require only sure screening
�Reduce influence of spurious correlation.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 48 / 63
![Page 51: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/51.jpg)
2.7 Variable Expansions and Group
Penalty
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 49 / 63
![Page 52: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/52.jpg)
Structured Nonparametric Models
F Additive model (Stone, 85, Hastie and Tibshirani, 90)
Y = f1(X1) + · · ·+ fp(Xp) + ε
F Two-d nonparametric interactions:
Y =p
∑i=1
fi(Xi) +∑i<j
fi,j(Xi ,Xj) + ε
F Varying coefficient model:
Y = β0(U) + β1(U)X1 + · · ·+ βp(U)Xp + ε.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 50 / 63
![Page 53: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/53.jpg)
Expanded Linear Model
�Via basis expansion, they can be written as a linear model.
Example: Approx fj(x) = ∑Kjk=1 βjk Bjk (x) using the basis functions
{Bjk (x)}Kjk=1 (e.g. B-spline). Additive model becomes
Y = {β1,1B1,1(X1) + · · ·+ β1,K1B1,K1(X1)}+ · · ·
+{βp,1Bp,1(Xp) + · · ·+ βp,Kp Bp,Kp (Xp)}+ ε.
Interaction model: Y = ∑pi=1 αiXi + ∑i<j βi,jXiXj + ε
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 51 / 63
![Page 54: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/54.jpg)
Full Nonparametrics and Kernel Tricks
Saturated nonparametric model: Y = f (X) + ε.
curse of dimensitionality
Kernel Tricks: Yi = ∑nj=1 βj K (Xi/λ,Xj/λ)︸ ︷︷ ︸
X∗ij prototype
+ε
FPolynomial kernel K (x,y) = (xT y + 1)d for a positive integer d
FGaussian kernel K (x,y) = exp(−‖x−y‖2/2)
FCoefficients βj can be regularized (p = n)
�Also applicable to structured nonparametrics.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 52 / 63
![Page 55: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/55.jpg)
Group penalty
F Covariate divided into groups, e.g. additive model:
F Data: Y = ∑pj=1 Xj︸︷︷︸
n×Kj
βj + ε
F Group PLS: 12n‖Y−∑
pj=1 Xjβj‖2 + ∑
pj=1 pλ(‖βj‖Wj )
F Select or kill a group of variables. (Lin & Yuan, 06)
F Discussed by Antoniadis & Fan (01) for selecting blocks of wavelet
coefficients.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 53 / 63
![Page 56: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/56.jpg)
2.8 Prediction of HPI
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 54 / 63
![Page 57: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/57.jpg)
Prediction of Home Price Appreciation
Data: HPA collected at ”≈” 1000 Core Based Statistical Areas.
Objective: To project HPA over 30-40 years for approx 1000 CBSAs based on
national assumption.
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 55 / 63
![Page 58: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/58.jpg)
HPI in several states
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 56 / 63
![Page 59: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/59.jpg)
Growth of dimensionality
�Local correlation makes dimensionality growths quickly (1000 neighborhoods
requires 1 m parameters).
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 57 / 63
![Page 60: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/60.jpg)
Conditional sparsity
Model: Yt+1 is the HPA in one CBSA:
Yt+1 = β0 + β1XN,t +382
∑j=2
βjXt,j + εt
FSparsity of {βj}382j=2 are explored by SCAD
Results: 30% more accurate than the simple modeling
What is benchmark?
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 58 / 63
![Page 61: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/61.jpg)
Local neighborhood selection
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 59 / 63
![Page 62: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/62.jpg)
Effectiveness of sparse modeling
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 60 / 63
![Page 63: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/63.jpg)
Estimated benchmark with different model size
5 10 15 20 25 30
0.30
0.35
0.40
0.45
0.50
0.55
0.60
(a) SF:naive method
model size, n = 2 to 30
estim
ated s
tanda
rd div
iation
5 10 15 20 25 30
0.30
0.35
0.40
0.45
0.50
0.55
0.60
(b) SF:RCV method
model size, n = 2 to 30
estim
ated s
tanda
rd div
iation
5 10 15 20 25 30
0.25
0.30
0.35
0.40
0.45
0.50
0.55
(c) LA:naive method
model size, n = 2 to 30
estim
ated s
tanda
rd div
iation
5 10 15 20 25 30
0.25
0.30
0.35
0.40
0.45
0.50
0.55
(d) LA:RCV method
model size, n = 2 to 30
estim
ated s
tanda
rd div
iation
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 61 / 63
![Page 64: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/64.jpg)
Estimating of prediction errors
San Francisco
Model size 2 3 5 10 15 20 30
N-LASSO 0.558 0.524 0.507 0.456 0.394 0.386 0.364
RCV-LASSO 0.556 0.554 0.518 0.506 0.473 0.475 0.474
R2 (in %) 76.92 79.83 81.40 85.67 89.79 90.66 92.58
Los Angeles
Model size 2 3 5 10 15 20 30
N-LASSO 0.524 0.489 0.458 0.440 0.375 0.314 0.250
RCV-LASSO 0.526 0.521 0.521 0.499 0.479 0.460 0.462
R2 (in %) 88.68 90.23 91.56 92.56 94.86 96.57 98.05
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 62 / 63
![Page 65: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes2.pdf · Jianqing Fan (PrincetonUniversity) ORF 525, S20: Statistical Foundations of Data Science 1/63. 2. Penalized](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc41019619d6521f1c7ba7/html5/thumbnails/65.jpg)
Notes
Data period: Training: 01/01 – 12/05 Testing 01/06 – 12/09
SD of monthly variation: SF = 1.08% LA = 1.69%
Benchmark: 0.53%.
One-month PE in 2006: SF = .67% LA = .86%
Jianqing Fan (Princeton University) ORF 525, S20: Statistical Foundations of Data Science 63 / 63