Variable selection based on multiple, high-dimensional...
Transcript of Variable selection based on multiple, high-dimensional...
![Page 1: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/1.jpg)
Variable selection based on multiple,high-dimensional genomic data: from theLasso to the smoothed adaptive Lasso
Peter Buhlmann
Seminar fur Statistik, ETH Zurich
January 2007
![Page 2: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/2.jpg)
High-dimensional data
(X1, Y1), . . . , (Xn, Yn) i.i.d. or stationaryXi p-dimensional predictor variableYi response variable, e.g. Yi ∈ R or Yi ∈ {0, 1}
high-dimensional: p � n
areas of application: biology, astronomy, imaging, marketingresearch, text classification,...
![Page 3: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/3.jpg)
High-dimensional data
(X1, Y1), . . . , (Xn, Yn) i.i.d. or stationaryXi p-dimensional predictor variableYi response variable, e.g. Yi ∈ R or Yi ∈ {0, 1}
high-dimensional: p � n
areas of application: biology, astronomy, imaging, marketingresearch, text classification,...
![Page 4: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/4.jpg)
Some examples from biology
1. Classification of cancer sub-types based on microarray geneexpression dataX = gene expression profileY ∈ {0, 1, . . . , J − 1} the class-label of cancer sub-typen ≈ 10− 100, p ≈ 3′000− 25′000
2. Motif regression:search for transcription factor binding site on DNA sequenceusing gene expressions and DNA sequence data
motif = (overrepresented) pattern on DNA sequence(transcription factor binding site)
data:X = motif scores for motifs up-stream of single genesbased on sequence data only (e.g. MDscan from Liu et al.)Y = gene expression for single genes, over multiple time pointsp ≈ 4′000, n ≈ 20× 4′000 = 80′000
![Page 5: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/5.jpg)
Some examples from biology
1. Classification of cancer sub-types based on microarray geneexpression dataX = gene expression profileY ∈ {0, 1, . . . , J − 1} the class-label of cancer sub-typen ≈ 10− 100, p ≈ 3′000− 25′000
2. Motif regression:search for transcription factor binding site on DNA sequenceusing gene expressions and DNA sequence data
motif = (overrepresented) pattern on DNA sequence(transcription factor binding site)
data:X = motif scores for motifs up-stream of single genesbased on sequence data only (e.g. MDscan from Liu et al.)Y = gene expression for single genes, over multiple time pointsp ≈ 4′000, n ≈ 20× 4′000 = 80′000
![Page 6: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/6.jpg)
High-dimensional linear models
Yi = (β0+)
p∑j=1
βjX(j)i + εi , i = 1, . . . , n
p � nin short: Y = Xβ + ε
goals:I prediction, e.g. squared prediction errorI variable selection
i.e. estimating the effective variables(having corresponding coefficient 6= 0)
![Page 7: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/7.jpg)
High-dimensional linear models
Yi = (β0+)
p∑j=1
βjX(j)i + εi , i = 1, . . . , n
p � nin short: Y = Xβ + ε
goals:I prediction, e.g. squared prediction errorI variable selection
i.e. estimating the effective variables(having corresponding coefficient 6= 0)
![Page 8: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/8.jpg)
Approaches include:
Ridge regression (Tikhonov regularization) for prediction
variable selection via AIC, BIC, (g)MDL (in a forward manner)
Bayesian methods for regularization, ...
computational feasibility for high-dimensional problems(2p sub-models)
(quasi-) convex optimization⇔ (adaptive) Lasso︸ ︷︷ ︸
Tibshirani (1996)
![Page 9: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/9.jpg)
Lasso for linear models
β(λ) = argminβ(n−1‖Y − Xβ‖2 + λ︸︷︷︸≥0
‖β‖1︸ ︷︷ ︸Ppj=1 |βj |
)
convex optimization problem
I Lasso does variable selectionsome of the βj(λ) = 0(because of “`1-geometry”)
I β(λ) is (typically) a shrunken LS-estimate
![Page 10: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/10.jpg)
The prediction problem
Theorem (Greenshtein & Ritov, 2004)I linear model with p = pn = O(nα) for some α < ∞
(high-dimensional)I ‖β‖1 = ‖βn‖1 =
∑pnj=1 |βj,n| = o((n/ log(n))1/4) (sparse)
I other minor conditionsThen, for suitable λ = λn,
EX [( f (X )︸︷︷︸β(λ)T X
− f (X )︸︷︷︸βT X
)2] −→ 0 in probability (n →∞)
![Page 11: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/11.jpg)
and Lasso performs “quite well” for prediction
binary lymph node classification using gene expressions:a high noise problem
n = 49 samples, p = 7130 gene expressions
cross-validated misclassification error (2/3 training; 1/3 test)
Lasso L2Boosting FPLR Pelora 1-NN DLDA SVM21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88%
multivariate gene selection best 200 genes (Wilcoxon test)no additional gene selection
Lasso selected on CV-average 13.12 out of p = 7130 genes
![Page 12: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/12.jpg)
The variable selection problem
Yi = (β0+)
p∑j=1
βjX(j)i + εi , i = 1, . . . , n
goal: find the effective predictor variablesi.e. the set Etrue = {j ; βj 6= 0}
`0-penalty methods, e.g. BIC, AIC,...
β(λ) = argminβ(n−1‖Y − Xβ‖2 + λ ‖β‖0︸ ︷︷ ︸Ppj=1 I(βj 6=0)
)
I computationally infeasible: 2p sub-modelsad-hoc heuristic optimization such as forward-backward
I often “instable” (Breiman (1996, 1998))
![Page 13: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/13.jpg)
convexization of computationally hard problem
use the Lasso for variable selection : E(λ) = {j ; βj(λ) 6= 0}
can be computed efficiently for all λ’s using the LARSalgorithm (Efron, Hastie, Johnstone, Tibshirani, 2004)
O(np min(n, p)) operation countslinear in p if p � n
![Page 14: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/14.jpg)
CPU timelymph node classification example: p = 7130, n = 49
computing Lasso solutions for all λ’s
2.603 seconds using lars in R (with use.gram=F)
![Page 15: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/15.jpg)
Properties of Lasso for variable selection
Theorem (Meinshausen & PB, 2004 (publ: 2006))
I Y , X (j)’s Gaussian (not crucial)I sufficient and almost necessary LfV condition
(LfV = Lasso for Variable selection); see also Zhao & Yu(2006)
I if p = p(n) is growing with nI p(n) = O(nα) for some 0 < α < ∞ (high-dimensionality)I |Etrue,n| = O(nκ) for some 0 < κ < 1 (sparsity)I the non-zero βj ’s are outside the n−1/2-range
Then: if λ = λn ∼ const .n−1/2−δ/2 (0 < δ < 1/2),
P[E(λ) = Etrue] = 1−O(exp(−Cn1−δ))
statistical (asymptotic) justification of convexization ofcomputationally hard problem for variable selection
![Page 16: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/16.jpg)
Properties of Lasso for variable selection
Theorem (Meinshausen & PB, 2004 (publ: 2006))
I Y , X (j)’s Gaussian (not crucial)I sufficient and almost necessary LfV condition
(LfV = Lasso for Variable selection); see also Zhao & Yu(2006)
I if p = p(n) is growing with nI p(n) = O(nα) for some 0 < α < ∞ (high-dimensionality)I |Etrue,n| = O(nκ) for some 0 < κ < 1 (sparsity)I the non-zero βj ’s are outside the n−1/2-range
Then: if λ = λn ∼ const .n−1/2−δ/2 (0 < δ < 1/2),
P[E(λ) = Etrue] = 1−O(exp(−Cn1−δ))
statistical (asymptotic) justification of convexization ofcomputationally hard problem for variable selection
![Page 17: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/17.jpg)
LfV condition is restrictivesufficient and necessary for consistent model selection with Lasso
it fails to hold if design matrix is “too correlated”⇒ Lasso is not consistent anymore for selecting the true model
![Page 18: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/18.jpg)
The “reason”
too much bias – shrinkage even for large values
for orthogonal design:
−3 −2 −1 0 1 2 3
−2−1
01
2
threshold functions
z
hard−thresholdingnn−garrotesoft−thresholding
Bias in soft-thresholdingis disturbing (at least sometimes)
better:Nonnegative Garrote (Breiman, 1995)and similar proposals
![Page 19: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/19.jpg)
The LfV condition: a condition on the covariance of X
LfV condition︸ ︷︷ ︸Meinshausen & PB (2004)
⇔ Irrepresentable condition︸ ︷︷ ︸Zhao & Yu (2006)
′′ ⇔′′ Lasso is consistent for variable selection
Irrepresentable condition ⇔ |Σnoise;eff Σ−1eff ;eff sign(βeff )| ≤ 1− η
it holds forI Σij ≤ ρ|i−j| (0 ≤ ρ < 1) power decay correlationsI dictionaries with coherence︸ ︷︷ ︸
max. correlation
< (2peff − 1)−1
(notion of coherence: Donoho, Elad & Temlyakov (2004))I easy to construct examples where condition fails to hold
![Page 20: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/20.jpg)
Choice of λ
first (not so good) idea: choose λ to optimize predictione.g. via some cross-validation scheme
but: for prediction oracle solution
λ∗ = argminλE[(Y −p∑
j=1
β(j λ)X (j))2]
P[E(λ∗) = Etrue] < 1 (n →∞) (or = 0 if pn →∞ (n →∞))
asymptotically: prediction optimality yields too large models(Meinshausen & PB, 2004; related example by Leng et al., 2006)
![Page 21: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/21.jpg)
If LfV condition fails to hold:
Meinshausen & Yu (2006): for suitable λ = λn
‖β − β‖22 =
p∑j=1
(βj − βj)2 = oP(1)
under much weaker conditions than LfVI maximal and minimal sparse eigenvalues of empirical covariance matrixI number of effective variables in relation to sparse eigenvalues of
empirical covariance matrix
implication: Lasso yields too large models(for fixed coefficients βj , j = 1, 2, . . .)
![Page 22: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/22.jpg)
in summary: asymptotically,I prediction optimal solution yields too large modelsI if LfV condition fails to hold
Lasso yields too large models
Lasso as a“filter for variable selection”
i.e. true model is contained in selected models from Lasso
![Page 23: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/23.jpg)
in summary: asymptotically,I prediction optimal solution yields too large modelsI if LfV condition fails to hold
Lasso yields too large models
Lasso as a“filter for variable selection”
i.e. true model is contained in selected models from Lasso
![Page 24: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/24.jpg)
Binary lymph node classification in breast cancer: n = 49 p = 7130
5-fold CV tuned Lasso selects 23 genes (on whole data set)
note (in practice): identifiability problem among highlycorrelated predictor variables
an ad-hoc approach:keep the 23 plus all its highly correlated genes for furthermodeling, interpretation etc...
![Page 25: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/25.jpg)
Binary lymph node classification in breast cancer: n = 49 p = 7130
5-fold CV tuned Lasso selects 23 genes (on whole data set)
note (in practice): identifiability problem among highlycorrelated predictor variables
an ad-hoc approach:keep the 23 plus all its highly correlated genes for furthermodeling, interpretation etc...
![Page 26: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/26.jpg)
Adaptive Lasso
recap: under “weak” assumptions,
Etrue ⊆ E( λ︸︷︷︸pred. optim.
)
quite many non-zero, “small” βj ’s from the Lasso
various possibilities to improve:I hard-thresholding of coefficients
(using prediction optimality)I thresholding of coefficients and re-estimation of non-zero coefficients
with least squares (using prediction optimality)
![Page 27: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/27.jpg)
Adaptive Lasso (Zou, 2006): re-weighting the penalty function
β = argminβ
n∑i=1
(Yi − (Xβ)i)2 + λ
p∑j=1
|βj ||βinit ,j |
,
βinit ,j from Lasso in first stage (or OLS if p < n)︸ ︷︷ ︸Zou (2006)
for orthogonal design,if βinit = OLS:Adaptive Lasso = NN-garrote
−3 −2 −1 0 1 2 3
−2−1
01
2
threshold functions
z
hard−thresholdingadaptive Lassosoft−thresholding
![Page 28: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/28.jpg)
furthermore:
I Zou (2006): adaptive Lasso is consistent for variableselection “in general”(proof for low-dimensional problems only)
I Huang, Ma & Zhang (2006): as above but for sparse,high-dimensional problems
![Page 29: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/29.jpg)
n = 300, p = 20, . . . 650, peff = 20
1
1
1
1
1
1
0 100 300 500
0.02
0.04
0.07
L2−loss
p
222 2
2 2 1
1
1
1
1
1
0 100 300 5000
2040
6080
100
number of selected variables
p
22 2 2 2 2
1: Lasso 2: adaptive Lassoadditional pure noise variables are much less damaging withthe adaptive Lasso than for Lasso
![Page 30: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/30.jpg)
Binary lymph node classification in breast cancer: n = 49 p = 7130
5-fold CV tuning for each method
cross-validated quantities (2/3 training; 1/3 test)
misclassif. error number of selected genesLasso 21.1% 13.12
Adaptive Lasso 20.1% 7.3
![Page 31: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/31.jpg)
Binary lymph node classification in breast cancer: n = 49 p = 7130
5-fold CV tuning for each method
cross-validated quantities (2/3 training; 1/3 test)
misclassif. error number of selected genesLasso 21.1% 13.12
Adaptive Lasso 20.1% 7.3
![Page 32: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/32.jpg)
Bacillus Subtilis for vitamin production (project withDSM)
data: response Y , p = 4088 gene expressions, n = 115
goal: find important genes for Y
statistically:regression problem Y versus p = 4088 gene expressionsfind the variables (genes) which are important for regression
identifiability problem due to high correlation (collinearity)among genes elastic net (Zou & Hastie, 2005) which encourages to selectnon or all among highly correlated predictor variables
![Page 33: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/33.jpg)
Adaptive elastic net (builds on the idea of the adaptive Lasso)
out
Gene.48
Gene.289
Gene.385
Gene.412
Gene.447
Gene.535
Gene.816
Gene.837
Gene.942
Gene.943Gene.945
Gene.946
Gene.948
Gene.960
Gene.1025
Gene.1027
Gene.1058
Gene.1123
Gene.1223
Gene.1251
Gene.1273
Gene.1358
Gene.1546
Gene.1564
Gene.1640
Gene.1706
Gene.1712
Gene.1885
Gene.1932
Gene.2360
Gene.2438
Gene.2439
Gene.2928
Gene.2929
Gene.2937
Gene.3031
Gene.3032
Gene.3033
Gene.3034
Gene.3132
Gene.3312
Gene.3693
Gene.3694
Gene.3943
regression coefficients ⇔ partial correlations Gaussian graphical modeluseful visualization: neighbours of Y (selected genes) andtheir conditional dependencies (w.r.t. all variables (genes))
![Page 34: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/34.jpg)
we did not insist on sparsity
but aimed for all relevant and their highly correlatedvariables/genes
more “false positives”, but sometimes desirable inexploratory stage
![Page 35: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/35.jpg)
Can we improve?adaptive Lasso yields pretty good solutions for variableselection and prediction
but note the limitation: p � n ...
“strategy”:
make “sample size” larger byintegrating other suitable data-sets
a simple model:
data-sets D(t), t = 1, 2, . . . , Neach measuring Y (t), X (t) with sample size n(t)Y (t) = X (t)β(t) + ε(t),β(t) smoothly changing over t︸ ︷︷ ︸
topology for indexing data-sets
![Page 36: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/36.jpg)
Can we improve?adaptive Lasso yields pretty good solutions for variableselection and prediction
but note the limitation: p � n ...
“strategy”:
make “sample size” larger byintegrating other suitable data-sets
a simple model:
data-sets D(t), t = 1, 2, . . . , Neach measuring Y (t), X (t) with sample size n(t)Y (t) = X (t)β(t) + ε(t),β(t) smoothly changing over t︸ ︷︷ ︸
topology for indexing data-sets
![Page 37: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/37.jpg)
Time course experiments
t = 1, 2, . . . , N represents timeY (t) and X (t) measurements for the same variables use usual metric on R+
use smoothed Lasso (Meier & PB (in progress))
β(τ) = argminβ
T∑t=1
K (t − τ
h)︸ ︷︷ ︸
weight w(t ,τ)
(n−1‖Y (t)− X (t)β‖22 + λ‖β‖1)
(if n(t) ≡ n)
![Page 38: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/38.jpg)
Time course experiments
t = 1, 2, . . . , N represents timeY (t) and X (t) measurements for the same variables use usual metric on R+
use smoothed Lasso (Meier & PB (in progress))
β(τ) = argminβ
T∑t=1
K (t − τ
h)︸ ︷︷ ︸
weight w(t ,τ)
(n−1‖Y (t)− X (t)β‖22 + λ‖β‖1)
(if n(t) ≡ n)
![Page 39: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/39.jpg)
results for the smoothed (adaptive) Lasso (Meier & PB):if h = hN → 0 suitably slowly and β(·) is smooth: for suitableλ = λ(n, N, h):
I improved convergence rate for ‖β − β‖22 by a factor (Nh)−a
(for smoothed Lasso and smoothed adaptive Lasso)
a = 1 and (Nhopt)−1 = N−4/5n1/5 for low-dimensional case
i.e. improvement if N not too small w.r.t. n (N/n1/4 →∞)e.g: n1/4 ≈ 2.7 if n = 50 and n1/4 ≈ 8 if n = 4′000
I for the smoothed adaptive Lasso:asymptotic consistency for variable selection (as fornon-smoothed case), but better empirical performance
![Page 40: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/40.jpg)
E[N−1 ∑Nt=1 ‖β(t)− β(t)‖2
2]
n=20,p=8 n=200,p=20 n=100,p=5’000
0.2
0.4
0.6
0.8
1.0
percentage of average MSE of Lasso
models
perc
enta
ge
LassoAdaptive Lasso
Smoothed adaptive Lasso
Lasso
Adaptive Lasso
Smoothed adaptive Lasso
Lasso
Adaptive Lasso
Smoothed adaptive Lasso
N = 9 (n = 20) or N = 18 (n = 100, 200)
![Page 41: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/41.jpg)
E[|peff − peff |]
n=20,p=8 n=200,p=20 n=100,p=5’000
0.0
0.2
0.4
0.6
0.8
1.0
percentage of average error for variable selection of Lasso
models
perc
enta
ge
Lasso
Adaptive Lasso
Smoothed adaptive Lasso
Lasso
Adaptive Lasso
Smoothed adaptive Lasso
Lasso
Adaptive Lasso
Smoothed adaptive Lasso
and smoothed adaptive Lasso is sparsest
N = 9 (n = 20) or N = 18 (n = 100, 200)
![Page 42: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/42.jpg)
Motif regression for time-course experiments
goal: find transcription factor binding sites(for a set of co-regulated genes)
fact: a transcription factor tends to recognizea conserved pattern (a “motif”) in DNA sequence
search for “overrepresented patterns” such as TCTATTGTTToccurring in up-stream region of gene(s)
![Page 43: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/43.jpg)
MotifRegressor which integrates sequence and geneexpression data (Conlon, Liu, Lieb & Liu, 2003):
I for highly expressed genes:up-stream of each gene, search for p candidate motifswith MDscan, based on DNA sequence data only
I compute motif-score for all n genes and all p candidatemotifs(score ≈ occurrences of candidate motif in gene’sup-stream region)based on DNA sequence data only
I n ≈ 4′000− 25′000 genes and their expression Y ,p ≈ 4′000 candidate motif-scores for each genegene expression and DNA sequence data
I do regression and determine the significant variables (i.e.candidate motifs which are significant):
Yi = gene-expression of genei =
p∑j=1
βj Xi,j︸︷︷︸motif score
+ errori
this approach is very competitive in comparison to otheralgorithms (e.g. AlignAce, MEME)
![Page 44: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/44.jpg)
“always”: very noisy dataafter variable selection: R2 ≈ 0.05− 0.15
nevertheless:MotifRegressor seems often better than competitive algorithms
further improvement with adaptive Lasso over forward variableselection in MotifRegressorand it yields meaningful or even true findings (work in progresswith Liu and collaborators)
![Page 45: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/45.jpg)
“always”: very noisy dataafter variable selection: R2 ≈ 0.05− 0.15
nevertheless:MotifRegressor seems often better than competitive algorithms
further improvement with adaptive Lasso over forward variableselection in MotifRegressorand it yields meaningful or even true findings (work in progresswith Liu and collaborators)
![Page 46: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/46.jpg)
“always”: very noisy dataafter variable selection: R2 ≈ 0.05− 0.15
nevertheless:MotifRegressor seems often better than competitive algorithms
further improvement with adaptive Lasso over forward variableselection in MotifRegressorand it yields meaningful or even true findings (work in progresswith Liu and collaborators)
![Page 47: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/47.jpg)
Time-course experiments (e.g. from cell-cycle)
for every time point t :I gene-expression vector/profile Y (t)I motif-scores for every gene and every motif-candidate:
always the same, i.e. X (t) ≡ Xmultivariate regression:
Y (t) = Xβ(t) + error(t)
and reasonable assumption that β(·) changes smoothly w.r.t.time
use the smoothed adaptive Lasso
![Page 48: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/48.jpg)
Spellman et al.’s cell-cycle experiment for yeast
N = 9 time pointsn = 4443, p = 2155
cross-validated mean squared prediction error:
time point adaptive Lasso smoothed adaptive Lasso1 40.5 41.72 95.2 95.53 60.0 61.44 59.3 59.05 32.3 32.46 36.5 36.57 37.3 37.38 27.1 27.19 29.9 29.9
essentially the same predictive performancebut note: high noise ⇒ similar prediction performance
![Page 49: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/49.jpg)
number of selected variables (motifs):
time point adaptive Lasso smoothed adaptive Lasso1 500 4382 73 733 43 204 77 465 53 416 0 07 0 08 45 169 0 0
smoothed adaptive Lasso often substantially sparserfewer false positives expected
![Page 50: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/50.jpg)
interpretation of significant motifs (via TRANSFAC):e.g.
GACGCG TRANSFAC−→ MCB︸ ︷︷ ︸trans. factor
some well known cell-cycle regulators:STE12, SCB, MCB, PH04, SW15, MCM1
and in addition: ROX1, M3B, XBP1
I substantial overlap of findings with Conlon, Liu, Lieb & Liu(2003)
I our method is much more stable than (non-smoothed)forward variable selection used in Conlon, Liu, Lieb & Liu(2003) fewer false positives expected with smoothed adaptiveLasso
![Page 51: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/51.jpg)
currently working on motif finding in Arabidopsis Thaliana(much less explored organism then yeast)
with Gruissem lab at ETH Zurich
![Page 52: Variable selection based on multiple, high-dimensional ...stat.ethz.ch/Manuscripts/buhlmann/paris2007.pdf · eff −1)−1 (notion of coherence: Donoho, Elad & Temlyakov (2004)) ...](https://reader034.fdocuments.net/reader034/viewer/2022050305/5f6d8d0a418c5d5235208631/html5/thumbnails/52.jpg)
Conclusions
1. Lasso is computationally attractive for variable selection inhigh-dimensional generalized linear models(including e.g. Cox’s partial likelihood for survival data)but: it yields too large models
2. Adaptive Lasso is an elegant, effective way to correctLasso’s overestimation behavior
3. Smoothed adaptive Lasso is potentially powerful fortime-course experiments(or multivariate structures, i.e. “multiple data-sets”)
software packages are available in R:lars, glmppath, grplasso