Lecture 6: Variable Selection - LASSO
Wenbin Lu
Department of Statistics
North Carolina State University
Fall 2019

    Wenbin Lu

    Department of StatisticsNorth Carolina State University

    Fall 2019

  • Outline

    Modern Penalization Methods

    Lq penalty, ridge


    computation, solution path, LARS algorithmsparameter tuning, cross validationtheoretical properties

    ridge regression

  • Oracle Properties


    data (Xi ,Yi ), i = 1, · · · , nn: sample size

    d : the number of predictors, Xi ∈ Rd .the full index set S = {1, 2, · · · , d}.the selected index set given by a procedure is Â, its size is |Â|.the linear coefficient vector β = (β1, · · · , βd)T .the true linear coefficients β0 = (β10, · · · , βd0)T .the true model A0 = {j : j = 1, · · · , d , |βj0| 6= 0}.

  • Oracle Properties Oracle Properties


    Shrinkage method for variable selection a continuous process

    more stable, not suffering from high variability.

    one unified selection and estimation framework; easy to makeinferences

    advances in theories: oracle properties, valid asymptotic inferences

    In general, we standardize the inputs first before applying the methods.

    X ’s are centered to mean 0 and variance 1

    y is centered to mean 0

    We often fit a model without an intercept.

    Ideal Variable Selection

    The best results we can expect from a selection procedure:

    able to obtain the correct model structure

    keep all the important variables in the modelfilter all the noisy variable out of the model

    has some optimal inferential properties

    consistent, optimal convergence rateasymptotic normality, most efficient

    True model: yi = β00 +∑d

    j=1 xijβj0 + �i ,

    important index set: A0 = {j : βj0 6= 0, j = 1, · · · , d}unimportant index set: Ac0 = {j : βj0 = 0, j = 1, · · · , d}

    Oracle Properties

    An oracle performs as if the true model were known:

    1 selection consistency

    β̂j 6= 0 for j ∈ A0; β̂j = 0 for j ∈ Ac0;

    2 estimation consistency:

    √n(β̂A0 − βA0)→d N(0,ΣI ),

    βA0 = {βj , j ∈ A0} and ΣI is the covariance matrix if knowing thetrue model.

  • Shrinkage Methods

    Penalized Regularization Framework

    One revolutionary work is the development of regularization framework:


    L(β; y,X ) + λJ(β)

    L(β; y,X ) is the loss function

    For OLS, L =∑n

    i=1[yi − β0 −∑d

    j=1 βjxij ]2

    For MLE methods, L = log likelihood

    For Cox’s PH models, L is the partial likelihood

    In supervised learning, L is the hinge loss function (SVM), orexponential loss (AdaBoost)

  • Shrinkage Methods

    Lq Shrinkage Penalty

    J(β) is the penalty function.

    Jq(|β|) = λ‖β‖qq =d∑


    |βj |q, q ≥ 0.

    J0(|β|) =∑d

    j=1 I (βj 6= 0) (L0; Donoho and Johnstone 1988)

    J1(|β|) =∑d

    j=1 |βj | (LASSO; Tibshirani 1996)

    J2(|β|) =∑d

    j=1 β2j (Ridge; Hoerl and Kennard, 1970)

    J∞(|β|) = maxj |βj | (Supnorm penalty; Zhang et al. 2008)

  • Shrinkage Methods

    Elements of Statistial Learning

    Hastie, Tibshirani & Friedman 2001 Chapter 3

    PSfrag replaementsq = 4 q = 2 q = 1 q = 0:5 q = 0:1

    Figure 3.13: Contours of onstant value of Pj j�j jqfor given values of q.

  • Shrinkage Methods LASSO

    The LASSO

    Applying the absolute penalty on the coefficients



    (yi −d∑


    xijβj)2 + λ


    |βj |

    λ ≥ 0 is a tuning parameterλ controls the amount of shrinkage; the larger λ, the greater amountof shrinkage

    What happens if λ→ 0? (no penalty)What happens if λ→∞?

  • Shrinkage Methods LASSO

    Equivalent formulation for LASSO



    (yi −d∑


    xijβj)2, subject to


    |βj | ≤ t.

    Explicitly apply the magnitude constraint on the parameters

    Making t sufficiently small will cause some coefficients to be exactlyzero.

    The lasso is a kind of continuous subset selection

  • Shrinkage Methods LASSO

    Elements of Statistial Learning

    Hastie, Tibshirani & Friedman 2001 Chapter 3

    β^ β^2. .β



    β1 β

    Figure 3.12: Estimation piture for the lasso (left)and ridge regression (right). Shown are ontours of theerror and onstraint funtions. The solid blue areas arethe onstraint regions j�1j+ j�2j � t and �21 + �22 � t2,respetively, while the red ellipses are the ontours ofthe least squares error funtion.

  • Shrinkage Methods LASSO

    LASSO Solution: soft thresholding

    For the orthogonal design case with X ′X = I , the LASSO method isequivalent to: solve βj ’s componentwisely by solving


    (βj − β̂lsj )2 + λ|βj |

    The solution to the above problem is

    β̂lassoj = sign(β̂lsj )(|β̂lsj | −




    β̂lsj −

    λ2 if β̂

    lsj >


    0 if |β̂lsj | ≤λ2

    β̂lsj +λ2 if β̂

    lsj < −


    shrinks big coefficients by a constant λ2 towards zero.

    truncates small coefficients to zero exactly.

  • Shrinkage Methods LASSO

    Comparison: Different Methods Under Orthogonal Design

    When X is orthonormal, i.e. XTX = I

    Best subset (of size M): β̂lsj if rank(|β̂lsj |) ≤ M.keeps the largest coefficients

    Ridge regression: β̂lsj /(1 + λ)

    does a proportional shrinkage

    Lasso: sign(β̂lsj )(|β̂lsj | −λ2 )+

    transform each coefficient by a constant factor first, then truncate it atzero with a certain threshold“soft thresholding”, used often in wavelet-based smoothing

  • Shrinkage Methods Parameter Tuning of LASSO

    How to Choose t or λ

    The tuning parameter t or λ should be adaptively chosen to minimize theMSE or PE. Common tuning methods include

    the tuning error, cross validation, leave-out-one cross validation(LOOCV), AIC, or BIC.

    If t is chosen larger than t0 =∑d

    j=1 |β̂lsj |, then β̂lassoj = β̂lsj .If t = t0/2, the least squares coefficients are shrunk by 50%.

    In practice, we sometimes standardize t by s = t/t0 in [0, 1], and choosethe best value s from [0, 1].

  • Shrinkage Methods Parameter Tuning of LASSO

    K-fold Cross Validation

    The simplest and most popular way to estimate the tuning error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K

    leave the kth portion out, and fit the model using the other K − 1parts. Denote the solution by β̂−k

    calculate prediction errors of β̂−k on the left-out kth portion

    3 Average the prediction errors

    Define the Index function (allocating memberships)

    κ : {1, ..., n} → {1, ...,K}.

    The K -fold cross-validation estimate of prediction error (PE) is

    CV =1



    (yi − xTi β̂−κ(i)

    )2Typical choice: K = 5, 10

  • Shrinkage Methods Parameter Tuning of LASSO

    Choose λ

    Choose a sequence of λ values.

    For each λ, fit the LASSO and denote the solution by β̂lassoλ .

    Compute the CV (λ) curve as

    CV (λ) =1



    (yi − xTi β̂



    which provides an estimate of the test error curve.

    Find the best parameter λ∗ which minimizes CV (λ).

    Fit the final LASSO model with λ∗. The final solution is denoted asβ̂lassoλ∗

  • Shrinkage Methods Parameter Tuning of LASSO


  • Shrinkage Methods Parameter Tuning of LASSO

    One-standard Error for CV

    One-standard error rule:

    choose the most parsimonious model with error no more than onestandard error above the best error.

    used often in model selection (for a smaller model size).

  • Shrinkage Methods Parameter Tuning of LASSO


  • Shrinkage Methods Computation of LASSO

    Computation of LASSO

    Quadratic Programming

    Shooting Algorithms (Fu 1998; Zhang and Lu, 2007)

    LARS solution path (Efron et al. 2004)

    the most efficient algorithm for LASSOdesigned for standard linear modelsR package ”lars”

    Coordinate Descent Algorithm (CDA; Friedman et al. 2010)

    designed for GLMsuitable for ultra-high dimensional problemsR package ”glmnet”

  • Shrinkage Methods Computation of LASSO

    R package: lars

    Provides the entire sequence of coefficients and fits, starting from zero, tothe least squares fit.


    lars(x, y, type = c("lasso", "lar", "forward.stagewise",

    "stepwise"), trace = FALSE, normalize = TRUE,

    intercept = TRUE, Gram, eps = .Machine$double.eps,

    max.steps, use.Gram = TRUE)

    A “lars” object is returned, for which print, plot, predict, coef andsummary methods exist.

    Details in Efron et al. (2004).

    With ”lasso” option, it computes the complete lasso solutionsimultaneously for ALL values of the shrinkage parameter (λ or t ors) in the same computational cost as a least squares fit.

  • Shrinkage Methods Computation of LASSO

    Details About LARS function

    x : matrix of predictors

    y : response

    type: One of ”lasso”, ”lar”, ”forward.stagewise” or ”stepwise”, allvariants of Lasso. Default is ”lasso”.

    trace: if TRUE, lars prints out its progress

    normalize: if TRUE, each variable is standardized to have unit L2norm, otherwise it is left alone. Default is TRUE.

    intercept: if TRUE, an intercept is included (not penalized),otherwise no intercept is included. Default is TRUE.

    Gram: XTX matrix; useful for repeated runs (bootstrap) where alarge XTX stays the same.

    eps: An effective zero

  • Shrinkage Methods Computation of LASSO

    Elements of Statistial Learning

    Hastie, Tibshirani & Friedman 2001 Chapter 3

    Shrinkage Factor s



    0.0 0.2 0.4 0.6 0.8 1.0





    •• • • • •

    • •• •

    • •• •

    • •• lcavol

    • • • • ••


    •• •

    • •• •

    • • • • •• • • • • lweight

    • • • • • • • • • • • • • ••

    • • • • • • • • • •age

    • • • • • • • • • ••


    •• • •

    • • • •• • • • lbph

    • • • • • • ••



    •• •

    • •• •

    • •• •

    • •svi

    • • • • • • • • • • • • • • ••





    • lcp

    • • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •


    •• •

    • •• •

    • •• •


  • Shrinkage Methods Computation of LASSO

    Prediction for LARS

    predict(object, newx, s, type = c("fit", "coefficients"),

    mode = c("step", "fraction", "norm", "lambda"), ...)

    coef(object, ...)

    object: A fitted lars object

    newx: If type=“fit”, then newx should be the x values at which thefit is required. If type=“coefficients”, then newx can be omitted.

    s: a value, or vector of values, indexing the path. Its values dependson the mode . By default (mode=“step”), s takes on values between0 and p.

    type: If type=“fit”, predict returns the fitted values. Iftype=”coefficients”, predict returns the coefficients.

  • Shrinkage Methods Computation of LASSO

    “Mode” for Prediction in LARS

    If mode=“step”, then s should be the argument indexes the lars stepnumber, and the coefficients will be returned corresponding to thevalues corresponding to step s.

    If mode=“fraction”, then s should be a number between 0 and 1, andit refers to the ratio of the L1 norm of the coefficient vector, relativeto the norm at the full LS solution.

    If mode=“norm”, then s refers to the L1 norm of the coefficientvector.

    If mode= λ, then s is the lasso regularization parameter

  • Shrinkage Methods Computation of LASSO

    Example: Diabetes Data




    ## fit LASSO


  • Shrinkage Methods Computation of LASSO

    R Code to Compute K-CV Error Curve for Lars

    cv.lars(x, y, K = 10, index, type = c("lasso",

    "lar", "forward.stagewise", "stepwise"),

    mode=c("fraction", "step"), ...)

    x : Input to lars

    y : Input to lars

    K : Number of folds

    index: Abscissa values at which CV curve should be computed. Ifmode=”fraction”, this is the fraction of the saturated |β|. Thedefault value index=seq(from = 0, to = 1, length = 100). Ifmode=”step”, this is the number of steps in lars procedure.

  • Shrinkage Methods Computation of LASSO

    R Code: CV.LARS



  • Shrinkage Methods Consistency Properties of LASSO

    Consistency - Definition

    Let the true parameters be β0 and their estimates by β̂n. The subscript ‘n’means the sample size n.

    Estimation consistency

    β̂n − β0 →p 0, as n→∞.

    Model selection consistency

    P({j : β̂j 6= 0} = {j : β0j 6= 0}

    )→ 1, as n→∞.

    Sign consistency

    P(β̂n =s β0

    )→ 1, as n→∞,

    whereβ̂n =s β0 ⇔ sign(β̂n) = sign(β0).

  • Shrinkage Methods Consistency Properties of LASSO

    Consistency of LASSO Estimator

    Knight and Fu (2000) have shown that

    Estimation Consistency: The LASSO solution is of estimationconsistency for fixed p. In other words,

    β̂lasso(λn)→p β, as λn = o(n).

    It is root-n consistent and asymptotically normal.

    Model Selection Property: For λn ∝ n12 , as n→∞, there is a

    non-vanishing positive probability for lasso to select the true model.

    The l1 penalization approach is called basis pursuit in signal processing(Chen, Donoho, and Saunders, 2001).

  • Shrinkage Methods Consistency Properties of LASSO

    Model Selection Consistency of LASSO Estimator

    Zhao and Yu (2006). On Model Selection Consistency of Lasso. JMLR, 7,2541–2563.

    If an irrelevant predictor is highly correlated with the predictors in thetrue model, Lasso may not be able to distinguish it from the truepredictors with any amount of data and any amount of regularization.

    Under Irrepresentable Condition (IC), the LASSO is modelselection consistent in both fixed and large p settings.

    The IC condition is represented as

    |[Xn(1)TXn(1)]−1XTn (1)Xn(2)| < 1− η,

    i.e., the regression coefficients of irrevelant covariates Xn(2) on therelevant covariates Xn(1) is constrained. It is almost necessary andsufficient for Lasso to select the true model.

  • Shrinkage Methods Consistency Properties of LASSO

    Special Cases for LASSO to be Selection Consistent

    Assume the true model size |A0| = q.The underlying model must satisfy a nontrivial condition if the lassovariable selection is consistent (Zou, 2006).

    The lasso is always consistent in model selection under the followingspecial cases:

    when p = 2;when the design is orthgonal;when the covariates have bounded constant correlation, 0 < rn <


    for some c ;when the design has power decay correlation with corr(Xi ,Xj) = ρ

    |i−j|n ,

    where |ρn| ≤ c < 1.

  • Ridge Penalty

    Ridge Regression (q = 2)

    Applying the squared penalty on the coefficients



    (yi −d∑


    xijβj)2 + λ



    λ ≥ 0 is a tuning parameterThe solution

    β̂ridge = (XTX + λId)−1X


    Note the similarity to the ordinary least squares solution, but with theaddition of a “ridge” down the diagonal.

    As λ→ 0, β̂ridge → β̂ols.As λ→∞, β̂ridge → 0.

  • Ridge Penalty

    Ridge Regression Solution

    In the special case of an orthonormal design matrix,

    β̂ridgej =

    β̂olsj1 + λ


    This illustrates the essential “shrinkage” feature of ridge regression

    Applying the ridge regression penalty has the effect of shrinking theestimates toward zero

    The penalty introducing bias but reducing the variance of the estimate

    Ridge estimator does not threshold, since the shrinkage is smooth(proportional to the original coefficient).

  • Ridge Penalty


    If the design matrix X is not full rank, then

    XTX is not invertible,

    For OLS estimator,, there is no unique solution for β̂ols.

    This problem does not occur with ridge regression, however.

    Theorem: For any design matrix X , the quantity XTX + λId is always

    invertible; thus, there is always a unique solution β̂ridge.

    For the ridge estimator, an unbiased estimate for its df is

    d̂f (β̂ridge) = Tr{X (XTX + λId)−1XT}.

  • Ridge Penalty

    Bias and Variance

    The variance of the ridge regression estimate is

    Var(β̂ridge) = σ2WXTXW ,

    where W = (XTX + λId)−1.

    The bias of the ridge regression estimate is

    Bias(β̂ridge) = −λWβ.

    It can be shown that

    the total variance∑d

    j=1 Var(β̂ridgej ) is a monotone decreasing

    sequence with respect to λ

    while the total squared bias∑d

    j=1 Bias2(β̂

    ridgej ) is a monotone

    increasing sequence with respect to λ.

  • Ridge Penalty

    Mean Squared Error (MSE) of Ridge

    Existence Theorem:There always exists a λ such that the MSE of β̂ridge is less than the MSE

    of β̂ols.

    a rather surprising result with somewhat radical implications: even ifthe model we fit is exactly correct and follows the exact distributionwe specify, we can always obtain a better (in terms of MSE)estimator by shrinking towards zero.

  • Ridge Penalty

    Bayesian Interpretation of Ridge Estimation

    Penalized regression can be interpreted in a Bayesian context:

    Suppose y = Xβ + �, where � ∼ N(0, σ2In).Given the prior β ∼ N(0, τ2Id), then the posterior mean of β giventhe data is

    (XTX +σ2



  • Ridge Penalty

    R package: lm.ridge


    lm.ridge(formula, data, subset, na.action, lambda)

    formula: a formula expression as for regression models, of form

    response ∼ predictorsdata: an optional data frame in which to interpret the variablesoccurring in formula.

    subset: expression saying which subset of the rows of the data shouldbe used in the fit. All observations are included by default.

    na.action a function to filter missing data.

    lambda: A scalar or vector of ridge constants.

    If an intercept is present, its coefficient is not penalized.

  • Ridge Penalty

    Output for lm.ridge

    Returns a list with components

    coef: matrix of coefficients, one row for each value of lambda. Notethat these are not on the original scale and are for use by the coefmethod.

    scales: scalings used on the X matrix.

    Inter: was intercept included?

    lambda: vector of lambda values

    ym: mean of y

    xm: column means of x matrix

    GCV: vector of GCV values

  • Ridge Penalty

    Example for Ridge Regression




  • Ridge Penalty

    Elements of Statistial Learning

    Hastie, Tibshirani & Friedman 2001 Chapter 3Co



    0 2 4 6 8






































    PSfrag replaements df(�)Figure 3.7: Pro�les of ridge oeÆients for theprostate aner example, as tuning parameter � is var-ied. CoeÆients are plotted versus df(�), the e�e-tive degrees of freedom. A vertial line is drawn atdf = 4:16, the value hosen by ross-validation.

  • Ridge Penalty Ridge vs LASSO

    Lasso vs Ridge

    Ridge regression:

    is a continuous shrinkage method; achieves better predictionperformance than OLS through a biase-variance trade-off.

    cannot produce a parsimonious model, for it always keeps all thepredictors in the model.

    The lasso

    does both continuous shrinkage and automatic variable selectionsimultaneously.

  • Ridge Penalty Ridge vs LASSO

    Empirical Comparison

    Tibshirani (1996) and Fu (1998) compared the prediction performance ofthe lasso, ridge and bridge regression (Frank and Friedman, 1993) andfound

    None of them uniformly dominates the other two.

    However, as variable selection becomes increasingly important inmodern data analysis, the lasso is much more appealing owing to itssparse representation.

