New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf ·...

August 2007 Trevor Hastie, Stanford Statistics 1

Least Angle RegressionBrad Efron et.al.∗ (2004)

Stanford University

∗ Trevor Hastie, Ian Johnstone, Rob Tibshirani


The Beginning

It all started with this picture on page 330 in our Elements ofStatistical Learning (2001). This picture links boosting to the lasso,and is intended to explain how boosting fits models inhigh-dimensional space.

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

t =P

j |βj |

Coeffi

cien

ts

Coeffi

cien

ts

Lasso Forward Stagewise

Iteration


0 200 400 600 800 1000

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Adaboost Stumps for Classification

Iterations

Tes

t Mis

clas

sific

atio

n E

rror

Adaboost StumpAdaboost Stump shrink 0.1


Least Squares Boosting with Trees

Elements of Statistical Learning (chapter 10)

Response y, predictors x = (x1, x2 . . . xp).

1. Start with function F (x) = 0 and residual r = y

2. Fit a CART regression tree to r giving f(x)

3. Set F (x) ← F (x) + εf(x), r ← r − εf(x) and repeat steps 2and 3 many times


Least Squares Boosting for Linear Regression

Here is a version of least squares boosting for linear regression:(assume predictors are standardized)

(Incremental) Forward Stagewise

1. Start with r = y, β1, β2, . . . βp = 0.

2. Find the predictor xj most correlated with r

3. Update βj ← βj + δj , where δj = ε · sign〈r, xj〉4. Set r ← r − δj · xj and repeat steps 2 and 3 many times

δj = 〈r, xj〉 gives usual forward stagewise; different from forwardstepwise

Analogous to least squares boosting, with trees=predictors


0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

t =P

j |βj |

Coeffi

cien

ts

Coeffi

cien

tsLasso Forward Stagewise

Iteration


Least Angle Regression — LAR

Like a “more democratic” version of forward stepwise regression.

1. Start with r = y, β1, β2, . . . βp = 0. Assume xj standardized.

2. Find predictor xj most correlated with r.

3. Increase βj in the direction of sign〈r, xj〉 until some othercompetitor xk has as much correlation with current residualas does xj .

4. Move (βj , βk) in the joint least squares direction for (xj , xk)until some other competitor x� has as much correlation withthe current residual

5. Continue in this way until all predictors have been entered.Stop when 〈r, xj〉 = 0 ∀ j, i.e. OLS solution.


µ0 µ1

x2 x2

x1

u2

y1

y2

The LAR direction u2 at step 2 makes an equal angle with x1 andx2.


2 4 6 8 10

050

0010

000

1500

020

000

Correlations

Lars Steps

Abs

olut

e C

orre

latio

ns

1

2

3

4

5

6

7

8

9

10

9

4

7

2

105

8 6 1

Ck

• Maximal correlationsdecrease with steps.

• Lars/lasso decreasesRSS optimally withincrease in ||β||1.

• Dantzig Selector de-creases maximum|〈xj , r〉| optimallywith increase in ||β||1.(Candes and Tao,2006)


0 500 1500 2500 3500

−50

00

500

Lasso

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

0 500 1500 2500 3500

−50

00

500

LAR

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

t =P

|βj | →t =P

|βj | →

βj

βj


0 500 1500 2500 3500

−50

00

500

Lasso

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

0 500 1500 2500 3500

−50

00

500

Stagewise

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

t =P

|βj | →t =P

|βj | →

βj

βj


LAR, Lasso and Forward Stagewise

• Not always identical

• In orthogonal predictor case: yes

• In hard to verify case of monotone coefficient paths: yes

• In general, almost!

• LAR algorithm can be simply modified to give both Lasso andForward Stagewise paths.


Lasso via LAR

Start with LAR. If a coefficient crosses zero, stop. Drop thatpredictor, recompute the best direction and continue. This givesthe Lasso path

“Proof”: use KKT conditions for appropriate Lagrangian.Informally:

∂

∂βj

[12||y − Xβ||2 + λ

∑j

|βj |]

= 0

⇔〈xj , r〉 = λ · sign(βj) if βj �= 0 (active)


Forward Stagewise via LAR

Compute the LAR direction, but constrain the sign of thecoefficients to match the correlations 〈xj , r〉.

The incremental forward stagewise procedure approximates thesesteps, one predictor at a time. As step size ε → 0, can show that itcoincides with this modified version of LAR.


S+A

S+B

SA

CA

vB

vA

x2

x1

x3

The forward stagewise direction lies in the positive cone spannedby the (signed) predictors with equal correlation with the currentresidual.


lars package

• The LARS algorithm computes the entire Lasso/FS/LAR pathin same order of computation as one full least squares fit.

• When p � N , the solution has at most N non-zero coefficients.Works efficiently for micro-array data (p in thousands).

• Cross-validation is quick and easy.

• Available for R (CRAN) and Splus.


0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

Sta

ndar

dize

d C

oeffi

cien

ts

LAR

52

110

84

69

0 2 3 4 5 7 8 10

|β(λ)|/|β(0)|

df for LAR

• df are labeled at thetop of the figure

• At the point a com-petitor enters the ac-tive set, the df are in-cremented by 1.

• Not true, for example,for stepwise regression.

Stein’s Lemma: df(µ) def=∑n

i=1 cov(µi, yi)/σ2


Lasso or Forward Stagewise?

• Micro-array example (Golub Data). N = 38, p = 7129,response binary ALL vs AML

• Lasso behaves chaotically near the end of the path, whileForward Stagewise is smooth, stable, and mostly monotone.

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

8

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LASSO

2968

6801

2945

461

2267

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

6

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

Forward Stagewise

4535

353

246

2945

1834

2267


LARS started a path frenzy!

• Elasticnet: (Zou and Hastie, 2005). Compromise between lassoand ridge: minimize

∑i(yi −

∑j xijβj)2 subject to

α||β||1 + (1 − α)||β||22 ≤ t. Useful for situations where variablesoperate in correlated groups (genes in pathways).

• Glmpath: (Park and Hastie, 2005). Approximates the L1

regularization path for generalized linear models. e.g. logisticregression, Poisson regression.

• Grouped Lasso: (Yuan and Li, 2006, following Lin and Zhang,2002). Extends lasso to deal with groups of variables (e.g.dummy variables for factors):

maxβ

�(y; β) − λ

M∑m=1

γm||βm||2.


The �2 norm ensures the vector of coefficients βm for eachgroup are all zero or non-zero together. Usingpredictor-corrector methods we can construct the path for thiscriterion: Park & Hastie (2006), Regularization pathalgorithms for detecting gene interactions.

• Bin Yu group: Boosted Lasso and Composite AbsolutePenalties.

• Rosset and Zhu (2004) discuss conditions needed to obtainpiecewise-linear paths. A combination of piecewisequadratic/linear loss function, and an L1 penalty, is sufficient.

• Svmpath: Entire regularization path for the SVM. Exploitsprevious point — loss is piecewise linear, penalty is quadratic(Hastie, Rosset, Tibshirani and Zhu, JMLR 2004)

• Bach and Jordan (2004) have path algorithms for Kernelestimation, and for efficient ROC curve estimation. The latter


is a useful generalization of the svmpath algorithm discussedabove.

• Friedman and Popescu (2004) extend the incremental forwardstagewise idea by updating all coefficients whose gradientexceeds a predefined threshold.

• fused lasso, graphical lasso, . . . and many more.


Pathwise Coordinate Descent

Many problems have the form

min{βj}p

1

⎡⎣R(y, β) + λ

p∑j=1

Pj(βj)

⎤⎦ .

• If R and Pj are convex, and R is differentiable, then coordinatedescent converges to the solution.

• Often each coordinate step is trivial. E.g. for lasso, it amountsto soft-thresholding, with many steps leaving βj = 0.

• Decreasing λ slowly means not much cycling is needed.

• Coordinate moves can exploit sparsity. Currently we (JHF,TH, RT) fit a lasso logistic regression path (100 λ values) withN = 11K and p = 4.8M and sparsity 0.0001 in < 2 minutes.


Conclusions

Brad: on behalf of all the path crazies, thanks for LARS and happybirthday!

.


Conclusions


And by the way, a great movie ...

.


Conclusions


And by the way, a great movie ...

New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf ·...

Documents

Transcript of New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf ·...