New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf ·...

26
August 2007 Trevor Hastie, Stanford Statistics 1 Least Angle Regression Brad Efron et.al. (2004) Stanford University Trevor Hastie, Ian Johnstone, Rob Tibshirani

Transcript of New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf ·...

Page 1: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 1

Least Angle RegressionBrad Efron et.al.∗ (2004)

Stanford University

∗ Trevor Hastie, Ian Johnstone, Rob Tibshirani

Page 2: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 2

Page 3: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 3

The Beginning

It all started with this picture on page 330 in our Elements ofStatistical Learning (2001). This picture links boosting to the lasso,and is intended to explain how boosting fits models inhigh-dimensional space.

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

t =P

j |βj |

Coeffi

cien

ts

Coeffi

cien

ts

Lasso Forward Stagewise

Iteration

Page 4: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 4

0 200 400 600 800 1000

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Adaboost Stumps for Classification

Iterations

Tes

t Mis

clas

sific

atio

n E

rror

Adaboost StumpAdaboost Stump shrink 0.1

Page 5: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 5

Least Squares Boosting with Trees

Elements of Statistical Learning (chapter 10)

Response y, predictors x = (x1, x2 . . . xp).

1. Start with function F (x) = 0 and residual r = y

2. Fit a CART regression tree to r giving f(x)

3. Set F (x) ← F (x) + εf(x), r ← r − εf(x) and repeat steps 2and 3 many times

Page 6: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 6

Least Squares Boosting for Linear Regression

Here is a version of least squares boosting for linear regression:(assume predictors are standardized)

(Incremental) Forward Stagewise

1. Start with r = y, β1, β2, . . . βp = 0.

2. Find the predictor xj most correlated with r

3. Update βj ← βj + δj , where δj = ε · sign〈r, xj〉4. Set r ← r − δj · xj and repeat steps 2 and 3 many times

δj = 〈r, xj〉 gives usual forward stagewise; different from forwardstepwise

Analogous to least squares boosting, with trees=predictors

Page 7: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 7

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

t =P

j |βj |

Coeffi

cien

ts

Coeffi

cien

tsLasso Forward Stagewise

Iteration

Page 8: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 8

Least Angle Regression — LAR

Like a “more democratic” version of forward stepwise regression.

1. Start with r = y, β1, β2, . . . βp = 0. Assume xj standardized.

2. Find predictor xj most correlated with r.

3. Increase βj in the direction of sign〈r, xj〉 until some othercompetitor xk has as much correlation with current residualas does xj .

4. Move (βj , βk) in the joint least squares direction for (xj , xk)until some other competitor x� has as much correlation withthe current residual

5. Continue in this way until all predictors have been entered.Stop when 〈r, xj〉 = 0 ∀ j, i.e. OLS solution.

Page 9: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 9

µ0 µ1

x2 x2

x1

u2

y1

y2

The LAR direction u2 at step 2 makes an equal angle with x1 andx2.

Page 10: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 10

2 4 6 8 10

050

0010

000

1500

020

000

Correlations

Lars Steps

Abs

olut

e C

orre

latio

ns

1

2

3

4

5

6

7

8

9

10

9

4

7

2

105

8 6 1

Ck

• Maximal correlationsdecrease with steps.

• Lars/lasso decreasesRSS optimally withincrease in ||β||1.

• Dantzig Selector de-creases maximum|〈xj , r〉| optimallywith increase in ||β||1.(Candes and Tao,2006)

Page 11: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 11

0 500 1500 2500 3500

−50

00

500

Lasso

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

0 500 1500 2500 3500

−50

00

500

LAR

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

t =P

|βj | →t =P

|βj | →

βj

βj

Page 12: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 12

0 500 1500 2500 3500

−50

00

500

Lasso

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

0 500 1500 2500 3500

−50

00

500

Stagewise

123 4 5 67 89 10 1

2

3

4

5

6

7

8

9

10

t =P

|βj | →t =P

|βj | →

βj

βj

Page 13: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 13

LAR, Lasso and Forward Stagewise

• Not always identical

• In orthogonal predictor case: yes

• In hard to verify case of monotone coefficient paths: yes

• In general, almost!

• LAR algorithm can be simply modified to give both Lasso andForward Stagewise paths.

Page 14: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 14

Lasso via LAR

Start with LAR. If a coefficient crosses zero, stop. Drop thatpredictor, recompute the best direction and continue. This givesthe Lasso path

“Proof”: use KKT conditions for appropriate Lagrangian.Informally:

∂βj

[12||y − Xβ||2 + λ

∑j

|βj |]

= 0

⇔〈xj , r〉 = λ · sign(βj) if βj �= 0 (active)

Page 15: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 15

Forward Stagewise via LAR

Compute the LAR direction, but constrain the sign of thecoefficients to match the correlations 〈xj , r〉.

The incremental forward stagewise procedure approximates thesesteps, one predictor at a time. As step size ε → 0, can show that itcoincides with this modified version of LAR.

Page 16: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 16

S+A

S+B

SA

CA

vB

vA

x2

x1

x3

The forward stagewise direction lies in the positive cone spannedby the (signed) predictors with equal correlation with the currentresidual.

Page 17: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 17

lars package

• The LARS algorithm computes the entire Lasso/FS/LAR pathin same order of computation as one full least squares fit.

• When p � N , the solution has at most N non-zero coefficients.Works efficiently for micro-array data (p in thousands).

• Cross-validation is quick and easy.

• Available for R (CRAN) and Splus.

Page 18: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 18

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

Sta

ndar

dize

d C

oeffi

cien

ts

LAR

52

110

84

69

0 2 3 4 5 7 8 10

|β(λ)|/|β(0)|

df for LAR

• df are labeled at thetop of the figure

• At the point a com-petitor enters the ac-tive set, the df are in-cremented by 1.

• Not true, for example,for stepwise regression.

Stein’s Lemma: df(µ) def=∑n

i=1 cov(µi, yi)/σ2

Page 19: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 19

Lasso or Forward Stagewise?

• Micro-array example (Golub Data). N = 38, p = 7129,response binary ALL vs AML

• Lasso behaves chaotically near the end of the path, whileForward Stagewise is smooth, stable, and mostly monotone.

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

8

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LASSO

2968

6801

2945

461

2267

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

6

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

Forward Stagewise

4535

353

246

2945

1834

2267

Page 20: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 20

LARS started a path frenzy!

• Elasticnet: (Zou and Hastie, 2005). Compromise between lassoand ridge: minimize

∑i(yi −

∑j xijβj)2 subject to

α||β||1 + (1 − α)||β||22 ≤ t. Useful for situations where variablesoperate in correlated groups (genes in pathways).

• Glmpath: (Park and Hastie, 2005). Approximates the L1

regularization path for generalized linear models. e.g. logisticregression, Poisson regression.

• Grouped Lasso: (Yuan and Li, 2006, following Lin and Zhang,2002). Extends lasso to deal with groups of variables (e.g.dummy variables for factors):

maxβ

�(y; β) − λ

M∑m=1

γm||βm||2.

Page 21: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 21

The �2 norm ensures the vector of coefficients βm for eachgroup are all zero or non-zero together. Usingpredictor-corrector methods we can construct the path for thiscriterion: Park & Hastie (2006), Regularization pathalgorithms for detecting gene interactions.

• Bin Yu group: Boosted Lasso and Composite AbsolutePenalties.

• Rosset and Zhu (2004) discuss conditions needed to obtainpiecewise-linear paths. A combination of piecewisequadratic/linear loss function, and an L1 penalty, is sufficient.

• Svmpath: Entire regularization path for the SVM. Exploitsprevious point — loss is piecewise linear, penalty is quadratic(Hastie, Rosset, Tibshirani and Zhu, JMLR 2004)

• Bach and Jordan (2004) have path algorithms for Kernelestimation, and for efficient ROC curve estimation. The latter

Page 22: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 22

is a useful generalization of the svmpath algorithm discussedabove.

• Friedman and Popescu (2004) extend the incremental forwardstagewise idea by updating all coefficients whose gradientexceeds a predefined threshold.

• fused lasso, graphical lasso, . . . and many more.

Page 23: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 23

Pathwise Coordinate Descent

Many problems have the form

min{βj}p

1

⎡⎣R(y, β) + λ

p∑j=1

Pj(βj)

⎤⎦ .

• If R and Pj are convex, and R is differentiable, then coordinatedescent converges to the solution.

• Often each coordinate step is trivial. E.g. for lasso, it amountsto soft-thresholding, with many steps leaving βj = 0.

• Decreasing λ slowly means not much cycling is needed.

• Coordinate moves can exploit sparsity. Currently we (JHF,TH, RT) fit a lasso logistic regression path (100 λ values) withN = 11K and p = 4.8M and sparsity 0.0001 in < 2 minutes.

Page 24: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 24

Conclusions

Brad: on behalf of all the path crazies, thanks for LARS and happybirthday!

.

Page 25: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 25

Conclusions

Brad: on behalf of all the path crazies, thanks for LARS and happybirthday!

And by the way, a great movie ...

.

Page 26: New August 2007 Trevor Hastie, Stanford Statistics 1web.stanford.edu/~hastie/TALKS/bradfest.pdf · 2008. 5. 13. · August 2007 Trevor Hastie, Stanford Statistics 17 lars package

August 2007 Trevor Hastie, Stanford Statistics 26

Conclusions

Brad: on behalf of all the path crazies, thanks for LARS and happybirthday!

And by the way, a great movie ...