Smoothing Spline ANOVA Models II. Variable Selection and...

26
Smoothing Spline ANOVA Models II. Variable Selection and Model Building via Likelihood Basis Pursuit Hao Helen Zhang (first author), with Grace Wahba, Yi Lin, Meta Voelker, Michael Ferris, Ronald Klein, and Barbara Klein http://www4.stat.ncsu.edu/˜hzhang2 http://www.stat.wisc.edu/˜wahba http://www.stat.wisc.edu/˜yilin http://www.cs.wisc.edu/˜ferris http://www.ssc.wisc.edu/aging/kleinr.htm http://www.ssc.wisc.edu/aging/kleinb.htm Joint Statistical Meetings San Francisco, CA August 6, 2003 1

Transcript of Smoothing Spline ANOVA Models II. Variable Selection and...

Page 1: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

Smoothing Spline ANOVA Models II.

Variable Selection and Model Building via

Likelihood Basis Pursuit

Hao Helen Zhang (first author), withGrace Wahba, Yi Lin, Meta Voelker, Michael Ferris,

Ronald Klein, and Barbara Klein

http://www4.stat.ncsu.edu/˜hzhang2

http://www.stat.wisc.edu/˜wahba

http://www.stat.wisc.edu/˜yilin

http://www.cs.wisc.edu/˜ferris

http://www.ssc.wisc.edu/aging/kleinr.htm

http://www.ssc.wisc.edu/aging/kleinb.htm

Joint Statistical MeetingsSan Francisco, CA

August 6, 2003

1

Page 2: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

Abstract

We describe Likelihood Basis Pursuit, a nonparamet-ric method for variable selection and model building,based on merging ideas from Lasso and Basis Pur-suit works and from smoothing spline ANOVA models.An application to nonparametric variable selection forrisk factor modeling in the Wisconsin EpidemiologicalStudy of Diabetic Retinopathy is described.

Although there are many approaches to variable andmodel selection, we believe that this one has somenovel and useful aspects.

2

Page 3: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

References

[ZWL02]H. Zhang, G. Wahba, Y. Lin, M. Voelker, M. Fer-ris, R. Klein, and B. Klein. Variable selection andmodel building via likelihood basis pursuit. Techni-cal Report 1059, UW-Madison Statistics Department,2002, under review.[on web]

[CDS98]S. Chen, D. Donoho, and M. Saunders. Atomicdecomposition by basis pursuit. SIAM J. Sci. Com-put., 20:33–61, 1998.

[T96]R. J. Tibshirani. Regression shrinkage and se-lection via the lasso. Journal of the Royal StatisticalSociety, B, 58:267–288, 1996.

[F98]W. J. Fu. Penalized regression: the bridge ver-sus the LASSO. Journal of Computational and Graph-ical Statistics, 7:397–416, 1998.

3

Page 4: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

References (cont.)

[WWGKK95]G. Wahba, Y. Wang, C. Gu, R. Klein, andB. Klein. Smoothing spline ANOVA for exponentialfamilies, with application to the Wisconsin Epidemi-ological Study of Diabetic Retinopathy. Ann. Statist.,23:1865–1895, 1995.

[W90] G. Wahba. Spline models for observationaldata, SIAM, 1990.

4

Page 5: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

OUTLINE

1. Motivation: The WESDR data.

2. Likelihood Basis Pursuit for the WESDR data.

3. What is Likelihood Basis Pursuit?

4. Why do l1 penalties give sparser solutions?

5. Building an overcomplete set of basis functions forthe Likelihood Basis Pursuit, from a smoothing splineANOVA model.

6. Computations.

7. The importance measure for the model terms.

8. Choosing the importance threshold.

9. Back to the WESDR results.

10. Closing remarks.

5

Page 6: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 1. Motivation: WESDR (WisconsinEpidemiological Study of Diabetic Retinopathy).

WESDR is an ongoing epidemiological study of a co-hort of patients receiving their medical care in an 11-county area in southern Wisconsin. Baseline exam,1980, with four, ten, fourteen and twenty year followups.(refs in [ZWL02]). [WWGKK95] built a smoothing splineANOVA model for four year risk of progression of dia-betic retinopathy from baseline, as a function of threerisk factors. We started out with twenty possible riskfactors, narrowed it down to four variables by very la-borious means, mostly repeated applications of para-metric and nonparametric regression analyses of smallgroups of variables at a time, and existing analyses byothers. Finally, three variables were selected as ap-parently the most important, by ad hoc means. It wasevident that it would be highly desireable to have amodel selection procedure that could simultaneouslyselect important variables/components of a spline ANOVAmodel. Such a procedure has been obtained in[Z02][ZWL02], and is today’s topic.

6

Page 7: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 1. Motivation: WESDR (WisconsinEpidemiological Study of Diabetic Retinopathy)(cont).

[WWGKK95] looked for the four year risk of progres-sion of diabetic retinopathy from baseline at a cohortof (selected) n = 669 younger onset subjects. Letp(t) be the probability of progression for a subject withrisk factor vector t and f = log[p/(1−p)]. The modelfitted was f(t) = f(dur, gly, bmi) = µ+ f1(dur)+

a2 ·gly+f3(bmi)+f13(dur, bmi) where dur = dura-tion of diabetes, gly = glycosylated hemoglobin, andbmi = body mass index.

duration (yr)

body

mas

s in

dex

(kg/

m )

2

0 10 20 30 40 50

2030

4050

0.3

0.5

0.5 1

2

2

3

3 4

•• •• ••• • ••• •• ••• • ••••• •• •• •••• •••• ••• •• ••• •••• •• •••• •• • •• •• ••• ••• •• •••• • •• • ••• •• •• ••• •• •• •• •• •• •• •••• •• •• •••• • •••• • •• •• •• •• •••• • • •• •• • •• ••••• ••• • ••• • •••• ••• •• •• •• •••• •••• • • • •• • •••••• •• ••• • ••• • • •• • • •• ••• • • •••• •• ••• • •• • •• ••••• •••• •• • •••• • ••• •• •• •••• •• •• • • ••• • ••••• •• •• ••• •• • •

••

010

2030

40

duration (yr)15

20

25

30

35

40body mass index (kg/m )2

00.

10.2

0.30

.40.

50.6

prob

abili

ty

(Right: probability plotted against bmi and dur;gly at median.) (Note: model is not monotonic indur) From [WWGKK95]

7

Page 8: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

Wisconsin Epidemiologic Study of Diabetic

Retinopathy (WESDR)

• Continuous covariates:

X1: (dur) duration of diabetes at the time of baseline examination, years

X2: (gly) glycosylated hemoglobin, a measure of hyperglycemia, %

X3: (bmi) body mass index, kg/m2

X4: (sys) systolic blood pressure, mmHg

X5: (ret) retinopathy level

X6: (pulse) pulse rate, count for 30 seconds

X7: (ins) insulin dose, kg/day

X8: (sch) years of school completed

X9: (iop) intraocular pressure, mmHg

• Categorical covariates:

Z1: (smk) smoking status (0 = no, 1 = any)

Z2: (sex) gender (0 = female, 1 = male)

Z3: (asp) use of at least one aspirin for (0 = no, 1 = yes)

at least three months while diabetic

Z4: (famdb) family history of diabetes (0 = none, 1 = yes)

Z5: (mar) marital status (0 = no, 1 = yes/ever)

1

8

Page 9: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 2. WESDR: The Likelihood Basis Pursuit resultfor the WESDR data.

gly dur sch bmi puls ret sys iop ins sex smk asp famdb mar

0

0.2

0.4

0.6

0.8

1

L1 n

orm

L1 norm scores for the WESDR main effects model,from a Likelihood Basis Pursuit analysis. The methodselected gly, dur, sch and bmi, in that order, as im-portant variables in a smoothing spline ANOVA maineffects model (smoothing spline additive model).

Note: sch is years of schooling completed. It cameup in some of the analyses in [WWGKK95] but wasnot considered a direct cause of disease and was notfurther considered.

Next: How it was done.9

Page 10: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 3. What is Likelihood Basis Pursuit?

Likelihood Basis Pursuit combines ideas from theLASSO [T96][F98], basis pursuit [CDS98], and smooth-ing spline ANOVA models to generate an overcom-plete set of basis functions, which are then used in apenalized likelihood variational problem with l1 penal-ties. Basis pursuit uses l1 penalties, instead of quadraticpenalties, to obtain solutions which are relatively sparsein the number of basis functions with non-0 coeffi-cients.

Why do l1 penalties result in sparser solutions thanquatratic penalties?

10

Page 11: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 4. Why do l1 penalties give sparser solutions?

c1

c2

l1 l2

Circle:∑N

j=1 c2j ≤ M , diamond:∑N

j=1 |cj| ≤ M .

Ellipses: contours of constant∑n

i=1(yi−∑N

j=1 xijcj)2.

Find c to minimize:

n∑

i=1

(yi −N∑

j=1

xijcj)2

subject to∑N

j=1 |cj|p ≤ M . Green dots: minimizers

for (l. to r.) p = 1 and p = 2. Note that c1 = 0 forp = 1.

11

Page 12: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 4. Why do l1 penalties give sparser solutions ?(cont.)

The problem: Find c to minimize:

n∑

i=1

(yi −N∑

j=1

xijcj)2

subject to∑N

j=1 |cj|p ≤ M is generally equivalent to

the problem: Find c to min

n∑

i=1

(yi −N∑

j=1

xijcj)2 + λ

N∑

j=1

|cj|p

for some λ = λ(M).

12

Page 13: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions for the Likelihood Basis Pursuit, from a

smoothing spline ANOVA model.

1. Main effects model, continuous variables.

First: the usual penalized likelihood: Let l!kl(u) bethe lth Bernoulli polynomial, and let K(u, v) = k2(u)k2(v)−

k4(|u − v|), u, v ∈ [0,1] (spline kernel [W90]). Letx = (x1, · · · , xd), and the observations be {yi, xi}

where xi = (x1i , · · · , xd

i ), i = 1, · · · , n.

The solution to the problem: Find f (in an appropriatespace) of the form f(x) = µ +

∑dα=1 fα(xα) to min

1

n

n∑

i=1

C(yi, f(xi)) +d

α=1

θ−1α

(fα”)2

has a representation

f(x) = µ +d

α=1

bαk1(xα) +n

i=1

ci

d∑

α=1

θαK(xα, xαi )

13

Page 14: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions for the Likelihood Basis Pursuit, from

smoothing spline ANOVA model (cont.).

1. Main effects model, continuous variables (cont.).

Since generally an excellent approximation to the so-lution to the variational problem can be obtained withfewer basis functions, a selected subset, xi1, · · · , xiN ,of the xi can be used to generate the solution. Thus

n∑

i=1

ci

d∑

α=1

θαK(xα, xαi )

is replaced by

N∑

j∗=1

cj∗

d∑

α=1

θαK(xα, xαj∗)

where {xj∗ = (x1j∗, · · · , xd

j∗), j∗ = 1, · · · , N} is theselected subset of the xi.

14

Page 15: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions...(cont.).

1. Main effects model, continuous variables (cont.).This suggests the overcomplete set of 1 + d + dN

basis functions:

{1, bα(x) ≡ k1(xα), Bα

j∗(x) ≡ K(xα, xαj∗)}.

for α = 1, · · · , d, j∗ = 1, · · · , N

The basis pursuit variational problem is to find f(x),

f(x) = µ +d

α=1

bαbα(x) +d

α=1

N∑

j∗=1

cαj∗Bαj∗(x)

to minimize

1

n

n∑

i=1

C(yi, f(xi)) + λπ(d

α=1

|bα|) + λs(d

α=1

N∑

j∗=1

|cαj∗|)

15

Page 16: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions...(cont.).

2. Two factor interaction model, continuous variables.

Add tensor products of the previous basis functions:(Similar rationale as before-from the two-factor inter-action smoothing spline ANOVA model.)

{span bαbβ, α < β; bαBβj∗, α 6= β, j∗ = 1, · · · , N ;

Bαj∗B

βj∗, α < β, j∗ = 1, · · · , N}

to the previous set of basis functions.

16

Page 17: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions...(cont.).

2. Two factor interaction model (more) ..

Recalling the definitions

{bα(x) ≡ k1(xα), Bα

j∗(x) ≡ K(xα, xαj∗)}.}

f is now of the form:

f(x) = µ +d

α=1

bαbα(x) +d

α=1

N∑

j∗=1

cαj∗Bαj∗(x)

+λππ∑

α<β

bαβbα(x)bβ(x)

+λπs∑

α6=β

N∑

j∗=1

cπsαβj∗b

α(x)Bβ(x)

+λss∑

α<β

N∑

j∗=1

cssαβj∗B

α(x)Bβ(x).

17

Page 18: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions...(cont.).

2. Two factor interaction model (more)...

...and now the variational problem becomes: Find f

to minimize;

1

n

n∑

i=1

C(yi, f(xi)) + λπ(d

α=1

|bα|) + λs(d

α=1

N∑

j∗=1

|cαj∗|)

+λππ(∑

α<β

|bαβ|) + λπs(∑

α6=β

N∑

j∗=1

|cπsαβj∗|)

+λss(∑

α<β

N∑

j∗=1

|cssαβj∗|).

18

Page 19: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 5. Building an overcomplete set of basisfunctions...(cont.).

2. Two factor interaction model, continuous variables.There are now five smoothing parameters in this setup:

λπ parametric main effectsλs smooth main effectsλππ parametric − parametric interactionsλπs parametric − smooth interactionsλss smooth − smooth interactions

Other variants of smoothing parameters are possible,as well as other reproducing kernels.

Categorical variables are added to the model in a straight-forward way, but now there will be more combinationsof possible interaction terms.

19

Page 20: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 6. Computations

The model is fitted via a tailored mathematical pro-gramming algorithm called slice modeling, see [ZWL02],originally due to Michael Ferris. MINOS was the un-derlying optimization code.

Given Bernoulli data, the smoothing parameters arefitted via GACV (Generalized Approximate Cross Val-idation) for basis pursuit models. The GACV looksessentially the same as the GACV for Bernoulli datawith the usual RKHS penalty functional [XW96], andthe randomized trace technique is used to compute it.

20

Page 21: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 7. The importance measure for the model terms.

We have adopted the empirical L1 norm to assess therelative importance of the various terms. (Since thesmoothing spline ANOVA basis functions all averageto 0 this is makes sense). The empirical L1 normsof the main effects fα and the two-factor interactionsfαβ are defined as

L1(fα) = 1n

∑ni=1 |fα(xα

i )|

= 1n

∑ni=1 |bαk1(x

αi ) +

∑Nj=1 cαj∗K1(x

αi , xα

j∗)|

and

L1(fαβ) = 1n

∑ni=1 |fαβ(x

αi , x

βi )|

= 1n

∑ni=1 |bαβk1(x

αi )k1(x

βi )

+∑N

j=1 cπsαβj∗K1(x

αi , xα

j∗)k1(xβi )k1(x

βj∗)

+∑N

j=1 cπsβαj∗K1(x

βi , x

βj∗)k1(x

αi )k1(x

αj∗)

+∑N

j=1 cssαβj∗K1(x

α, xαj∗)K1(x

β, xβj∗)|.

The empirical L2 norm gives essentially the same re-sults.

21

Page 22: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 8. Choosing the importance threshold.

A Monte Carlo bootstrap test is used to sequentiallyadd terms to the model, beginning in the order of theL1 norms, until the next term which is a candidate forinclusion fails the importance test.

For η = 0 until end, do

1. Posit a null model which includes the first η termsin rank order of their L1 norms. Fit this model, callit fη. Here we want to test whether the (η +1)stterm is important.

2. Generate T independent Monte Carlo bootstrapdata sets {y

ηti , xi}, t = 1, · · · , T from fη, using

the original design x1, · · · , xn.

22

Page 23: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 8. Choosing the importance threshold (cont.).

3 Using the Monte Carlo bootstrap data generatedabove, fit the full model to each bootstrap sam-ple, and compute the L1 scores Lt

1(η + 1), t =

1,2, · · · , T for the next term. This results in aMonte Carlo bootstrap sample of T observationsof L1(η + 1) under the null model hypothesis.

4 Compare L1(η + 1) to the bootstrap samplesLt

1(η + 1), t = 1,2, · · · , T . If L1(η + 1) is be-tween the rth and the r +1st largest Lt

1(η +1),then the Monte Carlo p-value is pη = r+1

T+1.

end. For an α level test, increment η until pη ≥ α

(or the full model is reached), and declare that the im-portance threshold is L1(η) if η > 1 and any numberlarger than L1(1) if η = 0.

23

Page 24: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 9. Back to the WESDR results.

gly dur sch bmi puls ret sys iop ins sex smk asp famdb mar

0

0.2

0.4

0.6

0.8

1

L1 n

orm

L1 norm scores for the WESDR main effects model.

The importance threshhold is .39 (blue line). The p-values for all four of the selected variables gly, dur, sch, bmi,were .02. The solution was very sparse, with about90% of the coefficients 0.

We were happy with the results - the method returnedimportant variables, that had previously been selectedby much more tedious methods. Simulation results in[ZWL02] have also demonstrated the efficacy of theapproach in data sets where the ’truth’ is known.

24

Page 25: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 9. Back to the WESDR results(cont.).

0 10 20 30 40

−1.

50.

01.

5

dur

Logi

t

Estimated logit component for dur

For comparison, the linear logistic regression modelusing the function glm in R was fit, and the linear co-efficient for dur was not significant at level α = 0.05.From the plot: A linear fit would not be good for dur.

Refitting the linear logistic model by including dur2,the hypothesis test for dur2 was significant with p-value 0.02. But with 14 variables, this is not easilydiscovered using parametric logistic regression.

25

Page 26: Smoothing Spline ANOVA Models II. Variable Selection and …pages.stat.wisc.edu/~wahba/ftp1/tr1080/wald2.pdf · 2004-12-23 · suit works and from smoothing spline ANOVA models. An

♣♣ 10. Closing remarks.

We claim that that Likelihood Basis Pursuit can be auseful screening tool in data sets with many potentialpredictor variables, and can provide helpful insuranceagainst loss of information due to ill-fitting parametricmodels.

26