SADC Course in Statistics Choosing the best model (Session 08)

17
SADC Course in Statistics Choosing the “best” model (Session 08)

Transcript of SADC Course in Statistics Choosing the best model (Session 08)

Page 1: SADC Course in Statistics Choosing the best model (Session 08)

SADC Course in Statistics

Choosing the “best” model

(Session 08)

Page 2: SADC Course in Statistics Choosing the best model (Session 08)

2To put your footer here go to View > Header and Footer

Learning Objectives

At the end of this session, you will be able to• use a simple descriptive approach to select

of the most appropriate subset of explanatory variables

• apply methods of variable selection (based on statistical tests) in a meaningful way to get the “best” model

• appreciate the effect on t-probabilities when x’s are added or dropped from a model

• understand dangers of using automatic selection procedures

Page 3: SADC Course in Statistics Choosing the best model (Session 08)

3To put your footer here go to View > Header and Footer

Example of choosing “best” set of x’s

Consider data (fictitious) from a retrospective study of patients surviving less than 4 months after being diagnosed as having acute leukaemia.

Objective: To identify factors affecting survival time.

Variables were:y = survival time (days) after diagnosisx1 = no: of chemotherapy sessionsx2 = total volume of blood transfused

x3 = no: of days of hospital carex4 = age of patient (years).

Page 4: SADC Course in Statistics Choosing the best model (Session 08)

4To put your footer here go to View > Header and Footer

Start with a matrix plot

Page 5: SADC Course in Statistics Choosing the best model (Session 08)

5To put your footer here go to View > Header and Footer

Summary statistics for all regressionsHow many possible regression models exist?

Example with x1 and x3 to show summaries:---------+--------------------------------------- Source | SS df MS F Prob>F---------+--------------------------------------- Model | 1488.691 2 744.346 6.07 0.0188Residual | 1227.072 10 122.707 ---------+--------------------------------------- Total | 2715.763 12 226.314 ---------+---------------------------------------

No. of parameters fitted (p) = 3

R2p = 1488.69 / 2715.07 = 0.5482

Adjusted R2p = 1 – 122.71 / 226.31 = 0.4578

Page 6: SADC Course in Statistics Choosing the best model (Session 08)

6To put your footer here go to View > Header and Footer

Descriptive approach (all regressions)

No. of x’s p = No. of parameters

Terms in model

R2 Adj. R2 Res. M.S.

None None None 0 0 226.3

1 1 x1 0.534 0.492 115.1

1 1 x2 0.666 0.636 82.4

1 1 x3 0.286 0.221 176.3

1 1 x4 0.675 0.645 80.4

2 2 x1, x2 0.979 0.974 5.8

2 2 x1, x3 0.548 0.458 122.7

2 2 x1, x4 0.972 0.967 7.5

2 2 x2, x3 0.847 0.816 41.5

2 2 x2, x4 0.680 0.616 86.9

2 2 x3, x4 0.935 0.922 17.6

3 3 x1, x2, x3 0.982 0.976 5.4

3 3 x1, x2, x4 0.982 0.976 5.3

3 3 x1, x3, x4 0.981 0.975 5.7

3 3 x2, x3, x4 0.973 0.964 8.2

4 4 x1, x2, x3, x4 0.982 0.974 6.0

Page 7: SADC Course in Statistics Choosing the best model (Session 08)

7To put your footer here go to View > Header and Footer

A descriptive approach… continued

Plot R2 versus no. of parameters (p) in model

Which model would you select on the basis of these results?

Page 8: SADC Course in Statistics Choosing the best model (Session 08)

8To put your footer here go to View > Header and Footer

A descriptive approach… continued

Which model would you select on the basis of the residual mean square?

Alternatively, plot residual mean square. Small residual mean square is good!

Page 9: SADC Course in Statistics Choosing the best model (Session 08)

9To put your footer here go to View > Header and Footer

An inferential approach…

Use a sequential procedure to select variables that contribute most, and significantly, to the regression model.

Three popular methods exist:

• Forward selection

• Backward elimination

• Stepwise regression

Page 10: SADC Course in Statistics Choosing the best model (Session 08)

10To put your footer here go to View > Header and Footer

Forward selection …

Select the “best” single variable - see slide 6

Ask, “Is it contributing significantly?” Answer: Yes (see below)

----------------------------------------- y | Coef. Std. Err. t P>|t|-------+--------------------------------- x4 | -.73816 .1546 -4.77 0.001const. | 117.57 5.2622 22.34 0.000-----------------------------------------

Now consider 2-variable models with x4.

Page 11: SADC Course in Statistics Choosing the best model (Session 08)

11To put your footer here go to View > Header and Footer

Two-variable models with x4 ----------------------------------------- y | Coef. Std.Err. t P>|t|-------------+--------------------------- x4 | -.61395 .04864 -12.62 0.000 x1 | 1.4400 .13842 10.40 0.000const.| 103.10 2.1240 48.54 0.000----------------------------------------- x4 | -.45694 .69595 -0.66 0.526 x2 | .31090 .74861 0.42 0.687const.| 94.160 56.627 1.66 0.127----------------------------------------- x4 | -.72460 .07233 -10.02 0.000 x3 | -1.1999 .18902 -6.35 0.000const.| 131.28 3.2748 40.09 0.000-----------------------------------------

Page 12: SADC Course in Statistics Choosing the best model (Session 08)

12To put your footer here go to View > Header and Footer

Three-variable models with x4, x1 ----------------------------------------- y | Coef. Std.Err. t P>|t|-------------+--------------------------- x4 | -.23654 .17329 -1.37 0.205 x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052const. | 71.648 14.142 5.07 0.001----------------------------------------- x4 | -.64280 .04454 -14.43 0.000 x1 | 1.0519 .22368 4.70 0.001 x3 | -.41004 .19923 -2.06 0.070const. | 111.68 4.5625 24.48 0.000-----------------------------------------Model with x1, x2 and x4 would be selected!- despite x4 now being non-significant!

Page 13: SADC Course in Statistics Choosing the best model (Session 08)

13To put your footer here go to View > Header and Footer

Backward elimination gives x1,x2 --------------------------------------- y | Coef. Std.Err. t P>|t|-----+--------------------------------- x1 | 1.5511 .74477 2.08 0.071 x2 | .51017 .7238 0.70 0.501 x3 | .10191 .7547 0.14 0.896 x4 | -.14406 .7091 -0.20 0.844--------------------------------------- x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 x4 | -.23654 .17329 -1.37 0.205--------------------------------------- x1 | 1.4683 .12130 12.10 0.000 x2 | .66225 .04585 14.44 0.000---------------------------------------

Page 14: SADC Course in Statistics Choosing the best model (Session 08)

14To put your footer here go to View > Header and Footer

Stepwise selection procedure…

This is similar to forward selection, but at each stage of the process, all x’s in the model are re-assessed to check if those that entered the model at an earlier stage still remain “important”.

Note: Software packages allow automatic use of one of these with pre-specified p-values for selection and deletion of variables. Usually available only with quantitative x’s.

Page 15: SADC Course in Statistics Choosing the best model (Session 08)

15To put your footer here go to View > Header and Footer

Discussion… in small groups • Look back at results. What do you observe

with the forward and backward procedures. Do they give the same results?

• Did the selection using forward seem sensible, given that for x4, the p-value =0.205?

• Can you work out what model would results with a stepwise selection procedures?

• Is it a good idea to use such automatic selection procedures available in software packages? If not, why not?

Page 16: SADC Course in Statistics Choosing the best model (Session 08)

16To put your footer here go to View > Header and Footer

Discussion continued…

Suppose a medical researcher told you that a model without x2 was not meaningful, how would you proceed with your model selection?

What other latent (lurking) variables, measurable or non-measurable, might affect y?

What further steps would you undertaken before accepting the final model?

Page 17: SADC Course in Statistics Choosing the best model (Session 08)

17To put your footer here go to View > Header and Footer

Practical work follows to ensure learning objectives are

achieved…