Lecture 4: ANOVA Table - Purdue University

4-1

Lecture 4: ANOVA Table

STAT 512

Spring 2011

Background Reading

KNNL: 2.6-2.7

4-2

Topic Overview

Working-Hotelling Confidence Band

Inference Example using SAS

ANOVA Table

4-3

Working-Hotelling Confidence Band (1)

This gives a confidence limit for the whole

line at once, in contrast to the confidence

interval for just one ˆhY at a time.

Regression line 0 1 hb b X describes hE Y for

given hX .

We have 95% CI for hˆE(Y ) = hY pertaining to

specific hX .

4-4


We want a 95% Confidence band for all Xh –

this is a confidence limit for the whole line at

once, in contrast to the confidence interval for

just one ˆhY at a time.

The confidence limit is given by ˆ ˆh hY W s Y ,

where 2 2 1 ;2, 2W F n . Since we are doing

all values of hX at once, it will be wider at each

hX than CIs for individual hX .

4-5


We are used to constructing CI’s with t’s, not

W’s. Can we fake it?

We can find a new, smaller alpha for tc that

would give the same results – kind of an

“effective alpha” that takes into account that

you are estimating the entire line.

We find W2 for our desired true α, and then

find the effective αt to use with tc that gives

W(α) = tc(αt).

4-6

SAS Example

(musclemass.sas)

(Problem 1.27 in KNNL)

Muscle mass is expected to decrease with

age. Study explores this relationship in

women (n = 60)

15 women randomly selected from each of

four age groups 40-49, 50-59, 60-69, 70-79

We will analyze this data set assuming that

the simple linear regression model applies.

4-7

Read in the Data

For textbook files – easiest way is to simply

open data as text file or through website

and paste it into SAS using “datalines”.

DATA muscle;

input mmass age;

datalines;

106 43

106 41

.....

;

4-8

Produce a Scatter Plot

goptions ftitle=centb ftext=swissb htitle=3

htext=1.5 ctitle=blue ctext=black;

symbol1 v=dot c=blue ;

axis1 label=('Age (Years)');

axis2 label=(angle=90 'Muscle Mass');

PROC GPLOT data=muscle;

plot mmass*age /haxis=axis1 vaxis=axis2;

title 'Muscle Mass vs Age in women';

RUN; QUIT;

4-10

Examining Scatter Plots

Form – linear looks mostly reasonable

Direction – muscle mass seems to decrease

as age increases

Strength – there is quite a bit of scatter so

the relationship is likely weak to moderate

4-11

Regression Model Goals

Estimate the difference in mean muscle

mass for women differing in age by 1 year.

Produce CI’s and PI’s for women age 50,

60, and 70

Plot 95% Confidence Band for the

regression line.

4-12

Preliminaries

DATA slime;

age = 50; mmass = .; output;



DATA muscle; set muscle slime;

PROC PRINT; RUN;

This adds to the data set so that we can easily

predict for ages of 50, 60, and 70.

4-13

PROC REG

PROC REG data=muscle outest=params outseb;

model mmass=age /clb clm cli;

output out=mean_resp p=predicted

stdp=SE_mean lclm = LCL_mean

uclm=UCL_mean;

output out=predict p=predicted

stdi=SE_pred lcl=LCL_pred

ucl=UCL_pred;

id age;

PROC PRINT data=params;

PROC PRINT data=mean_resp; where mmass=.;

PROC PRINT data=predict; where mmass=.;

RUN;

4-14

Output (1)

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 11627 11627 174.06 <.0001

Error 58 3875 66.8

Total 59 15502

Root MSE 8.17318 R-Square 0.7501

4-15

Output (2)

Parameter Estimates

Par. Std

Variable DF Est. Error t Value Pr>|t|

Intercept 1 156.35 5.51226 28.36 <.0001

age 1 -1.19 0.09020 -13.19 <.0001

Variable DF 95% Confidence Limits

Intercept 1 145.31257 167.38056

age 1 -1.37054 -1.00945

4-16

Interpretation

In women, muscle mass decreases by an

average of 1.19 units per year.

A 95% CI for the amount of this decrease is

(1.01 , 1.37). In other words, the 95% CI

for 1 is (-1.37,-1.01).

Note: 95% represents the probability that,

for any given repetition of the experiment,

the confidence interval will actually cover

the true value.

4-17

Output (3)

Obs age predict SE_mean LCL_mean UCL_mean

61 50 96.8468 1.38715 94.0701 99.6235

62 60 84.9468 1.05515 82.8347 87.0590

63 70 73.0469 1.38911 70.2663 75.8275

Obs age predict LCL_pred UCL_pred SE_pred

61 50 96.8468 80.2524 113.441 8.29005

62 60 84.9468 68.4507 101.443 8.24101

63 70 73.0469 56.4519 89.642 8.29038

4-18

Interpretation

Prediction intervals are pretty wide –

indicating that there is a large amount of

variation. The estimated standard

deviation (RMSE) was 8.2.

We wouldn’t be able to well-predict the

muscle loss for a single subject, but we

would be able to well-predict the average

muscle loss for multiple subjects (the SE’s

associated to the mean response are fairly

small)

4-19

Regression Plots

symbol1 v=dot c=blue;

symbol2 v=none i=rlclm95 c=green;

symbol3 v=none i=rlcli95 c=red;

PROC GPLOT data=muscle;

plot mmass*(age age age) / haxis=axis1

vaxis=axis2 overlay;

title2 'Confidence and Prediction Bands';

RUN; QUIT;

4-21

Working-Hotelling Adjustment

Previous Confidence Bands were

unadjusted.

To produce the W-H confidence bands for

the regression line, first use the F-

distribution to compute W = 2.18.

For T-distribution with 58 degrees of

freedom, this corresponds to an effective

alpha of about 0.01. So use 0.99 instead of

0.95 to get adjusted confidence band.

4-22

Compute Effective Alpha

data a1;

n=60; alpha=0.05; dfn = 2; dfd = n-2;

w2 = 2 * finv(1-alpha,dfn,dfd);

w=sqrt(w2); alphat=2*(1-probt(w,dfd));

tstar=tinv(1-alphat/2,dfd); output;

PROC PRINT data=a1; RUN;

n alpha dfn dfd w2 w alphat tstar

60 0.05 2 58 6.31 2.51 0.0148 2.51234

Use 0.01 (more conservative) as effective alpha.

4-24

ANOVA Table

Organize the variation arithmetically

Total (or corrected total) sum of squares is

2

TOT Y iSS SS Y Y

Think of this as the total possible variation

that might be explained by the model. The

percentage of TOTSS that we actually

explain is the coefficient of determination 2R .

4-25

Partitioning SSTOT

Two sources: MODEL (variation explained

by regression) and ERROR (unexplained

or residual variation)

2 22

ˆ ˆ

ˆ ˆ

i i i i

SSR SSESSTOT

i i i i

Y Y Y Y Y Y

Y Y Y Y Y Y

(cross terms cancel: see page 65)

4-27

Total Sum of Squares

Ignore X while predicting Y: Best predictor

is Y . TOTSS is the sum of squared

deviations from this predictor.

2

TOT Y iSS SS Y Y

Degrees of Freedom is n – 1 since est. Y

Mean Square: / 1TOT TOTMS SS n is

the usual estimate of variance when there

is no predictor term involved.

4-28

Model Sum of Squares

Variation explained by the regression model

2ˆ

R iSS Y Y

Degrees of freedom is 1 since estimating the

slope parameter (intercept parameter taken

care of already in estimation of Y ).

Mean Square: /R R RMS SS df

4-29

Error Sum of Squares

Unexplained variation

2ˆ

E i iSS Y Y

Degrees of freedom is n – 2 (difference

between the total and model degrees of

freedom.)

Mean Square Error is /E E EMS SS df .

This is the best estimate of the variance for

Y once we condition on the explanatory

variable(s).

4-30

ANOVA Table

Source df SS MS

Regression

(Model) 1

2

iY Y R

SSR

df

Error 2n 2ˆ

i iY Y E

SSE

df

Total 1n 2

iY Y T

SSTO

df

4-31

Expected Mean Squares

Mean Squares are random variables since

Y’s are random variables. Can compute:

2 21

2

XE MSR SS

E MSE

When 0 1: 0H is true, then E MSR

and E MSE are identical and in

particular their ratio is 1.

4-32

F-test

Under the null, F = MSR/MSE has an F

distribution with 1 and n – 2 degrees of

freedom.

When 0 1: 0H is false, MSR tends to be

larger, so we would want to reject the null

when F is large

Generally, reject if F is bigger than critical

value (or in practice, when p-value is less

than the significance level).

4-33

ANOVA Table with Test

Source df SS MS F P

Model 1 SSM MSM MSM

MSE .xxx

Error n – 2 SSE MSE

Total n – 1

(model used here b/c this is what you see in SAS)

4-34

Example (Muscle Mass)

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 11627 11627 174.06 <.0001

Error 58 3875 66.8

Total 59 15502

Root MSE 8.17318 R-Square 0.7501

4-35

Upcoming in Lecture 5...

General Linear Test (Section 2.8)

Coefficient of Determination/Correlation

(Section 2.9)

Assessing Validity of Model Assumptions

(Chapter 3)

Lecture 4: ANOVA Table - Purdue University

Documents

Transcript of Lecture 4: ANOVA Table - Purdue University