Lecture 4: ANOVA Table - Purdue University
Transcript of Lecture 4: ANOVA Table - Purdue University
4-1
Lecture 4: ANOVA Table
STAT 512
Spring 2011
Background Reading
KNNL: 2.6-2.7
4-2
Topic Overview
Working-Hotelling Confidence Band
Inference Example using SAS
ANOVA Table
4-3
Working-Hotelling Confidence Band (1)
This gives a confidence limit for the whole
line at once, in contrast to the confidence
interval for just one ˆhY at a time.
Regression line 0 1 hb b X describes hE Y for
given hX .
We have 95% CI for hˆE(Y ) = hY pertaining to
specific hX .
4-4
Working-Hotelling Confidence Band (2)
We want a 95% Confidence band for all Xh –
this is a confidence limit for the whole line at
once, in contrast to the confidence interval for
just one ˆhY at a time.
The confidence limit is given by ˆ ˆh hY W s Y ,
where 2 2 1 ;2, 2W F n . Since we are doing
all values of hX at once, it will be wider at each
hX than CIs for individual hX .
4-5
Working-Hotelling Confidence Band (3)
We are used to constructing CI’s with t’s, not
W’s. Can we fake it?
We can find a new, smaller alpha for tc that
would give the same results – kind of an
“effective alpha” that takes into account that
you are estimating the entire line.
We find W2 for our desired true α, and then
find the effective αt to use with tc that gives
W(α) = tc(αt).
4-6
SAS Example
(musclemass.sas)
(Problem 1.27 in KNNL)
Muscle mass is expected to decrease with
age. Study explores this relationship in
women (n = 60)
15 women randomly selected from each of
four age groups 40-49, 50-59, 60-69, 70-79
We will analyze this data set assuming that
the simple linear regression model applies.
4-7
Read in the Data
For textbook files – easiest way is to simply
open data as text file or through website
and paste it into SAS using “datalines”.
DATA muscle;
input mmass age;
datalines;
106 43
106 41
.....
;
4-8
Produce a Scatter Plot
goptions ftitle=centb ftext=swissb htitle=3
htext=1.5 ctitle=blue ctext=black;
symbol1 v=dot c=blue ;
axis1 label=('Age (Years)');
axis2 label=(angle=90 'Muscle Mass');
PROC GPLOT data=muscle;
plot mmass*age /haxis=axis1 vaxis=axis2;
title 'Muscle Mass vs Age in women';
RUN; QUIT;
4-9
4-10
Examining Scatter Plots
Form – linear looks mostly reasonable
Direction – muscle mass seems to decrease
as age increases
Strength – there is quite a bit of scatter so
the relationship is likely weak to moderate
4-11
Regression Model Goals
Estimate the difference in mean muscle
mass for women differing in age by 1 year.
Produce CI’s and PI’s for women age 50,
60, and 70
Plot 95% Confidence Band for the
regression line.
4-12
Preliminaries
DATA slime;
age = 50; mmass = .; output;
age = 60; mmass = .; output;
age = 70; mmass = .; output;
DATA muscle; set muscle slime;
PROC PRINT; RUN;
This adds to the data set so that we can easily
predict for ages of 50, 60, and 70.
4-13
PROC REG
PROC REG data=muscle outest=params outseb;
model mmass=age /clb clm cli;
output out=mean_resp p=predicted
stdp=SE_mean lclm = LCL_mean
uclm=UCL_mean;
output out=predict p=predicted
stdi=SE_pred lcl=LCL_pred
ucl=UCL_pred;
id age;
PROC PRINT data=params;
PROC PRINT data=mean_resp; where mmass=.;
PROC PRINT data=predict; where mmass=.;
RUN;
4-14
Output (1)
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 11627 11627 174.06 <.0001
Error 58 3875 66.8
Total 59 15502
Root MSE 8.17318 R-Square 0.7501
4-15
Output (2)
Parameter Estimates
Par. Std
Variable DF Est. Error t Value Pr>|t|
Intercept 1 156.35 5.51226 28.36 <.0001
age 1 -1.19 0.09020 -13.19 <.0001
Variable DF 95% Confidence Limits
Intercept 1 145.31257 167.38056
age 1 -1.37054 -1.00945
4-16
Interpretation
In women, muscle mass decreases by an
average of 1.19 units per year.
A 95% CI for the amount of this decrease is
(1.01 , 1.37). In other words, the 95% CI
for 1 is (-1.37,-1.01).
Note: 95% represents the probability that,
for any given repetition of the experiment,
the confidence interval will actually cover
the true value.
4-17
Output (3)
Obs age predict SE_mean LCL_mean UCL_mean
61 50 96.8468 1.38715 94.0701 99.6235
62 60 84.9468 1.05515 82.8347 87.0590
63 70 73.0469 1.38911 70.2663 75.8275
Obs age predict LCL_pred UCL_pred SE_pred
61 50 96.8468 80.2524 113.441 8.29005
62 60 84.9468 68.4507 101.443 8.24101
63 70 73.0469 56.4519 89.642 8.29038
4-18
Interpretation
Prediction intervals are pretty wide –
indicating that there is a large amount of
variation. The estimated standard
deviation (RMSE) was 8.2.
We wouldn’t be able to well-predict the
muscle loss for a single subject, but we
would be able to well-predict the average
muscle loss for multiple subjects (the SE’s
associated to the mean response are fairly
small)
4-19
Regression Plots
symbol1 v=dot c=blue;
symbol2 v=none i=rlclm95 c=green;
symbol3 v=none i=rlcli95 c=red;
PROC GPLOT data=muscle;
plot mmass*(age age age) / haxis=axis1
vaxis=axis2 overlay;
title2 'Confidence and Prediction Bands';
RUN; QUIT;
4-20
4-21
Working-Hotelling Adjustment
Previous Confidence Bands were
unadjusted.
To produce the W-H confidence bands for
the regression line, first use the F-
distribution to compute W = 2.18.
For T-distribution with 58 degrees of
freedom, this corresponds to an effective
alpha of about 0.01. So use 0.99 instead of
0.95 to get adjusted confidence band.
4-22
Compute Effective Alpha
data a1;
n=60; alpha=0.05; dfn = 2; dfd = n-2;
w2 = 2 * finv(1-alpha,dfn,dfd);
w=sqrt(w2); alphat=2*(1-probt(w,dfd));
tstar=tinv(1-alphat/2,dfd); output;
PROC PRINT data=a1; RUN;
n alpha dfn dfd w2 w alphat tstar
60 0.05 2 58 6.31 2.51 0.0148 2.51234
Use 0.01 (more conservative) as effective alpha.
4-23
4-24
ANOVA Table
Organize the variation arithmetically
Total (or corrected total) sum of squares is
2
TOT Y iSS SS Y Y
Think of this as the total possible variation
that might be explained by the model. The
percentage of TOTSS that we actually
explain is the coefficient of determination 2R .
4-25
Partitioning SSTOT
Two sources: MODEL (variation explained
by regression) and ERROR (unexplained
or residual variation)
2 22
ˆ ˆ
ˆ ˆ
i i i i
SSR SSESSTOT
i i i i
Y Y Y Y Y Y
Y Y Y Y Y Y
(cross terms cancel: see page 65)
4-26
4-27
Total Sum of Squares
Ignore X while predicting Y: Best predictor
is Y . TOTSS is the sum of squared
deviations from this predictor.
2
TOT Y iSS SS Y Y
Degrees of Freedom is n – 1 since est. Y
Mean Square: / 1TOT TOTMS SS n is
the usual estimate of variance when there
is no predictor term involved.
4-28
Model Sum of Squares
Variation explained by the regression model
2ˆ
R iSS Y Y
Degrees of freedom is 1 since estimating the
slope parameter (intercept parameter taken
care of already in estimation of Y ).
Mean Square: /R R RMS SS df
4-29
Error Sum of Squares
Unexplained variation
2ˆ
E i iSS Y Y
Degrees of freedom is n – 2 (difference
between the total and model degrees of
freedom.)
Mean Square Error is /E E EMS SS df .
This is the best estimate of the variance for
Y once we condition on the explanatory
variable(s).
4-30
ANOVA Table
Source df SS MS
Regression
(Model) 1
2
iY Y R
SSR
df
Error 2n 2ˆ
i iY Y E
SSE
df
Total 1n 2
iY Y T
SSTO
df
4-31
Expected Mean Squares
Mean Squares are random variables since
Y’s are random variables. Can compute:
2 21
2
XE MSR SS
E MSE
When 0 1: 0H is true, then E MSR
and E MSE are identical and in
particular their ratio is 1.
4-32
F-test
Under the null, F = MSR/MSE has an F
distribution with 1 and n – 2 degrees of
freedom.
When 0 1: 0H is false, MSR tends to be
larger, so we would want to reject the null
when F is large
Generally, reject if F is bigger than critical
value (or in practice, when p-value is less
than the significance level).
4-33
ANOVA Table with Test
Source df SS MS F P
Model 1 SSM MSM MSM
MSE .xxx
Error n – 2 SSE MSE
Total n – 1
(model used here b/c this is what you see in SAS)
4-34
Example (Muscle Mass)
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 11627 11627 174.06 <.0001
Error 58 3875 66.8
Total 59 15502
Root MSE 8.17318 R-Square 0.7501
4-35
Upcoming in Lecture 5...
General Linear Test (Section 2.8)
Coefficient of Determination/Correlation
(Section 2.9)
Assessing Validity of Model Assumptions
(Chapter 3)