Lecture 13: Multiple linear regression
description
Transcript of Lecture 13: Multiple linear regression
2001
Bio 4118 Applied BiostatisticsL13.1
Université d’Ottawa / University of Ottawa
Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression
Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression
When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
2001
Bio 4118 Applied BiostatisticsL13.2
Université d’Ottawa / University of Ottawa
Some GLM proceduresSome GLM proceduresSome GLM proceduresSome GLM procedures
ProcedureDependentvariable
Independent variable(s)
Simpleregression
1 continuous 1 continuous
SingleclassificationANOVA
1 continuous 1 categorical*
Multiple-classificationANOVA
1 continuous 2 or more categorical*
ANCOVA 1 continuousAt least 1 categorical*, atleast 1 continuous
Multipleregression
1 continuous 2 or more continuous
*either categorical or treated as a categorical variable
2001
Bio 4118 Applied BiostatisticsL13.3
Université d’Ottawa / University of Ottawa
When do we use When do we use multiple regression?multiple regression?
When do we use When do we use multiple regression?multiple regression?
to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables
e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance
to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables
e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance
Log [P]
Lo
g P
rod
uct
ion
Log [P]
Lo
g P
rod
uct
ion
Log [Zoo]
2001
Bio 4118 Applied BiostatisticsL13.4
Université d’Ottawa / University of Ottawa
The general model is:
which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.
The general model is:
which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.
The multiple regression model: general The multiple regression model: general formform
The multiple regression model: general The multiple regression model: general formform
Y Xi jj
k
ij i
1
X2
X1
Y
X2
X1
Y, X1, X2^
Y, X1, X2
Y X , X 1 2.
2001
Bio 4118 Applied BiostatisticsL13.5
Université d’Ottawa / University of Ottawa
What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?
What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?
j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!
j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!
Partial regression
Simple (pooled)regression -4 -2 0 2 4
-8
-4
0
4
8
X1
Y
X2 = 3
X2 = 1
X2 = -1
X2 = -3
2001
Bio 4118 Applied BiostatisticsL13.6
Université d’Ottawa / University of Ottawa
The effect of scaleThe effect of scaleThe effect of scaleThe effect of scale
Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.
So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.
Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.
So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.
Y j = 2
4
2
01 2
Xj
Y j = .02
4
2
0100 200
2001
Bio 4118 Applied BiostatisticsL13.7
Université d’Ottawa / University of Ottawa
Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.
The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.
Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.
The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.
The multiple regression model: The multiple regression model: standardized formstandardized form
The multiple regression model: The multiple regression model: standardized formstandardized form
YY Y
sX
X X
s
Y X
s
s
ii
Yij
ij j
X
i jj
k
ij i
j jX
Y
j
j
* *
* * *
*
,
1
2001
Bio 4118 Applied BiostatisticsL13.8
Université d’Ottawa / University of Ottawa
Regression coefficients: summaryRegression coefficients: summary
Partial regression coefficient: equals the slope of the regression of Y on Xj when all other independent variables are held constant.
Standardized partial regression coefficient: the rate of change of Y in standard deviation units per one standard deviation of Xj with all other independent variables held constant.
2001
Bio 4118 Applied BiostatisticsL13.9
Université d’Ottawa / University of Ottawa
AssumptionsAssumptions
independence of residuals homoscedasticity of residuals linearity (Y on all X) no error on independent variables normality of residuals
2001
Bio 4118 Applied BiostatisticsL13.10
Université d’Ottawa / University of Ottawa
Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of
squaressquares
Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of
squaressquares
Total SS Model (Explained) SS Unexplained (Error) SS
( )Y Yii
N
1
2 ( )Y Yii
N
1
2 ( )Y Yii
N
i
1
2
Y
Y = +
2001
Bio 4118 Applied BiostatisticsL13.11
Université d’Ottawa / University of Ottawa
Partition total sums of squares into model and residual SS:
Partition total sums of squares into model and residual SS:
Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares
Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares
SS Y Yii
N
Total ( )
1
2
SS Y Yii
N
model ( )
1
2
SS Y Yii
N
ierror ( )
1
2X2
X1
Y
Model SS
Total SS
Residual SS
2001
Bio 4118 Applied BiostatisticsL13.12
Université d’Ottawa / University of Ottawa
Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares
Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares
So, MSmodel = s2Y and
MSerror = 0 if observed = expected for all i.
Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.
H0: F = 1
So, MSmodel = s2Y and
MSerror = 0 if observed = expected for all i.
Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.
H0: F = 1
MSY Yi
i
N
model
( )
1
2
1
MSY Y
N
ii
N
i
error
( )
1
2
2
2001
Bio 4118 Applied BiostatisticsL13.13
Université d’Ottawa / University of Ottawa
Hypothesis testing II: Hypothesis testing II: testing individual testing individual partial regression partial regression
coefficientscoefficients
Hypothesis testing II: Hypothesis testing II: testing individual testing individual partial regression partial regression
coefficientscoefficients Test each hypothesis by a t-
test:
Note: these are 2-tailed hypotheses!
Test each hypothesis by a t-test:
Note: these are 2-tailed hypotheses!
ts
ts
j
j
j
,
YY
H02: 2 = 0,accepted
X2, X1 fixed
X1 = 2
X1 = 3
YY
X1, X2 fixed
H01: = 0,rejected
X2 = 1
X2 = 2
2001
Bio 4118 Applied BiostatisticsL13.14
Université d’Ottawa / University of Ottawa
MulticollinearityMulticollinearityMulticollinearityMulticollinearity Independent variables are
correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix.
Independent variables are correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix.
Variable X1 X2 X3
X1
2 12
13
X2
21
2 23
X3
31
32
2
Variance
Covariance
X1
colinear
X2
independent
X3
X2
2001
Bio 4118 Applied BiostatisticsL13.15
Université d’Ottawa / University of Ottawa
Multicollinearity: problemsMulticollinearity: problemsMulticollinearity: problemsMulticollinearity: problems
If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.
But if they are correlated, the former will be less than the latter.
So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?
If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.
But if they are correlated, the former will be less than the latter.
So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?
SS SS SS
if
X X X X
X X
model model model1 2 1 2
1 2
2 0
,
,
SS SS SS
if
X X X X
X X
model model model1 2 1 2
1 2
2 0
,
,
2001
Bio 4118 Applied BiostatisticsL13.16
Université d’Ottawa / University of Ottawa
Multicollinearity: consequencesMulticollinearity: consequences
inflated standard errors for regression coefficients sensitivity of parameter estimates to small
changes in data But, estimates of partial regression coefficients
remain unbiased. One or more independent variables may not
appear in the final regression model not because they do not covary with Y, but because they covary with another X.
2001
Bio 4118 Applied BiostatisticsL13.17
Université d’Ottawa / University of Ottawa
Detecting multicollinearityDetecting multicollinearity
high R2 but few or no significant t-tests for individual independent variables
high pairwise correlations between X’s high partial correlations among regressors
(independent variables are a linear combination of others)
Eigenvalues, condition index, tolerance and variance inflation factors
2001
Bio 4118 Applied BiostatisticsL13.18
Université d’Ottawa / University of Ottawa
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other
Eigenvalue: the magnitude (length) of the corresponding eigenvector
Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other
Eigenvalue: the magnitude (length) of the corresponding eigenvector
X2
X1
1
2
X2X
1
1
2
1
2
2001
Bio 4118 Applied BiostatisticsL13.19
Université d’Ottawa / University of Ottawa
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity Eigenvalues: if all k
eigenvalues are approximately equal, multicollinearity is low.
Condition index: sqrt(l /s); near 1 indicates low multicollinearity.
Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.
Eigenvalues: if all k eigenvalues are approximately equal, multicollinearity is low.
Condition index: sqrt(l /s); near 1 indicates low multicollinearity.
Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.
X2
X1
Low correlation 1 = 2
X2X
1
High correlation 1 >> 2
2001
Bio 4118 Applied BiostatisticsL13.20
Université d’Ottawa / University of Ottawa
Remedial measuresRemedial measures
Get more data to reduce correlations. Drop some variables. Use principal component or ridge regression,
which yield biased estimates but with smaller standard errors.
2001
Bio 4118 Applied BiostatisticsL13.21
Université d’Ottawa / University of Ottawa
Multiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general idea Evaluate significance of a
variable by fitting two models: one with the term in, the other with it removed.
Test for change in model fit ( MF) associated with removal of the term in question.
Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!
Evaluate significance of a variable by fitting two models: one with the term in, the other with it removed.
Test for change in model fit ( MF) associated with removal of the term in question.
Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!
Model A(X1 in)
Model B(X2 out)
M F(e.g. R2)
Delete X1
( small)
Retain X1
( large)
2001
Bio 4118 Applied BiostatisticsL13.22
Université d’Ottawa / University of Ottawa
Fitting multiple regression modelsFitting multiple regression models
Goal: find the “best” model, given the available data.
Problem 1: what is “best”? highest R2? lowest RMS? highest R2 but contains only individually
significant independent variables? maximizes R2 with minimum number of
independent variables?
2001
Bio 4118 Applied BiostatisticsL13.23
Université d’Ottawa / University of Ottawa
Selection of independent variables Selection of independent variables (cont’d)(cont’d)
Problem 2: even if “best” is defined, by what method do we find it?
Possibilities: compute all possible models (2k -1) and
choose the best one. use some procedure for winnowing down the
set of possible models.
2001
Bio 4118 Applied BiostatisticsL13.24
Université d’Ottawa / University of Ottawa
Strategy I: computing all possible Strategy I: computing all possible modelsmodels
Strategy I: computing all possible Strategy I: computing all possible modelsmodels
Compute all possible models and choose the “best” one.
cons: time-consuming leaves definition of
“best” to researcher
pros: if the “best” model is
defined, you will find it!
Compute all possible models and choose the “best” one.
cons: time-consuming leaves definition of
“best” to researcher
pros: if the “best” model is
defined, you will find it!
{X1, X2, X3}
{X2}
{X1}
{X3}
{X1, X2}
{X2, X3}
{X1, X3}
{X1, X2, X3}
2001
Bio 4118 Applied BiostatisticsL13.25
Université d’Ottawa / University of Ottawa
Strategy II: Strategy II: forward selectionforward selection
Strategy II: Strategy II: forward selectionforward selection Start with variable that has
highest (significant) R2, i.e. highest partial correlation coefficient r.
Add others one at a time until no further significant increase in R2 with js recomputed at each step.
problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.
Start with variable that has highest (significant) R2, i.e. highest partial correlation coefficient r.
Add others one at a time until no further significant increase in R2 with js recomputed at each step.
problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.
{X1, X2, X3}
{X2}
r2 > r1 > r3
{X1, X2, X3}
{X1, X2}
RR2
RR21
R21R2
R21R2
{X2}
{X1, X2, X3}
Finalmodel
R123R21
{X1, X2}
R123R21
2001
Bio 4118 Applied BiostatisticsL13.26
Université d’Ottawa / University of Ottawa
Forward selection: Forward selection: order of entryorder of entry
Forward selection: Forward selection: order of entryorder of entry
Begin with the variable with the highest partial correlation coefficient.
Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.
Begin with the variable with the highest partial correlation coefficient.
Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.
{X1, X2, X3, X4}
{X2}
r2 > r1 > r3 > r4
{X2, X1}
{X2, X4}
p[F(X2, X4)] = .55
X4 eliminated
p to enter = .05
{X2, X3} {X2, X1}
p[F(X2)] = .001
p[F(X2, X1)] = .002p[F(X2, X3)] = .04
...
{X2, X3}
2001
Bio 4118 Applied BiostatisticsL13.27
Université d’Ottawa / University of Ottawa
Strategy III: Strategy III: backward selectionbackward selection
Strategy III: Strategy III: backward selectionbackward selection
Start with all variables. Drop variables whose
removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.
But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.
Start with all variables. Drop variables whose
removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.
But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.
{X1, X2, X3}
{X3}
r2 < r1 < r3
{X1, X3} RR13
R3R13
R13R123
{X3}
{X1, X2, X3}
Finalmodel
RR123
R13R123
R3R13
{X1, X3}
2001
Bio 4118 Applied BiostatisticsL13.28
Université d’Ottawa / University of Ottawa
Backward selection: Backward selection: order of entryorder of entry
Backward selection: Backward selection: order of entryorder of entry
Begin with the variable with the smallest partial correlation coefficient.
Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.
Begin with the variable with the smallest partial correlation coefficient.
Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.
{X1, X2, X3, X4}
{X2, X1, X3}
r2 > r1 > r3 > r4
{X2, X1}
p[F(X2, X1)] = .25
p to remove = .10
p[F(X2, X3)] = .001
...
p[F(X2, X1, X3)] = .44
X4 removed
X3 removed X1 , X2 still in
X2, X3, X1 still in
{X1, X3}{X2, X3}
p[F(X1, X3)] = .009
2001
Bio 4118 Applied BiostatisticsL13.29
Université d’Ottawa / University of Ottawa
Strategy IV: stepwise Strategy IV: stepwise selectionselection
Strategy IV: stepwise Strategy IV: stepwise selectionselection
Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.
To avoid infinite loops, we usually set p to enter > p to remove.
Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.
To avoid infinite loops, we usually set p to enter > p to remove.
{X1, X2, X3, X4}
{X2}
r2 > r1 > r4 > r3
{X1, X2, X3}
{X2, X4}
p[F(X2, X4)] = .03
p to enter = .10p to remove = .05
{X2, X3} {X2, X1}
p[F(X2)] = .001
p[F(X2, X1)] = .002p[F(X2, X3)] = .09
{X1, X2, X4}
p[F(X1, X2, X4)] = .02 p[F(X1, X2, X3)] = .19{X1, X4}
2001
Bio 4118 Applied BiostatisticsL13.30
Université d’Ottawa / University of Ottawa
ExampleExample
log of herptile species richness (logherp) as a function of log wetland area (logarea), percentage of land within 1 km covered in forest (cpfor2) and density of hard-surface roads within 1 km (thtdens)
2001
Bio 4118 Applied BiostatisticsL13.31
Université d’Ottawa / University of Ottawa
Example (all variables)Example (all variables)
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032
2001
Bio 4118 Applied BiostatisticsL13.32
Université d’Ottawa / University of Ottawa
Example (cont’d)Example (cont’d)
ANALYSIS OF VARIANCE
SOURCE SS DF MS F-RATIO P
REGRESSION 0.760 3 0.253 9.662 0.000 RESIDUAL 0.629 24 0.026
2001
Bio 4118 Applied BiostatisticsL13.33
Université d’Ottawa / University of Ottawa
Example: forward stepwiseExample: forward stepwiseDEPENDENT VARIABLE LOGHERP MINIMUM TOLERANCE FOR ENTRY INTO MODEL = .010000 FORWARD STEPWISE WITH ALPHA-TO-ENTER= .10 AND ALPHA-TO-REMOVE= .05
STEP # 0 R= .000 RSQUARE= .000
VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT OUT PART. CORR --- 2 LOGAREA 0.596 . . .1E+01 14.321 0.001 3 CPFOR2 0.305 . . .1E+01 2.662 0.115 4 THTDEN -0.496 . . .1E+01 8.502 0.007
2001
Bio 4118 Applied BiostatisticsL13.34
Université d’Ottawa / University of Ottawa
Forward stepwise (cont’d)Forward stepwise (cont’d)STEP # 1 R= .596 RSQUARE= .355TERM ENTERED: LOGAREA
VARIABLE COEFF. SE. STD COEF. TOL. F 'P'
IN --- 1 CONSTANT 2 LOGAREA 0.247 0.065 0.596 .1E+01 14.321 0.001
OUT PART. CORR --- 3 CPFOR2 0.382 . . 0.99 4.273 0.049 4 THTDEN -0.529 . . 0.98 9.725 0.005
2001
Bio 4118 Applied BiostatisticsL13.35
Université d’Ottawa / University of Ottawa
Forward stepwise (cont’d)Forward stepwise (cont’d)
STEP # 2 R= .732 RSQUARE= .536 TERM ENTERED: THTDEN
VARIABLE COEFF. SE. STD COEF .TOL. F 'P'
IN --- 1 CONSTANT 2 LOGAREA 0.225 0.057 0.542 0.98 15.581 0.001 4 THTDEN -0.042 0.013 -0.428 0.98 9.725 0.005
OUT PART. CORR --- 3 CPFOR2 0.156 . . 0.74380 0.599 0.447
2001
Bio 4118 Applied BiostatisticsL13.36
Université d’Ottawa / University of Ottawa
Forward stepwise: final modelForward stepwise: final model
FORWARD STEPWISE: P TO INCLUDE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732SQUARED MULTIPLE R: 0.536ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.161
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005
2001
Bio 4118 Applied BiostatisticsL13.37
Université d’Ottawa / University of Ottawa
Example: backward stepwise (final Example: backward stepwise (final model)model)
BACKWARD STEPWISE: P TO REMOVE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732SQUARED MULTIPLE R: 0.536ADJUSTED SQUARED MULTIPLE R: .499STANDARD ERROR OF ESTIMATE: 0.161
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005
2001
Bio 4118 Applied BiostatisticsL13.38
Université d’Ottawa / University of Ottawa
Example: subset modelExample: subset model
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670SQUARED MULTIPLE R: 0.449ADJUSTED SQUARED MULTIPLE R: .405STANDARD ERROR OF ESTIMATE: 0.175
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049
2001
Bio 4118 Applied BiostatisticsL13.39
Université d’Ottawa / University of Ottawa
What if relationship between Y What if relationship between Y and one or more X’s is nonlinear?and one or more X’s is nonlinear?
Option 1: transform data. Option 2: use non-linear regression. Option 3: use polynomial regression.
2001
Bio 4118 Applied BiostatisticsL13.40
Université d’Ottawa / University of Ottawa
In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable.
In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable.
The polynomial regression modelThe polynomial regression modelThe polynomial regression modelThe polynomial regression model
Y Xi jj
k
ij
i
1
10
100
1000
10 30 50 70 90 110
Current velocity (cm/s)
Bla
ck f
ly b
iom
ass
(mg
DM
/m²)
Linear model2nd orderpolynomial model
2001
Bio 4118 Applied BiostatisticsL13.41
Université d’Ottawa / University of Ottawa
Fit simple linear model. Fit model with quadratic,
test for increase in SSmodel .
Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel .
Include terms of order up to the power of (number of points of inflexion plus 1).
Fit simple linear model. Fit model with quadratic,
test for increase in SSmodel .
Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel .
Include terms of order up to the power of (number of points of inflexion plus 1).
The polynomial regression model: The polynomial regression model: procedureprocedure
The polynomial regression model: The polynomial regression model: procedureprocedure
10
100
1000
10 30 50 70 90 110
Current velocity (cm/s)
Bla
ck f
ly b
iom
ass
(mg
DM
/m²)
Linear model2nd orderpolynomial model
2001
Bio 4118 Applied BiostatisticsL13.42
Université d’Ottawa / University of Ottawa
Polynomial regression: caveatsPolynomial regression: caveatsPolynomial regression: caveatsPolynomial regression: caveats
The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.
By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.
The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.
By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.
Extrapolation of polynomial models is always nonsense.
Extrapolation of polynomial models is always nonsense.
X1
Y
Y = X1- X12
2001
Bio 4118 Applied BiostatisticsL13.43
Université d’Ottawa / University of Ottawa
Power analysis Power analysis in GLM in GLM
(including MR)(including MR)
Power analysis Power analysis in GLM in GLM
(including MR)(including MR)
In any GLM, hypotheses are tested by means of an F-test.
Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.
Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.
In any GLM, hypotheses are tested by means of an F-test.
Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.
Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.
F
FR
df
df
SS
SS
dfSS
dfSS
MS
MSF
factor
error
error
factor
errorerror
factorfactor
error
factor
1
/
/
2
2001
Bio 4118 Applied BiostatisticsL13.44
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR22Partial and total Partial and total RR22
The total R2 (R2Y•B) is the
proportion of variance in Y accounted for (explained by) a set of independent variables B.
The partial R2 (R2Y•A,B- R2
Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.
The total R2 (R2Y•B) is the
proportion of variance in Y accounted for (explained by) a set of independent variables B.
The partial R2 (R2Y•A,B- R2
Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.
Proportion of varianceaccounted for by both A
and B (R2Y•A,B)
Proportion of variance
accounted for by A only
(R2Y•A)(total R2)
Proportion of variance accounted
for by Bindependent of A
(R2Y•A,B- R2
Y•A )(partial R2)
2001
Bio 4118 Applied BiostatisticsL13.45
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR22
Partial and total Partial and total RR22
The total R2 (R2Y•B) for
set B equals the partial R2 (R2
Y•A,B- R2Y•A ) with
respect to set B if either (1) the total R2 for A (R2
Y•A) is zero, or (2) if A and B are independent (in which case R2
Y•A,B= R2
Y•A + R2Y•B).
The total R2 (R2Y•B) for
set B equals the partial R2 (R2
Y•A,B- R2Y•A ) with
respect to set B if either (1) the total R2 for A (R2
Y•A) is zero, or (2) if A and B are independent (in which case R2
Y•A,B= R2
Y•A + R2Y•B).
Proportion of variance
accounted for by B
(R2Y•B)(total R2)
Proportion of variance
independent of A(R2
Y•A,B- R2Y•A )
(partial R2)
A
Y
B
A
Equal iff
2001
Bio 4118 Applied BiostatisticsL13.46
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR2 2 in multiple regressionin multiple regressionPartial and total Partial and total RR2 2 in multiple regressionin multiple regression
Suppose we have three independent variables X1 ,X2
and X3 .
Suppose we have three independent variables X1 ,X2
and X3 .
32321
32
1
321
,2
,,22
,2
,22
22
,,2
,2
321 ,,
XXYXXXYAYBAY
XXYBY
XYAY
XXXYBAY
RRRR
RR
RR
RR
XXBXA
Log [P]
Lo
g P
rod
uct
ion
Log [Zoo]
2001
Bio 4118 Applied BiostatisticsL13.47
Université d’Ottawa / University of Ottawa
Defining effect size in multiple Defining effect size in multiple regressionregression
Defining effect size in multiple Defining effect size in multiple regressionregression
The effect size, denoted f2 is given by the ratio of the factor (source) R2
factor and the appropriate error R2
error.
Note: both R2factor and
R2error depend on the
null hypothesis under investigation.
The effect size, denoted f2 is given by the ratio of the factor (source) R2
factor and the appropriate error R2
error.
Note: both R2factor and
R2error depend on the
null hypothesis under investigation.
22
2
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.48
Université d’Ottawa / University of Ottawa
Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2
Y•B) is determined. The error variance proportion
is then 1- R2Y•B .
H0: R2Y•B = 0
Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands
B ={LOGAREA, CPFOR2,THTDEN }
Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2
Y•B) is determined. The error variance proportion
is then 1- R2Y•B .
H0: R2Y•B = 0
Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands
B ={LOGAREA, CPFOR2,THTDEN }
Defining effect Defining effect size in multiple size in multiple
regression: regression: case 1case 1
Defining effect Defining effect size in multiple size in multiple
regression: regression: case 1case 1
22
2
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.49
Université d’Ottawa / University of Ottawa
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032
22
2
.5471.21
1 .547
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.50
Université d’Ottawa / University of Ottawa
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2
Y•A,B- R2Y•A
). The error variance proportion is
then 1- R2Y•A,B .
H0: R2Y•A,B- R2
Y•A = 0 Example: herptile richness in
southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA,
CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN}
Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2
Y•A,B- R2Y•A
). The error variance proportion is
then 1- R2Y•A,B .
H0: R2Y•A,B- R2
Y•A = 0 Example: herptile richness in
southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA,
CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN}
22
2
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.51
Université d’Ottawa / University of Ottawa
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670SQUARED MULTIPLE R: 0.449ADJUSTED SQUARED MULTIPLE R: .405STANDARD ERROR OF ESTIMATE: 0.175
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162
VARIABLE COEFF. SE STD COEF. TOL. T P
CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032
2001
Bio 4118 Applied BiostatisticsL13.52
Université d’Ottawa / University of Ottawa
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2
Y•A,B- R2Y•A =.098 .
The error variance proportion is then 1- R2
Y•A,B= 1 - .547 . So effect size for variable
THTDEN is 0.216 .
The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2
Y•A,B- R2Y•A =.098 .
The error variance proportion is then 1- R2
Y•A,B= 1 - .547 . So effect size for variable
THTDEN is 0.216 .
216.547.1
.449.547.
1 2},2,{
2}2,{
2},2,{
2
THTDENCPFORLOGAREA
CPFORLOGAREA
THTDENCPFORLOGAREA
R
R
R
f
2001
Bio 4118 Applied BiostatisticsL13.53
Université d’Ottawa / University of Ottawa
Determining powerDetermining powerDetermining powerDetermining power Once f2 has been
determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .
knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .
Once f2 has been determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .
knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .
= .05)
= .01)
Decreasing 2
1-
1 = 2
= .05 = .01
2 3 4 51 1.5 2 2.5
)1( 212 f
2001
Bio 4118 Applied BiostatisticsL13.54
Université d’Ottawa / University of Ottawa
Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in
southeastern Ontario wetlandssoutheastern Ontario wetlands sample of 28 wetlands 3 variables (LOGAREA,
CPFOR2, THTDEN) Dependent variable is log10 of
the number of herptile species.
What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?
sample of 28 wetlands 3 variables (LOGAREA,
CPFOR2, THTDEN) Dependent variable is log10 of
the number of herptile species.
What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?
Variable t p
LOGAREA(1)
3.96 0.001
THTDEN (2) -2.28 .032
CPFOR2 (3) .774 .447
R2{1,2,3} 0.547
R2{1,2 } 0.536
2001
Bio 4118 Applied BiostatisticsL13.55
Université d’Ottawa / University of Ottawa
Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in
southeastern Ontario wetlandssoutheastern Ontario wetlands Sample effect size
f2 for CPFOR2 once effects of LOGAREA and THTDEN have been controlled for = .024 .
Source (CPFOR2) df = 1 = 1
Error df = 2 = 28 - 1 - 1 - 1 = 25
Sample effect size f2 for CPFOR2 once effects of LOGAREA and THTDEN have been controlled for = .024 .
Source (CPFOR2) df = 1 = 1
Error df = 2 = 28 - 1 - 1 - 1 = 25
),(..)(.
)(f
..
...R
RRf
},,{
},{},,{
21
212
2321
221
23212
, ,given tables,from2716481251024
1
0245471
5365471