Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression
description
Transcript of Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression
Lecture 5:SLR Diagnostics (Continued)CorrelationIntroduction to Multiple Linear Regression
BMTRY 701Biostatistical Methods II
From last lecture
What were the problems we diagnosed?
We shouldn’t just give up!
Some possible approaches for improvement• remove the outliers: does the model change?
• transform LOS: do we better adhere to model assumptions?
Outlier Quandry
To remove or not to remove outliers
Are they real data?
If they are truly reflective of the data, then what does removing them imply?
Use caution!• better to be true to the data• having a perfect model should not be at the expense
of using ‘real’ data!
Removing the outliers: How to?
I am always reluctant.
my approach in this example:• remove each separately• remove both together• compare each model with the model that includes
outliers
How to decide: compare slope estimates.
SENIC Data> par(mfrow=c(1,2))> hist(data$LOS)> plot(data$BEDS, data$LOS)
Histogram of data$LOS
data$LOS
Freq
uenc
y
6 8 10 14 18
010
2030
4050
0 200 400 600 8008
1012
1416
1820
data$BEDS
data
$LO
S
How to fit regression removing outlier(s)?
> keep.remove.both <- ifelse(data$LOS<16,1,0)> keep.remove.20 <- ifelse(data$LOS<19,1,0)> keep.remove.18 <- ifelse(data$LOS<16 | data$BEDS<600,1,0)> > table(keep.remove.both)keep.remove.both 0 1 2 111 > table(keep.remove.20)keep.remove.20 0 1 1 112 > table(keep.remove.18)keep.remove.18 0 1 1 112
Regression Fitting
reg <- lm(LOS ~ BEDS, data=data)
reg.remove.both <- lm(LOS ~ BEDS, data=data[keep.remove.both==1,])
reg.remove.20 <- lm(LOS ~ BEDS, data=data[keep.remove.20==1,])
reg.remove.18 <- lm(LOS ~ BEDS, data=data[keep.remove.18==1,])
How much do our inferences change?
reg removeboth
remove 20
remove 18
β1 estimate
0.00406 0.00299 0.00393 0.00314
se(β1) 0.00086 0.00070 0.00073 0.00085
% change 0(ref)
26% 3% 23%
Why is “18” a bigger outlier than “20”?
Leverage and Influence
Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates.
Influence is a measure of how much a data point actually does affect the estimated model.
Leverage and influence both may be defined in terms of matrices
More later in MLR (MPV ch. 6)
Graphically
0 200 400 600 800
810
1214
1618
20
data$BEDS
data
$LO
S
regw/out 18w/out 20w/out both
R code
par(mfrow=c(1,1))plot(data$BEDS, data$LOS, pch=16)# old plain old regression modelabline(reg, lwd=2)# plot “20” to show which point we are removing, then# add regression linepoints(data$BEDS[keep.remove.20==0], data$LOS[keep.remove.20==0],
col=2, cex=1.5, pch=16)abline(reg.remove.20, col=2, lwd=2)# plot “18” and then add regressionlinepoints(data$BEDS[keep.remove.18==0], data$LOS[keep.remove.18==0],
col=4, cex=1.5, pch=16)abline(reg.remove.18, col=4, lwd=2)# add regression line where we removed both outliersabline(reg.remove.both, col=5, lwd=2)# add a legend to the plotlegend(1,19, c("reg","w/out 18","w/out 20","w/out both"),
lwd=rep(2,4), lty=rep(1,4), col=c(1,2,4,5))
What to do?
Let’s try something else What was our other problem?
• heteroskedasticity (great word…try that at scrabble)
• non-normality of outliers
Common way to solve: transform the outcome
Determining the Transformation
Box-Cox transformation approach
Finds the “best” power transformation to achieve closest distribution to normality
Can apply to• a variable• to a linear regression model
When applied to a regression model, result tells you what is the ‘best’ power transform of Y to achieve normal residuals
Review of power transformation
Assume we want to transform Y Box-Cox considers Ya for all values of a Solution is the a that provides the “most normal”
looking Ya
Practical powers• a = 1: identity• a = ½ : square-root• a = 0: log• a = -1: 1/Y. usually we also take negative so that
order is maintained (see example) Often not practical interpretation: Y-0.136
Box-Cox for linear regression
library(MASS)
bc <- boxcox(reg)
-2 -1 0 1 2
-350
-340
-330
-320
-310
log-
Like
lihoo
d
95%
Transform
8 10 12 14 16 18 20
-0.1
4-0
.12
-0.1
0-0
.08
-0.0
6
data$LOS
ty
ty <- -1/data$LOS
plot(data$LOS, ty)
0 200 400 600 800
-0.1
4-0
.12
-0.1
0-0
.08
-0.0
6
data$BEDS
ty
New regression: transform is -1/LOS
plot(data$BEDS, ty, pch=16)reg.ty <- lm(ty ~ data$BEDS)abline(reg.ty, lwd=2)
More interpretable?
LOS is often analyzed in the literature Common transform is log
• it is well-known that LOS is skewed in most applications
• most people take the log• people are used to seeing and interpreting it on the
log scale How good is our model if we just take the log?
Regression with log(LOS)
8 10 12 14 16 18 20
2.0
2.2
2.4
2.6
2.8
3.0
data$LOS
logy
0 200 400 600 800
2.0
2.2
2.4
2.6
2.8
3.0
data$BEDS
logy
Let’s compare: residual plots
0 200 400 600 800
-0.0
20.
000.
020.
04
data$BEDS
reg.
ty$r
esid
uals
0 200 400 600 800
-0.2
0.0
0.2
0.4
0.6
data$BEDS
reg.
logy
$res
idua
ls
Let’s compare: distribution of residuals-0
.02
0.00
0.02
0.04
Residuals where Y = -1/LOS
-0.2
0.0
0.2
0.4
0.6
Residuals where Y = log(LOS)
Let’s Compare: |Residuals|
0 200 400 600 800
0.00
0.02
0.04
data$BEDS
abs(
reg.
ty$r
esid
uals
)
0 200 400 600 8000.
00.
20.
40.
6
data$BEDS
abs(
reg.
logy
$res
idua
ls)
p=0.59 p=0.12
Let’s Compare: QQ-plot
-2 -1 0 1 2
-0.0
20.
000.
020.
04
TY
Theoretical Quantiles
Sam
ple
Qua
ntile
s
-2 -1 0 1 2-0
.20.
00.
20.
40.
6
LogY
Theoretical Quantiles
Sam
ple
Qua
ntile
s
R code logy <- log(data$LOS)par(mfrow=c(1,2))plot(data$LOS, logy)plot(data$BEDS, logy, pch=16)reg.logy <- lm(logy ~ data$BEDS)abline(reg.logy, lwd=2)
par(mfrow=c(1,2))plot(data$BEDS, reg.ty$residuals, pch=16)abline(h=0, lwd=2)plot(data$BEDS, reg.logy$residuals, pch=16)abline(h=0, lwd=2)
boxplot(reg.ty$residuals)title("Residuals where Y = -1/LOS")boxplot(reg.logy$residuals)title("Residuals where Y = log(LOS)")
qqnorm(reg.ty$residuals, main="TY")qqline(reg.ty$residuals)qqnorm(reg.logy$residuals, main="LogY")qqline(reg.logy$residuals)
Regression results> summary(reg.ty)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.169e-01 2.522e-03 -46.371 < 2e-16 ***data$BEDS 3.953e-05 7.957e-06 4.968 2.47e-06 ***---
> summary(reg.logy)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---
Let’s compare: results ‘untransformed’
0 200 400 600 800
810
1214
1618
20
data$BEDS
data
$LO
S
0 200 400 600 8007
89
1011
12
data$BEDS
data
$LO
S
R code
par(mfrow=c(1,2))plot(data$BEDS, data$LOS, pch=16)abline(reg, lwd=2)lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2)lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)
plot(data$BEDS, data$LOS, pch=16, ylim=c(7,12))abline(reg, lwd=2)lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2)lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)
So, what to do?
What are the pros and cons of each transform?
Should we transform at all?!
Switching Gears: Correlation
“Pearson” Correlation Measures linear association between two
variables A natural by-product of linear regression Notation: r or ρ (rho)
Correlation versus slope?
Measure different aspects of the association between X and Y
Slope: measures if there is a linear trend Correlation: provides measure of how close the
datapoints fall to the line
Statistical significance is IDENTICAL• p-value for testing that correlation is 0 is the SAME as
the p-value for testing that the slope is 0.
0 1 2 3 4 5
-50
510
15
x
y1
0 1 2 3 4 5-5
05
1015
x
y2
Example: Same slope, different correlation
0 1 2 3 4 5
-50
510
x
y1
0 1 2 3 4 5-5
05
10
x
y2
r = 0.46, b1=2 r = 0.95, b1=2
0 1 2 3 4 5
-10
010
2030
x
y1
0 1 2 3 4 5
-10
010
2030
x
y2
Example: Same correlation, different slope
r = 0.46, b1=4 r = 0.46, b1=2
Correlation
Scaled version of Covariance between X and Y Recall Covariance:
Estimating the Covariance:
)])([(),( yxxy YXEYXCov
n
iiinxy yyxx
1
1 ))((̂
Correlation
n
i
n
iii
n
iii
xy
yyxx
yyxx
1 1
22
1
)()(
))((̂
Interpretation
Correlation tells how closely two variables “track” one another
Provides information about ability to predict Y from X
Regression output:• look for R2
• for SLR, sqrt(R2) = correlation Can have low correlation yet significant
association With correlation, 95% confidence interval is
helpful
LOS ~ BEDS
> summary(lm(data$LOS ~ data$BEDS))
Call:lm(formula = data$LOS ~ data$BEDS)
Residuals: Min 1Q Median 3Q Max -2.8291 -1.0028 -0.1302 0.6782 9.6933
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.6253643 0.2720589 31.704 < 2e-16 ***data$BEDS 0.0040566 0.0008584 4.726 6.77e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.752 on 111 degrees of freedomMultiple R-squared: 0.1675, Adjusted R-squared: 0.16 F-statistic: 22.33 on 1 and 111 DF, p-value: 6.765e-06
409.01675.0
95% Confidence Interval for Correlation
The computation of a confidence interval on the population value of Pearson's correlation (ρ) is complicated by the fact that the sampling distribution of r is not normally distributed. The solution lies with Fisher's z' transformation described in the section on the sampling distribution of Pearson's r. The steps in computing a confidence interval for ρ are:• Convert r to z' • Compute a confidence interval in terms of z' • Convert the confidence interval back to r.
freeware!• http://www.danielsoper.com/statcalc/calc28.aspx• http://glass.ed.asu.edu/stats/analysis/rci.html• http://faculty.vassar.edu/lowry/rho.html
log(LOS) ~ BEDS
> summary(lm(log(data$LOS) ~ data$BEDS))
Call:lm(formula = log(data$LOS) ~ data$BEDS)
Residuals: Min 1Q Median 3Q Max -0.296328 -0.106103 -0.005296 0.084177 0.702262
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1618 on 111 degrees of freedomMultiple R-squared: 0.1805, Adjusted R-squared: 0.1731 F-statistic: 24.44 on 1 and 111 DF, p-value: 2.737e-06
425.01805.0
Multiple Linear Regression
Most regression applications include more than one covariate
Allows us to make inferences about the relationship between two variables (X and Y) adjusting for other variables
Used to account for confounding. Especially important in observational studies
• smoking and lung cancer• we know people who smoke tend to expose themselves to other
risks and harms• if we didn’t adjust, we would overestimate the effect of smoking
on the risk of lung cancer.
Importance of including ‘important’ covariates
If you leave out relevant covariates, your estimate of β1 will be biased
How biased?
Assume: • true model:
• fitted model:
eXXY iii 22110
*1
*1
*0 eXY ii
Fun derivation
n
ii
n
ii
n
iii
n
ii
n
ii
n
ii
n
i
n
ii
n
iii
n
ii
n
iii
n
ii
n
i
n
iiiiii
n
ii
n
i
n
iiii
n
i
n
iiiin
ii
n
iiin
ii
n
ii
n
iii
XX
XXXXXXX
XX
XXXXnXXXXX
XX
XXXXXX
XX
YEXYEX
YXYXEXX
YYXXEXX
XX
YYXXEE
1
211
1212
11112
112
1
211
1
211
1 1212
1111012
112
1
21110
1
211
1 1221101221101
1
211
1 111
1 111
1
211
111
1
211
1
211
111
*1
)(
)(
)(
)()(
)(
)()(
)(
1
))(()(
1
)(
))(()(
Fun derivation
n
ii
n
ii
n
ii
n
ii
n
ii
n
iii
n
ii
n
iii
n
ii
n
iii
n
ii
n
ii
n
iii
n
ii
n
ii
n
ii
n
ii
n
ii
n
iii
n
ii
n
ii
n
ii
n
ii
n
ii
n
iii
n
ii
n
ii
XX
XX
XX
XX
XX
XXXX
XX
XXXX
XX
XXXX
XX
XXXX
XX
XXX
XX
XXXXXXX
XX
XXXXXXXE
1
222
1
211
1221
1
222
1
222
1
211
12211
21
1
211
12211
21
1
211
1212
11
21
1
211
1212
112
1
211
111
1
211
1
211
1212
112
111
1
211
1
211
1212
11112
112
1
211
*1
)(
)(
)(
)(
)(
))((
)(
))((
)(
)()(
)(
)()(
Fun derivation
211221
2
11221
1
222
1
211
1221*1
ˆˆ if ˆˆ
)(
)()(
n
ii
n
ii
XX
XXE
Implications
The bias is a function of the correlation between the two covariates, X1 and X2
If the correlation is high, the bias will be high If the correlation is low, the bias may be quite
small. If there is no correlation between X1 and X2, then
omitting X2 does not bias inferences However, it is not a good model for prediction if
X2 is related to Y
Example: LOS ~ BEDS analysis.
0 200 400 600 800
2.0
2.4
2.8
data$BEDS
log(
data
$LO
S)
0 100 200 300 400 500 600
2.0
2.4
2.8
data$NURSE
log(
data
$LO
S)
0 200 400 600 800
020
040
060
0
data$BEDS
data
$NU
RS
E
> cor(cbind(data$BEDS, data$NURSE, data$LOS)) [,1] [,2] [,3][1,] 1.0000000 0.9155042 0.4092652[2,] 0.9155042 1.0000000 0.3403671[3,] 0.4092652 0.3403671 1.0000000
R code
reg.beds <- lm(log(data$LOS) ~ data$BEDS)reg.nurse <- lm(log(data$LOS) ~ data$NURSE)reg.beds.nurse <- lm(log(data$LOS) ~ data$BEDS + data$NURSE)summary(reg.beds)summary(reg.nurse)summary(reg.beds.nurse)
SLRs
Call:lm(formula = log(data$LOS) ~ data$BEDS)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---
Call:lm(formula = log(data$LOS) ~ data$NURSE)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1682138 0.0250054 86.710 < 2e-16 ***data$NURSE 0.0004728 0.0001127 4.195 5.51e-05 ***---
BEDS + NURSE> summary(reg.beds.nurse)
Call:lm(formula = log(data$LOS) ~ data$BEDS + data$NURSE)
Residuals: Min 1Q Median 3Q Max -0.291537 -0.108447 -0.006711 0.087594 0.696747
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1522361 0.0252758 85.150 <2e-16 ***data$BEDS 0.0004910 0.0001977 2.483 0.0145 * data$NURSE -0.0001497 0.0002738 -0.547 0.5857 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1624 on 110 degrees of freedomMultiple R-squared: 0.1827, Adjusted R-squared: 0.1678 F-statistic: 12.29 on 2 and 110 DF, p-value: 1.519e-05