The Goldilocks Dilemma: Impacts of Multicollinearity -- A ...
Detecting and reducing multicollinearity
description
Transcript of Detecting and reducing multicollinearity
![Page 1: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/1.jpg)
Detecting and reducing multicollinearity
![Page 2: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/2.jpg)
Detecting multicollinearity
![Page 3: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/3.jpg)
Common methods of detection
• Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.
• Non-significant t-tests for all of the slopes but a significant overall F-test.
• Significant correlations among pairs of predictor variables (correlations, matrix scatter plots).
• Variance inflation factors (VIF).
![Page 4: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/4.jpg)
The first variance at issueFor the model:
ipipiii xxxy 1,122110
the variance of the estimated coefficient bk is:
2
1
2
2
1
1
kn
ikik
k Rxx
bVar
2kRwhere is the R2 value obtained by regressing
the kth predictor on the remaining predictors.
![Page 5: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/5.jpg)
The second variance at issueFor the model:
iikki xy 0
the variance of the estimated coefficient bk is:
n
ikik
k
xxbVar
1
2
2
min
![Page 6: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/6.jpg)
The ratio of the two variances
2
2
2
22
2
min 1
111
k
kik
kkik
k
k
R
xx
Rxx
bVar
bVar
![Page 7: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/7.jpg)
Variance inflation factors
The variance inflation factor for the kth predictor is:
21
1
kk R
VIF
2kRwhere is the R2 value obtained by regressing
the kth predictor on the remaining predictors.
![Page 8: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/8.jpg)
Variance inflation factors (VIFk)
• A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model.
• VIFs exceeding 4 warrant investigation.
• VIFs exceeding 10 are signs of serious multicollinearity.
![Page 9: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/9.jpg)
Blood pressure example
120
110
53.25
47.75
97.325
89.375
2.125
1.875
8.275
4.425
72.5
65.5
120110
76.25
30.75
53.2547.75
97.32589.375
2.1251.875
8.2754.425
72.565.576.25
30.75
BP
Age
Weight
BSA
Duration
Pulse
Stress
n = 20 hypertensive individuals
p-1 = 6 predictor variables
![Page 10: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/10.jpg)
Blood pressure example
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
![Page 11: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/11.jpg)
Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIFConstant -12.870 2.557 -5.03 0.000Age 0.70326 0.04961 14.18 0.000 1.8Weight 0.96992 0.06311 15.37 0.000 8.4BSA 3.776 1.580 2.39 0.033 5.3Dur 0.06838 0.04844 1.41 0.182 1.2Pulse -0.08448 0.05161 -1.64 0.126 4.4Stress 0.005572 0.003412 1.63 0.126 1.8
S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4%
Analysis of VarianceSource DF SS MS F PRegression 6 557.844 92.974 560.64 0.000Residual Error 13 2.156 0.166Total 19 560.000
![Page 12: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/12.jpg)
Regress x2 = weight on 5 predictorsPredictor Coef SE Coef T P VIFConstant 19.674 9.465 2.08 0.057Age -0.1446 0.2065 -0.70 0.495 1.7BSA 21.422 3.465 6.18 0.000 1.4Dur 0.0087 0.2051 0.04 0.967 1.2Pulse 0.5577 0.1599 3.49 0.004 2.4Stress -0.02300 0.01308 -1.76 0.101 1.5
S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9%
Analysis of VarianceSource DF SS MS F PRegression 5 308.839 61.768 20.77 0.000Residual Error 14 41.639 2.974Total 19 350.478
![Page 13: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/13.jpg)
The variance inflation factor calculated by its definition
40.8
881.01
1
1
12
min
kk
k
RbVar
bVar
The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.
![Page 14: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/14.jpg)
The pairwise correlations
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
![Page 15: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/15.jpg)
Regress y = BP on age, weight, duration and stress
Predictor Coef SE Coef T P VIFConstant -15.870 3.195 -4.97 0.000Age 0.68374 0.06120 11.17 0.000 1.5Weight 1.03413 0.03267 31.65 0.000 1.2Dur 0.03989 0.06449 0.62 0.545 1.2Stress 0.002184 0.003794 0.58 0.573 1.2
S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0%
Analysis of VarianceSource DF SS MS F PRegression 4 555.45 138.86 458.28 0.000Residual Error 15 4.55 0.30Total 19 560.00
![Page 16: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/16.jpg)
Reducing data-based multicollinearity
![Page 17: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/17.jpg)
Data-based multicollinearity
• Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.
![Page 18: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/18.jpg)
Some methods
• Modify the regression model by eliminating one or more predictor variables.
• Collect additional data under different experimental or observational conditions.
![Page 19: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/19.jpg)
(Modified!) Allen Cognitive Level (ACL) Study
• Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit:– Response y = ACL score
– x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale
– x2 = abstraction (Abstract) score on Shipley Institute of Living Scale
– x3 = score on Symbol-Digit Modalities Test (SDMT)
![Page 20: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/20.jpg)
Allen Cognitive Level (ACL) Study on 23 patients
47.7517.25
27.7517.25
28.513.5
5.8
4.2
47.75
17.25
27.75
17.25
ACL
SDMT
Vocab
Abstract
![Page 21: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/21.jpg)
Strong correlation between Vocab and Abstract
3525155
30
20
10
Abstract
Vo
cab
Pearson correlation of Vocab and Abstract = 0.990
![Page 22: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/22.jpg)
Regress y = ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P VIFConstant 3.747 1.342 2.79 0.012SDMT 0.02326 0.01273 1.83 0.083 1.7Vocab 0.0283 0.1524 0.19 0.855 49.3Abstract -0.0138 0.1006 -0.14 0.892 50.6
S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8%
Analysis of Variance
Source DF SS MS F PRegression 3 3.6854 1.2285 2.28 0.112Residual Error 19 10.2476 0.5393Total 22 13.9330
![Page 23: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/23.jpg)
Allen Cognitive Level (ACL) Study on 69 patients
57.520.5 32.517.5 30.511.5
5.8
4.2
57.5
20.5
32.5
17.5
ACL
SDMT
Vocab
Abstract
![Page 24: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/24.jpg)
Plot after having collected more data
403020100
40
30
20
10
Abstract
Vo
cab
Pearson correlation of Vocab and Abstract = 0.698
![Page 25: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/25.jpg)
Regress y = ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P VIFConstant 3.9463 0.3381 11.67 0.000SDMT 0.027404 0.007168 3.82 0.000 1.6Vocab -0.01740 0.01808 -0.96 0.339 2.1Abstract 0.01218 0.01159 1.05 0.297 2.2
S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3%
Analysis of Variance
Source DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496
![Page 26: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/26.jpg)
Reducing structural multicollinearity
In context of polynomial regression models
![Page 27: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/27.jpg)
Structural multicollinearity
• Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.
![Page 28: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/28.jpg)
Example
• (General research question) What is impact of exercise on human immune system?
• (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?
![Page 29: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/29.jpg)
30 40 50 60 70
1000
1500
2000
Maximal oxygen uptake (ml/kg)
Imm
uno
glo
bin
(mg)
Scatter plot
![Page 30: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/30.jpg)
A quadratic polynomial regression function
iiii xxy 21110
where:
• yi = amount of immunoglobin in blood (mg)
• xi = maximal oxygen uptake (ml/kg)
• typical assumptions about error terms (“INE”)
![Page 31: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/31.jpg)
Estimated quadratic function
30 40 50 60 70
1000
1500
2000
oxygen
igg
igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
Regression Plot
![Page 32: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/32.jpg)
Interpretation of the regression coefficients
• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.
• b1 is the slope of the tangent line at x = 0.
• b2 indicates the up/down direction of curve
– b2 < 0 means curve is concave down
– b2 > 0 means curve is concave up
![Page 33: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/33.jpg)
The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Regress y = iggon oxygen and oxygen2
![Page 34: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/34.jpg)
Structural multicollinearity
7060504030
5000
4000
3000
2000
1000
oxygen
oxy
ge
nsq
Pearson correlation of oxygen and oxygensq = 0.995
![Page 35: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/35.jpg)
“Center” the predictors
637.50OxygenOxCent
2637.50 OxygenOxCentSq
Mean of oxygen = 50.637
oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064
![Page 36: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/36.jpg)
Wow! It really works!
20100-10-20
400
300
200
100
0
oxcent
oxc
ent
sq
Pearson correlation of oxcent and oxcentsq = 0.219
![Page 37: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/37.jpg)
A better quadratic polynomial regression function
iiii xxxxy 2*11
*1
*0
xxx ii *where denotes the centered predictor
and:
• yi = amount of immunoglobin in blood (mg)
• typical assumptions about error terms (“INE”)
iiii xxy 2**11
**1
*0
![Page 38: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/38.jpg)
The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq
Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Regress y = iggon oxcent and oxcent2
![Page 39: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/39.jpg)
Interpretation of the regression coefficients
• b0 is predicted response at the predictor mean.
• b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model.
• b2 indicates the up/down direction of curve
– b2 < 0 means curve is concave down
– b2 > 0 means curve is concave up
![Page 40: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/40.jpg)
-20 -10 0 10 20
1000
1500
2000
oxcent
igg
igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
Regression Plot
Estimated regression function
![Page 41: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/41.jpg)
Similar estimates of coefficients from first-order linear model
-20 -10 0 10 20
1000
1500
2000
oxcent
igg
igg = 1557.63 + 32.7427 oxcent
S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %
Regression Plot
![Page 42: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/42.jpg)
The relationship between the two forms of the model
2**11
**1
*0ˆ iii xbxbby Centered model:
21110ˆ iii xbxbby Original model:
*1111
*11
*11
2*11
*1
*00
2
bb
xbbb
xbxbbb
where:
![Page 43: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/43.jpg)
2** 5362.00.342.1632ˆ iii xxy
5362.0
3.88)637.50)(5362.(234
4.1464)637.50(5362.0)637.50(342.1632
11
1
20
b
b
b
2536.03.884.1464ˆ iii xxy
Mean of oxygen = 50.637
![Page 44: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/44.jpg)
1000 1500 2000
-200
-100
0
100
200
Fitted Value
Res
idua
l
Residuals Versus the Fitted Values(response is igg)
Model evaluation
![Page 45: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/45.jpg)
-200 -100 0 100 200
-2
-1
0
1
2
Nor
mal
Sco
re
Residual
Normal Probability Plot of the Residuals(response is igg)
Model evaluation
![Page 46: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/46.jpg)
Model use: What is predicted IgG if maximal oxygen uptake is 90?
There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values
Values of Predictors for New Observations
New Obs oxcent oxcentsq1 39.4 1549
![Page 47: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/47.jpg)
The hierarchical approach to model fitting
Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.
iiiii xxxY 3111
21110
Is a first-order linear model (“line”) adequate?
0: 111110 H
![Page 48: Detecting and reducing multicollinearity](https://reader036.fdocuments.net/reader036/viewer/2022081420/568157a3550346895dc5344e/html5/thumbnails/48.jpg)
The hierarchical approach to model fitting
But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.
That is, if a quadratic term was significant, you would use this regression function:
21110 iii xxYE
2110 ii xYE
and not this one: