(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)
-
date post
21-Dec-2015 -
Category
Documents
-
view
229 -
download
0
Transcript of (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)
![Page 1: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/1.jpg)
(Correlation and)(Multiple) Regression
Friday 5th March
(and Logistic Regression too!)
![Page 2: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/2.jpg)
The Shape of Things to Come…Rest of Module
Week 8 Week 9 Week 10
Morning:Regression
Morning:Published MultivariateAnalyses
Morning:Log-linear
Models
Afternoon:Regression &
LogisticRegression(Computing
Session)
Afternoon:Logistic
Regression
Afternoon:Log-linear
Models(Computing
Session)
ASSESSMENT D ASSESSMENT E
![Page 3: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/3.jpg)
The Correlation Coefficient (r)
Age at FirstChildbirth
Age at First Cohabitation
This shows the strength/closeness of a relationship
r = 0.5(or perhaps less…)
![Page 4: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/4.jpg)
r = + 1 r = -1
r = 0
![Page 5: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/5.jpg)
Correlation… and Regression
• r measures correlation in a linear way
• … and is connected to linear regression
• More precisely, it is r2 (r-squared) that is of relevance
• It is the ‘variation explained’ by the regression line
• … and is sometimes referred to as the ‘coefficient of determination’
![Page 6: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/6.jpg)
y
x
Mean
The arrows show the overall variation(variation from the mean of y)
![Page 7: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/7.jpg)
y
x
Mean
Some of the overall variation is explained by theregression line (i.e. the arrows tend to be shorter than
the dashed lines, because the regression line is closer to the points than the mean line is)
![Page 8: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/8.jpg)
Length ofResidence (y)
Age (x)0
C
1
B
outlier
ε
y = Bx + C + ε Error term(Residual)
ConstantSlope
Regressionline
![Page 9: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/9.jpg)
• Some variation is explained by the regression line• The residuals constitute the unexplained variation
• The regression line is chosen so as to minimise the sum of the squared residuals
• i.e. to minimise Σε2 (Σ means ‘sum of’)
• The full/specific name for this technique is
Ordinary Least Squares (OLS) linear regression
Choosing the line that best explains the data
![Page 10: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/10.jpg)
Regression assumptions #1 and #2
0
ε
Frequency
#1: Residuals have the usual symmetric, ‘bell-shaped’ normal distribution
#2: Residuals are independent of each other
![Page 11: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/11.jpg)
y
y
x
x
HomoscedasticitySpread of residuals (ε) stays consistent in size (range) as x increases
HomoscedasticitySpread of residuals (ε)
increases as x increases (or varies in some other way)
Use Weighted Least Squares
Regression assumption #3
![Page 12: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/12.jpg)
Regression assumption #4
• Linearity! (We’ve already assumed this…)
• In the case of a non-linear relationship, one may be able to use a non-linear regression equation, such as:
y = B1x + B2x2 + c
![Page 13: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/13.jpg)
Another problem: Multicollinearity
• If two ‘independent variables’, x and z, are perfectly correlated (i.e. identical), it is impossible to tell what the B values corresponding to each should be
• e.g. if y = 2x + c, and we add z, should we get:• y = 1.0x + 1.0z + c, or• y = 0.5x + 1.5z + c, or• y = -5001.0x + 5003.0z + c ?• The problem applies if two variables are highly
(but not perfectly) correlated too…
![Page 14: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/14.jpg)
Example of Regression(from Pole and Lampard, 2002, Ch. 9)
• GHQ = (-0.69 x INCOME) + 4.94
• Is -0.69 significantly different from 0 (zero)?
• A test statistic that takes account of the ‘accuracy’ of the B of -0.69 (by dividing it by its standard error) is t = -2.142
• For this value of t in this example, the significance value is p = 0.038 < 0.05
• r-squared here is (-0.321)2 = 0.103 = 10.3%
![Page 15: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/15.jpg)
… and of Multiple Regression
• GHQ = (-0.47 x INCOME) + (-1.95 x HOUSING) + 5.74
• For B = 0.47, t = -1.51 (& p = 0.139 > 0.05)
• For B = -1.95, t = -2.60 (& p = 0.013 < 0.05)
• The r-squared value for this regression is 0.236 (23.6%)
![Page 16: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/16.jpg)
Interaction effects…
Squareroot oflength
of residence
Age
Women
All
Men
In this situation there is an interaction between the effects of age and of gender, so B (the slope) varies according to gender and is greater for women
![Page 17: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/17.jpg)
Logistic regression and odds ratios
• Men: 1967/294 = 6.69 (to 1)
• Women: 1980/511 = 3.87 (to 1)
• Odds ratio 6.69/3.87 = 1.73
• Men: p/(1-p) = 3.87 x 1.73 = 6.69
• Women: p/(1-p) = 3.87 x 1 = 3.87
![Page 18: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/18.jpg)
Odds and log odds
• Odds = Constant x Odds ratio
• Log odds = log(constant) + log(odds ratio)
![Page 19: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/19.jpg)
• Men
log (p/(1-p)) = log(3.87) + log(1.73)
• Women
log (p/(1-p)) = log(3.87) + log(1) = log(3.87)
• log (p/(1-p)) = constant + log(odds ratio)
![Page 20: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/20.jpg)
• Note that:
log(3.87) = 1.354
log(6.69) = 1.900
log(1.73) = 0.546
log(1) = 0
• And that the ‘reverse’ of the logarithmic transformation is exponentiation
![Page 21: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/21.jpg)
• log (p/(1-p)) = constant + (B x SEX)
where B = log(1.73)SEX = 1 for menSEX = 0 for women
• Log odds for men = 1.354 + 0.546 = 1.900• Log odds for women
= 1.354 + 0 = 1.354
• Exp(1.900) = 6.69 & Exp(1.354) = 3.87
![Page 22: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/22.jpg)
Interpreting effects in Logistic Regression
• In the above example: Exp(B) = Exp(log(1.73)) = 1.73 (the odds ratio!)
• In general, effects in logistic regression analysis take the form of exponentiated B’s (Exp(B)), which are odds ratios. Odds ratios have a multiplicative effect on the (odds of) the outcome
• Is a B of 0.546 (= log(1.73)) significant?• In this case p = 0.000 < 0.05 for this B.
![Page 23: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/23.jpg)
Back from odds to probabilities
• Probability = Odds / (1 + Odds)
• Men: 6.69 / (1 + 6.69) = 0.870
• Women: 3.87 / (1 + 3.87) = 0.795
![Page 24: (Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d6b5503460f94a4abe1/html5/thumbnails/24.jpg)
‘Multiple’ Logistic regression
• log odds = c + (B1 x SEX) + (B2 x AGE)
= c + (0.461 x SEX) + (-0.099 x AGE)
• For B1 = 0.461, p = 0.000 < 0.05
• For B2 = -0.099, p = 0.000 < 0.05
• Exp(B1) = Exp(0.461) = 1.59
• Exp(B2) = Exp(-0.099) = 0.905