1 Lecture 8 Relationships between Scale variables: Regression Analysis Graduate School Quantitative...
-
date post
18-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of 1 Lecture 8 Relationships between Scale variables: Regression Analysis Graduate School Quantitative...
1
Lecture 8Relationships between Scale variables: Regression Analysis
Graduate SchoolQuantitative Research Methods
Gwilym [email protected]
2
Notices: Register
3
Plan:
1. Linear & Non-linear Relationships 2. Fitting a line using OLS 3. Inference in Regression 4. Ommitted Variables & R2 5. Types of Regression Analysis 6. Properties of OLS Estimates 7. Assumptions of OLS 8. Doing Regression in SPSS
4
1. Linear & Non-linear relationships between variables
Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?– is income related to education?– is worker alienation related to job monotony?
We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured
assuming a particular theory of causation
5
Relationships between scale variables
The most straight forward way to investigate evidence for relationship is to look at scatter plots:– traditional to:
• put the dependent variable (I.e. the “effect”) on the vertical axis
– or “y axis”
• put the explanatory variable (I.e. the “cause”) on the horizontal axis
– or “x axis”
6
Scatter plot of IQ and Income:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000
7
We would like to find the line of best fit:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000
bxay ˆ
line of slope
intercept
where,
b
ya
IQbaINCOME
8
What does the output mean?
9
Sometimes the relationship appears non-linear:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
10
… and so a straight line of best fit is not always very satisfactory:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
11
Could try a quadratic line of best fit:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
12
But we can simulate a non-linear relationship by first transforming one of the variables:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
IQ2
3002001000
INC
_S
Q
1000000000
800000000
600000000
400000000
200000000
0
13
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
IQ2_LN
5.85.65.45.25.04.84.64.44.2
INC
OM
E
40000
30000
20000
10000
14
… or a cubic line of best fit:(overfitted?)
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
15
Or could try two linear lines:“structural break”
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
16
2. Fitting a line using OLS
The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation:
n
iii yy
1
2)ˆ(minWhere:
yi = observed value of y
= predicted value of yi
= the value on the line of best fit corresponding to xi
iy
17
Regression estimates of a, b:or Ordinary Least Squares (OLS):
This criterion yields estimates of the slope b and y-intercept a of the straight line:
22 )( xxn
yxxynb
xbya
18
3. Inference in Regression: Hypothesis tests on the slope coefficient:
Regressions are usually run on samples, so what can we say about the population relationship between x and y?
Repeated samples would yield a range of values for estimates of b ~ N(, sb)
• I.e. b is normally distributed with mean = = population mean = value of b if regression run on population
If there is no relationship in the population between x and y, then = 0, & this is our H0
19
What does the standard error mean?
20
Hypothesis test on b: (1) H0: = 0
(I.e. slope coefficient, if regression run on population, would = 0)
H1: (2) = 0.05 or 0.01 etc.
(3) Reject H0 iff P < • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6)
(4) Calculate P and conclude.
bbb s
b
s
b
s
bt
0 2ndf
21
Example using SPSS output: (1) H0: no relationship between house price and floor area.
H1: there is a relationship (2), (3), (4):
P = 1- CDF.T(24.469,554) = 0.000000
Reject H0
Coefficientsa
-10029.0 3242.545 -3.093 .002
638.825 26.108 .721 24.469 .000
(Constant)
Floor Area (sq meters)
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Purchase Pricea.
22
4. Ommitted Variables & R2 Q/ is floor area the only factor?How much of the variation in Price does it explain?
Floor Area (sq meters)
3002001000
Pu
rch
ase
Pri
ce300000
200000
100000
0
23
R-square R-square tells you how much of the variation in y is explained by the
explanatory variable x– 0 < R2 < 1 (NB: you want R2 to be near 1).
– If more than one explanatory variable, use Adjusted R2
Model Summary
.721a .519 .519 26925.18Model1
R R SquareAdjustedR Square
Std. Errorof the
Estimate
Predictors: (Constant), Floor Area (sq meters)a.
24
Example: 2 explanatory variablesCoefficientsa
-23817.3 3623.761 -6.573 .000
507.271 30.722 .572 16.511 .000
24951.769 3401.282 .254 7.336 .000
(Constant)
Floor Area (sq meters)
Number of Bathrooms
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Purchase Pricea.
Model Summary
.750a .562 .560 25726.74Model1
R R SquareAdjustedR Square
Std. Errorof the
Estimate
Predictors: (Constant), Number of Bathrooms ,Floor Area (sq meters)
a.
25
Scatter plot (with floor spikes)
Purchase Price
300
100000
200000
300000
200
Floor Area (sq meters)3.53.0100 2.5
Number of Bathrooms2.01.51.0
26
3D Surface Plots:Construction, Price & UnemploymentQ = -246 + 27P - 0.2P2 - 73U + 3U2
Q
Ut-1
P
020
4060
80
0
5
1015
-500
0
500
020
4060
80
0
5
1015
27
Construction Equation in a Slump
Q = 315 + 4P - 73U + 5U2
020
4060
80
0
510
15
0
200
400
600
800
020
4060
80
0
510
15
28
5. Types of regression analysis: Univariate regression: one explanatory variable
– what we’ve looked at so far in the above equations
Multivariate regression: >1 explanatory variable– more than one equation on the RHS
Log-linear regression & log-log regression:– taking logs of variables can deal with certain types of non-
linearities & useful properties (e.g. elasticities)
Categorical dependent variable regression:– dependent variable is dichotomous -- observation has an
attribute or not • e.g. MPPI take-up, unemployed or not etc.
29
6. Properties of OLS estimators
OLS estimates of the slope and intercept parameters have been shown to be BLUE (provided certain assumptions are met):
• Best• Linear• Unbiased• Estimator
30
“Best” in that they have the minimum variance compared with other estimators (i.e. given repeated samples, the OLS estimates for α and β vary less between samples than any other sample estimates for α and β).
“Linear” in that a straight line relationship is assumed.
“Unbiased” because, in repeated samples, the mean of all the estimates achieved will tend towards the population values for α and β.
“Estimates” in that the true values of α and β cannot be known, and so we are using statistical techniques to arrive at the best possible assessment of their values, given the information available.
31
7. Assumptions of OLS:
For estimation of a and b to be BLUE and for regression inference to be correct:1. Equation is correctly specified:
– Linear in parameters (can still transform variables)– Contains all relevant variables– Contains no irrelevant variables– Contains no variables with measurement errors
2. Error Term has zero mean3. Error Term has constant variance
32
4. Error Term is not autocorrelated– I.e. correlated with error term from previous time
periods
5. Explanatory variables are fixed– observe normal distribution of y for repeated fixed
values of x
6. No linear relationship between RHS
variables – I.e. no “multicolinearity”
33
8. Doing Regression analysis in SPSS To run
regression analysis in SPSS, click on Analyse, Regression, Linear:
34
Select your dependent (i.e. ‘explained’) variable and independent (i.e. ‘explanatory’) variables:
35
e.g. Floor area and bathrooms:Floor area = + Number of bathrooms +
Model Summary
.584a .341 .340 35.58Model1
R R SquareAdjustedR Square
Std. Errorof the
Estimate
Predictors: (Constant), Number of Bathroomsa.
ANOVAb
362375.7 1 362375.7 286.292 .000a
701228.5 554 1265.755
1063604 555
Regression
Residual
Total
Model1
Sum ofSquares df
MeanSquare F Sig.
Predictors: (Constant), Number of Bathroomsa.
Dependent Variable: Floor Area (sq meters)b.
36
Coefficientsa
40.928 4.700 8.708 .000
64.622 3.819 .584 16.920 .000
(Constant)
Number of Bathrooms
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Floor Area (sq meters)a.
37
Confidence Intervals for regression coefficients Population slope coefficient CI:
Rule of thumb:
bi SEtb
bSEb 2
38
e.g. regression of floor area on number of bathrooms, CI on slope:
= 64.6 2 3.8
= 7.6
95% CI = ( 57, 72)
39
Confidence Intervals in SPSS:Analyse, Regression, Linear, click on Statistics and
select Confidence intervals:
40
Our rule of thumb said 95% CI for slope = ( 57, 72). How does this compare?
Coefficientsa
40.928 4.700 8.708 .000 31.697 50.160
64.622 3.819 .584 16.920 .000 57.120 72.123
(Constant)
BATHROOM Numberof Bathrooms
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: FLOORARE Floor Area (sq meters)a.
41
Past Paper: (C2)Relationships (30%)
Suppose you have a theory that suggests that time watching TV is determined by gregariousness the less gregarious, the more time spent watching TV
Use a random sample of 60 observations from the TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender.
Carefully interpret the output from this model and discuss the statistical robustness of the results.
42
Reading:
Regression Analysis:– *Field, A. chapters on regression.– *Moore and McCabe Chapters on regression.– Kennedy, P. ‘A Guide to Econometrics’– Bryman, Alan, and Cramer, Duncan (1999)
“Quantitative Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10.
– Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).