Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients...

21
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model Assumptions, and their effects (9.6) 1
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients...

Sociology 601 Class 19: November 3, 2008

• Review of correlation and standardized coefficients

• Statistical inference for the slope (9.5)

• Violations of Model Assumptions, and their effects (9.6)

1

9.5 Inference for a slope.

• Problem: we have measures for the strength of association between two linear variables, but no measures for the statistical significance of that association.• We know the slope & intercept for our sample; what can

we say about the slope & intercept for the population?

• Solution: hypothesis tests for a slope and confidence intervals for a slope.• Need a standard error for the coefficients

• Difficulties: additional assumptions, complications with estimating a standard error for a slope.

2

Assumptions Needed to make Population Inferences for slopes.

• The sample is selected randomly.

• X and Y are interval scale variables.

• The mean of Y is related to X by the linear equation E{Y} = + X.

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)

• The conditional distribution of Y at each value of X is normal.

• There is no error in the measurement of X.

3

Common Ways to Violate These Assumptions

• The sample is selected randomly.

o Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster.

o Two or more siblings in the same family.

o Sample = populations (e.g., states in the U.S.)

• X and Y are interval scale variables.

o Ordinal scale attitude measures

o Nominal scale categories (e.g., race/ethnicity, religion)

4

Common Ways to Violate These Assumptions (2)

• The mean of Y is related to X by the linear equation E{Y} = + X.

o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)

o Thresholds:

o Logarithmic (e.g., earnings <- education)

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)

o earnings <- education

o hours worked <- time

o adult child occupational status <- parental occupational status

5

Common Ways to Violate These Assumptions (3)

• The conditional distribution of Y at each value of X is normal.

o earnings (skewed) <- education

o Y is binary, or a %

• There is no error in the measurement of X.

o almost everything

o what is the effect of measurement error in x on b?

6

The Null hypothesis for slopes

Null hypothesis: the variables are statistically independent.

• Ho: = 0. The null hypothesis is that there is no linear relationship between X and Y.

• Implication for : E{Y} = + 0*X = ;

= .

(Draw figure of distribution of Y, X when Ho is true)

7

Test Statistic for slopes

• What is the range of b’s we would get if we take repeated samples from a population and calculate b for each of those samples?

• That is, what is the standard error of the sample slope b’s?

• Test statistic: t = b /hat b

o where hat b is the standard error of the sample

slope b.o df for the t statistic (with one x – variable) is n-2o when n is large, the t statistic is asymptotically equivalent

to a z-statistic

• What would make hat b smaller?8

Calculating the s.e. of b

hat b = hat / (sX*sqrt(n-1))

where hat = sqrt(SSE/n-2)(= root MSE)

• the standard error of b is smaller when…o the sample size is largeo the standard deviation of X is large (there is a

wide range of X values) o the conditional standard deviation of Y is small.

9

Conclusions about Population

• P-value:

calculated as in any t-test, but remember df = n-2

a z-test is appropriate when n > 30 or so

• Conclusions:

evaluate p-value based o n a previously selected alpha level

Rule of thumb: b should be at least 2x standard error.

10

Example of Inference about a Slope

• In an analysis of poverty and crime in the 50 states plus DC, a computer output provides the following:

• E{Murder rate} = -10.14 + 1.322*{Poverty rate}

(Poverty rate in %, murder rate per 100,000)

• SSE = 3904.3 SST = 5743.3

• N = 51 Sx = 4.584

• Do a hypothesis test to determine whether there is a linear relationship between crime rates and poverty rates.

11

Stata Example of Inference about a Slope

• In an analysis of poverty and crime in the 50 states plus DC, stata computer output provides the following:

regress murder poverty

Source | SS df MS Number of obs = 51-------------+------------------------------ F( 1, 49) = 23.08 Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 Residual | 3904.25223 49 79.6786169 R-squared = 0.3202-------------+------------------------------ Adj R-squared = 0.3063 Total | 5743.32154 50 114.866431 Root MSE = 8.9263

------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707-----------------------------------------------------------------------------

• Interpret whether there is a linear relationship between crime rates and poverty rates.

12

Example of Inference about a Slope

• SSE = 3904.3 SST = 5743.3

• N = 51 Sx = 4.58

• b= 1.323• b= 1.323

13€

b =Σ(X − X )(Y −Y )

Σ(X − X )2,

a =Y −bX

Example of Inference about a Slope

• SSE = 3904.3 SST = 5743.3

• N = 51 Sx = 4.58

• b= 1.323

• seb= sqrt (SSE / (n-2) ) / (sx * sqrt(n-1))

= sqrt (3904.3/49) / ( 4.585*sqrt(50) )

= sqrt (79.68) / (4.585 * 7.071)

= 8.926 / 32.421

= 0.275

• t = b / seb = 1.323 / 0.275 = 4.81

• p < .001

95% confidence interval for b = 0.783 to 1.86114

Confidence interval for a slope.

• Confidence interval for a slope:

c.i. = b ± t*hat b

the standard t-score for a 95% confidence interval is

t.025 , with df = n-2

• An alternative to a confidence interval is to report both b and hat b .

15

Example of Confidence Interval of a Slope

• SSE = 3904.3 SST = 5743.3

• N = 51 Sx = 4.58

• b = 1.323

• seb = 0.275

95% confidence interval for

b = 1.322 +- 2.009*0.275

= 1.322 +- 0.552

= 0.783 to 1.861

16

Inference for a slope using STATA

. regress attend regul

Source | SS df MS Number of obs = 18

-------------+------------------------------ F( 1, 16) = 9.65

Model | 2240.05128 1 2240.05128 Prob > F = 0.0068

Residual | 3715.94872 16 232.246795 R-squared = 0.3761

-------------+------------------------------ Adj R-squared = 0.3371

Total | 5956 17 350.352941 Root MSE = 15.24

------------------------------------------------------------------------------

attend | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

regul | -5.358974 1.72555 -3.11 0.007 -9.016977 -1.700972

_cons | 36.83761 5.395698 6.83 0.000 25.39924 48.27598

------------------------------------------------------------------------------

• The significance test and confidence interval for b appear on the line with the name of the x-variable.

• Can you find SSE and SST? df for the model? r?

17

Things to watch out for: extrapolation.

Extrapolation beyond observed values of X is dangerous.• The pattern may be nonlinear.• Even if the pattern is linear, the standard errors become

increasingly wide.• Be especially careful interpreting the Y-intercept: it may lie

outside the observed data.o e.g., year zeroo e.g., zero education in the U.S.o e.g., zero parity

19

Things to watch out for: outliers

• Influential observations and outliers may unduly influence the fit of the model.

• The slope and standard error of the slope may be affected by influential observations.

• This is an inherent weakness of least squares regression.

• You may wish to evaluate two models; one with and one without the influential observations.

20

Things to watch out for: truncated samples

Truncated samples cause the opposite problems of influential observations and outliers.

• Truncation on the X axis reduces the correlation coefficient for the remaining data.

• Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors.

•Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.

22

Things to watch out for: measurement error

Error in measurement of the X variable creates a bias that makes the correlation appear weaker.

This problem can be a measurement issue or an interpretation issue.

23