Embed Size (px)
Transcript of Chapter13
- 1. Chapter 13 Simple Linear Regression & Correlation Inferential Methods
- Consider the two variables x and y. Adeterministic relationshipis one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or
- y = 5e -2xwhere x is the dependent variable.
Deterministic Models 3.
- A description of the relation between two variables x and y that are not deterministically related can be given by specifying aprobabilistic model .
- The general form of anadditive probabilistic modelallows y to be larger or smaller than f(x) by a random amount,e .
- Themodel equationis of the form
Probabilistic Models Y = deterministic function of x + random deviation = f(x) +e 4. Probabilistic Models Deviations from the deterministic part of a probabilistic model e=-1.5 5. Simple Linear RegressionModel Thesimple linear regression modelassumes that there is a line with vertical or y intercept a and slope b, called thetrueorpopulation regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y =+ x +e Without the random deviatione , all observed points(x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts. 6. Simple Linear RegressionModel 0 0 x = x 1 x = x 2 e 2 Observation when x = x 1 (positive deviation) e 2 Observation when x = x 2 (positive deviation) = vertical intercept Population regression line (Slope ) 7. Basic Assumptions of the Simple Linear RegressionModel
- The distribution ofeat any particular x value has mean value 0 ( e= 0).
- The standard deviation ofe(which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by .
- The distribution ofeat any particular x value is normal.
- The random deviationse 1 ,e 2 , ,e nassociated with different observations are independent of one another.
8. More About the Simple Linear RegressionModel and (standard deviation of y for fixed x) = . For any fixed x value, y itself has a normal distribution. 9. Interpretation of Terms
- The slopeof the population regression line is themean (average)change in y associated with a 1-unit increase in x.
- The vertical interceptis the height of the population line when x = 0.
- The size ofdetermines the extent to which the (x, y) observations deviate from the population line.
Small Large 10. Illustration of Assumptions 11. Estimates for the Regression Line The point estimates of , the slope, and , the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line.That is, 12. Interpretation of y = a + bx
- Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations:
- a + bx* is a point estimate of the mean y value when x = x*.
- a + bx* is a point prediction of an individual y value to be observed when x = x*.
13. Example The following data was collected in a study of age and fatness in humans. * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon ( 153 Gd) absorptiometry.American Journal of Clinical Nutrition ,40 , 834-839 One of the questions was, What is the relationship between age and fatness? 14. Example 15. Example 16. Example 17. Example A point estimate for the %Fatfor a human who is 45 years old isIf 45 is put into the equation for x, we have both anestimated %Fat for a 45 year old humanor anestimated average %Fat for 45 year old humans The two interpretations are quite different. 18. Example A plot of the data points along with the least squares regression line created with Minitab is given to the right. 19. Terminology 20. Definition formulae Thetotal sum of squares , denoted bySSTo , is defined as Theresidual sum of squares , denoted bySSResid , is defined as 21. Calculation Formulae Recalled SSToandSSResidare generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: 22. Coefficient of Determination
- Thecoefficient of determination , denoted byr 2 , gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.
23. Estimated Standard Deviation, s e The statistic for estimating the variance 2is where 24. Estimated Standard Deviation, s e The estimate ofis theestimated standard deviation The number of degrees of freedom associated with estimating orin simple linear regression is n - 2. 25. Example continued 26. Example continued 27. Example continued With r 2= 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves.This would suggest that the model is only useful in the sense of provide gross ballpark estimates for %Fat for humans based on age. 28. Properties of the Sampling Distribution of b
- The mean value of b is .Specifically,
- b = and hence b is an unbiased statistic for estimating
When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met:
- The statistic b has a normal distribution (a consequence of the error e being normally distributed)
- The standard deviation of the statistic b is
29. Estimated Standard Deviation of b Theestimated standard deviation of the statistic bis When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2 30. Confidence interval for When then four basic assumptions of the simple linear regression model are satisfied, aconfidence interval for , the slope of the population regression line, has the form b(t critical value) s b where the t critical value is based ondf = n - 2. 31. Example continued Recall A 95% confidence interval estimate foris 32. Example continued Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. A 95% confidence interval estimate foris 33. Example continued The regression equation is % Fat y = 3.22 + 0.548 Age (x) PredictorCoefSE CoefTP Constant3.2215.0760.630.535 Age (x)0.54800.10565.190.000 S = 5.754R-Sq = 62.7%R-Sq(adj) = 60.4% Analysis of Variance SourceDFSSMSFP Regression1891.87891.8726.940.000 Residual Error16529.6633.10 Total171421.54 Minitab output looks like Regression Analysis: % Fat y versus Age (x) Regression line residual df= n -2SSResid SSTo Estimated slope b Estimated y intercept a 34. Hypothesis Tests Concerning Null hypothesis:H 0 := hypothesized value 35. Hypothesis Tests Concerning
- Alternate hypothesis and finding the P-value:
- H a :> hypothesized value
- P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t
- H a :< hypothesized value
- P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t
36. Hypothesis Tests Concerning
- H a : hypothesized value
- If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t)
- If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t)
37. Hypothesis Tests Concerning
- The distribution of e at any particular x value has mean value 0 ( e= 0)
- The standard deviation of e is , which does not depend on x
- The distribution of e at any particular x value is normal
- The random deviations e 1 , e 2 , , e nassociated with different observations are independent of one another
38. Hypothesis Tests Concerning Quite often the test is performed with the hypotheses H 0 := 0 vs. H a : 0 This particular form of the test is called themodel utility test for simple linear regression. The null hypothesis specifies that there isnouseful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. The test statistic simplifies toand is called thet ratio . 39. Example Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977)Patterns in Human Geography , Canada: Douglas David and Charles Ltd., 158. 40. Example The plot of the data points produced by Minitab follows 41. Example 42. Example Some basic summary statistics 43. Example Continuing with the calculations 44. Example Continuing with the calculations 45. Example 46. Example - Model Utility Test
- = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point
- H 0 : = 0
- H a