Download - Chapter13

Transcript
Page 1: Chapter13

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 13

Simple Linear Regression&

CorrelationInferential Methods

Page 2: Chapter13

2 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or y = 5e-2x where x is the dependent variable.

Deterministic Models

Page 3: Chapter13

3 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model.

The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e.

The model equation is of the form

Probabilistic Models

Y = deterministic function of x + random deviation = f(x) + e

Page 4: Chapter13

4 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Probabilistic Models

Deviations from the deterministic part of a probabilistic model

e=-1.5

Page 5: Chapter13

5 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Simple Linear Regression Model

The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line.

When a value of the independent variable x is fixed and an observation on the dependent variable y is made,

y = + x + e

Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.

Page 6: Chapter13

6 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Simple Linear Regression Model

0

0x = x1 x = x2

e2

Observation when x = x1

(positive deviation)

e2

Observation when x = x2

(positive deviation)

= vertical intercept

Population regression line(Slope )

Page 7: Chapter13

7 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular x value has mean value 0 (µe = 0).

2. The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by .

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, …, en associated with different observations are independent of one another.

Page 8: Chapter13

8 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

More About the Simple Linear Regression Model

and

(standard deviation of y for fixed x) = .

For any fixed x value, y itself has a normal distribution.

mean y value height of the populationx

for fixed x regression line above x

Page 9: Chapter13

9 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interpretation of Terms

1. The slope of the population regression line is the mean (average) change in y associated with a 1-unit increase in x.

2. The vertical intercept is the height of the population line when x = 0.

3. The size of determines the extent to which the (x, y) observations deviate from the population line.

Small Large

Page 10: Chapter13

10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Illustration of Assumptions

Page 11: Chapter13

11 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimates for the Regression Line

The point estimates of , the slope, and , the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line.

That is,

xy

xx

Sb point estimate of

S

a point estimate of y bx

where 2

2

xy xx

x y xS xy and S x

n n

where 2

2

xy xx

x y xS xy and S x

n n

Page 12: Chapter13

12 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interpretation of y = a + bx

Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations:

1. a + bx* is a point estimate of the mean y value when x = x*.

2. a + bx* is a point prediction of an individual y value to be observed when x = x*.

Page 13: Chapter13

13 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

The following data was collected in a study of age and fatness in humans.

* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839

One of the questions was, “What is the relationship between age and fatness?”

Age 23 23 27 27 39 41 45 49 50% Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1

Age 53 53 54 56 57 58 58 60 61% Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5

Page 14: Chapter13

14 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleAge (x) % Fat y x2 xy

23 9.5 529 218.523 27.9 529 641.727 7.8 729 210.627 17.8 729 480.639 31.4 1521 1224.641 25.9 1681 1061.945 27.4 2025 123349 25.2 2401 1234.850 31.1 2500 155553 34.7 2809 1839.153 42 2809 222654 29.1 2916 1571.456 32.5 3136 182057 30.3 3249 1727.158 33 3364 191458 33.8 3364 1960.460 41.1 3600 246661 34.5 3721 2104.5

834 515 41612 25489.2

2

n 18

X 834

y 515

X 41612

XY 25489.2

Page 15: Chapter13

15 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

2

n 18, x 834, y 515

x 41612, xy 25489.2

2

2xx

2

xS x

n

83441612 2970

18

xy

x yS xy

n834 515

25489.2 1627.5318

Page 16: Chapter13

16 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

xy

xx

S 1627.53b 0.54799

S 2970

515 834a y bx 0.54799 3.2209

18 18

y 3.22 0.548x

Page 17: Chapter13

17 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

A point estimate for the %Fat for a human who is 45 years old is

If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans

The two interpretations are quite different.

y 3.22 0.548x

a + bx=3.22+0.548(45)=27.9%

a + bx=3.22+0.548(45)=27.9%

Page 18: Chapter13

18 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

A plot of the data points along with the least squares regression line created with Minitab is given to the right.

6050403020

40

30

20

10

Age (x)

% F

at y

S = 5.75361 R-Sq = 62.7 % R-Sq(adj) = 60.4 %

% Fat y = 3.22086 + 0.547991 Age (x)

Regression Plot

Page 19: Chapter13

19 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Terminology

The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives

=1st predicted value

=2nd predicted value

=nth predicted value

1 1

2 2

n n

y a bx

y a bx

...

y a bx

The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives

=1st predicted value

=2nd predicted value

=nth predicted value

1 1

2 2

n n

y a bx

y a bx

...

y a bx

The residuals for the least squares line are the values: 1 1 2 2 n ny y , y y , ..., y yˆ ˆ ˆ The residuals for the least squares line are the values: 1 1 2 2 n ny y , y y , ..., y yˆ ˆ ˆ

Page 20: Chapter13

20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Definition formulae

The total sum of squares, denoted by SSTo, is defined as

The residual sum of squares, denoted by SSResid, is defined as

2 2 21 2 n

2

SSTo (y y) (y y) (y y)

(y y)

2 2 21 1 2 2 n n

2

SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ

(y y)ˆ

Page 21: Chapter13

21 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Calculation Formulae Recalled

SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:

22 2 y

SSTo y y yn

2 2SSResid (y y) y a y b xyˆ

Page 22: Chapter13

22 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Coefficient of Determination

The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.

The coefficient of determination, r2, can be computed as 2 SSResid

r 1SSTo

The coefficient of determination, r2, can be computed as 2 SSResid

r 1SSTo

Page 23: Chapter13

23 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimated Standard Deviation, se

The statistic for estimating the variance 2 is

where

2e

SSResids

n 2

2 2ˆSSResid (y y) y a y b xy 2eThe subscript e in s is a reminder that we are

estimating the variance of the "errors" or residuals.

Page 24: Chapter13

24 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimated Standard Deviation, se

The estimate of is the estimated standard deviation

The number of degrees of freedom associated with estimating or in simple linear regression is n - 2.

2e es s

Page 25: Chapter13

25 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example continuedSSResid

529.66

2e

SSResids

n 2529.6618 233.104

2e es s

33.104

5.754

Age (x) % Fat (y) y2Predicted Value

Residual

23 9.5 90.3 15.82 -6.32 40.0023 27.9 778.4 15.82 12.08 145.8127 7.8 60.8 18.02 -10.22 104.3827 17.8 316.8 18.02 -0.22 0.0539 31.4 986.0 24.59 6.81 46.3441 25.9 670.8 25.69 0.21 0.0445 27.4 750.8 27.88 -0.48 0.2349 25.2 635.0 30.07 -4.87 23.7450 31.1 967.2 30.62 0.48 0.2353 34.7 1204.1 32.26 2.44 5.9353 42.0 1764.0 32.26 9.74 94.7854 29.1 846.8 32.81 -3.71 13.7856 32.5 1056.3 33.91 -1.41 1.9857 30.3 918.1 34.46 -4.16 17.2758 33.0 1089.0 35.00 -2.00 4.0258 33.8 1142.4 35.00 -1.20 1.4560 41.1 1689.2 36.10 5.00 25.0061 34.5 1190.3 36.65 -2.15 4.62834 515.0 16156.3 529.66

ˆy yy 2ˆy y

Page 26: Chapter13

26 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example continued

2n 18, y 515.0, y 16156.3

xy 25489.2 ,a 3.2209, b 0.54799

2

2 2

2

ySSTot= y-y y

n

(515.0)16156.3 1421.5

18

2 SSResid 529.66r 1 1 1 0.373 0.627

SSTo 1421.5

Page 27: Chapter13

27 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example continued

With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age.

The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves.

This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age.

Page 28: Chapter13

28 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Properties of the Sampling Distribution of b

1. The mean value of b is . Specifically,

b= and hence b is an unbiased statistic for estimating

When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met:

b

xxS

2. The standard deviation

of the statistic b is

3. The statistic b has a normal distribution (a consequence of the error e being normally distributed)

Page 29: Chapter13

29 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimated Standard Deviation of b

The estimated standard deviation of the statistic b is e

b

xx

s

S

When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable

is the t distribution with df = n - 2

b

bt

s

Page 30: Chapter13

30 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Confidence interval for

When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form

b (t critical value)sb

where the t critical value is based on

df = n - 2.

Page 31: Chapter13

31 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example continuedRecall

2 2

n 18, x 834, y 515

x 41612, xy 25489.2, y 16156.3

b 0.54799, a 3.2209

eb

xx

s 5.754s 0.1056

S 2970

A 95% confidence interval estimate for is

bb t s 0.5480 (2.12) (0.1056) 0.5480 0.2238

es 5.754

Page 32: Chapter13

32 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example continued

Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%.

A 95% confidence interval estimate for is

bb ts 0.5480 2.12(0.1056)

0.5480 0.2238

(0.324,0.772)

Page 33: Chapter13

33 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The regression equation is% Fat y = 3.22 + 0.548 Age (x)

Predictor Coef SE Coef T PConstant 3.221 5.076 0.63 0.535Age (x) 0.5480 0.1056 5.19 0.000

S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4%

Analysis of Variance

Source DF SS MS F PRegression 1 891.87 891.87 26.94 0.000Residual Error 16 529.66 33.10Total 17 1421.54

Example continued

Minitab output looks like

Regression line

2es

residual df = n -2

SSResidSSTo

Estimated slope b

Regression Analysis: % Fat y versus Age (x)

Estimated y intercept a

Page 34: Chapter13

34 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests Concerning

Null hypothesis: H0: = hypothesized value

Test statistic:

The test is based on df = n - 2

b

b hypothesized valuet

s

Test statistic:

The test is based on df = n - 2

b

b hypothesized valuet

s

Page 35: Chapter13

35 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests Concerning

Alternate hypothesis and finding the P-value:

1. Ha: > hypothesized valueP-value = Area under the t curve with

n - 2 degrees of freedom to the right of the calculated t

2. Ha: < hypothesized value

P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t

Page 36: Chapter13

36 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests Concerning

3. Ha: hypothesized value

a) If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t)

b) If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t)

Page 37: Chapter13

37 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests Concerning

Assumptions:

1. The distribution of e at any particular x value has mean value 0 (e = 0)

2. The standard deviation of e is , which does not depend on x

3. The distribution of e at any particular x value is normal

4. The random deviations e1, e2, … , en associated with different observations are independent of one another

Page 38: Chapter13

38 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests Concerning

Quite often the test is performed with the hypotheses

H0: = 0 vs. Ha: 0

This particular form of the test is called the model utility test for simple linear regression.

The test statistic simplifies to and is called the t ratio.b

bt

s

The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.

Page 39: Chapter13

39 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleConsider the following data on percentage unemployment and suicide rates.

* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.

CityPercentage

UnemployedSuicide

RateNew York 3.0 72Los Angeles 4.7 224Chicago 3.0 82Philadelphia 3.2 92Detroit 3.8 104Boston 2.5 71San Francisco 4.8 235Washington 2.7 81Pittsburgh 4.4 86St. Louis 3.1 102Cleveland 3.5 104

Page 40: Chapter13

40 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

The plot of the data points produced by Minitab follows

Page 41: Chapter13

41 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

CityPercentage

Unemployed (x)

Suicide Rate (y)

x2 xy y2

New York 3.0 72 9.00 216.0 05184 Los Angeles 4.7 224 22.09 1052.8 50176 Chicago 3.0 82 9.00 246.0 06724 Philadelphia 3.2 92 10.24 294.4 08464 Detroit 3.8 104 14.44 395.2 10816 Boston 2.5 71 6.25 177.5 05041 San Francisco 4.8 235 23.04 1128.0 55225 Washington 2.7 81 7.29 218.7 06561 Pittsburgh 4.4 86 19.36 378.4 07396 St. Louis 3.1 102 9.61 316.2 10404 Cleveland 3.5 104 12.25 364.0 10816

38.7 1253 142.57 4787.2 176807

Page 42: Chapter13

42 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

Some basic summary statistics

2

2

n 11, x 38.7, x 142.57

y 1253, y 176807, xy 4787.2

xy

x yS xy

n(38.7)(1253)

4787.211

378.92

2

2

xx

2

xS x

n38.7

142.5711

6.4164

Page 43: Chapter13

43 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

Continuing with the calculations

xy

xx

S 378.92b 59.06

S 6.4164

1253 38.7a y bx 59.06 93.86

11 11

y 93.86 59.06x

Page 44: Chapter13

44 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

Continuing with the calculations

2 2

SSResid

ˆ(y y) y a y b xy

176807 ( 93.857)(1253) 59.055(4787.2)

11701.9

2

2 2yy

2

ySSTo S (y y) y

n

1253176807

1134078.9

Page 45: Chapter13

45 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

2 SSResid 11701.9r 1 1

SSto 34078.91 0.343 0.657

e

SSResid 11701.9s 36.06

n-2 9

Page 46: Chapter13

46 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Model Utility Test

1. = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point

2. H0:= 0

3. Ha: 0

4. has not been preselected. We shall interpret the observed level of significance (P-value)

5. Test statistic:

b b b

b hypothesized value b 0 bt

s s s

Page 47: Chapter13

47 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Model Utility Test6. Assumptions: The following plot (Minitab) of

the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate.

Page 48: Chapter13

48 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Model Utility Test

8. P-value: The table of tail areas for t-distributions only has t values 4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002)

7. Calculation:

eb

xx

s 36.06s 14.24

S 6.4164

b

b 59.06t 4.15

s 14.24

Page 49: Chapter13

49 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Model Utility Test

8. Conclusion:

Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate.

Page 50: Chapter13

50 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Minitab Output

Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x)

The regression equation isSuicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x)

Predictor Coef SE Coef T PConstant -93.86 51.25 -1.83 0.100Percenta 59.05 14.24 4.15 0.002

S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8%T value for Model Utility Test

H0: = 0 Ha: 0

P-value

Page 51: Chapter13

51 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residual Analysis

The simple linear regression model equation is y = + x + e where e represents the random deviation of an observed y value from the population regression line + x .

Key assumptions about e

1. At any particular x value, the distribution of e is a normal distribution

2. At any particular x value, the standard deviation of e is , which is constant over all values of x.

Page 52: Chapter13

52 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residual Analysis

To check on these assumptions, one would examine the deviations e1, e2, …, en.

Generally, the deviations are not known, so we check on the assumptions by looking at the residuals which are the deviations from the estimated line, a + bx.

The residuals are given by

1 1 1 1

2 2 2 2

n n n n

ˆy y y (a bx )

ˆy y y (a bx )

ˆy y y (a bx )

Page 53: Chapter13

53 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standardized Residuals

Recall: A quantity is standardized by subtracting its mean value and then dividing by its true (or estimated) standard deviation.

For the residuals, the true mean is zero (0) if the assumptions are true.

i i

2

ˆy y exx

x x1s s 1

n S

The estimated standard deviation of a residual depends on the x value. The estimated standard deviation of the ith residual, , is given by i iˆy y

Page 54: Chapter13

54 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standardized Residuals

As you can see from the formula for the estimated standard deviation the calculation of the standardized residuals is a bit of a calculational nightmare.

Fortunately, most statistical software packages are set up to perform these calculations and do so quite proficiently.

Page 55: Chapter13

55 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standardized Residuals - Example

Consider the data on percentage unemployment and suicide rates

Notice that the standardized residual for Pittsburgh is -2.50, somewhat large for this size data set.

CityPercentage Unemployed

Suicide Rate

Residual Standardized Residual

New York 3.0 72 83.31 -11.31 -0.34Los Angeles 4.7 224 183.70 40.30 1.34Chicago 3.0 82 83.31 -1.31 -0.04Philadelphia 3.2 92 95.12 -3.12 -0.09Detroit 3.8 104 130.55 -26.55 -0.78Boston 2.5 71 53.78 17.22 0.55San Francisco 4.8 235 189.61 45.39 1.56Washington 2.7 81 65.59 15.41 0.48Pittsburgh 4.4 86 165.99 -79.98 -2.50St. Louis 3.1 102 89.21 12.79 0.38Cleveland 3.5 104 112.84 -8.84 -0.26

y ˆy y

Page 56: Chapter13

56 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

PittsburghThis point has an unusually high residual

Page 57: Chapter13

57 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Normal Plots

500-50

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is Suicide)

2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5

2

1

0

-1

-2

Nor

mal

Sco

re

Standardized Residual

Normal Probability Plot of the Residuals(response is Suicide)

Notice that both of the normal plots look similar. If a software package is available to do the calculation and plots, it is preferable to look at the normal plot of the standardized residuals.

In both cases, the points look reasonable linear with the possible exception of Pittsburgh, so the assumption that the errors are normally distributed seems to be supported by the sample data.

Page 58: Chapter13

58 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

More Comments

The fact that Pittsburgh has a large standardized residual makes it worthwhile to look at that city carefully to make sure the figures were reported correctly. One might also look to see if there are some reasons that Pittsburgh should be looked at separately because some other characteristic distinguishes it from all of the other cities.

Pittsburgh does have a large effect on model.

Page 59: Chapter13

59 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

2

1

0

-1

-2

x

Sta

nd

ard

ized

Res

idu

al

Standardized Residuals Versus x(response is y)

This plot is an example of a satisfactory plot that indicates that the model assumptions are reasonable.

Visual Interpretation of Standardized Residuals

Page 60: Chapter13

60 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

This plot suggests that a curvilinear regression model is needed.

2

1

0

-1

-2

x

Sta

nd

ard

ized

Res

idu

al

Standardized Residuals Versus x

(response is y)

Visual Interpretation of Standardized Residuals

Page 61: Chapter13

61 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

This plot suggests a non-constant variance. The assumptions of the model are not correct.

2

1

0

-1

-2

-3

x

Sta

nd

ard

ized

Res

idu

al

Standardized Residuals Versus x(response is y)3

Visual Interpretation of Standardized Residuals

Page 62: Chapter13

62 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

This plot shows a data point with a large standardized residual.

2

1

0

-1

-2

-3

x

Sta

nd

ard

ized

Res

idu

al

Standardized Residuals Versus x(response is y)

Visual Interpretation of Standardized Residuals

Page 63: Chapter13

63 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

This plot shows a potentially influential observation.

2

1

0

-1

-2

x

Sta

nd

ard

ized

Res

idu

al

Standardized Residuals Versus x(response is y)

Visual Interpretation of Standardized Residuals

Page 64: Chapter13

64 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - % Unemployment vs. Suicide Rate

This plot of the residuals (errors) indicates some possible problems with this linear model. You can see a pattern to the points.

Generally decreasing pattern to these points.

Unusually large residual

These two points are quite influential since they are far away from the others in terms of the % unemployed

Page 65: Chapter13

65 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Properties of the Sampling Distribution of a + bx for a Fixed x Value

Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a + bx* has the following properties:

1. The mean value of a + bx* is + x*, so a + bx* is an unbiased statistic for estimating the average y value when x = x*

Page 66: Chapter13

66 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Properties of the Sampling Distribution of a + bx for a Fixed x Value

3. The distribution of the statistic a + bx* is normal.

2. The standard deviation of the statistic a + bx* denoted by a+bx*, is given by

2

a bx*xx

x * x1n S

Page 67: Chapter13

67 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Addition Information about the Sampling Distribution of a + bx for a Fixed x Value

The estimated standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by 2

a bx* exx

x * x1s s

n S

When the four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable

is the t distribution with df = n - 2.a bx*

a bx * ( x*)t

s

Page 68: Chapter13

68 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Confidence Interval for a Mean y Value

When the four basic assumptions of the simple linear regression model are met, a confidence interval for a + bx*, the average y value when x has the value x*, is

a + bx* (t critical value)sa+bx*

Where the t critical value is based on df = n -2.

Many authors give the following equivalent form for the confidence interval.

2

exx

1 (x * x)a bx * (t critical value)s

n S

Page 69: Chapter13

69 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Confidence Interval for a Single y Value

When the four basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x has the value x*, has the form

Where the t critical value is based on df = n -2.

2 2e a bx*a bx * (t critical value) s s

Many authors give the following equivalent form for the prediction interval.

2

exx

1 (x * x)a bx * (t critical value)s 1

n S

Page 70: Chapter13

70 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Mean Annual Temperature vs. Mortality

Data was collected in certain regions of Great Britain, Norway and Sweden to study the relationship between the mean annual temperature and the mortality rate for a specific type of breast cancer in women.

* Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in certain European countries. British Medical Journal, 1, 488-490

Mean Annual Temperature (F°)

51.3 49.9 50.0 49.2 48.5 47.8 47.3 45.1

Mortality Index 102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2

Mean Annual Temperature (F°)

46.3 42.1 44.2 43.5 42.3 40.2 31.8 34.0

Mortality Index 78.9 84.6 81.7 72.2 65.1 68.1 67.3 52.5

Page 71: Chapter13

71 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Mean Annual Temperature vs. Mortality

Regression Analysis: Mortality index versus Mean annual temperature The regression equation isMortality index = - 21.8 + 2.36 Mean annual temperature Predictor Coef SE Coef T PConstant -21.79 15.67 -1.39 0.186Mean ann 2.3577 0.3489 6.76 0.000 S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9% Analysis of Variance Source DF SS MS F PRegression 1 2599.5 2599.5 45.67 0.000Residual Error 14 796.9 56.9Total 15 3396.4 Unusual ObservationsObs Mean ann Mortalit Fit SE Fit Residual St Resid 15 31.8 67.30 53.18 4.85 14.12 2.44RX R denotes an observation with a large standardized residualX denotes an observation whose X value gives it large influence.

Page 72: Chapter13

72 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Mean Annual Temperature vs. Mortality

504030

100

90

80

70

60

50

Mean annual

Mor

talit

y in

S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %

Mortality in = -21.7947 + 2.35769 Mean annual

Regression Plot

The point has a large standardized residual and is influential because of the low Mean Annual Temperature.

Page 73: Chapter13

73 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Mean Annual Temperature vs. Mortality

Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X 2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88) 3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54) 4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02) 5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25) 6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57) X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Mean ann1 31.82 35.03 40.04 44.65 50.06 51.3

These are the x* values for which the above fits, standard errors of the fits, 95% confidence intervals for Mean y values and prediction intervals for y values given above.

Page 74: Chapter13

74 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

504030

120

110

100

90

80

70

60

50

40

30

Mean annual

Mor

talit

y in

S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %

Mortality in = -21.7947 + 2.35769 Mean annual

95% PI

95% CI

Regression

Regression PlotExample - Mean Annual Temperature vs. Mortality

95% prediction interval for single y value at x = 45. (67.62,100.98)

95% confidence interval for Mean y value at x = 40. (67.20, 77.82)

Page 75: Chapter13

75 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A Test for Independence in a Bivariate Normal Population

Null hypothesis: H0: = 0

Assumption: r is the correlation coefficient for a random sample from a bivariate normal population.

Test statistic:

The t critical value is based on df = n - 2

2

rt

1 rn 2

Page 76: Chapter13

76 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A Test for Independence in a Bivariate Normal Population

Alternate hypothesis: H0: > 0 (Positive dependence): P-value is the area under the appropriate t curve to the right of the computed t.

Alternate hypothesis: H0: < 0 (Negative dependence): P-value is the area under the appropriate t curve to the right of the computed t.

Alternate hypothesis: H0: 0 (Dependence): P-value is

i. twice the area under the appropriate t curve to the left of the computed t value if t < 0 and

ii. twice the area under the appropriate t curve to the right of the computed t value if t > 0

Page 77: Chapter13

77 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

Recall the data from the study of %Fat vs. Age for humans.

There are 18 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.79209.

We will test to see if there is a dependence at the 0.05 significance level.

Age (x) % Fat y x2 xy23 9.5 529 218.523 27.9 529 641.727 7.8 729 210.627 17.8 729 480.639 31.4 1521 1224.641 25.9 1681 1061.945 27.4 2025 123349 25.2 2401 1234.850 31.1 2500 155553 34.7 2809 1839.153 42 2809 222654 29.1 2916 1571.456 32.5 3136 182057 30.3 3249 1727.158 33 3364 191458 33.8 3364 1960.460 41.1 3600 246661 34.5 3721 2104.5

Page 78: Chapter13

78 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example1. = the correlation between % fat and

age in the population from which the sample was selected

2. H0: = 0

3. Ha: 0

4. = 0.05

5. Test statistic:2

rt , df n 2

1 rn 2

Page 79: Chapter13

79 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example6. Looking at the two normal plots, we can see

it is not reasonable to assume that either the distribution of age nor the distribution of % fat are normal. (Notice, the data points deviate from a linear pattern quite substantially. Since neither is normal, we shall not continue with the test.

P-Value: 0.011A-Squared: 0.980

Anderson-Darling Normality Test

N: 18StDev: 13.2176Average: 46.3333

6252423222

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

Age (x)

Normal Probability Plot

P-Value: 0.032A-Squared: 0.796

Anderson-Darling Normality Test

N: 18StDev: 9.14439Average: 28.6111

40302010

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

% Fat y

Normal Probability Plot

Page 80: Chapter13

80 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Another ExampleHeight vs. Joint Length

The professor in an elementary statistics class wanted to explain correlation so he needed some bivariate data. He asked his class (presumably a random or representative sample of late adolescent humans) to measure the length of the metacarpal bone on the index finger of the right hand (in cm) and height (in ft). The data are provided on the next slide.

Page 81: Chapter13

81 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Height vs. Joint Length

There are 17 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.74908.

We will test to see if the true population correlation coefficient is positive at the 0.05 level of significance.

Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0Height 64 68.5 69 64 68 73 72 75 70

Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3Height 68.5 65 67 70 65 75 70 66

Page 82: Chapter13

82 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1. = the true correlation between height and right index finger metacarpal joint in the population from which the sample was selected

2. H0: = 0

3. Ha: > 0

4. = 0.05

Example - Height vs. Joint Length

5. Test statistic:2

rt , df n 2

1 rn 2

Page 83: Chapter13

83 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6. Looking at the two normal plots, we can see it is reasonable to assume that the distribution of age and the distribution of % fat are both normal. (Notice, the data points follow a reasonably linear pattern. This appears to confirm the assumption that the sample is from a bivariate normal distribution. We will assume that the class was a random sample of young adults.

Example - Height vs. Joint Length

P-Value: 0.557A-Squared: 0.294

Anderson-Darling Normality Test

N: 17StDev: 3.49974Average: 68.8235

757065

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

Height

Normal Probability Plot

P-Value: 0.156A-Squared: 0.524

Anderson-Darling Normality Test

N: 17StDev: 0.419734Average: 3.43529

4.03.53.0

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

Joint

Normal Probability Plot

Page 84: Chapter13

84 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example - Height vs. Joint Length

8. P-value: Looking on the table of tail areas for t curves under 15 degrees of freedom, 4.379 is off the bottom of the table, so P-value < 0.001. Minitab reports the P-value to be 0.001.

9. Conclusion: The P-value is smaller than = 0.05, so we can reject H0. We can conclude that the true population correlation coefficient is greater then 0. I.e., the metacarpal bone is longer for taller people.

7. Calculation:

2 2

r 0.74908t 4.379

1 r 1 (0.74908)n 2 17 2