Lecture 19 Simple linear regression (Review, 18.5, 18.8) Homework 5 is posted and due next Tuesday...

Lecture 19

• Simple linear regression (Review, 18.5, 18.8)

• Homework 5 is posted and due next Tuesday by 3 p.m.

• Extra office hour on Thursday after class.

Review of Regression Analysis

• Goal: Estimate E(Y|X) – the regression function

• Uses: – E(Y|X) is a good prediction of Y based on X– E(Y|X) describes the relationship between Y

and X

• Simple linear regression model: E(Y|X) is a straight line (the regression line)

XXYE 10)|(

• Example 18.2 (Xm18-02)– A car dealer wants to find

the relationship between the odometer reading and the selling price of used cars.

– A random sample of 100 cars is selected, and the data recorded.

– Find the regression line.

Car Odometer Price1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718

. . .

. . .

. . .

Independent variable x

Dependent variable y

The Simple Linear Regression Line

Simple Linear Regression Model• The data are assumed to be a

realization of

• are the unknown parameters of the model. Objective of regression is to estimate them.

• , the slope, is the amount that Y changes on average for each one unit increase in X.

• , the standard error of estimate, is the standard deviation of the amount by which Y differs from E(Y|X), i.e., standard deviation of the errors

),(,),,( 11 nn yxyx

),0(~ ,,

,1,2

1

10

Niid

nixy

n

iii

210 ,,

1

Estimation of Regression Line

• We estimate the regression line by the least squares line , the line that minimizes the sum of squared prediction errors for the data.

xbb 10 x10

xbybs

)Y,Xcov(b

10

2x

1

xbybs

)Y,Xcov(b

10

2x

1

Fitted Values and Residuals

• The least squares line decomposes the data into two parts where

• are called the fitted or predicted values. • are called the residuals.• The residuals are estimates of the errors

iii eyy ˆ

iii

ii

yye

xbby

ˆ

ˆ 10

nyy ˆ,,ˆ1

nee ,,1 ),,( 1 n

nee ,,1

Estimating

• The standard error of estimate (root mean squared error) is an estimate of

• The standard error of estimate is basically the standard deviation of the residuals.

• measures how useful the simple linear regression model is for prediction

• If the simple regression model holds, then approximately– 68% of the data will lie within one of the LS line.– 95% of the data will lie within two of the LS line.

n

i ii yyn

s1

2)ˆ(2

1

s

s

s

s

s

18.4 Error Variable: Required Conditions• The error is a critical part of the regression model.• Four requirements involving the distribution of must

be satisfied.– The probability distribution of is normal.

– The mean of is zero for each x: E(|x) = 0 for each x.

– The standard deviation of is for all values of x.

– The set of errors associated with different values of y are all independent.

The Normality of

From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation given x.

From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation given x.

0 + 1x1

0 + 1x2

0 + 1x3

E(y|x2)

E(y|x3)

x1 x2 x3

E(y|x1)

The standard deviation remains constant,

but the mean value changes with x

– To measure the strength of the linear relationship we use the coefficient of determination R2 .

2

i

22y

2x

22

)yy(

SSE1Ror

ss

)Y,Xcov(R

2i

22y

2x

22

)yy(

SSE1Ror

ss

)Y,Xcov(R

Coefficient of determination

n

iii yySSE

1

2)ˆ(


• To understand the significance of this coefficient note:

Overall variability in yThe regression model

Remains, in part, unexplained The error

Explained in part by

SSESSRSSTot

yySSTot

yySSR

yySSE

n

ii

n

ii

n

iii

1

2

2

1

1

2

)(

)ˆ(

)ˆ(


x1 x2

y1

y2

y

Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

22

21 )yy()yy( 2

22

1 )yy()yy( 222

211 )yy()yy(

Total variation in y = Variation explained by the regression line

+ Unexplained variation (error)

Variation in y = SSR + SSE


• R2 measures the proportion of the variation in y that is explained by the variation in x.

2i

2i

2i

2i

2

)yy(

SSR

)yy(

SSE)yy(

)yy(

SSE1R

• R2 takes on any value between zero and one.R2 = 1: Perfect match between the line and the data points.

R2 = 0: There is no linear relationship between x & y

• Example 18.5– Find the coefficient of determination for Example

18.2; what does this statistic tell you about the model?

• Solution– Solving by hand;

6501.)],[cov(

)996,259)(688,528,43(]511,712,2[

22

22 2

yx ss

yxR

Coefficient of determination,Example

Example 18.2 in JMPB i v a r i a t e F i t o f P r i c e B y O d o m e t e r

1 3 5 0 0

1 4 0 0 0

1 4 5 0 0

1 5 0 0 0

1 5 5 0 0

1 6 0 0 0

Price

1 5 0 0 0 2 5 0 0 0 3 5 0 0 0 4 5 0 0 0

O d o m e te r

L in e a r F it

L i n e a r F i t P r i c e = 1 7 0 6 6 . 7 6 6 - 0 . 0 6 2 3 1 5 5 O d o m e t e r S u m m a r y o f F i t R S q u a r e 0 . 6 5 0 1 3 2 R S q u a r e A d j 0 . 6 4 6 5 6 2 R o o t M e a n S q u a r e E r r o r 3 0 3 . 1 3 7 5 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t 1 7 0 6 6 . 7 6 6 1 6 9 . 0 2 4 6 1 0 0 . 9 7 < . 0 0 0 1 O d o m e t e r - 0 . 0 6 2 3 1 5 0 . 0 0 4 6 1 8 - 1 3 . 4 9 < . 0 0 0 1

SEs of Parameter Estimates

• From the JMP output, • Imagine yourself taking repeated samples of the

prices of cars with the odometer readings from the “population.”

• For each sample, you could estimate the regression line by least squares. Each time, the least squares line would be a little different.

• The standard errors estimate how much the least squares estimates of the slope and intercept would vary over these repeated samples.

0046.)ˆ(,02.169)ˆ( 10 sese

nxx ,,1

Confidence Intervals

• If simple linear regression model holds, estimated slope follows a t-distribution.

• A 95% confidence interval for the slope is given by

• A 95% confidence interval for the intercept is given by

1

)ˆ(ˆ12,025.1 set n

0 )ˆ(ˆ02,025.0 set n

Testing the slope– When no linear relationship exists between two

variables, the regression line should be horizontal.

Different inputs (x) yielddifferent outputs (y).

No linear relationship.Different inputs (x) yieldthe same output (y).

The slope is not equal to zero The slope is equal to zero

Linear relationship.Linear relationship.Linear relationship.Linear relationship.

• We can draw inference about 1 from b1 by testing

H0: 1 = 0

H1: 1 = 0 (or < 0,or > 0)

– The test statistic is

– If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2.

1b

11

sb

t

1b

11

sb

t

The standard error of b1.

2x

bs)1n(

ss

1

2x

bs)1n(

ss

1

where

Testing the Slope

• Example 18.4– Test to determine whether there is enough

evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 18.2. Use = 5%.

Testing the Slope,Example

• Solving by hand– To compute “t” we need the values of b1 and sb1.

– The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.Approximately, t.025 = 1.984

49.1300462

00623

00462.)690,528,43)(99(

1.303

)1(

0623.

1

1

11

2

1

.

.s

bt

sn

ss

b

b

x

b

Testing the Slope,Example

• Using the computer

There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.

Testing the Slope,Example Xm18-02

B i v a r i a t e F i t o f P r i c e B y O d o m e t e r

1 3 5 0 0

1 4 0 0 0

1 4 5 0 0

1 5 0 0 0

1 5 5 0 0

1 6 0 0 0

Price

1 5 0 0 0 2 5 0 0 0 3 5 0 0 0 4 5 0 0 0

O d o m e te r

L in e a r F it

L i n e a r F i t P r i c e = 1 7 0 6 6 . 7 6 6 - 0 . 0 6 2 3 1 5 5 O d o m e t e r S u m m a r y o f F i t R S q u a r e 0 . 6 5 0 1 3 2 R S q u a r e A d j 0 . 6 4 6 5 6 2 R o o t M e a n S q u a r e E r r o r 3 0 3 . 1 3 7 5 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t 1 7 0 6 6 . 7 6 6 1 6 9 . 0 2 4 6 1 0 0 . 9 7 < . 0 0 0 1 O d o m e t e r - 0 . 0 6 2 3 1 5 0 . 0 0 4 6 1 8 - 1 3 . 4 9 < . 0 0 0 1

Cause-and-effect Relationship

• A test of whether the slope is zero is a test of whether there is a linear relationship between x and y in the observed data, i.e., is a change in x associated with a change in y.

• This does not test whether a change in x causes a change in y. Such a relationship can only be established based on a carefully controlled experiment or extensive subject matter knowledge about the relationship.

Example of Pitfall

• A researcher measures the number of television sets per person X and the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?

• If we are satisfied with how well the model fits the data, we can use it to predict the values of y.

• To make a prediction we use– Point prediction, and– Interval prediction

18.7 Using the Regression Equation• Before using the regression model, we need

to assess how well it fits the data.

Point Prediction

• Example 18.7– Predict the selling price of a three-year-old Taurus

with 40,000 miles on the odometer (Example 18.2).

– It is predicted that a 40,000 miles car would sell for $14,575.

– How close is this prediction to the real price?

575,14)000,40(0623.17067x0623.17067y A point prediction

Interval Estimates• Two intervals can be used to discover how closely the

predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – estimates the average y for a given x.

– The confidence interval

– The confidence interval

2x

2g

2 s)1n()xx(

n1

sty

2x

2g

2 s)1n()xx(

n1

sty

– The prediction interval– The prediction interval

2x

2g

2 s)1n()xx(

n1

1sty

2x

2g

2 s)1n()xx(

n1

1sty

Interval Estimates,Example

• Example 18.7 - continued – Provide an interval estimate for the bidding

price on a Ford Taurus with 40,000 miles on the odometer.

– Two types of predictions are required:• A prediction for a specific car

• An estimate for the average price per car


• Solution– A prediction interval provides the price estimate for a

single car:

2x

2g

2 s)1n()xx(

n1

1sty

605575,14690,528,43)1100(

)009,36000,40(

100

11)1.303(984.1)]40000(0623.067,17[

2

t.025,98

Approximately

• Solution – continued– A confidence interval provides the estimate of

the mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.

• The confidence interval (95%) =

2

i

2g

2)xx(

)xx(

n1

sty

70575,14690,528,43)1100(

)009,36000,40(

100

1)1.303(984.1)]40000(0623.067,17[

2


– As xg moves away from x the interval becomes longer. That is, the shortest interval is found at

2x

2g

2 s)1n(

)xx(

n1

sty

x

g10 xbby

The effect of the given xg on the length of the interval

x

x1x)1x( 1x)1x(

g10 xbby

)1xx(y g )1xx(y g

1x 1x

– As xg moves away from the interval becomes longer. That is, the shortest interval is found at


2x

2g

2 s)1n(

)xx(

n1

sty

2x

2

2 s)1n(1

n1

sty

xx

x

– As xg moves away from the interval becomes longer. That is, the shortest interval is found at .

g10 xbby

2x)2x( 2x)2x(

2x 2x

2x

2g

2 s)1n()xx(

n1

sty

2x

2

2 s)1n(1

n1

sty

2x

2

2 s)1n(2

n1

sty


xx

Practice Problems

• 18.84,18.86,18.88,18.90,18.94

Lecture 19 Simple linear regression (Review, 18.5, 18.8) Homework 5 is posted and due next Tuesday...

Documents

Transcript of Lecture 19 Simple linear regression (Review, 18.5, 18.8) Homework 5 is posted and due next Tuesday...