Lecture 19 Simple linear regression (Review, 18.5, 18.8) Homework 5 is posted and due next Tuesday...
-
date post
22-Dec-2015 -
Category
Documents
-
view
230 -
download
2
Transcript of Lecture 19 Simple linear regression (Review, 18.5, 18.8) Homework 5 is posted and due next Tuesday...
Lecture 19
• Simple linear regression (Review, 18.5, 18.8)
• Homework 5 is posted and due next Tuesday by 3 p.m.
• Extra office hour on Thursday after class.
Review of Regression Analysis
• Goal: Estimate E(Y|X) – the regression function
• Uses: – E(Y|X) is a good prediction of Y based on X– E(Y|X) describes the relationship between Y
and X
• Simple linear regression model: E(Y|X) is a straight line (the regression line)
XXYE 10)|(
• Example 18.2 (Xm18-02)– A car dealer wants to find
the relationship between the odometer reading and the selling price of used cars.
– A random sample of 100 cars is selected, and the data recorded.
– Find the regression line.
Car Odometer Price1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718
. . .
. . .
. . .
Independent variable x
Dependent variable y
The Simple Linear Regression Line
Simple Linear Regression Model• The data are assumed to be a
realization of
• are the unknown parameters of the model. Objective of regression is to estimate them.
• , the slope, is the amount that Y changes on average for each one unit increase in X.
• , the standard error of estimate, is the standard deviation of the amount by which Y differs from E(Y|X), i.e., standard deviation of the errors
),(,),,( 11 nn yxyx
),0(~ ,,
,1,2
1
10
Niid
nixy
n
iii
210 ,,
1
Estimation of Regression Line
• We estimate the regression line by the least squares line , the line that minimizes the sum of squared prediction errors for the data.
xbb 10 x10
xbybs
)Y,Xcov(b
10
2x
1
xbybs
)Y,Xcov(b
10
2x
1
Fitted Values and Residuals
• The least squares line decomposes the data into two parts where
• are called the fitted or predicted values. • are called the residuals.• The residuals are estimates of the errors
iii eyy ˆ
iii
ii
yye
xbby
ˆ
ˆ 10
nyy ˆ,,ˆ1
nee ,,1 ),,( 1 n
nee ,,1
Estimating
• The standard error of estimate (root mean squared error) is an estimate of
• The standard error of estimate is basically the standard deviation of the residuals.
• measures how useful the simple linear regression model is for prediction
• If the simple regression model holds, then approximately– 68% of the data will lie within one of the LS line.– 95% of the data will lie within two of the LS line.
n
i ii yyn
s1
2)ˆ(2
1
s
s
s
s
s
18.4 Error Variable: Required Conditions• The error is a critical part of the regression model.• Four requirements involving the distribution of must
be satisfied.– The probability distribution of is normal.
– The mean of is zero for each x: E(|x) = 0 for each x.
– The standard deviation of is for all values of x.
– The set of errors associated with different values of y are all independent.
The Normality of
From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation given x.
From the first three assumptions we have:y is normally distributed with meanE(y) = 0 + 1x, and a constant standard deviation given x.
0 + 1x1
0 + 1x2
0 + 1x3
E(y|x2)
E(y|x3)
x1 x2 x3
E(y|x1)
The standard deviation remains constant,
but the mean value changes with x
– To measure the strength of the linear relationship we use the coefficient of determination R2 .
2
i
22y
2x
22
)yy(
SSE1Ror
ss
)Y,Xcov(R
2i
22y
2x
22
)yy(
SSE1Ror
ss
)Y,Xcov(R
Coefficient of determination
n
iii yySSE
1
2)ˆ(
Coefficient of determination
• To understand the significance of this coefficient note:
Overall variability in yThe regression model
Remains, in part, unexplained The error
Explained in part by
SSESSRSSTot
yySSTot
yySSR
yySSE
n
ii
n
ii
n
iii
1
2
2
1
1
2
)(
)ˆ(
)ˆ(
Coefficient of determination
x1 x2
y1
y2
y
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
22
21 )yy()yy( 2
22
1 )yy()yy( 222
211 )yy()yy(
Total variation in y = Variation explained by the regression line
+ Unexplained variation (error)
Variation in y = SSR + SSE
Coefficient of determination
• R2 measures the proportion of the variation in y that is explained by the variation in x.
2i
2i
2i
2i
2
)yy(
SSR
)yy(
SSE)yy(
)yy(
SSE1R
• R2 takes on any value between zero and one.R2 = 1: Perfect match between the line and the data points.
R2 = 0: There is no linear relationship between x & y
• Example 18.5– Find the coefficient of determination for Example
18.2; what does this statistic tell you about the model?
• Solution– Solving by hand;
6501.)],[cov(
)996,259)(688,528,43(]511,712,2[
22
22 2
yx ss
yxR
Coefficient of determination,Example
Example 18.2 in JMPB i v a r i a t e F i t o f P r i c e B y O d o m e t e r
1 3 5 0 0
1 4 0 0 0
1 4 5 0 0
1 5 0 0 0
1 5 5 0 0
1 6 0 0 0
Price
1 5 0 0 0 2 5 0 0 0 3 5 0 0 0 4 5 0 0 0
O d o m e te r
L in e a r F it
L i n e a r F i t P r i c e = 1 7 0 6 6 . 7 6 6 - 0 . 0 6 2 3 1 5 5 O d o m e t e r S u m m a r y o f F i t R S q u a r e 0 . 6 5 0 1 3 2 R S q u a r e A d j 0 . 6 4 6 5 6 2 R o o t M e a n S q u a r e E r r o r 3 0 3 . 1 3 7 5 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t 1 7 0 6 6 . 7 6 6 1 6 9 . 0 2 4 6 1 0 0 . 9 7 < . 0 0 0 1 O d o m e t e r - 0 . 0 6 2 3 1 5 0 . 0 0 4 6 1 8 - 1 3 . 4 9 < . 0 0 0 1
SEs of Parameter Estimates
• From the JMP output, • Imagine yourself taking repeated samples of the
prices of cars with the odometer readings from the “population.”
• For each sample, you could estimate the regression line by least squares. Each time, the least squares line would be a little different.
• The standard errors estimate how much the least squares estimates of the slope and intercept would vary over these repeated samples.
0046.)ˆ(,02.169)ˆ( 10 sese
nxx ,,1
Confidence Intervals
• If simple linear regression model holds, estimated slope follows a t-distribution.
• A 95% confidence interval for the slope is given by
• A 95% confidence interval for the intercept is given by
1
)ˆ(ˆ12,025.1 set n
0 )ˆ(ˆ02,025.0 set n
Testing the slope– When no linear relationship exists between two
variables, the regression line should be horizontal.
Different inputs (x) yielddifferent outputs (y).
No linear relationship.Different inputs (x) yieldthe same output (y).
The slope is not equal to zero The slope is equal to zero
Linear relationship.Linear relationship.Linear relationship.Linear relationship.
• We can draw inference about 1 from b1 by testing
H0: 1 = 0
H1: 1 = 0 (or < 0,or > 0)
– The test statistic is
– If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2.
1b
11
sb
t
1b
11
sb
t
The standard error of b1.
2x
bs)1n(
ss
1
2x
bs)1n(
ss
1
where
Testing the Slope
• Example 18.4– Test to determine whether there is enough
evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 18.2. Use = 5%.
Testing the Slope,Example
• Solving by hand– To compute “t” we need the values of b1 and sb1.
– The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.Approximately, t.025 = 1.984
49.1300462
00623
00462.)690,528,43)(99(
1.303
)1(
0623.
1
1
11
2
1
.
.s
bt
sn
ss
b
b
x
b
Testing the Slope,Example
• Using the computer
There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.
Testing the Slope,Example Xm18-02
B i v a r i a t e F i t o f P r i c e B y O d o m e t e r
1 3 5 0 0
1 4 0 0 0
1 4 5 0 0
1 5 0 0 0
1 5 5 0 0
1 6 0 0 0
Price
1 5 0 0 0 2 5 0 0 0 3 5 0 0 0 4 5 0 0 0
O d o m e te r
L in e a r F it
L i n e a r F i t P r i c e = 1 7 0 6 6 . 7 6 6 - 0 . 0 6 2 3 1 5 5 O d o m e t e r S u m m a r y o f F i t R S q u a r e 0 . 6 5 0 1 3 2 R S q u a r e A d j 0 . 6 4 6 5 6 2 R o o t M e a n S q u a r e E r r o r 3 0 3 . 1 3 7 5 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t 1 7 0 6 6 . 7 6 6 1 6 9 . 0 2 4 6 1 0 0 . 9 7 < . 0 0 0 1 O d o m e t e r - 0 . 0 6 2 3 1 5 0 . 0 0 4 6 1 8 - 1 3 . 4 9 < . 0 0 0 1
Cause-and-effect Relationship
• A test of whether the slope is zero is a test of whether there is a linear relationship between x and y in the observed data, i.e., is a change in x associated with a change in y.
• This does not test whether a change in x causes a change in y. Such a relationship can only be established based on a carefully controlled experiment or extensive subject matter knowledge about the relationship.
Example of Pitfall
• A researcher measures the number of television sets per person X and the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?
• If we are satisfied with how well the model fits the data, we can use it to predict the values of y.
• To make a prediction we use– Point prediction, and– Interval prediction
18.7 Using the Regression Equation• Before using the regression model, we need
to assess how well it fits the data.
Point Prediction
• Example 18.7– Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 18.2).
– It is predicted that a 40,000 miles car would sell for $14,575.
– How close is this prediction to the real price?
575,14)000,40(0623.17067x0623.17067y A point prediction
Interval Estimates• Two intervals can be used to discover how closely the
predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – estimates the average y for a given x.
– The confidence interval
– The confidence interval
2x
2g
2 s)1n()xx(
n1
sty
2x
2g
2 s)1n()xx(
n1
sty
– The prediction interval– The prediction interval
2x
2g
2 s)1n()xx(
n1
1sty
2x
2g
2 s)1n()xx(
n1
1sty
Interval Estimates,Example
• Example 18.7 - continued – Provide an interval estimate for the bidding
price on a Ford Taurus with 40,000 miles on the odometer.
– Two types of predictions are required:• A prediction for a specific car
• An estimate for the average price per car
Interval Estimates,Example
• Solution– A prediction interval provides the price estimate for a
single car:
2x
2g
2 s)1n()xx(
n1
1sty
605575,14690,528,43)1100(
)009,36000,40(
100
11)1.303(984.1)]40000(0623.067,17[
2
t.025,98
Approximately
• Solution – continued– A confidence interval provides the estimate of
the mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.
• The confidence interval (95%) =
2
i
2g
2)xx(
)xx(
n1
sty
70575,14690,528,43)1100(
)009,36000,40(
100
1)1.303(984.1)]40000(0623.067,17[
2
Interval Estimates,Example
– As xg moves away from x the interval becomes longer. That is, the shortest interval is found at
2x
2g
2 s)1n(
)xx(
n1
sty
x
g10 xbby
The effect of the given xg on the length of the interval
x
x1x)1x( 1x)1x(
g10 xbby
)1xx(y g )1xx(y g
1x 1x
– As xg moves away from the interval becomes longer. That is, the shortest interval is found at
The effect of the given xg on the length of the interval
2x
2g
2 s)1n(
)xx(
n1
sty
2x
2
2 s)1n(1
n1
sty
xx
x
– As xg moves away from the interval becomes longer. That is, the shortest interval is found at .
g10 xbby
2x)2x( 2x)2x(
2x 2x
2x
2g
2 s)1n()xx(
n1
sty
2x
2
2 s)1n(1
n1
sty
2x
2
2 s)1n(2
n1
sty
The effect of the given xg on the length of the interval
xx
Practice Problems
• 18.84,18.86,18.88,18.90,18.94