Least Squares Regression - University of Houston

32
Least Squares Regression Sections 5.3 & 5.4 Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c Department of Mathematics University of Houston Lecture 13 - 2311 Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston ) Sections 5.3 & 5.4 Lecture 13 - 2311 1 / 32

Transcript of Least Squares Regression - University of Houston

Page 1: Least Squares Regression - University of Houston

Least Squares RegressionSections 5.3 & 5.4

Cathy Poliak, [email protected] in Fleming 11c

Department of MathematicsUniversity of Houston

Lecture 13 - 2311

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 1 / 32

Page 2: Least Squares Regression - University of Houston

Outline

1 Least-Squares Regression

2 Prediction

3 Coefficient of Determination

4 Residuals

5 Residual Plots

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 2 / 32

Page 3: Least Squares Regression - University of Houston

Popper Set Up

Fill in all of the proper bubbles.

Make sure your ID number is correct.

Make sure the filled in circles are very dark.

This is popper number 09.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 3 / 32

Page 4: Least Squares Regression - University of Houston

Scatterplots and Correlation

We are wanting to know the relationship between two randomvariables.

To quickly look at the relationship we use the scatterplot andcorrelation to determine:

I Direction (positive or negative)I Strength (weak, moderately weak, strong)I Form (linear or non-linear)

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 4 / 32

Page 5: Least Squares Regression - University of Houston

Popper 09 Questions

We are looking at the relationship between how many items are on ashelf (shelf space) and the number of items sold in a week. Thecorrelation coefficient is r = 0.827.

1. What is the “direction” of the relationship between shelf space andweekly sales?

a) Positive b) Negative c) No direction2. What is the “strength” the relationship between shelf space and

weekly sales?a) Strong b) Moderate c) Weak

3. What “form” appears to be in the relationship between shelf spaceand weekly sales?

a) Linear b) Non-linear

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 5 / 32

Page 6: Least Squares Regression - University of Houston

Popper 09 Questions

Use the following scatterplot to answer the questions below.

−2 0 2 4 6

−2

26

10

x

y

4. What is the correlation coefficient for this set of data?a. 1 b. -0.94 c. 0.94 d. 0

5. If both X and Y are multiplied by 2, what would happen to thecorrelation coefficient?

a. It will stay the same. c. It will be multiplied by 2.b. It will be divided by 2. d. It will be zero.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 6 / 32

Page 7: Least Squares Regression - University of Houston

Example

The marketing manager of a supermarket chain would like to use shelfspace to predict the sales of coffee. A random sample of 12 stores isselected, with the following results.

Store Shelf Space (ft) Weekly Sales (# sold)1 5 1602 5 2203 5 1404 10 1905 10 2406 10 2607 15 2308 15 2709 15 280

10 20 26011 20 29012 20 310

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 7 / 32

Page 8: Least Squares Regression - University of Houston

Scatterplot

5 10 15 20

150

200

250

300

Shelf Space(feet)

Num

ber

Sol

d

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 8 / 32

Page 9: Least Squares Regression - University of Houston

Examining relationships

Correlation measures the direction and strength of the straight-linerelationship between two quantitative variables.

If a scatterpolt shows a linear relationship, we would like tosummarize this overall pattern by drawing a line on the scatterplot.

A regression line is a straight line that describes how a responsevariable y changes as an explanatory variable x changes.

This equation is used when one of the variables helps explain orpredict the other.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 9 / 32

Page 10: Least Squares Regression - University of Houston

Least-Squares regression

The least-squares regression line (LSRL) of Y on X is the line thatmakes the sum of the squares of the vertical distances of the datapoints from the line as small as possible.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 10 / 32

Page 11: Least Squares Regression - University of Houston

Least-Squares

5 10 15 20

150

200

250

300

Shelf Space(feet)

Num

ber

Sol

d

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 11 / 32

Page 12: Least Squares Regression - University of Houston

Equation of the least-squares regression line

Let x be the explanatory variable and y be the response variable for nindividuals. From the data calculate the means x and y and thestandard deviations sx and sy of the two variables, and their correlationr . The least-squares line is the equation

y = a + bx

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 12 / 32

Page 13: Least Squares Regression - University of Houston

Equation of the least-squares regression line

The least-squares line is the equation y = a + bx .

In the example of the supermarket sales, let Y = weekly sales andX = shelf space.

The least-squares regression equation to predict weekly salesbased on shelf space is

y = 145 + 7.4x

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 13 / 32

Page 14: Least Squares Regression - University of Houston

Calculating the least squares regression equation byhand

Let X be the explanatory variable and Y be the response variable for nindividuals.

1. From the data calculate the means x and y and the standarddeviations sx and sy of the two variables, and their correlation r .

2. Calculate the slope:b = r

sy

sx

3. Calculate the y -intercept:

a = y − bx

4. Then the equation is: y = a + bx . Where you input the slope intob and the y-intercept into a and leave y and x alone (do not putany numbers into y and x).

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 14 / 32

Page 15: Least Squares Regression - University of Houston

Finding Equation for Coffee Sales

Given these values determine the least-square regression line (LSRL)equation for predicting number sold based on shelf space.

Explanatory variable: Shelf space (X ) x = 12.5 feet, sx = 5.83874feet.Response variable: Sales (Y ) y = 237.5 units sold, sy = 52.2451units sold.Correlation: r = 0.827

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 15 / 32

Page 16: Least Squares Regression - University of Houston

Example of LSLR for Cost of Car

The following are descriptive statistics for estimating cost of the car Yby the age of the car (X ), r = −0.8224.

Estimated Cost (Y ) y = 10360.93 sy =5482.3372Age (X ) x = 5.214 sx = 2.940

Determine the Least Squares Equation (LSLR) of the data.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 16 / 32

Page 17: Least Squares Regression - University of Houston

Least-Square Regression Line Using R

Use the command: lm(Yvariable∼Xvariable). For example of thecoffee sales.

> shelf=c(5,5,5,10,10,10,15,15,15,20,20,20)> sales=c(160,220,140,190,240,260,230,270,280,260,290,310)> lm(sales~shelf)

Call:lm(formula = sales ~ shelf)

Coefficients:(Intercept) shelf

145.0 7.4

Where the Intercept value is a and the other value is b, the slope.Thus the equation in this example is: y = 145 + 7.4x .

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 17 / 32

Page 18: Least Squares Regression - University of Houston

Least-Squares Regression Line Using TI-83(84)

1. Make sure the diagnostics is turned on by clicking 2ND→CATALOG and scroll down to Diagnostics.

2. Choose STAT→ CALC then 8:LinReg(a+bx).

3. Make sure your Xlist is L1 and Ylist is L2 and select Calculate.

4. a is your y -intercept and b is your slope.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 18 / 32

Page 19: Least Squares Regression - University of Houston

Prediction

The equation of the regression line makes prediction easy.Substitute an x-value into the equation.Predict the weekly sales of coffee with shelf space of 12 feet.

y = 145 + 7.4× 12 = 233.8

Thus the 233.80 is the predicted weekly number of units of coffeesold that is has 12 feet of shelf space.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 19 / 32

Page 20: Least Squares Regression - University of Houston

Example

The least-squares regression line equation to predict cost of the car(Y )by the age of the car (X ) is: y = 18358− 1534x .

1. Predict the cost of an automobile for a 5 year old car.

2. Predict the cost of an automobile for a 20 year old car.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 20 / 32

Page 21: Least Squares Regression - University of Houston

Facts about least-squares regression

Fact 1: A change of one standard deviation in x corresponds to achange of r standard deviations in y . (b1 slope)

Fact 2: The least-squares regression line always passes throughthe point (x , y). That is why we can get b0 the y-intercept.

Fact 3: The distinction between explanatory and responsevariables is essential in regression.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 21 / 32

Page 22: Least Squares Regression - University of Houston

R2

The square of the correlation R2 describes the strength of astraight-line relationship.

The formal name of it is called the Coefficient of Determination.

This is the percent of variation of Y that is explained by thisequation.

R2 is a measure of how successful the regression was inexplaining the response.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 22 / 32

Page 23: Least Squares Regression - University of Houston

R2 for coffee sales

The correlation is r = 0.827

The coefficient of determination is R2 = 0.8272 = 0.684

This means that 68.4% of the variation in coffee sales can beexplained by the equation.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 23 / 32

Page 24: Least Squares Regression - University of Houston

Example

The following is an output of a least-squares regression equation in theTI-84.

What percent of the variation in the y -variable can be explained by thisregression equation?

a) 67% b) 6.7845% c) 77% d) 88%

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 24 / 32

Page 25: Least Squares Regression - University of Houston

Is good at predicting the response?

The coefficient of determination, R2 is the percent (fraction) ofvariability in the response variable (Y ) that is explained by theleast-squares regression with the explanatory variable.

This is a measure of how successful the regression equation wasin predicting the response variable.

The closer R2 is to one (100%) the better our equation is atpredicting the response variable.

From the previous question we can say that 84.33% of thevariation in the monthly gas bills can be explained by the leastsquares regression line equation.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 25 / 32

Page 26: Least Squares Regression - University of Houston

Is the LSLR good at predicting the response?

A residual is the difference between an observed value of theresponse variable and the value predicted by the regression line.

residual = observed y − predicted y

We can determine residuals for each observation.

The closer the residuals are to zero, the better we are at predictingthe response variable.

We can plot the residuals for each observation, these are calledthe residual plots.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 26 / 32

Page 27: Least Squares Regression - University of Houston

Example of residuals

The regression equation to determine number of units sold (y ) forcoffee by shelf space (x) is: y = 145 + 7.4x .A store that has 10 feet of space sold 260 units of coffee. Determinethe residual for this store.

1. Determine the predicted units sold for x = 10.

2. The observed units sold is the given value 260.

3. The residual is the difference between the observed y and thepredicted y.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 27 / 32

Page 28: Least Squares Regression - University of Houston

Residuals of coffee sales

The regression equation isWeekly Sales = 145 + 7.40 Shelf Space

Shelf Observed Predicted ResidualSpace Weekly Sales Weekly Sales

5 160 182 -225 220 182 385 140 182 -42

10 190 219 -2910 240 219 2110 260 219 4115 230 256 -2615 270 256 1415 280 256 2420 260 293 -3320 290 293 -320 310 293 17

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 28 / 32

Page 29: Least Squares Regression - University of Houston

Residual Plots

The plot of the residual values against the x values can tell usabout our LSRL model.

This is a scatterplot of where the x-values are on the horizontalaxis and the residuals are on the vertical axis.

In R: plot(x,resid(lm(y∼x)))

If the equation is a good way of predicting the y -variable (ourmodel is a good fit), the residual plot will be a bunch of scatteredpoints with no pattern that is centered about zero on the verticalaxis.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 29 / 32

Page 30: Least Squares Regression - University of Houston

Residual Plot

5 10 15 20

−40

−20

020

40

shelf

resi

dual

s

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 30 / 32

Page 31: Least Squares Regression - University of Houston

R code

> shelf=c(5,5,5,10,10,10,15,15,15,20,20,20)> sales=c(160,220,140,190,240,260,230,270,280,260,290,310)> plot(shelf,resid(lm(sales~shelf)),ylab="residuals")> abline(0,0)

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 31 / 32

Page 32: Least Squares Regression - University of Houston

Examining a residual plot

A curved pattern shows that the relationship is not linear.

Increasing spread about the zero line as x increases indicatesthe prediction of y will be less accurate for larger x . Decreasingspread about the zero line as x increases indicates the predictionof y to be more accurate for larger x .

Individual points with larger residuals are considered outliers inthe vertical (y ) direction.

Individual points that are extreme in the x direction are consideredoutliers for the x-variable.

Cathy Poliak, Ph.D. [email protected] Office in Fleming 11c (Department of Mathematics University of Houston )Sections 5.3 & 5.4 Lecture 13 - 2311 32 / 32