Linear Regression and Corelation

Linear Regression and Correlation

IntroductionSo far we have confined our discussion to the distributions involving only one variable. Sometimes, in practical applications, we might come across certain set of data, where each item of the set may comprise of the values of two or more variables.

In this chapter We develop numerical measures to express the relationship between two variables.Is the relationship strong or week, is it direct or inverse?In addition we develop an equation to express the relationship between variables. This will allow us to estimate one variable on the basis of other.

ExamplesIs there a relationship between the amount Healthtex spends per month on advertising and the sales in the month?Can we base an estimate of the cost to heat a home in January on the number of square feet in the home?Is there a relationship between the number of hours that students studied for an exam and the score earned?

What is Correlation AnalysisCorrelation analysis is the study of the relationship between variables.For example, suppose the sales manager of Copier Sales of America, which has large sales force throughout the United states and Canada, wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month.The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold.

Sales calls and copiers sold for 10 salespeople

Sales RepresentativeNumber of Sales CallsNumber of Copiers SoldTom Keller2030Jeff Hall4060Brian Virost2040Greg Fish3060Susan Welch1030Carlos Ramirez1040Rich Niles2040Mike Kiel2050Mark Reynolds2030Soni Jones3070

Correlation AnalysisBy reviewing the data it seems that there is some relationship between the number of sales call and the number of units sold.Here we develop some techniques to portray more precisely the relationship between the two variables, sales calls and copiers sold. This group of statistical technique is called correlation analysis.

Example:Copier Sales of America sells copiers to businesses of all sizes throughout the United States and Canada. Ms. Marcy Bancer was recently promoted to the position of national sales manager.At the upcoming sales meeting, the sales representatives from all over the country will be in attendance.She would like to impress upon them the importance of making that extra sales call each day.She decides to gather some information on the relationship between the number of sales calls and the number of copiers sold.

She selected a random sample of 10 sales representatives and determined the number of sales calls they made last month and the number of copiers they sold.What observations can you make about the relationship between the number of sales calls and the number of copiers sold?Develop the scatter diagram to display the information.

SolutionWe refer to the number of sales calls as the independent variable and the number of copier sold as the dependent variable.DEPENDENT VARIABLE: The variable that is being predicted or estimated.INDEPENDENT VARIABLE: A variable that provides the basis for estimation. It is the predictor variable.

Scatter DiagramSales CallsUnits Sold

The scatter diagram shows graphically that the sales representatives who make more calls tend to sell more copiers.It is reasonable for Ms. Banser, to tell her salespeople that the more sales calls they make the more copiers they can expect to sell.

The Coefficient of CorrelationOriginated by Karl Pearson, the coefficient of correlation, r describes the strength of the relationship between two sets of interval-scaled or ratio-scaled variables. It takes values from + 1 to 1. If two sets or data have r = +1, they are said to be perfectly correlated positively.If r = -1 they are said to be perfectly correlated negatively; and if r = 0 they are uncorrelated.

Perfect Positive Relationship

Perfectly Negative Correlation, r = -1

No Relationship, r = 0

The following drawing summarizes the strength and direction of the coefficient of correlation. PerfectNegativecorrelationNocorrelationPerfectpositivecorrelationStrongNegativecorrelationWeakNegativecorrelationModerateNegativecorrelationWeakpositivecorrelationModeratepositivecorrelationStrongPositivecorrelation-1.01.0-.500.50Negative correlationPositive correlation

Correlation Coefficient

Sales RepresentativeCallsSalesTom Keller2030-2-1530Jeff Hall40601815270Brian Virost2040-2-510Greg Fish3060815120Susan Welch1030-12-15180Carlos Ramirez1040-12-560Rich Niles2040-2-510Mike Kiel2050-25-10Mark Reynolds2030-2-1530Soni Jones3070825200900

Interpretation of resultValue of r is positive, so we see that there is a direct relationship between the number of sales calls and the number of copiers sold.The value of 0.759 is fairly close to 1.00, so we conclude that the relationship is strong.

The Coefficient of DeterminationIn the previous example the coefficient of correlation was 0.759, was interpreted as being strong.Terms such as weak, moderate and strong, however, do not have precise meaning.A measure that has more interpreted meaning is the coefficient of determination.It is computed by squaring the coefficient of correlation.For the above example r = 0.576, found by (0.759)

This is a proportion or percent; we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.

COEFFICIENT OF DETERMINATION: The proportion of the total variation in the dependent variable Y that is explained, or accounted for, by the variation in independent variable.

Testing the Significance of the Correlation CoefficientThe sales manager of Copier Sales of America found the strong correlation between the number of sales calls and the number of copiers sold.However, only 10 salespeople were sampled.Could it be that the correlation in the population is actually 0?This would mean the correlation of 0.759 was due to chance.The population in this example is all the salespeople employed by the firm.

Resolving this dilemma requires a test to answer the obvious question: Could there be zero correlation in the population from which the sample was selected?To put it another way, did the computed r come from a population of paired observations with zero correlation?We let represent the correlation in the population.

State null and alternate hypothesis(The correlation in the population is zero.)(The correlation in the population is different from zero.)

t test for the coefficient of correlationWith n 2 degrees of freedom

Using .05 level of significance, the decision rule states that if the computed t falls in the area between plus 2.306 and minus 2.306, the null hypothesis is not rejected.Applying formula, we get

The computed t is in the rejection region. Thus null hypothesis is rejected.This means the correlation population is not zero.

Regression Analysis

IntroductionIn previous section we developed measures to express the strength and the direction of the relationship between two variables.In this section we wish to develop an equation to express the linear (straight line) relationship between two variables. In addition we want to be to estimate the value of the dependent variable Y based on a selected value of the independent variable X.The technique used to develop the equation and provides the estimates is called regression analysis.

Regression Equation

We want to develop a linear equation that expresses the relationship between the number of sales calls and the number of units soldRegression Equation: An equation that expresses the linear relationship between two variables

The scatter diagram is reproduced below with a line drawn through the dots that a straight line would probably fit the data.

Least Squares PrincipleDetermining a regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y.

General Form of Linear Regression Equation Where:Yis the predicted value of the Y variable for a selected X value.

a is the Y-intercept. It is the estimated value of Y when X = 0. b is the slope of the line, or the average change in Y for each change of one unit (either increase or decrease) in the independent variable X.

X is any value of the independent variable that is selected.

The formula for a and b are:SLOPE OF REGRESSION LINEwhere:r is the correlation coefficient. is the standard deviation of Y (the dependent variable).

is the standard deviation of X (the independent variable).

Y - INTERCEPTwhere:

is the mean of Y (the dependent variable).

is the mean of X (the independent variable).

Example:Recall the example involving Copier Sales of America. The sales manager gathered information on the number of sales calls and the number of copiers sold for a random sample of 10 sales representatives.As a part of her presentation at the upcoming sales meeting, Ms. Bancer, the sales manager, would like to offer specific information about the relationship between the number of sales calls and the number of copiers sold.Use the least squares method to determine a linear equation to express the relationship between the two variables.What is the expected number of copiers sold by a representative who made 20 calls?

The calculations necessary to determine the regression equation are:Thus the regression equation is:

So if a salesperson makes 20 calls, he or she can expect to sell 42.6316 copiers, found byThe b value of 1.1842 means that for each additional sales call made the sales representative can expect to increase the number of copiers sold by about 1.2. to put in another way five additional sales calls in a month will result in about six more copier being sold, found by 1.1842(5) = 5.921The a value of 18.9476 is the point where the equation crosses the Y axis. A literal translation is that if no sales calls are made, that is, X = 0, 18.9476 copiers will be sold.X = 0 is outside the range of values included in the sample, therefore should not be used to estimate the number of copiers sold.

Drawing the Line of Regression

Sales Representative Sales Calls (X)Estimated Sales (Y)Tom Keller2042.6316Jeff Hall4066.3156Brian Virost2042.6316Greg Fish3054.4736Susan Welch1030.7896Carlos Ramirez1030.7896Rich Niles2042.6316Mike Kiel2042.6316Mark Reynolds2042.6316Soni Jones3054.4736

Features of the line of best fitThere is no other line through the data for which the sum of the squared deviations is smaller.In addition, this line will pass through the points represented by the mean of the X values and the mean of the Y values , that is,

The Standard Error of EstimateIn the preceding scatter diagram all the points do not lie exactly on the regression line.If all were lie on the same line, there would be no error in estimating the number of units sold.To put it another way, if all the points were on the regression line, units sold could be predicted with 100 percent accuracy.Thus there would be no error in predicting the Y variable based on an X variable.Perfect prediction in business and economics is practically impossible.

Therefore, there is a need for a measure that describes how precise the prediction of Y is based on X, or conversely, how inaccurate the estimates might be.This measure is called the standard error of estimate.The standard error of estimate, denoted by is the same concept as the standard deviation.The standard deviation measures the dispersion around the mean.The standard error of estimate measures the dispersion about the regression line.

Standard Error Of Estimate The standard deviation is based on the squared deviations from the mean, whereas the standard error of estimate is based on squared deviations between each Y and its predicted value, Y.

If is small, this means that the data are relatively close to the regression line and the regression equation can be used to predict Y with little error.

If is large, this means that the data are widely scattered around theregression line and the regression equation will not provide a precise estimate Y.

For the previous example, determine the standard error of estimate as a measure of how well the values fit the regression line.

Sales RepresentativeActualSales(Y) EstimatedSales(Y)Deviation(Y - Y)DeviationSquared(Y Y)Tom Keller3042.6316-12.6316159.557Jeff Hall6066.3156-6.315639.887Brian Virost4042.6316-2.63166.925Greg Fish6054.47365.526430.541Susan Welch3030.7896-0.78960.623Carlos Ramirez4030.78969.210484.831Rich Niles4042.6316-2.63166.925Mike Kiel5042.63167.368454.293Mark Reynolds3042.6316-12.6316159.557Soni Jones7054.473615.5264241.0690.00000784.211

The standard error of estimate is 9.901, found by using formula

MULTIPLE REGRESSION

IntroductionIn the preceding two sections, our discussion on regression and correlation analysis was confined to only two variables.However, in real life, we come across several situations where the relationship is not that simple.One variable may be affected by two or more independent variables.For example, sale of a product, Y, may be related to the number of variables such as price, income, advertising expenditure, seasons, number, size and location of the retail outlets, quality of the products and so forth.

If in such cases we take the affect of only one independent variable, then the magnitude of the error in the result is likely to be high.In view of this it is desirable to use two or more independent variables in the estimating equation.The statistical technique of extending linear regression so as to consider two or more independent variable is known as multiple linear regression.

The multiple linear regression takes the following form:Where Y is the independent variable, which is to be predicted. are the k known variables on which the predictions are to be based and a, are parameters, the value of which are to be determined by the method of least squares.

Example The following data relate to radio advertising expenditures, newspaper advertising expenditures and sales. Fit a regression line Y.

Radio ad exp.(000 Rs.) (X)47912Newspaper ad exp.(000) (X)1258Sales (Rs. Lakh) (Y)7121720

SolutionIt may be noted that here there are three variables, viz. Y, X and X, there will be three normal equations:

Multiplying (1) by 8 and subtracting (2) from (4),

XXYXXXX2XYXY4171641287721249144842495178145251538512820144966424016032165629015994505276

Multiplying (3) by 2 and subtracting (2) from (6),Multiplying (5) by 14 and (7) by 17 and subtracting (9) from (8)

Substituting the value of b = 1/59 in (5) above,

Substituting the value of b = 1.661 and b = 0.0169 in (1) above,Therefore multiple regression of Y on X and X isRadio advertising expenditure is more important than newspaperAdvertising expenditure.

Linear Regression and Corelation

Documents

Transcript of Linear Regression and Corelation