We thankfully acknowledge Krishna Kanta Handiqui State ...
Transcript of We thankfully acknowledge Krishna Kanta Handiqui State ...
We thankfully acknowledge Krishna Kanta Handiqui State Open
University (KKHSOU) for adoption of their e-content in our course
materials. However the screenshots of Microsoft Excel have been
captured from www.excel-easy.com.
GENERIC ELECTIVE IN
COMMERCE (GECO)
GECO-3
Business Statistics
BLOCK – 3
SIMPLE CORRELATION AND
REGRESSION ANALYSIS
UNIT-5 CORRELATION ANALYSIS
UNIT-6 REGRESSION ANALYSIS
1
UNIT 5 : CORRELATION ANALYSIS
Structure
5.0 Learning Objectives
5.1 Introduction
5.2 Correlation Analysis
5.3 Correlation and Causation
5.4 Types of Correlation
5.5 Methods of Measuring Correlation
5.5.1 Scatted Diagram
5.5.2 Karl Pearson's Coefficient of Correlation
5.5.3 Spearman's Rank Correlation
5.6 Probable error of correlation coefficient
5.7 Coefficient of determination
5.8 Correlation with the use of Ms EXCEL
5.9 Let us Sum up
5.10 Further Readings
5.11 Answer to Check your Progress
5.12 Model Questions
5.0 LEARNING OBJECTIVES
After going through this unit you will be able to -
Learn how correlation analysis expresses quantitatively, the degree and
direction of the association between two variables.
Compute and interpret different measures of correlation namely Karl earson’s
Correlation Coefficient, Spearman’s Rank Correlation Coefficient.
5.1 INTRODUCTION
In many business environments, we come across problems or situations where two
variables seem to move in the same direction such as both are increasing or
decreasing. At times an increase in one variable is accompanied by a decline in
another. Thus, if two variables are such that when one changes the other also changes
then the two variables are said to be correlated. The knowledge of such a relationship
is important to make inferences from the relationship between variables in a given
situation. For example, a marketing manager may be interested to investigate the
degree of relationship between advertising expenditure and the sales volume. The
2
manager would like to know whether money that he is going to spent on advertising
is justified or not in terms of sales generated.
Correlation analysis is used as a statistical technique to ascertain the association
between two quantitative variables. Usually, in correlation, the relationship between
two variables is expressed by a pure number i.e., a number without having any unit
of measurement. This pure number is referred to as coefficient of correlation or
correlation coefficient which indicates the strength and direction of statistical
relationship between variables.
It may be noted that correlation analysis is one of the most widely employed
statistical devices adopted by applied statisticians and has been used extensively not
only in biological problems but also in agriculture, economics, business and several
other fields. In this unit we shall introduce correlation analysis for two variables.
The importance of examining the relationship between two or more variables can be
stated in the form of following questions and accordingly requires the statistical
devices to arrive at conclusions:
i. Does there exist an association between two or more variables? If exists, what
is the form and the degree of that relationship?
ii. Is the relationship strong or significant enough to be useful to arrive at a
desirable conclusion?
iii. Can the relationship be used for predictions of future events, that is, to
forecast the most likely value of a dependent variable corresponding to the
given value of independent variable or variables?
The first two questions can be answered with the help of correlation analysis while
the final question can be answered by using the regression analysis.
In case of correlation analysis, the data on values of two variables must come from
sampling in pairs, one for each of the two variables.
5.2 CORRELATION ANALYSIS
By the term ‘correlation’ we mean the relationship between two variables. Two
variables are said to be correlated if the change in one variable results in a
corresponding change in the other variable. In the practical field we need to
investigate the type of relationship that might exist between the ages of husbands and
wives, the heights of fathers and sons, the amount of rainfall and the volume of
production of a particular crop, the price of a commodity and the demand for it, and
so on. Similarly, we may state the example of price of a commodity and the demand
3
for it. If the price of a commodity increases, there is a decline in its demand. Thus
there exists correlation between the two variables price and demand. In correlation,
we study the nature and the degree of relationship between the two variables.
Uses of Correlation Analysis:
In spite of certain limitations correlation analysis is a widely used statistical device.
With the help of correlation analysis one can ascertain the existence as well as degree
and direction of relations between two variables. It is an indispensable tool of
analysis for the people of Economics and Business. Variables in Economics and
Business are usually interrelated. In order to study the nature (positive or negative)
and degree (low, moderate or high) of relationship between any two of such related
variables correlation analysis is used. In reality, besides Business and Economics, it
is extensively used in various other branches.
5.3 CORRELATION AND CAUSATION
Correlation analysis helps us to have an idea about the degree and direction of the
relationship between the two variables under study. However, it fails to reflect upon
the cause and effect relationship between the variables. If there exist a cause and
effect relationship between the two variables, they are bound to vary in sympathy
with each other and, therefore, there is bound to be a high degree of correlation
between them. In other words, causation always implies correlation. However, the
converse is not true i.e., even a fairly high degree of correlation between the two
variables need not imply a cause and effect relationship between them. The high
degree of correlation between the variables may be due to the following reasons:
(i) Mutual dependence: The phenomenon under study may inter influence
each other. Such situations are usually observed in data relating to economic and
business situations. For example, variables like price, supply, and demand of a
commodity are mutually correlated. According to the principle of economics, as the
price of a commodity increases, its demand decreases, so price influences the
demand level. But if demand of a commodity increases due to growth in population,
then its price also increases. In this case increased demand makes an effect on the
price. However, the amount of export of a commodity is influenced by an increase or
decrease in custom duties but the reverse is normally not true.
(ii) Pure chance: It may happen that a small randomly selected sample from
a bivariate distribution (i.e., a distribution in which each unit of the series assumes
two values) may show a fairly high degree of correlation though, actually, such a
relationship may not exist in the universe. Such correlation may be due to chance
4
fluctuations. Moreover, the conscious or unconscious bias on the part of the
investigator, in the selection of the sample may also result in high degree of
correlation in the sample. It may be noted that in both the phenomena a fairly high
degree of correlation may be observed, though it is not possible to conceive them as
being causally related.
(iii) Influence of external factors: A high degree of correlation may be
observed between two variables due to the effect or inter action of a third variable or
a number of variables on each of these variables. For example, a fairly high degree of
correlation may be observed between the yield per hectare of two crops, say , rice
and potato, due to the effect of a number of factors like favorable weather conditions,
fertilizers used, irrigation facilities, etc., on each of them. But none of the two is the
cause of the other.
5.4 TYPES OF CORRELATION
Correlation may be broadly classified into the following three types:
(a) Positive, Negative and Zero Correlation,
(b) Linear and Non-linear Correlation,
(c) Simple, Partial and Multiple Correlation .
(a) Positive, Negative and Zero Correlation,
Positive Correlation: Two variables are said to be positively or directly correlated if
the values of the two variables deviate or move in the same direction i.e., if the
increase in the values of one variable results, on an average, in a corresponding
increase in the values of the other variable or if a decrease in the values of one
variable results, on an average, in a corresponding decrease in the values of the other
variable. For example, there exists positive correlation between the following pairs of
variables.
(i) The income and expenditure of a family on luxury items,
(ii) Advertising expenditure and the sales volume of a company,
(iii) Amount of rainfall and yield of a crop,
(iv) Height and weight of a student
(v) Price and supply of a commodity,
(vi) Temperature and sale of cold drinks on different days of a month in summer.
5
When the changes in two related variables are exactly proportional and are in the
same direction then we say that there is perfect positive correlation between them.
For example, there exists perfect positive correlation between the following pairs of
sets of data where each set of data may be assumed to be the values of a variable.
X: 10 20 30 40 50
Y: 2 4 6 8 10
U: 40 35 30 25 20 15 10
V: 14 12 10 8 6 4 2
Negative Correlation: The correlation between two variables is said to be negative
or inverse if the variables deviate in the opposite direction i.e., if the increase
(decrease) in the values of one variable results, on an average, in a corresponding
decrease (increase) in the values of the other variable. The following pairs of
variables are negatively correlated:
(i) Price and demand for a commodity,
(ii) Volume and pressure of a perfect gas,
(iii) Sale of woolen garments and the day temperature,
(iv) Number of workers and time required to complete a work
When the changes in two related variables are exactly proportional but are in the
opposite directions then we say that there is perfect negative correlation between
them. The correlation between each of the following pairs of variables is perfectly
negative:
X: 60 50 40 30 20
Y: 2 4 6 8 10
U: 0 1 2 3
V: 2 -3 -8 -13
Zero Correlation: Two variables are said to have zero correlation or no correlation
if they tend to change with no connection to each other. In such situation the
variables are said to be uncorrelated. For example, one should expect zero correlation
between the yield of crop and the heights of students, or between price of rice and
demand for sugar.
6
(b) Linear and Non-linear Correlation: The correlation between two variables is
said to be linear if corresponding to a unit change in one variable, there is a constant
change in the other variable over the entire range of the values. The following
example illustrates a linear correlation between the two variables X and Y.
X: 10 20 30 40 50
Y: 40 60 80 100 120
When these pairs of values X and Y are plotted on a graph paper, the line obtained
by joining the points would be a straight line.
In general, two variables X and Y are said to be linearly related if the relationship
that exists between the two variables is of the form given by,
Y= a +b X
Where ‘b’ is the slope and ‘a’ the intercept.
On the other hand, a non-linear correlation indicates an absolute change in one of the
variable values with respect to changes in values of another variable. In other words,
correlation is said to be non-linear when the amount of change in the values of one
variable does not bear a constant ratio to the amount of change in the corresponding
values of another variable. The following example illustrates a non-linear correlation
between the given variables.
X: 8 9 9 10 10 28 29 30
Y: 80 130 170 150 230 560 460 600
When these pair of values are plotted on a graph paper, the line obtained by joining
these points would not be a straight line, rather it would be curvi-linear.
(c) Simple, Partial and Multiple Correlation: The distinction amongst Simple,
Partial and Multiple Correlation depends upon the number of variables involved
under study.
In Simple correlation only two variables are introduced to study the relationship
between them. A study on income with respect to saving only, or sales revenue with
respect to amount of money spent on advertisement etc. are a few examples studied
under Simple Correlation.
When the study involves more than two variables then it is a problem of either Partial
or Multiple Correlation. In Partial Correlation, we study relationship between two
variables while the effect of other variable is held constant. In other words, in Part ial
Correlation we study the linear relationship between a dependent variable and one
7
particular independent variable out of a set of independent variables when all other
variables are held constant. For example, suppose our study involves three variables
X1, X2 and Y where X1 is the number of hours studied, X2 is I.Q. and Y is the marks
secured in the examination. Now if we study the relationship between number of
hours (X1) and marks obtained (Y) by the student keeping the effect of I.Q. (X2)
constant then it is a problem of Partial Correlation.
In Multiple Correlation three or more than three variables are studied simultaneously.
For example, the study of the relationship between the production of a particular crop
on one side and rainfall and use of fertilizer on the other side falls under Multiple
Correlation.
CHECK YOUR PROGRESS
Q 1: State whether the following statements are true or false:
(i) Correlation helps to formulate the relationship between the variables.
(ii) If the relationship between variables x and y is positive, then as the variable y
decreases, the variable x increases.
(iii) In a negative relationship as x increases, y decreases.
(iv)Multiple correlation deals with studying three or more than three variables
simultaneously
5.5 METHODS OF MEASURING CORRELATION
Here we shall confine our discussion to the methods of measuring only linear
relationships only. The commonly used methods for studying the correlation between
two variables are :
(i) Scatter Diagram method
(ii) Karl Pearson’s correlation coefficient method
(iii) Spearman’s Rank correlation method
(iv) Concurrent Deviation method
5.5.1 SCATTER DIAGRAM METHOD
Scatter Diagram is one of the simplest methods of diagrammatic representation of a
bivariate distribution and used to study the nature (i.e., positive, negative and zero)
and degree (i.e., weak or strong) of correlation between two variables. A scatter
diagram can be obtained on a graph paper by plotting observed pairs of values of
8
variables x and y, considering the independent variable values on the x-axis and the
dependent variable values on the y-axis. Suppose we are given n pairs of values
nn yxyxyx ,....,,.........,,, 2211 of two variables X and Y. These n points may be
plotted as dots ( ) in the xy plane. The diagram of dots so obtained is known as
scatter diagram. From scatter diagram we can form a fairly good, though rough idea
about the existence of relationship between the two variables. After plotting the
points in the xy plane we may have one of the types of scatter diagrams as shown
below:
Fig 5.1
9
Now with the help of Scatter Diagram, we can interpret the correlation between the
two variables as:
(i) If the points are very close to each other, then we can expect a fairly good amount
of correlation between the two variables. On the other hand, if there appears to be no
obvious pattern of the points of the scatter diagram then it indicates that there is
either no correlation or very low amount of correlation between the variables.
(ii) If the points on the scatter diagram reveal any trend (either upward or backward),
the variables are said to be correlated and the variables are uncorrelated if no trend is
revealed.
(iii) If there is an upward trend rising from lower left hand corner to the upper right
hand corner then we expect a positive correlation because in such a situation both the
variables move in the same direction. On the other hand, if the points on the scatter
diagram depict a downward trend starting from upper left hand corner to the lower
right hand corner, the correlation is negative since in this case the values of the two
variables run in the opposite direction.
(iv) In particular, if all the points lie on a straight line starting from the left bottom
and going up towards the right top, the correlation is perfect and positive, and if all
the points lie on a straight line starting from left top and coming down to right
bottom, the correlation is perfect and negative.
Remark: 1.The Scatter Diagram method enables us to form a rough idea of the
nature of the relationship between the two variables simply by inspection of the
graph. However, this method is not suitable to situations involving large number of
observations.
2. The method of scatter diagram provides information only about the nature of the
relationship that is whether it is positive or negative and whether it is high or low but
fails to provide an exact measure of the extent of the relationship between the two
variables.
Example 5.1: The percentage examination scores of 10 students in Data analysis
and Economics were as follows. Draw a scatter diagram for the data and comment on
the nature of correlation.
Student : A B C D E F G H I J
Data Analysis: 65 90 52 44 95 36 48 63 80 15
Economics : 62 71 58 58 64 40 42 66 67 55
10
Solution: A scatter diagram will give a preliminary indication of whether linear
correlation exists. We plot the ordered pairs (65, 62), (90, 71),………., (15, 55) as
shown in the following figure.
Fig 5.2
Since the points are very close to each other, we may expect a high degree of
correlation. Further, since the points reveal an upward trend starting from left bottom
to top right hand corner, the correlation is positive. Hence, we conclude that there
exists a high degree of positive correlation between the scores of the students in Data
Analysis and Economics.
5.5.2 KARL PEARSON’S CORRELATION COEFFICIENT METHOD
The scatter diagram gives a rough indication of the nature and extent/strength of the
relationship between the two variables. The quantitative measurement of the degree
of linear relationship between two variables, say ‘x’ and ‘y’, is given by a parameter
called correlation coefficient. It was developed by Karl Pearson. Karl Pearson;s
method of measuring correlation between two variables is called the coefficient of
correlation or correlation coefficient. It is also known as product moment coefficient.
For a set of n pairs of values of x and y , Karl Pearson’s correlation coefficient,
usually denoted by YXr , or xyr or simply r is defined by,
)()(
,
YVarXVar
YXCovr ……………………..(11.1)
Or, YX
YXCOVr
),(
where YYXXn
YXCov1
),(
11
22 1
),( XXn
XVar X
22 1
),( YYn
YVar Y
Substituting these values in the definition of r , We have
)3.12.......(..........
)2.12.........(..........
2222
22
YYnXXn
YXXYn
YYXX
YYXXr
=
)3.12.......(..........
2222
YnYXnX
YXnXY
Step deviation method for ungrouped data: When actual mean values x and y
are in fraction, the calculation of Pearson’s correlation coefficient can be simplified
by taking deviations of x and y values from their assumed means A and B,
respectively. Thus when AXd x and BYd y , where A and B are assumed
means of x and y values, the formula (11.2) becomes
)4.12.......(..........
2222
yyxx
yxyx
ddnddn
ddddnr
)4.12.......(..........
2222
yyxx
yxyx
dnddnd
ddnddr
Step deviation method for grouped data: When data on x and y values are
classified or grouped into a frequency distribution, (11.3) is modified as:
12
)5.12.......(..........
2222
yyxx
yxyx
fdfdnfdfdn
fdfddfdnr
Assumptions of Using Pearson’s Correlation Coefficient:
Karl Pearson’s correlation coefficient XYr is based on the following four
assumptions:
(i) It is appropriate to calculate when both variables x and y are measured
on an interval or a ratio scale.
(ii) The random variables X and Y are normally distributed.
(iii) There is a linear relationship between the variables.
(iv) There is a cause and effect relationship between two variables that
influences the distributions of both the variables.
Merits and Limitations of Correlation Coefficient: The correlation coefficient is a
numerical number lying between -1 and +1 that summarizes the magnitude as well as
direction of association between two variables. The chief merits of this method are
given below:
(i) Karl Pearson’s coefficient of correlation is a widely used statistical
device.
(ii) It summarizes the degree (high, moderate or low) in one figure.
(iii) It is based on all the observations.
However, analysis based on Pearsonian coefficient is subject to certain severe
limitations which are presented as:
(i) A value of XYr which is near to zero does not necessarily indicate that the two
variables X and Y are uncorrelated, but it merely indicates that there is no linear
relationship between them. There may be a curvilinear or some complex relationship
between the two variables which Pearson’s formula cannot detect as it is an
instrument of measuring linear correlation ship only.
(ii) Correlation is only a measure of the nature and degree of relationship between
two variables and it gives no indication of the kind of cause and effect relationship
that may exist between the two variables. It fails to identify the variables as
dependent or independent variables. Correlation theory simply seeks to discover if a
covariation between two variables exists or not. Statistical correlation technique may
reveal a very close relationship between two variables, but it cannot tell us about
cause and effect relationship between them or which variable causes other to react.
13
(iii) Two uncorrelated variables may exhibit a high degree of correlation between
them. For example, the data relating to the yield of rice and wheat may show a fairly
high degree of positive correlation although there is no connection between the two
variables, viz., yield of rice and yield of wheat. This may be due to the favourable
impact of extraneous factors like weather conditions, fertilizers used, irrigation
facilities, improved variety of seeds etc. on both of them.
(iv) Sometimes high correlation between two variables may be entirely spurious.
However, such a high correlation may exist due to chance and consequently such
correlations are termed as chance correlations.
Properties of Correlation Coefficient:
Some of the important properties of Correlation coefficient are given below:
(a) Correlation coefficient is a pure number.
(b) Correlation coefficient is independent of change of origin and scale of
measurement.
We shall now establish property (b).
Proof: Suppose we want to study the relationship between two variables X and Y.
Let these variables
be transformed to the new variables U and V by the change of origin and scale viz.,
h
aXU
,
k
bYV
....…………….(11.6)
where a and b are known as assumed mean or origin and h and k are known as scale
of measurement.
Therefore from (11.5) we have,
hUaX and kVbY ………………...(11.7)
Summing both sides and dividing by n we get,
UhaX and VkbY ………………….(11.8)
Subtracting (11.7) from (11.6) we have,
UUhXX and VVkYY
Putting these values in equation (11.2) we get,
2222 VVkUUh
VVkUUhrr XY
22
VVUUhk
VVUUhk
14
22
VVUU
VVUU
UVr
Since UVXY rr , therefore the correlation coefficient between the two original
variables X and Y is equal to the correlation coefficient between the new variables U
and V (where the new variables U and V are obtained from X and Y respectively
after changing the origin and scale of X and Y). Hence, we conclude that Correlation
coefficient is independent of change of origin and scale of measurement.
(c) Correlation coefficient lies between -1 and +1, i.e., 111 rorr .
Proof: Let us introduce two variables X and Y with their arithmetic means X and Y
and with standard deviations X and Y respectively.
Let us now consider the sum of squares
2
YX
YYXX
Which is always non-negative.
i.e., 0
2
YX
YYXX
02
22
YXYX
YYXXYYXX
022
2
2
2
YXYX
YYXXYYXX
Now dividing by n , we get
0121111 2
2
2
2 YYXX
nYY
nXX
n YXYx
0.2
.1
.1 2
2
2
2 YX
YX
Y
Y
x
X
022 r
01 r and 01 r
1 r and 1r
r 1 and r 1
11 r
15
Remarks: 1. This property provides us a check on our calculations. If any problem,
the obtained value of r lies outside the limits ,1 this implies that there is some
mistake in our calculations.
2. 1r indicates perfect positive correlation between the variables and 1r
indicates perfect negative correlation between the variables.
(d) If X and Y are two independent variables, ,0XYr but the converse is not true.
Proof: We have
yx
yXCovr
),(
22
YYEXXE
YYXXE
……………(11.9)
Now, YXYXYXXYEYYXXE
YXEYEXXEYXYE
= YXYXXYXYE
= YXXYE (Since X and Y are given to be independent,
= YXYEXE )()( YEXEXYE )
= YXYX
= 0
0 YYXXE
Therefore from (11.9),
0
0
22
YYEXXE
r
Thus when two variables are independent, they are uncorrelated i.e., 0XYr
But the converse is not true:
If two variables are uncorrelated then they may not be independent. Since we know
that Karl Pearson’s correlation coefficient is a measure of only linear correlation
between two variables, therefore 0XYr indicates that there exists no linear
relationship between the variables. There may, however, exist a strong non-linear or
curvilinear relationship between x and y even though .0XYr
(e) Correlation coefficient is symmetric, i.e., YXXY rr .
Proof: We have,
16
....(*)....................
22
YYXX
YYXXrXY
Now interchanging X and Y we get,
YXr
22
XXYY
XXYY
).......(**....................
22
YYXX
YYXXrYX
From (*) and (**) we find that
.YXXY rr
Interpretation of Various values of Correlation Coefficient:
The interpretations of various values of XYr are as follows:
(i) 10 XYr implies that there is positive correlation between X and Y. The closer
the value of XYr to 1, the stronger is the positive correlation.
(ii) 1XYr implies that there exists perfect and positive correlation between the
variables.
(iii) 01 XYr implies that there is negative correlation between X and Y. The
closer the value of XYr to-1, the stronger is the negative correlation.
(iv) 1XYr indicates that the correlation between the variables is perfect and
negative.
(v) 0XYr means that there is no correlation between the variables and hence the
variables are said to be uncorrelated.
CHECK YOUR PROGRESS
Q 2: Given ,43YYXX ,322
XX 722
YY
What is correlation coefficient r ?
Q 3: Given r =0.25, Cov(X, Y)=3.6, Var(X)=36 then what is S.D.(Y)
Example 5.2: The following data gives indices of industrial production and number
of registered unemployed people (in lakh). Determine Karl Pearson’s correlation
coefficient
17
Year Index of Production Number
Unemployed
1991 100 15
1992 102 12
1993 104 13
1994 107 11
1995 105 12
1996 112 12
1997 103 19
1998 99 26
Solution: To calculate the Karl Pearson’s correlation coefficient we prepare the
following table:
Year Productio
n (X) XX
2XX
Unemploye
d
(Y)
YY
2YY
YYXX
199
1
100 -4 16 15 0 0 0
199
2
102 -2 4 12 -3 9 6
199
3
104 0 0 13 -2 4 0
199
4
107 3 9 11 -4 16 -12
199
5
105 1 1 12 -3 9 -3
199
6
112 8 64 12 -3 9 -24
199
7
103 -1 1 19 4 16 -4
199
8
99 -5 25 26 11 121 -55
Tota
l
832 0 120 120 0 184 -92
;1048
832
n
XX 15
8
120
n
YY
Thus Karl Pearson’s Correlation Coefficient,
18
22
YYXX
YYXXrXY
=184120
92
Since coefficient of correlation 619.0r is moderately negative, it indicates that
there is a moderately large inverse correlation between the two variables. Hence, we
conclude that as the production index increases, the number of unemployed
decreases and vice-versa.
Example 5.3: The following data relate to age of employees and the number of days
they reported sick in a month.
Employees Age Sick
days
1 30 1
2 32 0
3 35 2
4 40 5
5 48 2
6 50 4
7 52 6
8 55 5
9 57 7
10 61 8
Calculate Karl Pearson’s coefficient of correlation and interpret it.
Solution: Let age and sick days be represented by variables X and Y respectively.
Then Karl Pearson’s
Correlation coefficient
22
YYXX
YYXXrXY
Now we prepare the following table for calculation
Age
X
Sick days
XX 2XX Y YY 2YY XX
YY
30 -16 256 1 -3 9 48
19
32 -14 196 0 -4 16 56
35 -11 121 2 -2 4 22
40 -6 36 5 1 1 -6
48 2 4 2 -2 4 -4
50 4 16 4 0 0 0
52 6 361 6 2 4 12
55 9 81 5 1 1 9
57 11 121 7 3 9 33
61 15 225 8 4 1 60
460 0 1092 40 0 64 230
6.410
46
n
XX
410
40
n
YY
Therefore we have,
22
YYXX
YYXXrXY
=641092
230
= 870.0
Since coefficient of correlation is closer to 1 and positive, therefore age of employees
and number of sick days are positively correlated to a high degree. Hence we
conclude that as the age of an employee increases, he is likely to go on sick leave
more often than others.
Example 5.4: The following data gives Sales and Net Profit for some of the top
Auto-makers during the quarter July-September 2006. Find the correlation
coefficient.
Company Average sales
estimates
(Rs.’00crores )
Average Net profit
Estimates(Rs.’0crores)
Tata Motors 65.00 47
Hero Honda 22.00 22
Bajaj Auto 24.00 34.5
TVS Motor 10.00 3.5
Bharat Forge 5 6
Ashok Leyland 16.00 9
M & M 24.00 20
Maruti Udyog 34.00 32
20
Solution: Let the average sales and average net profit of the given automobiles be
denoted by X and Y respectively. Then correlation coefficient is given by
2222 YYnXXn
YXXYnr [Using equation (11.3)]
Now we make the following table for calculation:
Company Average
sales(X)
Average
profit
(Y)
2X 2Y XY
Tata Motors 65.00 47 4225 2209 3055
Hero Honda 22.00 22 484 484 484
Bajaj Auto 24.00 34.5 576 1190.25 828
TVS Motor 10.00 3.5 100 12.25 35
Bharat Forge 5 6 25 36 30
Ashok Leyland 16.00 9 256 81 144
M & M 24.00 20 576 400 480
Maruti Udyog 34.00 32 1156 1024 1088
200 174 7398 5436.50 6144
Therefore correlation coefficient is given by,
22 17450.5436820073988
17420061448
r
=30276434924000059184
3480049152
=1321619184
14352
=9608629.1145063175.138
14352
=80578.15922
14352
=0.90135
Correlation Coefficient in case of Grouped Data:
In case of Bivariate frequency, if we are to deal with large volume of data then these
are classified in the form of a two-way frequency table known as bivariate table or
correlation table. Here for each of the variables, the values are classified into
different classes following the same considerations as in the case of univariate
distribution. If there are m classes for the values of the X variable and n classes for
21
the values of the Y variable then there will be nm cells in the two-way table.
Now we shall discuss the calculation of Karl Pearson’s correlation coefficient with
the help of the following example,
Example 5.5: Family income and its percentage spent on food in the case of 100
families gave the following bivariate frequency distribution. Calculate the coefficient
of correlation.
Food
Expenditure
(in %)
Family income (Rs.)
300200 400300 500400 600500 700600
Total
10-15
__
__ __
3
7
10
15-20
__
4
9
4
3
20
20-25
7
6
12
5
__
30
25-30
3
10
19
8
__
40
Total 10 20 40 20 10 100
Solution: Let us denote the income (in Rs.) and the food expenditure (%) by the
variables X and Y respectively. Now to calculate Karl Pearson’s coefficient of
correlation, we follow the steps given by,
Step 1: Find the mid points of various classes for X and Y series.
Step 2: Change the origin and scale in X series and Y series to the new variables u
and v by using the transformations:
100
450
X
h
AXu , and
5
5.17
Y
k
BYv
where x and y denote mid points of X series and Y series respectively and similarly
h and k denote magnitude of the classes of X and Y series respectively.
Step 3: For each class of X, find the total of cell frequencies of all the classes of Y
and similarly for each class of Y find the total of cell frequencies of all the classes of
X.
Step 4: Multiply the frequencies of X by the corresponding values of the variable u
and find the sum .fu
Step 5: Multiply the frequencies of Y by the corresponding values of the variable v
and find the sum .fv
22
Step 6: Multiply the frequency of each cell by the corresponding values of u and v
and write the product vuf within a square in the right hand top corner for each
cell.
Step 7: Add together all the figures in the top corner squares as obtained in step 6 to
get the last column uvf for each of the X and Y series. Finally, find the total of the
last column to get .fuv
8. Multiply the values of fu and fvby the corresponding values of u and v to get the
columns for2fu and
2fv . Add these values o obtain 2fu and 2fv .
The above calculations are presented below in the table:
X
Y
200-
300
300-
400
400-500 400-
500
500-
600
Mid
pt.(x)
250
350
450
550
650
u
v
-2
-1
0
1
2
f
fv
2fv
fuv
10-15
mid
pt.(y
)
12.5
-1
__
0
__
0
__
0
-
3
3
-
14
7
10
-10
10
-17
15-20
17.5
0
__
0
0
4
0
9
0
4
0
3
20
0
0
0
20-25
22.5
1
-14
7
-
6
6
0
12
5
5
__
0
30
30
30
-15
25-30
27.5
2
-12
3
-
20
10
0
19
16
8
__
0
40
80
160
-16
f 10 20 40 20 10 N=10
0 fv
=10
0
fv2
=20
0
fuv
= -
48
fu -20 -20 0 20 20 fu=0
2fu 40 20 0 20 40 fu2
=120
fuv -26 -26 0 18 -14 fuv
= - 48
23
uvr
2222 fvfvNfufuN
fvfufuvN
=
)100200100()0120100(
100048100
2
1000012000
4800
100002000012000
4800
4381.012000
48
Since Correlation coefficient is independent of change of origin and scale of
measurement, therefore, .4381.0 uvXY rr
5.5.3 SPEARMAN’S RANK CORRELATION COEFFICIENT
So far, we have confined our discussion with correlation between two variables,
which can be measured and quantified in appropriate units of money, time, etc.
However, sometimes, the data on two variables is given in the form of the ranks of
two variables based on some criterion. Here we introduce the method to study
correlation between ranks of the variables rather than their absolute values. This
method was developed by the British psychologist Charles Edward Spearman in
1904 .In other words, this method is used in a situation in which quantitative measure
of certain qualitative factors such as judgement, brands personalities, TV
programmes, leadership, colour, taste, cannot be fixed, but individual observations
can be arranged in a definite order. The ranking is assigned by using a set of ordinal
rank numbers, with 1 for the individual observation ranked first either in terms of
quantity or quality; and n for the individual observation ranked last in a group of n
pairs of observations. Mathematically, Spearman’s rank correlation coefficient is
defined as:
1
61
2
2
nn
dR …………(11.11)
Where R Rank correlation coefficient.
d = the difference between the pairs of ranks of the same
individual in the two
characteristics .
n =the number of pairs.
Advantages and Disadvantages of Spearman’s Correlation coefficient method:
24
Advantages:
(i) It is easy to understand and its application is simpler than Pearson’s
method.
(ii) It can be used to study correlation when variables are expressed in
qualitative terms like beauty, intelligence, honesty, efficiency and so on.
(iii) It is appropriate to measure the association between two variables if the
data type is at least ordinal scaled (ranked).
(iv) The sample data of values of two variables is converted into ranks either
in ascending order or descending order for calculating degree of
correlation between two variables.
Disadvantages:
(i) Values of both variables are assumed to be normally distributed and
describing a linear relationship rather than non-linear relationship.
(ii) It is not applicable in case of bivariate frequency distribution.
(iii) It needs a large computational time when number of pairs of values of
two variables exceed 30.
Case I: When ranks are given
When observations in a data set are already arranged in a particular order (rank),
consider the differences in pairs of observations to determine .d Square these
differences and obtain the total 2d . Finally apply the formula (11.11) to calculate
correlation coefficient.
Example 5.6: An office has 12 clerks. The long service clerks feel that they should
have a seniority increment based on length of service built into their salary structure.
An assessment of their efficiency by their departmental manager and the personnel
department produces a ranking of efficiency. This is shown below together with a
ranking of their length of service.
Ranking according to
length of service : 1 2 3 4 5 6 7 8 9 10 11 12
Ranking according
to efficiency : 2 3 5 1 9 10 11 12 8 7 6 4
Do the data support the clerks’ claim for seniority increment.
Solution: To determine whether the data support the clerks’ claim, we use
Spearman’s correlation coefficient which is given by
25
1
61
2
2
nn
dR
Since in the given data, the ranks are already been assigned, therefore we prepare the
following table for calculation.
Ranking according
to length of service
1R
Ranking according
to efficiency
2R
Difference
21 RRd
2d
1 2 -1 1
2 3 -1 1
3 5 -2 4
4 1 3 9
5 9 -4 16
6 10 -4 16
7 11 -4 16
8 12 -4 16
9 8 1 1
10 7 3 9
11 6 5 25
12 4 8 64
Total 178
Therefore, Spearman’s correlation coefficient is given by,
378.01716
10681
11212
17861
2
R
Thus from the result we observe that there exist a low degree of positive correlation
between length of service and efficiency. Therefore the claim of the clerks for a
seniority increment based on length of service is not justified.
Example 5.7: Ten competitors in a beauty contest are ranked by three judges in the
following order:
1st Judge : 1 6 5 10 3 2 4 9 7 7
2nd
Judge : 3 5 8 4 7 10 2 1 6 9
3rd
Judge : 6 4 9 8 1 2 3 10 5 9
Use the rank correlation coefficient to determine which pair of Judges has the nearest
approach to common tastes in beauty.
Solution: The pair of judges who have the nearest approach to common tastes in
beauty can be obtained in 3C2 =3 ways as follows:
(i) Judge 1and Judge 2, (ii) Judge 2 and Judge 3 and (iii) Judge 3 and Judge 1.
26
Now let 21 , RR and 3R denote the ranks assigned by the first, second and third Judges
respectively and let ijR be the rank correlation coefficient between the ranks assigned
by the ith
and jth
Judges, .3,2,1 ji Let ,jiij RRd be the difference of ranks of
an individual given by the ith
and jth Judges.
1R
2R
3R
d12
=R1-R2
d13
=R1-R3
d23
=R2-R3
d122
d132
d232
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 2 1 1 4 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 6 2 -4 36 4 16
3 7 1 -4 2 6 16 4 36
2 10 2 -8 0 8 64 0 64
4 2 3 2 1 -1 4 1 1
9 1 10 8 -1 9 64 1 81
7 6 5 1 2 1 1 4 1
7 9 7 -1 1 2 1 1 4
Total
200 60 214
We have 10n .
Applying the formula, Spearman’s rank correlation coefficients are given by,
2121.032
79910
20061
1
61
2
2
12
12
nn
dR
6363.011
79910
6061
1
61
2
2
13
13
nn
dR
297.0165
499910
21461
1
61
2
2
23
23
nn
dR
Since the correlation coefficient 6363.013 R is maximum, the pair of first and third
judges has the nearest approach to common tastes in beauty.
Since 2312 , RR are negative, the pair of judges (1, 2) and (2, 3) have opposite tastes
for beauty.
Case II: When ranks are not given
Spearman’s Rank correlation coefficient can also be used even if we are dealing with
variables which are measured quantitatively i.e., when the pairs of observations in
the data set are not ranked as in case I. In such a situation, we shall have to assign
ranks to the given set of data. The highest (smallest) observation is given the rank 1.
27
The next highest (next lowest) observation is given the rank 2 and so on. It is to be
noted that the same approach (i.e., either ascending or descending) should be
followed for all the variables under considerations.
Example 5.8: Calculate Spearman’s rank correlation coefficient between
advertising cost and sales from the following data:
Advertisement
cost (‘000Rs.): 39 65 62 90 82 75 25 98 36 78
Sales (lakhs) : 47 53 58 86 62 68 60 91 51 84
Solution: Let the variable X denote the advertisement cost (‘000 Rs.) and the
variable Y denote the sales (lakhs).
Let us now start ranking from the highest value for both the variables as given below:
X
Y
Rank
of X
(x)
Rank
of Y
(y)
d=x-y
d2
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 84 4 3 1 1
Total
2d
=30
Here .10n
Therefore, Spearman’s rank correlation is
82.011
99910
3061
1
61
2
2
nn
dR
The result shows a high degree of positive correlation between Advertising cost and
sales.
Case III: When ranks are equal
While ranking observations in the data set by considering either the highest value or
lowest value as rank 1, we may encounter a situation of more than one observations
28
being of equal size. In such a case, the rank to be assigned to individual observations
is an average of the ranks which these individual observations would have got had
they differed from each other. For example, if two observations are ranked equal at
third place, then the average rank of (3+4)/2=3.5 is assigned to these two
observations. Similarly, if three observations are ranked equal at third place, then the
average rank of (3+4+5)/3=4 is assigned to these three observations.
While equal ranks are assigned to a few observations in the data set, an
adjustment is needed in the Spearman’s rank correlation coefficient formula as given
below:
1
........12
1
12
16
12
2
3
21
3
1
2
nn
mmmmd
R
where im (i=1, 2, …..) stands for the number of items an observation is repeated in
the data set for both variables.
Example 5.9: A financial analyst wanted to find out whether inventory turnover
influences any company’s earnings per share (in per cent). A random sample of 7
companies listed in a stock exchange were selected and the following data was
recorded for each:
Company Inventory
Turnover
(Number of times)
Earnings per
share
(%)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8
Find the strength of association between inventory turnover and earnings per share.
Interpret the findings.
Solution: Let us start ranking from lowest value of both the variables. Since there are
tied ranks, the sum of the tied ranks is averaged and assigned to each of the tied
observations as shown below:
Inventory
turnover
(x)
Rank
R1
Earnings
per share
(y)
Rank
R2
d=R1-
R2
d2
29
4 2 11 5 -3 9.00
5 3.5 9 4 -0.5 0.25
7 6 13 6.5 0.5 0.25
8 7 7 1 6.0 36.00
6 5 13 6.5 -1.5 2.25
3 1 8 2.5 -1.5 2.25
5 3.5 8 2.5 1.0 1.00
Total
2d
=51
It may be noted that a value 5 of variable x is repeated twice (m1=2) and values 8 and
13 of variable y is also repeated twice, so m2 =2 and m3=2. Applying the formula
1
12
1
12
1
12
16
12
3
3
32
3
21
3
1
2
nn
mmmmmmd
R
=
177
2212
122
12
122
12
1516
12
333
=
0625.09375.01336
5.05.05.05161
The result shows a very week positive association between inventory turnover and
earnings per share.
5.6 PROBABLE ERROR OF CORRELATION COEFFICIENT
Having determined the value of the correlation coefficient, the next step is to find the
extent to which it is dependable. Probable error of correlation coefficient usually
denoted by ).(. rEP is an old measure of testing the reliability of an observed value
of correlation coefficient in so far as it depends upon the conditions of random
sampling.
If r is the observed correlation coefficient in a sample of n pairs of observations then
its standard error, usually denoted by ).(. rES is given by,
n
rrES
21).(.
Probable error of the Correlation coefficient is given by,
).(.6745.0).(. rESrEP
We have taken the factor 0.6745 because in a normal distribution 50% of the
observations lie in the range ,6745.0 where is the mean and is the s.d.
30
Uses of Probable Error:
The important uses of probable error of correlation coefficient i.e., ).(. rEP are given
by
(a) ).(. rEP may be used to determine the two limits ).(. rEPr within which here
is 50% chance that correlation coefficients of randomly selected samples from the
same population will lie.
(b) ).(. rEP may be used to test if an observed value of sample correlation coefficient
is significant of any correlation in the population. The rules for testing the
significance of population correlation coefficient are as below:
(i) If )(. rEPr then the population correlation coefficient r is not significant.
(ii) If )(.6 rEPr then population correlation coefficient r is significant.
(iii) In other situations nothing can be concluded with certainty.
It is to be mentioned that one should use probable error to test the significance of
population correlation coefficient when n , the number of pairs of observations is
fairly large. Moreover, probable error can be applied only under the following
situations:
(a) The data must have been drawn from a normal population.
(b) The observations included in the sample must be drawn randomly.
Example 5.10: The following are the marks obtained by 10 students in Mathematics
and Statistics in an examination. Determine the Karl Pearson’s coefficient of
correlation for these two series of marks. Calculate the probable error of this
correlation coefficient and examine the reliability (significance) of the correlation
coefficient. Also compute the limits within which the population correlation
coefficient may be expected to lie.
Marks in
Maths
Marks in Statistics
45 35
70 90
65 70
30 40
90 95
40 40
50 60
31
75 80
85 80
60 50
Solution: Let the marks in Maths and the marks in Stats be denoted by the variables
X and Y respectively. Let us shift both the origin and scale of the original variables
X and Y obtain the new variables U and V as given by
5
65,
5
60
YV
XU (The scale 5 being common factor of each of
X and Y )
We have, UVXY rr
Now we prepare the following table to compute UVr
X Y U V 2U 2V UV
45 35 -3 -6 9 36 18
70 90 2 5 4 25 10
65 70 1 1 1 1 1
30 40 -6 -5 36 25 30
90 95 6 6 36 36 36
40 40 -4 -5 16 25 20
50 60 -2 -1 4 1 2
75 80 3 3 9 9 9
85 80 5 3 25 9 15
60 50 0 -3 0 9 0
Total 2 -2 140 176 141
Now, UVXY rr
2222 VVnUUn
VUUVn
=
417610414010
2214110
=17561396
1414
=0.9031
Again, n
rrEP
216745.0).(.
=
10
9031.016745.0
2
32
= 0405.01623.3
128155.0
1623.3
19.06745.0
Reliability of the value of r :
We have, 9.0r and 6× ).(. rEP =6×0.0405=0.243. Since the value of r is much
higher than the value of ).(..6 rEP , as such the value of r is highly significant.
Limits for Population correlation coefficient:
).(. rEPr =0.9031 0.0405 i.e., 0.8626 and 0.9436
This implies that if we take another sample of size 10 from the same population, then
its correlation coefficient can be expected to lie between 0.8626 and 0.9436.
Note: When we say r is reliable or significant then it usually means that, on
average, students getting good marks in Mathematics also get good marks in
Statistics and students getting poor marks in Mathematics also get poor marks in
Statistics. We must not interpret that all the students getting good (poor) marks in
Mathematics also get good (poor) marks in Statistics. It happens since correlation
indicates an average relationship between two series only and not between the
individual items of the series.
5.7 COEFFICIENT OF DETERMINATION
Coefficient of correlation between two variables is a measure of degree of linear
relationship that may exist in between them and indicates the amount of variation of
one variable which is associated with or is accounted for by another variable. A more
useful and readily comprehensible measure for this purpose is the coefficient of
determination which indicates the percentage of the total variability of the dependent
variable that is accounted for or explained by the independent variable. In other
words, the coefficient of determination gives the ratio of the explained variance to
the total variance. The coefficient of determination is expressed by the square of the
correlation coefficient, i.e., 2r . The value of 2r lie between 0 and 1. For example, let
the two variables, say x and y be inter-dependent, and variation in x causes variation
in y . Further, let the correlation coefficient between them be say, 0.9. The coefficient
of determination, in this situation, is 81.09.02 which implies that %81 of the
variation in the dependent variable y is due to variation in the independent variable
x or is explained by the variation in x . The remaining %81100%19 is due to
or is explained by some other factors.
33
The various values of coefficient of determination 2r can be interpreted in the
following way:
(i) 02 r indicates that no variation in y can be explained by the variable x which
in turn indicates that there exists no association between x and y .
(ii) 12 r indicates that the values of y are completely explained by x which in
turn indicates that there exists perfect association between x and y .
(iii) 10 2 r reveals the degree of explained variation in y as a result of variation
in the values of x . Value of 2r closer to 0 shows low proportion of variation in
y explained by Again, value of 2r closer to 1 shows that value of x can predict
the actual value of y .
Example 5.11: Five students of a Management Programme at a certain Institute were
selected at random. Their Intelligent Quotient (I.Q.) and the marks obtained by them
in the paper in Decision Science (including Statistics) were as follows:
I.Q. Marks in Decision
Sciences (out of 100)
120 85
110 80
130 90
115 88
125 92
120 87
Calculate the coefficient of determination and interpret the result.
Solution: Here, we may consider I.Q. as the independent variable as X , and Marks
in Decision Science as dependent variable Y .This happens so because the marks
obtained, would generally depend on the I.Q. of a student.
Now, we prepare the following table for calculation
Now, we have Coefficient of determination = 2r
Where
2222
yyxx
yxyx
dnddnd
ddnddr
22
763822062650
7206960
34
809.032.148
120
88250
840960
6545.02 r
Which implies that 65.45% of variation in the marks is explained by I.Q. The rest of
the 34.55% variation in I.Q. could be due to some other factors like preparation for
the examination by the students, their mental frame during the examination, etc.
5.8 CORRELATION WITH THE USE OF MS EXCEL
The correlation coefficient (a value between -1 and +1) tells you how strongly two
variables are related to each other. We can use the CORREL function or the Analysis
Tool add-in in Excel to find the correlation coefficient between two variables. A
correlation coefficient of +1 indicates a perfect positive correlation. As variable X
increases, variable Y increases. As variable X decreases, variable Y decreases.
Student I.Q.
( X )
Marks in Decision
Science ( Y ) 100 Xd x
80 Yd y
2
xd 2
yd yx dd
I 120 85 20 5 400 25 100
II 110 80 10 0 100 0 0
III 130 90 30 10 900 100 300
IV 115 88 15 8 225 64 120
V 125 92 25 12 625 144 300
VI 120 87 20 7 400 49 140
Total
Average
120
20
42
7
2650 382 960
CHECK YOUR PROGRESS
Q 4: Under what situation rank correlation coefficient is used?
Q 5: What is Coefficient of determination? Interpret the meaning of 49.02 r .
35
- A correlation coefficient of -1 indicates a perfect negative correlation. As variable
X increases, variable Z decreases. As variable X decreases, variable Z increases.
- A correlation coefficient near 0 indicates no correlation.
To use the Analysis Tool add-in in Excel to quickly generate correlation coefficients
between multiple variables, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.
Note: can't find the Data Analysis button? Click here to load the Analysis Tool add-
in.
36
2. Select Correlation and click OK.
3. For example, select the range A1:C6 as the Input Range.
4. Check Labels in first row.
5. Select cell A8 as the Output Range.
6. Click OK.
Result :
37
Conclusion: variables A and C are positively correlated (0.91). Variables A and B are
not correlated (0.19). Variables B and C are also not correlated (0.11) . You can
verify these conclusions by looking at the graph.
5.9 LET US SUM UP
Correlation means existence of relationship between variables. When two variables
deviate in the same direction then we have positive correlation and when they move
in the opposite direction we say that here exists negative correlation between
variables. Again we have learnt that correlation between two variables is said to be
perfectly positive when there is proportional change in the two variables in the same
direction. On the other hand, correlation between two variables are said to be
perfectly negative when there is proportional change in two variables in opposite
directions. Again, if the change in one variable has no relation with the change in the
other variable then it is said that there is zero correlation.
We have learnt various methods with the help of which we can ascertain the
existence of relationship between variables. These methods include: (i) Scatter
diagram method, (ii) Karl Pearson’s correlation coefficient method, (iii) Spearman’s
rank correlation coefficient method.
Scatter diagram is a graphic tool to portray the relationship between variables. Karl
Pearson’s correlation coefficient measures the strength of the linear association
between variables with values near zero indicating a lack of linearity while values
near -1 or +1 suggest linearity. Karl Pearson coefficient of correlation is designated
by r .
38
We have also discussed a very important phenomenon in Correlation analysis which
is termed as Coefficient of determination. It is defined as the fraction of the variation
in one variable that is explained by the variation in the other variable and in other
words, it measures the proportion of variation in the dependent variable y that can be
attributed to independent variable x . It ranges from 0 to 1 and is the square of the
coefficient of correlation. Thus a coefficient of 0.82 suggests that 82% of the
variation in Y is accounted for by X
5.10 FURTHER READINGS
1) Srivastava, T.N., Rego, S. (2008). Statistics for Management. New Delhi. Tata
McGraw Hill Education Private Limited.
2) Sharma, J.K. (2007). Business Statistics. New Delhi. Pearson Education Ltd.
3) Hazarika, P.L. (2016). Essential Statistics For Economics And Business Studies.
New Delhi. Akansha Publishing House.
4) Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniques in
Business and Economics. New Delhi. Tata McGraw Hill Education Private Limited.
5) Bajpai, N. (2014). Business Statistics. New Delhi. Pearson Education Ltd.
5.11 ANSWERS TO CHECK YOUR PROGRESS
Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.
Ans. to Q No 2:
22
YYXX
YYXXr
9.07232
43
Ans. to Q No 3: We have, YX
YXCovr
),(
Y6
6.325.0 (since 36)( 2 XXVar , therefore 6X )
4.2 Y
Thus S.D.(Y)=2.4.
Ans. to Q No 4: Rank correlation coefficient is used in a situation in which
quantitative measure of certain qualitative factors such as judgment, leadership,
colour, tastes etc. cannot be fixed, but individual observations can be arranged in a
definite order.
39
Ans. to Q No 5: Coefficient of determination is a statistical measure of the proportion
of the variation in the dependent variable that is explained by independent variable.
Coefficient of determination 49.02 r or 49% indicates that only 49% of the
variation in the dependent variable y can be accounted for in terms of variable x.The
remaining 51% of the variability may be due to other factors.
Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.
Ans. to Q No 7: (i) True, (ii) False, (iii) True.
5.12 MODEL QUESTIONS
1. What is correlation? Define positive, negative and zero correlation.
2. What is a scatter diagram? Discuss by means of suitable scatter diagrams different
types of correlation that may exist between the variables in bivariate data.
3. What is Karl Pearson’s correlation coefficient? How would you interpret the value
of a coefficient correlation?
4. Distinguish between the coefficient of determination and the coefficient of
correlation. How would you interpret the value of a coefficient of determination?
5. What is rank correlation coefficient method? Bring out its usefulness.
6. Write a short note on the probable error of correlation coefficient.
7. Find the coefficient of correlation from the following data:
Cost : 39 65 62 90 82 75 25 98 36 78
Sales : 47 53 58 86 62 68 60 91 51 84
Also interpret the result.
8. From the following data, calculate Karl Pearson’s correlation coefficient
(i)
9
1
9
1
29
1
2
.193,346,120i i
iii
i
i YYXXYYXX
(ii) .1007,650,1586,10,80,125 22 XYYXnYX
9. Calculate the coefficient of correlation and its probable error from the following
data:
X: 1 2 3 4 5 6 7 8 9 10
Y: 20 16 14 10 10 9 8 7 6 5
40
10. Two departmental managers ranked a few trainees according to their perceived
abilities. The ranking are given below:
Trainee A B C D E F G H I J
Manager A 1 9 6 2 5 8 7 3 10 4
Manager B 3 10 8 1 7 5 6 2 9 4
Calculate an appropriate correlation coefficient to measure the consistency in the
ranking.
41
UNIT 6 : REGRESSION ANALYSIS
Structure
6.0 Learning Objectives
6.1 Introduction
6.2 Regression Analysis
6.3 Correlation vs Regression
6.4 Regression Lines
6.4.1 Determination of Regression Lines of Y on X
6.4.2 Determination of Regression Lines of X on Y
6.4.3 Regression Coefficients
6.5 Standard Error of Estimate
6.6 Regression Analysis with the use of Ms EXCEL
6.7 Let us Sum up
6.8 Further Readings
6.9 Answer to Check your Progress
6.10 Model Questions
6.0 LEARNING OBJECTIVES
After going through this unit you will be able to
• Use regression analysis for estimating the relationship between variables
• How correlation is different from regression
• Use least square method for estimating equation to predict future values of
the dependent variable
6.1 INTRODUCTION
In many business environments, we come across problems or situations where two
variables seem to move in the same direction such as both are increasing or
decreasing. At times an increase in one variable is accompanied by a decline in
another. Thus, if two variables are such that when one changes the other also
changes then the two variables are said to be correlated. The knowledge of such a
relationship is important to make inferences from the relationship between variables
in a given situation. For example, a marketing manager may be interested to
investigate the degree of relationship between advertising expenditure and the sales
volume. The manager would like to know whether money that he is going to spent
on advertising is justified or not in terms of sales generated.
42
Correlation analysis is used as a statistical technique to ascertain the association
between two quantitative variables. Usually, in correlation, the relationship between
two variables is expressed by a pure number i.e., a number without having any unit
of measurement. This pure number is referred to as coefficient of correlation or
correlation coefficient which indicates the strength and direction of statistical
relationship between variables.
It may be noted that correlation analysis is one of the most widely employed
statistical devices adopted by applied statisticians and has been used extensively not
only in biological problems but also in agriculture, economics, business and several
other fields. In this unit we shall introduce correlation analysis for two variables.
The importance of examining the relationship between two or more variables can be
stated in the form of following questions and accordingly requires the statistical
devices to arrive at conclusions:
(i) Does there exist an association between two or more variables? If exists, what is
the form and the degree of that relationship?
(ii) Is the relationship strong or significant enough to be useful to arrive at a
desirable conclusion?
(iii) Can the relationship be used for predictions of future events, that is, to forecast
the most likely value of a dependent variable corresponding to the given value
of independent variable or variables?
The first two questions can be answered with the help of correlation analysis while
the final question can be answered by using the regression analysis.
In case of correlation analysis, the data on values of two variables must come from
sampling in pairs, one for each of the two variables.
6.2 REGRESSION ANALYSIS
Correlation analysis deals with exploring the correlation that might exist between
two or more variables and indicates the degree and direction of their association, but
fails to answer the question:
Is there any functional relationship between two variables? If yes, can it be used to
estimate the most likely value of one variable, given the value of other variable?
Thus the statistical technique that expresses the relationship between two or more
variables in the form of an equation to estimate the value of a variable, based on the
given value of another variable, is called regression analysis. The variable whose
value is estimated using the algebraic equation is called dependent variable and the
43
variable whose value is used to estimate this value is called independent variable.
The linear algebraic equation used for expressing a dependent variable in terms of
independent variable is called linear regression equation.
In many business situations, it has been observed that decision making is based upon
the understanding of the relationship between two or more variables. For example, a
sales manager might be interested in knowing the impact of advertising on sales.
Here, advertising can be considered as an independent variable and sales can be
considered as the dependent variable. This is an example of simple linear regression
where a single independent variable is used to predict a single numerical dependent
variable.
The meaning of the term regression is “stepping back towards the average.” The
term regression was first introduced by Sir Francis Galton in 1877. His study on the
height of one thousand fathers and sons exhibited an interesting result where he
found that tall fathers tend to have tall sons and short fathers tend to have short sons.
However, the average height of the sons of a group of tall fathers was less than that
of the fathers. Galton concluded that abnormally tall or short parents tend to
“regress” or “step-back” to the average population height.
Advantages of Regression Analysis:
Some of the important advantages of regression analysis are given below:
1. Regression analysis helps in developing a regression equation with the help of
which the value of a dependent variable can be estimated for any given value of
the independent variable.
2. It helps to determine standard error of estimate to measure the variability of values
of a dependent at the line fits the data. When all the points fall on the line, the
standard error of estimate becomes zero.
3. When the sample size is large ( 30n ), the interval estimation for predicting the
value of a dependent variable based on standard error of estimate is considered to
be acceptable by changing the values of either x or y . The magnitude of 2r
remains constant regardless of the values of the two variables.
6.3 CORRELATION VS REGRESSION
(a) With the help of correlation one measures the co variation between two variables.
In correlation neither variable may be termed as dependent or independent variable.
Since correlation does not establish a relationship between the two variables as such
one cannot estimate the value of one variable corresponding to a given value of the
other variable. Regression establishes a functional relationship between two
44
variables and hence one can estimate the value of one variable corresponding to a
given value of the other variable.
(b) With the help of Correlation analysis, one cannot study which variable is the
cause and which variable is the effect. For example, a high degree of positive
correlation between price and supply does not indicate whether supply is the effect
of price or price is the effect of supply. Regression analysis, in contrast to
correlation, determines the cause-and-effect relationship between x and y , that is, a
change in the value of independent variable x causes a corresponding change in the
value of dependent variable y if all other factors that affect y remain unchanged.
(c) Correlation coefficient between two variables x and y is always symmetric. i.e.,
YXXY rr . But regression coefficient is not symmetric in general i.e., YXXY bb .
(d) Correlation coefficient is independent of the change of both origins and scale,
regression coefficients are independent of change of origin only but not of scale.
6.4 REGRESSION LINES
A regression line is the line from which one can get the best estimated value of the
dependent variable corresponding to a given value of the independent variable. Thus
a regression line is the line of best fitted line. The term best fit is interpreted in
accordance with the Principle of Least Squares which consists in minimizing the
sum of the squares of the residuals or the errors of estimates.
In case of two variables x and y we usually have two regression lines because each
variable may usually be treated as the dependent as well as the independent variable.
For example, let us consider two variables namely, price (P) and supply (S). We
know that other conditions remaining same; if the price of a commodity increases
(decreases) the supplies of the commodity also increases (decreases). In this case, S
is the independent variable and P is the dependent variable. Also, other conditions
remaining same when the supply of a commodity increases (decreases), its price
decreases (increases). In this case supply S is the independent variable and price P
is the dependent variable. Thus for two variables x and y we have two regression
lines. It is to be noted that
(i) when ,1XYr i.e., when there exists either perfect positive or
perfect negative correlation between x and y then both the lines of
regression coincide.
(ii) when ,0XYr i.e., when x and y are uncorrelated then the two lines
of regression become perpendicular to each other.
45
Thus when we consider the variable x as the independent and the variable y as
dependent then we get the regression equation of y on x and similarly in case of
regression equation of x on y we will have y as the independent variable and x as
the dependent variable. Sometimes, of course, from two correlated variables it is not
possible to obtain both the regression lines. For example, if the variable x denotes
the amount of rainfall in some years and the variable y denotes the production of
paddy in these years then obviously, x can be considered only as the independent
variable whereas y can be considered only as the dependent variable.
The regression line of y on x is that line which gives the best estimated value of y
corresponding to a given value of .x
The regression line of x on y is that line which gives the best estimated value of x
corresponding to a given value of .y
6.4.1 DETERMINATION OF THE REGRESSION LINE OF y ON x
Let nn yxyxyx ,,,.........,,, 2211 be n pairs of observations on the two variables
x and y under study. Let
bxay …………….(11.12)
be the line of regression (best fit) of y on x .
For any given point ii yxP ,1 in the scatter diagram, the error of estimate or residual
as given by the line of best fit (11.12) is ii HP . Now, the x coordinate of iH and iP
are same viz., ix and since iH lies on the line (11.12), the y coordinate of iH ,
i.e., MH i is bxa . Hence the error of estimate for iP is given by
MHMPHP iiii
= ii bxay
46
which is the error for the ith
point. We will have such errors for all the points on the
scatter diagram. For the points which lie above the line, the error would be positive
and for the points which lie below the line, the error would be negative.
By applying the method of least squares, the unknown constants a and b in (11.12)
needs to be determined in such a manner that the sum of the squares of the errors of
estimates is minimum. In other words, we have to minimize
n
i
ii
n
i
ii bxayHPE1
2
1
2
Subject to variations in a and b .
E may also be expressed as
,22
bxayyyE e ……(11.13)
where ey is the estimated value of y as given by (11.12) for given value of x &
summation ( ) being taken over the n pairs of observations.
Using the principle of maxima and minima in differential calculus, E will have an
optimum (maximum or minimum) for variations in a and b if its partial derivatives
w.r.t. a and b vanish separately. Hence, from (11.13) we get
0
a
E and 0
b
E
0.2
0.2
bxayb
bxay
bxaya
bxay
On simplifying, we have,
xbnay …………(11.14)
And 2xbxaxy ……….....(11.15)
Equations (11.14) and (11.15) are called normal equations. From the given values of
x and y we calculate 2,, xyx and xy . Putting these values in (11.14)
and (11.15) and solving these two equations simultaneously for a and b we get the
values of a and b which are given by
22 xxn
yxxynb ………………….(11.16)
and xbya
Now putting these values in equation (11.12) we get the regression equation of y
on x which is given by
47
xxbyy
where b is called the regression coefficient of y on x and is generally denoted by
the symbol yxb .
Thus writing yxb for b in the above equation we have
xxbyy yx ……………..(11.17)
The regression line of y on x given by (11.17) is to be used to estimate the most
probable or average value of y for any given value of .x
6.4.2 DETERMINATION OF THE REGRESSION LINE OF x ON y
Let the line of regression of x on y to be estimated for the given data is:
ybax ……………(11.18)
Applying the least squares method in a similar manner as discussed in case of
regression line of y on x we have the following two normal equations for
determining the values of a and b .
ybanx …………….(11.19)
2ybyaxy ……………..(11.20)
By solving these equations simultaneously for a and b and putting these values in
equation (11.18) we obtain the regression line of x on y given by
yybxx xy ………………(11.21)
where
22 yyn
yxxynbxy ……………….(11.22)
Equation (11.21) is the regression line of x on y where xyb given by equation
(11.22) is called regression coefficient of x on y .
The regression line of y on x given by (11.17) is to be used to estimate the most
probable or average value of y for any given value of .x
The regression line of x on y given by (11.21) is to be used to estimate the most
probable or average value of x for any given value of .y
Remark: (i) When there exists either perfect positive correlation or perfect negative
correlation between x and y then 1r , and consequently, the regression equation
of y on x becomes:
48
xxyyx
y
xy
xxyy
………………..(*)
On the other hand, in such situation the regression equation of x on y becomes:
yyxxy
x
xy
xxyy
……………..(**)
From equations (*) and (**) we conclude that we have the same line. Thus if there
exists either perfect positive correlation or perfect negative correlation between the
two variables i.e., when r 1, the two regression lines coincide.
(ii) The two lines of regression pass through the common point yx, since this point
satisfies both the regression equations.
6.4.3 REGRESSION COEFFICIENT
The regression coefficient of y on x i.e., yxb gives the amount of increase (decrease)
in y corresponding to one unit increase (decrease) in x when yxb is positive. On the
other hand the negative value of yxb gives the amount of decrease (increase) in y
corresponding to a unit increase (decrease) in .x
Similarly, if xyb is positive, it gives the amount of increase (decrease) in x
corresponding to the unit increase (decrease) in y . Again the negative value of xyb
gives the amount of decrease (increase) in x corresponding to a unit increase
(decrease) in .y
Other expressions of Regression coefficients:
The regression coefficient of y on x i.e., yxb can also be expressed as
2
,
x
yx
yxCovb
………………(A)
and x
y
yx rb
… ...……….(B)
where r is the correlation coefficient between x and .y
49
Again, we have,
yyxxn
yxCov1
),(
22 1
xxn
x
2)(
xx
yyxxbA yx
22
xxn
yxxyn ……………..(C)
We are left with the same equation as obtained in equation (11.16).
Similarly, the regression coefficient of x on y can also be expressed as below:
We have,
2
,
y
xy
yxCovb
……..………( A )
Also, y
x
xy rb
…………..( B )
Again we have, 22 1
yyn
y
2
yy
yyxxbA xy
=
22 yyn
yxxyn ……………( C )
Which is the same expression as obtained in equation (11.22).
Step deviation method for ungrouped data: When actual mean values x and y
are in fraction, then calculation of regression coefficients can be simplified by taking
deviations of x and y values from their assumed means A and B, respectively. Thus
when AXd x and BYd y , where A and B are assumed means of x and y
values, then
).......(..........22
Dddn
ddddnb
xx
yxyx
yx
50
).......(..........22
Dddn
ddddnb
yy
yxyx
xy
Properties of Regression Coefficients:
Property 1: The correlation coefficient is the geometric mean of the regression
coefficients.
Proof: We have the regression coefficient of y on x
x
y
yx rb
……….(11.13)
Similarly, the regression coefficient of x on y
y
x
xy rb
……….(11.14)
Multiplying (11.13) and (11.14) we get,
y
x
x
y
xyyx rrbb
2rbb xyyx
xyyx bbr
Thus the correlation coefficient r is the geometric mean of the two regression
coefficients yxb and xyb
Property 2: The correlation coefficient and the two regression coefficients are
simultaneously positive or simultaneously negative.
Proof: We have,
x
y
yx rb
and
y
x
xy rb
………………(11.14)
Standard deviation being square quantity can never be negative. Here we assume that
.0;0 yx Hence, we have,
0x and 0y
Therefore from (11.14), we observe that when r is positive, both yxb and xyb are
positive and when r is negative, both yxb and xyb are negative.
Thus we can conclude that when both yxb and xyb are positive then
xyyx bbr
51
and when both yxb and xyb are negative then
xyyx bbr
Property 3: The product of the two regression coefficients cannot exceed unity.
Proof: We have
xyyx bbr and 11 r
If the product of the two regression coefficients exceeds unity then
xyyx bb
will exceed unity as the square root of a number greater than one is also greater than
one. In this case r will be greater than 1 if yxb and xyb are positive and r will be less
than -1 if yxb and xyb are negative which is impossible since 11 r . This indicates
that the product of the two regression coefficients cannot exceed unity.
Property 4: The two regression coefficients are independent of the change of origin
but are dependent on the change of scale.
Proof: The property states that if we change the origin of the regression coefficients
then the values of the regression coefficients remain unchanged but if we change
their scale then their values get changed.
Let u and v be the new variables obtained by changing the origin and the scale of the
original variables x and y as follows:
k
byv
h
axu
, ……………….(i).
where hba ,, and k are (>0) are constants.
Since correlation coefficient is independent of the change of origin and scale, we
have,
uvxy rr ……………….(ii)
Due to the transformation (i)
ux h and vy k
Now,
u
v
uv
u
v
uv
x
y
xyyx rk
h
h
krrb
i.e., uvyx bk
hb
………………(iii)
Similarly,
v
u
uv
v
u
uv
y
x
xyxy rh
k
k
hrrb
i.e., uvyx bk
hb ……………………..(iv)
52
Hence, from equations (iii) and (iv) we conclude that the two regression coefficients
are independent
Property 5: Regression coefficient is not symmetric i.e., in general,
yxxy bb .
Proof: We have,
y
x
xy rb
and
x
y
yx rb
We observe that, in general yxxy bb .
On the other hand, regression coefficients yxb and xyb and correlation coefficient r
become equal only when ,yx which usually does not occur.
CHECK YOUR PROGRESS
Q 6: State whether the following statements are true or false:
(i) Regression analysis is a statistical technique that expresses the functional
relationship in the form of an equation.
(ii) Correlation coefficient is the geometric mean of regression coefficients.
(iii) If one of the regression coefficients is greater than one the other must also be
greater than one.
(iv) The product of regression coefficients is always more than one.
(v) If xyb is negative, then yxb is negative.
6.5 STANDARD ERROR OF AN ESTIMATE
The regression equations enable us to estimate the value of the dependent variable
for any given value of the independent variable. The estimates so obtained are,
however, not perfect. A measure of the precision of the estimates so obtained from
the regression equations is provided by the Standard Error (S.E.) of the estimate.
While standard deviation of the values of a variable measures the variation or
scatteredness of the values about their arithmetic mean, the standard error of estimate
measures the variation of scatteredness of the points or dots of the scatter diagram
about the regression line. The more closely the dots cluster around the regression
line, the more representative the line is so far as the relationship between the two
variables is concerned and the better is the estimate based on the equation of this
line. If all the dots lie on the regression line then there exists no variation about the
line and, as a result of which correlation between the variables will be perfect.
Thus, Standard error (S.E.) of estimate of y for given x denoted by yxS is defined by
53
2
ˆ2
n
yyS yx
To simplify the calculations of yxS , the following equivalent formula is used,
2
2
n
xybyayS yx
Where a and b are respectively the intercept and the slope of the regression line of
y on x which are to be determined by using the method of least squares.
Similarly, Standard error (S.E.) of estimate of x for given y denoted by xyS is
defined by
2
ˆ2
n
xxS xy
To simplify the calculations of yxS , the following equivalent formula is used,
2
2
n
xybxaxS xy
Where a and b are respectively the intercept and the slope of the regression line of x
on y which are to be determined by using the method of least squares.
Again a much more convenient formula for numerical computations is given by
21 rS yyx and
21 rS xxy .
Example 6.1 : A company is introducing a job evaluation scheme in which all jobs
are graded by points for skill, responsibility, and so on. Monthly pay scales (Rs. in
1000’s) are then drawn up according to the number of points allocated and other
factors such as experience and local conditions. To date the company has applied
this scheme to 9 jobs:
Job Points Pay(Rs.)
A 5 3.0
B 25 5.0
C 7 3.25
D 19 6.5
E 10 5.5
F 12 5.6
G 15 6.0
H 28 7.2
I 16 6.1
(a) Find the least squares regression line for linking pay scales to points.
(b) Estimate the monthly pay for a job graded by 20 points.
54
Solution: We consider monthly pay (Y ) as the dependent variable and job grade
points ( X ) as the independent variable. Now, the least square regression line for
linking pay scales to points i.e., the line of regression of Y on X is given by,
XXbYY yx
Now, we prepare the following table for calculation
Point
s
( X )
15 Xd X
2
Xd Pay
Scale
Y
5 YdY
2
Yd YX dd
5 -10 100 3.0 -2 4 20
25 10 100 5.0 0 0 0
7 -8 64 3.25 -1.75 3.06 14
19 4 16 6.5 1.50 2.25 6
10 -5 25 5.5 0.50 0.25 -2.5
12 -3 9 5.6 0.60 0.36 -1.8
15 0 0 6.0 1.00 1.00 0
28 13 169 7.2 2.2 4.84 28.6
16 1 1 6.1 1.1 1.21 1.1
137 2 484 48.15 3.15 16.97 65.40
(a) Here, 22.159
137
n
XX ; 35.5
9
15.48
n
YY
Since mean values X and Y are non-integer value, therefore deviations are taken
from assumed mean as done in the above table.
133.02.435
3.582
24849
15.3240.659222
XX
YXYX
YX
ddn
ddddnb
Substituting these values of YX , and YXb in the regression line, we have
22.15133.035.5 XY
326.3133.0
35.502426.2133.0
XY
XY
(b) For job grade point ,20X the estimated average pay scale is given by
986.520133.0326.3133.0326.3 XY
Hence, likely monthly pay for a job with grade points 20 is Rs. 5.986.
Example 6.2 : In the estimation of regression equations of two variables X and Y
the following results were obtained
3900,2860,6360,10,70,90 22 xyyxnYX
where YYyXXx ;
Obtain the two regression equations.
Solution: We have the line of regression of Y on X given by,
55
)...(.................... iXXbYY YX
Where,
22 x
xy
XX
YYXXbYX
= 6132.06360
3900
From (i) the required regression equation is
906132.070 XY
70188.556132.0 XY
812.146132.0 XY
Similarly, the line of regression of X on Y is given by,
)...(.................... iiYYbXX XY
Where,
22 y
xy
YY
YYXXbXY
= 3636.12860
3900
From (ii) the required regression equation is
703636.190 YX
90452.953636.1 YX
452.53636.1 YX
Example 6.3: If the two lines of regression are:
03054 yx and 0107920 yx ,
Which of these is the line of regression of x on y ? Determine xyr and y when
.3x
Solution: We are given the regression lines as
)....(....................03054 iyx
)......(....................0107920 iiyx
In order to determine the 21 rS yyx line of regression of x on y we need to
apply the property of regression coefficients, i.e., 2rbb xyyx . In the given
problem, let (i) be the line of regression of x on y and (ii) be the line of regression of
y on x .
56
From (i),
4
5
4
30
4
5 xybyx
From (ii),
9
20
9
107
9
20 yxbxy
Now, 17778.24
5
9
202 yxxybbr
But 10 2 r , therefore our assumption is wrong.
Hence (i) is the line of regression of y on x and (ii) is the line of regression of x on
y .
Assuming (i) as the line of regression of y on x we have,
65
43045 xyxy
Regression coefficient of y on x =5
4
Similarly, assuming (ii) as the line of regression of x on y we have,
20
107
20
9107920 yxyx
Regression coefficient of x on y =20
9
36.020
9
5
4.2 xyyx bbr
6.036.0 r
6.0r (since both the regression coefficients are positive, r must be positive.)
Again, we have, y
x
xy rb
4
6.0
3
5
4.
r
b xyx
y
(since 3x )
Example 6.4 : The following data relate to advertising expenditure (Rs. in lakh) and
their corresponding sales (Rs. in crore):
Advertising expenditure: 10 12 15 23 20
Sales : 14 17 23 25 21
(a) Find the equation of the least squares line fitting the data.
(b) Estimate the value of sales corresponding to advertising expenditure of Rs. 30
lakh.
(c) Calculate the standard error of estimate of sales on advertising expenditure.
Solution: (a) Let the advertising expenditure be denoted by x and sales by y. Then
we obtain the least squares line of y on x which is of the form given by,
)...(.................... ixxbyy yx
57
Where
22
xx
yxyx
yx
ddn
ddddnb ……………….(ii)
Now we construct the following table for calculation.
Advt.
Expenditure
( x )
16 xd x
2
xd Sales( y
)
20 yd y
2
yd yx dd
10 -6 36 14 -6 36 36
12 -4 16 17 -3 9 12
15 -1 1 23 3 9 -3
23 7 49 25 5 25 35
20 4 16 21 1 1 4
80 0 118 100 0 84 84
205
100;16
5
80
n
yy
n
xx
From (ii) 712.01185
845
YXb
(a) Therefore from (i) the regression equation of y on x is
16712.020 xy
xy 712.0608.8
Which is the required least squares line of sales on advertising expenditure.
(b) The least squares line obtained in part (a) may be applied to estimate the sales
turnover corresponding to the advertising expenditure of Rs. 30 lakh as:
.968.29.30712.0608.8712.0608.8ˆ croreRsxy
(c) The standard error of estimate of sale ( y ) on advertising expenditure ( y )
denoted by yxS defined by
2
2
n
xybyayS yx …………………….(iii)
Now we make the following table to calculate yxS
x y 2y xy
10 14 196 140
12 17 289 204
15 23 529 345
23 25 625 575
20 21 41 420
80 100 2080 1684
58
From (ii)
25
1684712.0100608.82080
yxS
= .594.23
11998.8602080
CHECK YOUR PROGRESS
Q 7: State whether the following statements are true or false:
(i) Standard error of estimate is a measure of scatter of the observations about the
regression line.
(ii) The standard error of estimate of y on x, yxS is equal to 21 ry
(iii) Smaller the value of yxS , better the line fits the data.
6.6 REGRESSION ANALYSIS WITH THE USE OF MS EXCEL
This example teaches you how to run a linear regression analysis in Excel and how
to interpret the Summary Output. Below you can find our data. The big question is:
is there a relation between Quantity Sold (Output) and Price and Advertising (Input).
In other words: can we predict Quantity Sold if we know Price and Advertising?
1. On the Data tab, in the Analysis group, click Data Analysis.
Note: can't find the Data Analysis button? Click here to load the Analysis Tool add-
in.
59
2. Select Regression and click OK.
3. Select the Y Range (A1:A8). This is the predictor variable (also called dependent
variable).
4. Select the X Range (B1:C8). These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.
5. Check Labels.
6. Click in the Output Range box and select cell A11.
7. Check Residuals.
8. Click OK.
Excel produces the following Summary Output (rounded to 3 decimal places).
R Square
60
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity
Sold is explained by the independent variables Price and Advertising. The closer to
1, the better the regression line (read on) fits the data.
Significance F and P-values
To check if your results are reliable (statistically significant), look at Significance F
(0.001). If this value is less than 0.05, you're OK. If Significance F is greater than
0.05, it's probably better to stop using this set of independent variables. Delete a
variable with a high P-value (greater than 0.05) and rerun the regression until
Significance F drops below 0.05. Most or all P-values should be below below 0.05.
In our example this is the case. (0.000, 0.001 and 0.005).
Coefficients
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
Advertising. In other words, for each unit increase in price, Quantity Sold decreases
with 835.722 units. For each unit increase in Advertising, Quantity Sold increases
with 0.592 units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals $4
and Advertising equals $3000, you might be able to achieve a Quantity Sold of
8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.
61
Residuals
The residuals show you how far away the actual data points are fom the predicted
data points (using the equation). For example, the first data point equals 8500. Using
the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =
8523.009, giving a residual of 8500 - 8523.009 = -23.009.
You can also create a scatter plot of these residuals.
6.7 LET US SUM UP
This unit also focuses on the process of developing a model known as regression
model under regression analysis which is used to predict the value of a dependent
variable by at least one independent variable. Here we have discussed only simple
linear regression analysis that involves only two types of variables. The variable
whose value is influenced or is to be predicted is called dependent variable and the
variable which influences the value or is used for prediction is called independent
62
variable. Once the line of regression is developed, by substituting the required
variable values and values of regression coefficient, regressed values, or predicted
values can be obtained.
Having developed a regression model to predict the dependent variable with the help
of independent variable, we need to focus on a few measures of variation. As one of
such measures we have introduced Standard error of estimate (S.E.). It measures the
dispersion of the actual values iy around the regression line. Low values of S.E.
indicate that the points cluster closely about the regression line.
6.8 FURTHER READINGS
Srivastava, T.N., Rego, S. (2008). Statistics for Management. New Delhi.
Tata McGraw Hill Education Private Limited.
Sharma, J.K. (2007). Business Statistics. New Delhi. Pearson Education Ltd.
Hazarika, P.L. (2016). Essential Statistics For Economics And Business
Studies. New Delhi. Akansha Publishing House.
Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniques in
Business and Economics. New Delhi. Tata McGraw Hill Education Private
Limited.
Bajpai, N. (2014). Business Statistics. New Delhi. Pearson Education Ltd.
6.9 ANSWERS TO CHECK YOUR PROGRESS
Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.
Ans. to Q No 2:
22
YYXX
YYXXr
9.07232
43
Ans. to Q No 3: We have, YX
YXCovr
),(
Y6
6.325.0 (since 36)( 2 XXVar , therefore 6X )
4.2 Y
Thus S.D.(Y)=2.4.
Ans. to Q No 4: Rank correlation coefficient is used in a situation in which
quantitative measure of certain qualitative factors such as judgment, leadership,
63
colour, tastes etc. cannot be fixed, but individual observations can be arranged in a
definite order.
Ans. to Q No 5: Coefficient of determination is a statistical measure of the
proportion of the variation in the dependent variable that is explained by independent
variable.
Coefficient of determination 49.02 r or 49% indicates that only 49% of the
variation in the dependent variable y can be accounted for in terms of variable x.The
remaining 51% of the variability may be due to other factors.
Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.
Ans. to Q No 7: (i) True, (ii) False, (iii) True.
6.10 MODEL QUESTIONS
1. Explain the concept of regression and point out its usefulness in dealing with
business problems.
2. What is linear regression? Why are there two regression lines? When do these
become identical?
3. Show that yxxy bbr 2 .
4. You are given below the following information about advertisement expenditure
and sales:
Adv. Exp.(x) Sales(y)
(Rs. in crore) (Rs. in crore)
Mean 20 120
Standard deviation 5 25
Correlation coefficient is 0.8.
(a) Obtain both the regression equations.
(b) Find the likely sales when advertisement expenditures Rs. 25 crore.
(c) What should be the advertisement budget if the company wants to attain sales
target of Rs. 150 crore?
5. A company believes that the number of salespersons employed is a good predictor
of sales. The following table exhibits sales (in thousands Rs.) and the number of
salespersons employed for different years:
Sales (in
thousands Rs.)
120 125 118 115 100 130 140 135 130 123
64
Number of
salespersons
Employed
10 15 12 18 20 21 22 20 15 19
Obtain a simple regression model to predict sales based on the number of
salespersons employed.
6. The HR manager of a multinational company wants to determine the relationship
between experience and income of employees. The following data are collected
from 14 randomly selected employees.
Employees 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Experience
(in years)
2 4 5 6 7 8 9 10 12 13 14 15 16 18
Income (in
thousands
Rs.)
30 40 45 35 50 60 70 65 60 55 75 80 85 75
(a) Develop a regression model to predict income based on the years of experience.
(b) Calculate the coefficient of determination and interpret the result.
(c) Calculate the standard error of estimate.
(d) Predict the income of an employee who has 22 years of experience.