7/30/2019 (12) Bivariate Data
1/31
Applied Statistics and Computing Lab
BIVARIATE DATA
Applied Statistics and Computing Lab
Indian School of Business
7/30/2019 (12) Bivariate Data
2/31
Applied Statistics and Computing Lab
Learning goals
Understanding bivariate data
Understanding the idea of correlation
Understanding linear regression
2
7/30/2019 (12) Bivariate Data
3/31
Applied Statistics and Computing Lab
Bivariate Data
3
7/30/2019 (12) Bivariate Data
4/31
Applied Statistics and Computing Lab4
7/30/2019 (12) Bivariate Data
5/31
Applied Statistics and Computing Lab5
7/30/2019 (12) Bivariate Data
6/31
Applied Statistics and Computing Lab6
7/30/2019 (12) Bivariate Data
7/31
Applied Statistics and Computing Lab7
7/30/2019 (12) Bivariate Data
8/31
Applied Statistics and Computing Lab
Why study variables together Variation in one variable may or may not affect
the variation in another variable Understanding the relationship
When the value of one variable changes, compare the
other variable for: Direction of movement and
: Magnitude of movement
PredictionIf a new value of one variable is observed, can we
predict the corresponding value of the other variable?
8
7/30/2019 (12) Bivariate Data
9/31
Applied Statistics and Computing Lab
Statistics for bivariate dataX Y
.. ..
.. ..
9
X|Y .. .. Totals
..
..
Totals 1
E(XY)
Data type I ( + + + )
=
Data type II ( + + + )
,
=
,
Data type I:
Data type II: (tabulating relative frequencies;
in case if there are multiple observationswith same values of X and Y)
E(X), E(Y), V(X) and V(Y) are calculated as
per the univariate mean and varianceformulae
7/30/2019 (12) Bivariate Data
10/31
Applied Statistics and Computing Lab
Covariance (denoted by Cov) We understand the variation in a single variable by looking at the
movement of its values from a central tendency
For a bivariate data, we want to look at the combined deviation The sign of such a measure may tell us about the two variables and
how they covary
Hence we can take product of the two sets of deviations
The Covariance calculates just this! It is defined as the expected value of the product of the deviation
of X from its mean, and the deviation of Y from its mean*
A reasonable measure of joint variation10
).(),cov(
)().()(),cov(
))]())(([(),cov(
n
y
n
x
n
xyYX
YEXEXYEYX
YEYXEXEYX
=
=
=
*Aczel A., Sounderpandian J. Complete business statistics
7/30/2019 (12) Bivariate Data
11/31
Applied Statistics and Computing Lab
Covariance (contd.) Covariance is independent of change of origin but
affected by change of scale
Covariance of 2 variables is always lesser than or equalto the product of variances of those two variables
Unit of covariance is obtained by taking a product ofthe units of X and Y
11
)],[cov(),cov(
)(and
)(For
YXcdVU
d
bYV
c
aXU
=
=
=
)().(),cov( YVarXVarYX
7/30/2019 (12) Bivariate Data
12/31
Applied Statistics and Computing Lab
Covariance (contd.) cov(Waist circumference, adipose tissue area) = 643.39
Can we compare this with another covariance? For the Body measurement data, consider both the Weight
and the Height of all the individuals
What is the covariance between Height and Weight for boththe genders?
= 27.13Kg. : Cms. and
= 40.38Kg. : Cms.
What information do we obtain by comparing these two
covariance values?
12
7/30/2019 (12) Bivariate Data
13/31
Applied Statistics and Computing Lab
Standardization If we standardize both the variables, the covariance is independent
of the unit of measurement
Makes the covariances of both categories comparable It would then lie between [-1,1]
The number is closer to 0 => the variables do not covary much
The number closer to 1 or -1 => the variables covary highly
, = 0.43
, = 0.53
The height and weight are moderately related to each other, forboth the genders
We will see that this covariance is the same as the measure westudy next!
13
7/30/2019 (12) Bivariate Data
14/31
Applied Statistics and Computing Lab
Correlation coefficient Denoted by (called rho)
Defined as the measure of the degree of linear association between the
two variables X and Y* Indicates the strength of and direction in which the two variables would
move, in relation with each other
Calculated as the proportion of the covariance between X and Y, to theproduct of standard deviations of X and Y
=( , )
Correlation coefficient is also termed as the Pearson Product-momentCorrelation Coefficient
, = 0.77 ( , ) = 0.43
( , ) = 0.53
14*Aczel A., Sounderpandian J. Complete business statistics
7/30/2019 (12) Bivariate Data
15/31
Applied Statistics and Computing Lab
Properties of Correlation coefficient Correlation coefficient of two variables is equal to the
covariance of their standardised forms
Lies between -1 and 1 (extremes included)
1 1
It is a dimension-free measure or a measure free of
units Is independent of both, change of origin and change of
scale
=( )
=( )
,
=
15
7/30/2019 (12) Bivariate Data
16/31
Applied Statistics and Computing Lab16
Perfect positive correlation.
If one of X or Y increases,
the other one must increaseas per an exact linear
relation. Similarly if one
decreases, the other
decreases by the same rule.
Perfect negative correlation.
If one of X or Y increases,
the other must decrease asper an exact liner relation.
Similarly if one decreases,
the other increases by the
same rule.
No linear relationship. Strong negative correlation.
If one of X or Y increases,the other decreases as per a
moderately strong linear
relation. Similarly if one
decreases, the other
increases by the same rule.
Strong negative correlation.
If one of X or Y increases,
the other decreases as per a
very strong linear relation.
Similarly if one decreases,
the other increases by the
same rule.
Moderate positive
correlation. If one of X or Y
increases, the other must
increase as per a moderately
strong linear relation.
Similarly if one decreases,
the other decreases by the
same rule.
Weak positive correlation. If
one of X or Y increases, the
other must increase as per a
weak linear relation.
Similarly if one decreases,
the other decreases by the
same rule.
No linear relationship.
Visuals from Aczel A., Sounderpandian J. Complete business statistics
7/30/2019 (12) Bivariate Data
17/31
Applied Statistics and Computing Lab
Limitations of correlation coefficient
Correlation
coefficient = 0.911!
17
Y X
0.6 2.01
0.2 2
0.2 2
0.2 2
0.1 2
0.1 2
0.1 2
0.05 2
0.05 2
0 2
X Y
-3 9
-2 4
-1 1
0 0
1 1
2 4
3 9
Correlation coefficient= 0
Yet, there exists aperfect quadraticrelation between Xand Y
7/30/2019 (12) Bivariate Data
18/31
Applied Statistics and Computing Lab
Correlation and causality A huge Roger Federer fan!
Watches several Fedearer - Nadal matches live
on television
Has recorded that Federer loses approximately80% of the matches, that this fan watches live
Does he cause Federer to lose, by watching
the match?
18
7/30/2019 (12) Bivariate Data
19/31
Applied Statistics and Computing Lab
Other measures Rank correlation
To measure the degree of correlation between two ordinal variables or
rankings
: Company rankings given by two different publications
: Ranks of universities published on two websites
Consider two groups of women. They are grouped based on whether they use a
particular brand of shampoo (say Shampoo A) or not. For each of the groups,responses are collated to indicate which of the five characteristics about their
shampoo are most important to them.
19
Characteristics Group 1 rankings Group 2 rankings D=(rank 2 rank 1)
Characteristic 1 1 5 4
Characteristic 2 3 3 0
Characteristic 3 2 4 2
Characteristic 4 5 1 -4
Characteristic 5 4 2 -2
7/30/2019 (12) Bivariate Data
20/31
Applied Statistics and Computing Lab
Other measures (contd.) Spearmans Rank correlation coefficient ():
= 1
6
( 1)
Where, d= difference between 2 ranks of each object
n= Number of objects
This rank correlation is also equal to the Pearson product-moment
correlation applied to the ranks organised in an ascending order
Lies in the interval [-1,1]
Higher the positive correlation coefficient, greater the degree of
agreement between two ranks
Higher the negative correlation coefficient (closer to -1), greater the
degree of disagreement between two ranks
A correlation coefficient of 0 indicates that there is absolutely no similarity
in the two ranks given to the same object
20
7/30/2019 (12) Bivariate Data
21/31
Applied Statistics and Computing Lab
Other measures (contd.) Kendalls Tau ():
= ( )
12 ( 1)
For n objects with ranks , ; for each i=1,2,,n, a pair of observations ( , )
and , is said to be,
concordantif the ranks of both elements agree i.e. both ( > ) and
> OR both ( < ) and <
discordantif( > ) and ( < ) OR ( < ) and ( > ), the pair is
said to be discordant
Neither concordant nor discordant if( = ) or =
Lies in the interval [-1,1]
If the agreement between two rankings is perfect, coefficient = 1
If the disagreement between two rankings is perfect, coefficient = -1
If the rankings are independent, the coefficient would be close to 0
21
7/30/2019 (12) Bivariate Data
22/31
Applied Statistics and Computing Lab
Linear Regression Suppose now, the variation in one variable (X) influences the
variation in the other variable (Y)
Is the adipose tissue area is influenced by waist circumference? Are ice-cream sales affected by the temperature in the city?
The variable X i.e. the variable that influences, is also referred to as
the predictor variable or the independent variable or theexplanatory variable
The variable Y i.e. the variable that is being influenced, is alsoreferred to as the outcome variable or the dependent variable orthe explanatory variable
Can we draw one line such that the equation of that line explainsthe relation between X and Y?
Which line describes the relationship in a reasonable way?
22
7/30/2019 (12) Bivariate Data
23/31
Applied Statistics and Computing Lab23
7/30/2019 (12) Bivariate Data
24/31
Applied Statistics and Computing Lab
Linear regression (contd.)
24Visuals from Aczel A., Sounderpandian J. Complete business statistics
This line minimizes the sum of squared vertical distances
7/30/2019 (12) Bivariate Data
25/31
Applied Statistics and Computing Lab
Linear regression (contd.) Simple linear regression model:
=
+
+
where, Y=Outcome variable
X=Predictor variable
=Random component in the model
=( )( )
( )
= -
If we can safely assume linear relationship between Xand Y, this model predicts average value by which Y willchange for one unit change in X
25
7/30/2019 (12) Bivariate Data
26/31
Applied Statistics and Computing Lab
Linear regression (contd.)
26
The model is estimated using Method of least squares
This method tries to minimize the sum of squared errors
There are other methods of estimation
Visuals from Aczel A., Sounderpandian J. Complete business statistics
7/30/2019 (12) Bivariate Data
27/31
Applied Statistics and Computing Lab
Linear regression (contd.) Goodness of the model depends on the strength
of linear relationship between X and Y The error could comprise of factors other than X,
that may affect Y
The coefficient of determination or is a
measure of the strength of linearity in the
relationship
It indicates the proportion of variation in Y, that is
explained by X
27
7/30/2019 (12) Bivariate Data
28/31
Applied Statistics and Computing Lab
Linear regression (contd.) Fitting a linear regression for the Waist circumference-Adipose tissue data gives following output in R:
We get the following regression equation: = 71.26 + 0.2( )
28
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.26327 1.88565 37.79
7/30/2019 (12) Bivariate Data
29/31
Applied Statistics and Computing Lab
Linear regression (contd.)
29
7/30/2019 (12) Bivariate Data
30/31
Applied Statistics and Computing Lab
R-codesFunction R-code
Dotplot install.packages(TeachingDemos)
library(TeachingDemos)
dots(variable name)
Scatter plot plot(variable1 name,variable2 name)
Covariance cov(variable1 name,variable2 name)
Correlation cor(variable1 name,variable2 name)Spearmans rank correlation cor(variable1 name,variable2 name,
method=spearman)
Kendalls tau cor(variable1 name,variable2 name,
method=kendall)Linear regression lm(response variable ~ explanatory
variable)
Regression line abline(response variable ~ explanatory
variable)
30
7/30/2019 (12) Bivariate Data
31/31
Applied Statistics and Computing Lab
Thank you
Top Related