Data analysis test for association BY Prof Sachin Udepurkar

48

description

Test of Association - Bivariate Analysis. To interpret relationship between variables

Transcript of Data analysis test for association BY Prof Sachin Udepurkar

Page 1: Data analysis   test for association BY Prof Sachin Udepurkar
Page 2: Data analysis   test for association BY Prof Sachin Udepurkar

DATA ANALYSIS – TESTING FOR ASSOCIATIONRelationship :

A consistent and systematic link between two or more variables

While interpreting the relationship between variables following aspects are taken into account :

1. Whether two or more variables are related at all i.e To measure whether relationship is present vide concept of statistical significance

2. If the relationship is present it is important to know the direction which can be either Positive or Negative

3. Understanding strength of association

4. Type of relationship

Page 3: Data analysis   test for association BY Prof Sachin Udepurkar

Univariate Data Bivariate Data

involving a single variable involving two variables

does not deal with causes or relationships deals with causes or relationships

the major purpose of univariate analysis is to describe the major purpose of bivariate analysis is to explain

central tendency - mean, mode, median

dispersion - range, variance, max, min, quartiles, standard deviation.

frequency distributions

bar graph, histogram, pie chart, linegraph, box-and-whisker plot 

analysis of two variables simultaneously

correlations

comparisons, relationships, causes,explanations

tables where one variable is contingent on the values of the other variable.

independent and dependent variables

Sample question:  How many of the students in the freshman class are female?  

Sample question:  Is there a relationship between the number of females in Computer Programming and their scores in Mathematics?  

Difference between Univariate and Bivariate

Page 4: Data analysis   test for association BY Prof Sachin Udepurkar

1) To measure whether relationship is present vide concept of statistical significance -

Whether relation exist between two or more variables

If we test for statistical significance and find that it exists then it is said that relationship is present

Stated another way , we say that knowledge about the behavior of one variable allows us to make a useful prediction about the behavior of another

For example :

If we found statistically significant relationship between the perceptions of the quality of Santa Fe Grill food and satisfaction , we would say a relationship is present and that perceptions of the quality of food will tell us what the perception of satisfaction are likely to be

Page 5: Data analysis   test for association BY Prof Sachin Udepurkar

2) If the relationship is present it is important to know the direction which can be either Positive or Negative

Presence of relationship precedes direction

The direction of relationship can either be positive or negative

For example :

Using Santa Fe Grill example we could say that a positive relationship exists if respondents who rate the quality of food high also are highly satisfied. Similarly , a negative relationship exists if respondents say the speed of service is slow (low rating ) but they are still satisfied (High rating)

Page 6: Data analysis   test for association BY Prof Sachin Udepurkar

3) Understanding strength of association

In general categorize the strength of association as

a. Non existentb. Weakc. Moderated. Strong

If a consistent and systematic relationship is not present then the strength of association is nonexistent

A weak association means there is low probability of variables having relationship

A strong association means there is high probability , a consistent and systematic relationship exists

Page 7: Data analysis   test for association BY Prof Sachin Udepurkar

4) Type of relationship

If we say two variables can be described as related, then we would pose this as question “What is the nature of relationship”? , How can the link between variables Y and X best be described ?

There are a number of different ways in which two variables (X & Y) can share a relationship

Page 8: Data analysis   test for association BY Prof Sachin Udepurkar

In the wake of finding answers to above questions following statistical methodologies will be applied

a.Covariation

a.Chi Square Test

a.Correlation Coefficient1. Pearson Correlation coefficient2. Coefficient of determination3. Spearman rank order correlation coefficient

b.Regression Analysis

Page 9: Data analysis   test for association BY Prof Sachin Udepurkar

COVARIATION :

It is defined as amount of change in one variable that is consistently related to the change in another variable of interest or degree of association between two items/variables

For example :

If we know DVD purchases are related to age ,then we want to know the extent to which younger persons purchase more DVDs and ultimately which types of DVDs

If two variables are foound to change together on a reliable or consistent basis then we can use that information to make predictions as well as decisions on advertising and marketing strategies

For example

Change in attitude towards Starbucks coffee advertising campaign as it varies between light, medium and heavy consumers of Starbucks coffee

Page 10: Data analysis   test for association BY Prof Sachin Udepurkar

SCATTER PLOTS AND CORRELATION A scatter plot (or scatter diagram) is used to

show the relationship between two variables

Page 11: Data analysis   test for association BY Prof Sachin Udepurkar

SCATTER PLOT EXAMPLES

y

x

y

x

y

y

x

x

Linear relationships

Curvilinear relationships

Page 12: Data analysis   test for association BY Prof Sachin Udepurkar

SCATTER PLOT EXAMPLES

y

x

y

x

y

y

x

x

Strong relationships

Weak relationships

(continued)

Page 13: Data analysis   test for association BY Prof Sachin Udepurkar

SCATTER PLOT EXAMPLES

y

x

y

x

No relationship

(continued)

Page 14: Data analysis   test for association BY Prof Sachin Udepurkar

Smoking

3020100-10

Lu

ng

Cap

acit

y

50

40

30

20

One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables

• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.

• The two variables covary in opposite directions.

• We now examine two statistics, covariance and correlation, for quantifying how variables covary.

Smoking and Lung Capacity

Cigarettes (X) Lung Capacity (Y)

0 45

5 42

10 33

15 31

20 29

Page 15: Data analysis   test for association BY Prof Sachin Udepurkar

The formula for calculating covariance of sample data is as follows :

x  = the independent variabley  = the dependent variablen  = number of data points in the sample  = the mean of the independent variable x  = the mean of the dependent variable y

To understand how covariance is used, consider the table below, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi).

Example : To understand how covariance is used, consider the table, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi)

Using the covariance formula, you can determine whether economic growth and S&P 500 returns have a positive or inverse relationship.

Page 16: Data analysis   test for association BY Prof Sachin Udepurkar

 Before you compute the covariance, calculate the mean of x and y

A ) Now you can identify the variables for the covariance formula as follows

x = 2.1, 2.5, 4.0, and 3.6 (economic growth)y = 8, 12, 14, and 10 (S&P 500 returns)  = 3.1  = 11B) Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.

Page 17: Data analysis   test for association BY Prof Sachin Udepurkar

Interpretation :

The covariance between the returns of the S&P 500 and economic growth is 1.53.

Since the covariance is positive, the variables are positively related—they move together in the same direction

Page 18: Data analysis   test for association BY Prof Sachin Udepurkar

Smoking

3020100-10

Lu

ng

Cap

acit

y

50

40

30

20

One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables

• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.

• The two variables covary in opposite directions.

• We now examine two statistics, covariance and correlation, for quantifying how variables covary.

Smoking and Lung Capacity

Cigarettes (X) Lung Capacity (Y)

0 45

5 42

10 33

15 31

20 29

Page 19: Data analysis   test for association BY Prof Sachin Udepurkar

Correlation :

Correlation is another way to determine how two variables are related.

In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together

Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move.

The correlation measurement, called a correlation coefficient, will always take on a value between 1 and – 1 called Pearson Correlation coefficient -

A) If the correlation coefficient is one

The variables have a perfect positive correlation.

This means that if one variable moves a given amount, the second moves proportionally in the same direction.

A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one.

Page 20: Data analysis   test for association BY Prof Sachin Udepurkar

B) If correlation coefficient is zero

No relationship exists between the variables

If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated.

C) If correlation coefficient is –1

The variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other

If one variable increases, the other variable decreases proportionally

A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1

Page 21: Data analysis   test for association BY Prof Sachin Udepurkar

To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below.

x,y) = correlation of the variables x and yCOV(x, y) = covariance of the variables x and ysx = sample standard deviation of the random variable x sy = sample standard deviation of the random variable y

To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable

From the earlier example, you know that the covariance of S&P 500 returns and economic growth was calculated to be 1.53

Page 22: Data analysis   test for association BY Prof Sachin Udepurkar

Now you need to determine the standard deviation of each of the variables

You would calculate the standard deviation of the S&P 500 returns and the economic growth

Using the information from above, you know that

COV(x,y) = 1.53sx = 0.90sy = 2.58

Page 23: Data analysis   test for association BY Prof Sachin Udepurkar

Now calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below.

A correlation coefficient of .66 tells you two important things:

•Because the correlation coefficient is a positive number, returns on the S&P 500 and economic growth are postively related.

•Because .66 is relatively far from indicating no correlation, the strength of the correlation between returns on the S&P 500 and economic growth is strong

Page 24: Data analysis   test for association BY Prof Sachin Udepurkar

The coefficient of determination is the amount of variability in one measure that is explained by the other measure

The coefficient of determination is the square of the correlation coefficient (r2)

For example, if the correlation coefficient between two variables is r = 0.90, the coefficient of determination is (0.90)2 = 0.81

Square of coefficient of correlation (Pearson correlation coefficient) gives coefficient of determination given by r 2

This number ranges from .00 to 1.0 showing proportion variation explained or accounted for in one variable by another

Page 25: Data analysis   test for association BY Prof Sachin Udepurkar

Spearman Rank Order correlation coefficient :

A statistical measure of linear association between two variables where both have been measured using ordinal (rank order) scales

Example :

Page 26: Data analysis   test for association BY Prof Sachin Udepurkar

INTRODUCTION TO REGRESSION ANALYSIS

Regression analysis is used to: Predict the value of a dependent variable based

on the value of at least one independent variable Explain the impact of changes in an independent

variable on the dependent variable

Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain the dependent variable

Page 27: Data analysis   test for association BY Prof Sachin Udepurkar

SIMPLE LINEAR REGRESSION MODEL

Only one independent variable, x

Relationship between x and y is described by a linear function

Changes in y are assumed to be caused by changes in x

Page 28: Data analysis   test for association BY Prof Sachin Udepurkar

TYPES OF REGRESSION MODELS

Positive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

Page 29: Data analysis   test for association BY Prof Sachin Udepurkar

εxββy 10 Linear component

POPULATION LINEAR REGRESSION

The population regression model:

Population y intercept

Population SlopeCoefficient

Random Error term, or residual

Dependent Variable

Independent Variable

Random Error component

Page 30: Data analysis   test for association BY Prof Sachin Udepurkar

LINEAR REGRESSION ASSUMPTIONS

Error values (ε) are statistically independent Error values are normally distributed for any

given value of x The probability distribution of the errors is

normal The probability distribution of the errors has

constant variance The underlying relationship between the x

variable and the y variable is linear

Page 31: Data analysis   test for association BY Prof Sachin Udepurkar

POPULATION LINEAR REGRESSION(continued)

Random Error for this x value

y

x

Observed Value of y for

xi

Predicted Value of y for

xi

εxββy 10

xi

Slope = β1

Intercept = β0

εi

Page 32: Data analysis   test for association BY Prof Sachin Udepurkar

xbby 10i

The sample regression line provides an estimate of the population regression line

ESTIMATED REGRESSION MODEL

Estimate of the regression intercept

Estimate of the regression slope

Estimated (or predicted) y value

Independent variable

The individual random error terms ei have a mean of zero

Page 33: Data analysis   test for association BY Prof Sachin Udepurkar

LEAST SQUARES CRITERION

b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals

210

22

x))b(b(y

)y(ye

Page 34: Data analysis   test for association BY Prof Sachin Udepurkar

THE LEAST SQUARES EQUATION

The formulas for b1 and b0 are:

algebraic equivalent:

n

xx

n

yxxy

b 22

1 )(

21 )(

))((

xx

yyxxb xbyb 10 and

Page 35: Data analysis   test for association BY Prof Sachin Udepurkar

INTERPRETATION OF THE SLOPE AND THE INTERCEPT

b0 is the estimated average value

of y when the value of x is zero

b1 is the estimated change in the

average value of y as a result of a one-unit change in x

Page 36: Data analysis   test for association BY Prof Sachin Udepurkar

FINDING THE LEAST SQUARES EQUATION

The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab

Other regression measures will also be computed as part of computer-based regression analysis

Page 37: Data analysis   test for association BY Prof Sachin Udepurkar

SIMPLE LINEAR REGRESSION EXAMPLE

A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)

A random sample of 10 houses is selected

Dependent variable (y) = house price in $1000s

Independent variable (x) = square feet

Page 38: Data analysis   test for association BY Prof Sachin Udepurkar

SAMPLE DATA FOR HOUSE PRICE MODEL

House Price in $1000s(y)

Square Feet (x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

Page 39: Data analysis   test for association BY Prof Sachin Udepurkar

REGRESSION USING EXCEL Tools / Data Analysis / Regression

Page 40: Data analysis   test for association BY Prof Sachin Udepurkar

EXCEL OUTPUT

Regression Statistics

Multiple R 0.76211

R Square 0.58082

Adjusted R Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA  df SS MS F

Significance F

Regression 1 18934.934818934.934

811.084

8 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000      

 Coefficien

ts Standard Error t StatP-

value Lower 95%Upper 95%

Intercept 98.24833 58.03348 1.692960.1289

2 -35.57720232.0738

6

Square Feet 0.10977 0.03297 3.329380.0103

9 0.03374 0.18580

The regression equation is: feet) (square 0.10977 98.24833 price house

Page 41: Data analysis   test for association BY Prof Sachin Udepurkar

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

Ho

use

Pri

ce (

$100

0s)

GRAPHICAL PRESENTATION

House price model: scatter plot and regression line

feet) (square 0.10977 98.24833 price house

Slope = 0.10977

Intercept = 98.248

Page 42: Data analysis   test for association BY Prof Sachin Udepurkar

INTERPRETATION OF THE INTERCEPT, B0

b0 is the estimated average value of Y when the

value of X is zero (if x = 0 is in the range of observed x values)

Here, no houses had 0 square feet, so b0 = 98.24833

just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet

feet) (square 0.10977 98.24833 price house

Page 43: Data analysis   test for association BY Prof Sachin Udepurkar

INTERPRETATION OF THE SLOPE COEFFICIENT, B1

b1 measures the estimated

change in the average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a

house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size

feet) (square 0.10977 98.24833 price house

Page 44: Data analysis   test for association BY Prof Sachin Udepurkar

LEAST SQUARES REGRESSION PROPERTIES

The sum of the residuals from the least squares regression line is 0 ( )

The sum of the squared residuals is a minimum (minimized )

The simple regression line always passes through the mean of the y variable and the mean of the x variable

The least squares coefficients are unbiased estimates of β0 and β1

0)ˆ( yy

2)ˆ( yy

Page 45: Data analysis   test for association BY Prof Sachin Udepurkar

EXPLAINED AND UNEXPLAINED VARIATION

Total variation is made up of two parts:

SSR SSE SST Total sum

of Squares

Sum of Squares

Regression

Sum of Squares Error

2)yy(SST 2)yy(SSE 2)yy(SSR

where: = Average value of the dependent variabley = Observed values of the dependent

variable = Estimated value of y for the given x

value

y

y

Page 46: Data analysis   test for association BY Prof Sachin Udepurkar

EXPLAINED AND UNEXPLAINED VARIATION

SST = total sum of squares Measures the variation of the yi values around their

mean y

SSE = error sum of squares Variation attributable to factors other than the

relationship between x and y

SSR = regression sum of squares Explained variation attributable to the relationship

between x and y

(continued)

Page 47: Data analysis   test for association BY Prof Sachin Udepurkar

(continued)

Xi

y

x

yi

SST = (yi - y)2

SSE = (yi - yi

)2

SSR = (yi - y)2

_

_

_

EXPLAINED AND UNEXPLAINED VARIATION

y

y

y_y

Page 48: Data analysis   test for association BY Prof Sachin Udepurkar

THANKS……