Lec 11
description
Transcript of Lec 11
STAT3010: Lecture 11
1
CORRELATION AND REGRESSION
Correlation Analysis (Section 10.1, Page 466)
The goal of correlation analysis is to understand the nature and strength of the relationship between x and y (bivariate data). We must first understand the relationship between 2 variables by view of the scatter plot of (x, y).
The following scatter plots display different types of relationships between the x and y values:
STAT3010: Lecture 11
2
To make precise statements about a data set, we must go beyond just a scatter plot. For example, we know that the above plot (b) scatter plot shows a direct (positive) linear relationship between x and y, but the question is, how positive is it? This is where the population correlation coefficient comes
correlation between x and y, it will also tell us the strength, ie.,
The population correlation coefficient, denoted by (rho), will only take on the values in the range of ______________.
The sign of the correlation coefficient indicates the nature of the relationship between x and y
And the magnitude of the correlation coefficient indicates the strength of the linear association between the 2 variables. Recall from STAT 2010/2020:
STAT3010: Lecture 11
3
THE SAMPLE CORRELATION COEFFICIENT r
Definition:
is given by
r = = .
Where Var(x) and Var(y) are the sample variances of x and y, respectively. Recall:
and
And Cov(x,y) is the covariance of x and y defined by:
Computing formulas for the three summation quantities are
ny
ySyy iiyy
222 )(
)(
STAT3010: Lecture 11
4
Standard deviation and variance only operate on 1 dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other.
Covariance is such a measure. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. So, if you had a 3-dimensional data set (x, y, z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions. Measuring the covariance between x and x, or y and y, or z and z would give you the variance of the x, y and z dimensions respectively.
Example 10.1: Correlation Between Body Mass Index and Systolic Blood Pressure
mass index and systolic blood pressure in males 50 yrs old. A random sample of 10 males 50 years of age is selected and their body mass index scores and systolic blood pressure is recorded in the following table:
STAT3010: Lecture 11
5
X = Body Mass Index Y = Systolic Blood Pressure
18.4 120 20.1 110 22.4 120 25.9 135 26.5 140 28.9 115 30.1 150 32.9 165 33.0 160
34.7 180
first view the scatter diagram:
Systolic blood pressure
Body mass index
Calculate the sample correlation coefficient and explain:
STAT3010: Lecture 11
6
Now that was sample correlation, what about population correlation? The sample correlation coefficient, r, is a point estimate for the population correlation coefficient, . Tests of hypothesis concerning address whether there is a linear association in the population.
To test the null hypothesis of NO linear relationship ( =0) we
use: with df = n-2 (using table B.3
to get a critical value)
Example 10.1.2: Statistical Inference Concerning
Hypothesis:
Test Statistic:
Decision:
Conclusion:
SAS CODE: options ps=62 ls=80; data correlation; input bmi sbp; cards; 18.4 120 20.1 110 22.4 120 25.9 135 26.5 140 28.9 115 30.1 150
STAT3010: Lecture 11
7
32.9 165 33.0 160 34.7 180 run; proc plot;
plot sbp*bmi; run; proc corr cov; var bmi sbp; run;
SAS OUTPUT: The SAS System Plot of sbp*bmi. Legend: A = 1 obs, B = 2 obs, etc.
165
1
A
17.5 20.0 22.5 25.0 27.5 30.0 32.5 35.0
bmi
STAT3010: Lecture 11
8
The SAS System
The CORR Procedure
2 Variables: bmi sbp
Covariance Matrix, DF = 9
bmi sbp
bmi 31.8521111 115.2166667 sbp 115.2166667 563.6111111
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
bmi 10 27.29000 5.64377 272.90000 18.40000 34.70000 sbp 10 139.50000 23.74050 1395 110.00000 180.00000
Pearson Correlation Coefficients, N = 10 Prob > |r| under H0: Rho=0
bmi sbp
bmi 1.00000 0.85992 0.0014
sbp 0.85992 1.00000 0.0014
Simple Linear Regression (Section 10.2, Page 477)
Regression analysis is used to develop the mathematical equation that best describes the relationship between two variables, x and y. In correlation analysis, it is not necessary to specify which of the two variables is the independent one and dependent one. In regression analysis, it is necessary, they must be specified.
STAT3010: Lecture 11
9
Remember our correlation plot from example 10.1 (above):
We now want to create the equation of the best fit of this data. This equation of the line relating y to x is called the simple linear regression equation and is given by:
Where Y is the dependent variable X is the independent variable
is the Y-intercept (the value of Y, when X=0) is the slope (the expected change in Y relative to one
unit change in X) is the error
The parameters of and in the least squares regression line are estimated in such a way that:
Let estimates of and be respectively denoted by and
. These estimators are the solutions of the following equations:
STAT3010: Lecture 11
10
STAT3010: Lecture 11
11
We have now obtained:
These estimates are called the least squares estimates of the slope and intercept. The estimate of the simple linear regression equation is given by substituting the least squares estimates in the simple linear regression equation:
Where is the expected value of Y for a given value of X.
Back to Example 10.1:
least squares regression equation for the data given in example 10.1.
STAT3010: Lecture 11
12
STAT3010: Lecture 11
13
To compute the regression estimates ( and ) within SAS, place the following code after the code introduced above on Page 6/7:
proc reg; model sbp=bmi; run;
The SAS System The CORR Procedure
2 Variables: bmi sbp Covariance Matrix, DF = 9
bmi sbp bmi 31.8521111 115.2166667 sbp 115.2166667 563.6111111
Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum bmi 10 27.29000 5.64377 272.90000 18.40000 34.70000 sbp 10 139.50000 23.74050 1395 110.00000 180.00000
Pearson Correlation Coefficients, N = 10
Prob > |r| under H0: Rho=0 bmi sbp
bmi 1.00000 0.85992 0.0014
sbp 0.85992 1.00000 0.0014
The SAS System The REG Procedure
Model: MODEL1 Dependent Variable: sbp
Number of Observations Read 10 Number of Observations Used 10
Analysis of Variance Sum of Mean
Source DF Squares Square F Value Pr > F Model 1 3750.89494 3750.89494 22.71 0.0014 Error 8 1321.60506 165.20063 Corrected Total 9 5072.50000
Root MSE 12.85304 R-Square 0.7395 Dependent Mean 139.50000 Adj R-Sq 0.7069 Coeff Var 9.21365
Parameter Estimates Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Intercept 1 40.78558 21.11158 1.93 0.0895 bmi 1 3.61724 0.75913 4.76 0.0014