Correlation and Regression
Basic Concepts
An Example
• We can hypothesize that the value of a house increases as its size increases.
• Said differently, size and house value “covary” or “co-relate.”
• Further, we can hypothesize that the relationship is a simple linear one, e.g., that as size increases, house value increases in a similar linear fashion.
• Hence we can use the simple linear equation,• y = a + bx, to describe the relationship
We Ask Two Questions…
• Is there a relationship and how strong is it?
• and• What is the relationship?
• We answer the first with a new statistic, a “correlation” coefficient.
• We answer the second with a linear regression model.
Two Questions
• We started with Correlation .
• We continue with Regression.
Terms
• Independent and Dependent variables
• Scatterplots
• Correlation, correlation coefficient, r
• Regression, regression coefficient, b
• Regression, regression constant, a
• Ordinary Least Squares (OLS) equation:y = a + bx + e
Issues
• Defining relationships– Nature of the relationship: for the moment,
linear– Strength of the relationship (using r)– Direction of the relationship (using r and b)– Calculation of the relationship: y = a + bx + e
Some useful websites
• http://noppa5.pc.helsinki.fi/koe/corr/cor7.html
• http://davidmlane.com/hyperstat/A60659.html
• http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
Illustration
• Case A. x= 2.5, y=2
• Case B. x=8, y = 7
Linear Trend
What if there are lots of data points?
0 1000 2000 3000 4000 5000SIZE
0
3000
6000
9000
12000P
RO
PV
ALU
If there are more data points?How do we summarize the relationships in the data?
Solution: Least Squares Regression, The Best Linear Fit
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
Dep
ende
nt V
aria
ble
A
B
C
Some Theory• Knowing nothing else, the best estimate
of a variable is its mean.
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean YLinear Trend
A
BC
The Regression Model does better…
• Deviation from y = yi – ymean
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean YLinear Trend
A
BC
A Regression equation…
• Measures the nature of the relationship between x and y using a linear model
• Measures the direction of the relationship
• Accompanying statistics, for the time being, r, measures the strength of the relationship.
Understanding the Improvement, measuring the deviations from the
mean
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean yLinear Trend
More Terms
• Yi – the value of a particular case• Y mean – mean value of y• Y hat – y with a ^ above it soŷ
• (Yi – Ymean) = total deviation from mean Y• (Yhat – Ymean) = explained deviation of Yi from
Y mean• (Yi – Yhat) = unexplained deviation of Yi from Y
mean
Bivariate Regression
• Relationships are modeled using the equation, y = a + bx + e
• Translation: The values of an interval level dependent variable, y, can be “predicted” or “modeled” by adding a constant, a, to the product of a slope coefficient, b, times the values of the independent variable, x, and an error term, e.
Estimating the Equation, y = a + bx + e
• The regression equation is calculated by finding the equation that minimizes the sum of the squared deviations between the data points, the y’s, and the predicted y’s, also called y hat.
ymeany
ypredictedorhaty y
ebxay
ˆ
Correlation Coefficient: r
• A measure of the strength of a linear relationship between two interval variables, x and y
• Ranges from – 1 to + 1
• The higher the value of r (e.g., the closer to -1 or + 1, the stronger the relationship between x and y
Correlation Coefficient calculation
• r = Covariance of x and y divided by the product of the standard deviation of x and the standard deviation of y
• Covariance is the sum of the products of the deviations of the cases divided by N.
Equations...
22 )()(
))((
YYXX
YYXXr
tcoefficienncorrelatior
Calculating a and b
2
222
22
)(
)ˆ(
YY
YYrR
XNX
YXNXYb
XbYN
XbYa
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
Dep
ende
nt V
aria
ble
A
B
C
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean YLinear Trend
A
BC
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean YLinear Trend
A
BC
2 3 4 5 6 7 8 9Independent Variable
1
2
3
4
5
6
7
8
De
pe
nd
en
t V
ari
ab
le
Mean yLinear Trend
X Y
2.5 2
4 7
8 7