CHAPTER 7 Linear Correlation & Regression Methods

• 7.1 - Motivation

• 7.2 - Correlation / Simple Linear Regression

• 7.3 - Extensions of Simple Linear Regression

CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …Testing for association between two POPULATION variables X and Y…

• Categorical variables • Numerical variables

Categories of X

Categories of Y

Chi-squared Test ???????

Examples: X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)Y = Income level (Low, Middle, High)

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Parameter Estimation via SAMPLE DATA …

2 2

2 2

( )

( )

X X

Y Y

E X

E Y

2

2

( )21

( )21

x x

x n

y y

y n

s

s

PARAMETERS


Variances:

( )( )XY X YE X Y

• Numerical variables

???????


Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y y

(can be +, –, or 0)

Covariance: Covariance:


2

2

( )21

( )21

x x

x n

y y

y n

s

s

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y yx1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS


Variances:

Covariance:

( )( )XY X YE X Y


???????


Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

(n data points) Covariance:


2

2

( )21

( )21

x x

x n

y y

y n

s

s

x1 x2 x3 x4 … xn



Variances:

Covariance:

( )( )XY X YE X Y


???????


Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

(n data points) Covariance:

Testing for association between two population variables X and Y…


???????

PARAMETERS


Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Linear Correlation Coefficient:

2 2

XY

X Y

Always between –1 and +1

LINEAR^


2

2

( )21

( )21

x x

x n

y y

y n

s

s

2 2

XY

X Y

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn



Variances:

Covariance:

( )( )XY X YE X Y


???????


Variances:

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot



(n data points)


JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn



Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y


???????


Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

Scatterplot

(n data points)

Example in R (reformatted for brevity):

> c(mean(x), mean(y)) 7.05 12.08

> var(x) 29.48944

> var(y) 43.76178

> cov(x, y) -25.86667

2 2

xy

x y

sr

s s



> cor(x, y) -0.7200451

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

n = 10

plot(x, y, pch = 19)


2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn


X

Y

JAMA. 2003;290:1486-1493

Scatterplot



r measures the strength of linear association

(n data points)


2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn


X

Y

JAMA. 2003;290:1486-1493

Scatterplot



–1 0 +1positive linear

correlationnegative linear

correlation

r


(n data points)


2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn


X

Y

JAMA. 2003;290:1486-1493

Scatterplot





correlation

r


(n data points)



2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn


X

Y

JAMA. 2003;290:1486-1493

Scatterplot





correlation

r

(n data points)

> cor(x, y) -0.7200451


Testing for linear association between two numerical population variables X and Y…

Linear Correlation Coefficient

2 2

XY

X Y

0: 0 "

."

H No linear association

between X and Y

: 0 "

."AH Linear association

between X and Y

2 2

xy

x y

sr

s s


Now that we have r, we can conduct HYPOTHESIS TESTING on

Test Statistic for p-value

222 ~

1n

rT n t

r

2

0.7210 2

1 ( .72)

82.935 on t

p-value = .0189 < .052 * pt(-2.935, 8)


If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s


r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

in what sense??? 2ErrSS ie


in what sense???


2 2

xy

x y

sr

s s





0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

“Least Squares Regression Line”

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

2ErrSS ie

( , ) is on linex y

0 1ˆ ˆY X


2 2

xy

x y

sr

s s





ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


2ErrSS ie

( , ) is on linex yCheck

0 1ˆ ˆY X


ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

2ErrSS ie


ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

fitted response Y

2ErrSS ie

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

fitted response Y

ˆ 18.26391 0.87715Y X 2

ErrSS ie

ˆ 18.26391 0.87715Y X


Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~Y

ˆY Y

2ErrSS ie

ˆ 18.26391 0.87715Y X


( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2ErrSS ie 189.6555


Linear Regression Coefficients

0 1: 0 "

."


between X and Y

1: 0 "


between X and Y

Linear Regression CoefficientsTest Statistic for p-value?

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X


Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

ˆ 18.26391 0.87715Y X


( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


2ErrSS ie

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391


predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

189.6555



0 1: 0 "

."


between X and Y

1: 0 "


between X and Y




1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X


2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

> plot(x, y, pch = 19)> lsreg = lm(y ~ x) # or lsfit(x,y)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…

Source df SS MS F-ratio p-value

Treatment

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X


Regression

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?


Regression 1

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?



0 1: 0 "

."


between X and Y

1: 0 "


between X and Y




1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X


2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

Errdf 8


Regression 1

Error 8

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

?

?

?


Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df


Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df

SSTot is a measure of the total amount

of variability in the observed responses (i.e., before any model-fitting).

2TotSS ( )y y 2( 1) yn s


JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

SSReg is a measure of the total amount

of variability in the fitted responses (i.e., after model-fitting.)

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s


Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

SSErr is a measure of the total amount

of variability in the resulting residuals (i.e., after model-fitting).

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

ˆ 18.26391 0.87715Y X > cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

= 189.656

= 393.856

= 9 (43.76178)

= 204.2

ˆ 18.26391 0.87715Y X

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y


predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

= 189.656

= 393.856

= 204.2

SSTot = SSReg + SSErr minimum

> cor(x, y) -0.7200451

TotErr

Reg


Regression 1 204.200 MSReg

Fk – 1, n – k 0 < p < 1

Error 8 189.656 MSErr

Total 9 393.856 –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X


Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

ANOVA Table

Same as before!

0 1

1

: 0

: 0A

H

H

0 1Y X


Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 *Residuals 8 189.66 23.707


Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Coefficient of Determination

> cor(x, y) -0.7200451


Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,


2 2( 0.72)r 0.5185

> plot(x, y, pch = 19)> lsreg = lm(y ~ x)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Reg

Tot

SS

SS 0.5185

2 2( 0.72)r 2 2( 0.72)r 0.5185Coefficient of Determination


Given:

2 2

xy

x y

sr

s s


Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means

x

y

Variances2

2

x

y

s

s

Covariance

xys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y

–1 r +1measures the strength of linear association

(ANOVA)All point estimates can be upgraded to CIs for hypothesis testing, etc.

Given:

2 2

xy

x y

sr

s s




12ˆ ,xy xs s 0 1

ˆ ˆy x

minimizes SSErr =

(ANOVA)


x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y0 1ˆ ˆˆ XY

upper 95% confidence band

95% Confidence Intervals

lower 95% confidence band

y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

(see notes for “95% prediction intervals”)

Given:

2 2

xy

x y

sr

s s




12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

(ANOVA)


x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y


= SSTot – SSReg

2ˆ( )y y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

Reg2

Tot

SS

SSr proportion of total variability modeled

by the regression line’s variability.

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1k kY X X X X

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

0 1 2 3 1: 0kH 1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X : 0

for some 1,2,..., 1A iH

i k

"

."

Linear association between Y and

at least one of its predictors


Multilinear Regression

“main effects”

For now, assume the “additive model,” i.e., main effects only.


Fitted response

Residual

True response yi

X1

X20

Y

(x1i , x2i)

Predictors

ˆiy

ˆii iyye

1 2( , , )x x y

0 1 1 2 2ˆ ˆ ˆ ˆY X X

Once calculated, how do we then test the null hypothesis?

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

ANOVA


0 1 1 2 2 3 3 1 1k kY X X X X



“main effects”

R code example: lsreg = lm(y ~ x1+x2+x3)

R code example: lsreg = lm(y ~ x1+x2+x3)R code example: lsreg = lm(y ~ x+x^2+x^3)


0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

cubes +

k k

k k k

Y X X X X

X X X



“main effects”

quadratic terms, etc.(“polynomial regression”)

R code example: lsreg = lm(y ~ x+x^2+x^3)R code example: lsreg = lm(y ~ x1+x2+x1:x2)R code example: lsreg = lm(y ~ x1*x2)


0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

1,2 1 2 1,3 1 3 1, 1 1 1

2,3 2 3 2,4 2 4 2, 1 2 1

cubes +

+

+

+

k k

k k k

k k

k k

Y X X X X

X X X

X X X X X X

X X X X X X



“main effects”

quadratic terms, etc.(“polynomial regression”)

“interactions”“interactions”

Recall…

Example in R (reformatted for brevity):

> I = c(1,1,1,1,1,0,0,0,0,0)

> lsreg = lm(y ~ x*I)> summary(lsreg)

Coefficients:

Estimate(Intercept) 6.56463x 0.00998I 6.80422x:I 1.60858

Suppose these are actually two subgroups, requiring two distinct linear regressions!

Multiple Linear Reg with interactionwith an indicator (“dummy”) variable:

ˆ 6.56 0.01 6.80 1.61Y X I X I

I = 1

I = 0

ˆ 6.56 0.01Y X

ˆ 13.36 1.62Y X 0 1 2 3

ˆ ˆ ˆ ˆY X I X I

ANOVA Table (revisited)

0 1 2 3 1: 0kH

From sample of n data points…. 0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

-11 2 3 k

"No linear association between Y and

any of its predictors X , X , X ,…, X ."

0 1 1 2 2 1 1k kY X X X

: 0

for some 1,2,..., 1A iH

i k

"Linear association between Y and

at least one of its predictors."

Note that if true, then it would follow that 0Y

But how are these regression coefficients calculated in general?

“Normal equations” solved via computer (intensive).

Note that if true, then it would follow that

0 .Y

0ˆ .y

Source df SS MS F p-value

Regression

Error

Total

0 1 2 3 1: 0kH

1k

n k

1n 2

1

( )n

ii

y y

2

1

ˆ( )n

ii

y y

2

1

ˆ( )n

i ii

y y

SS

df

RegMS

ErrMS

Reg

Err

MS

MS

1,k n kF 0 1p

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X

ANOVA Table (revisited)

(based on n data points).

*** How are only the statistically significant variables determined? ***

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

If significant, then…

3 .05p

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p 3 .05p


X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:


…… ……


X1

+ + + + ……

X2 X41 2 X3

3 4

Y

delete that term,


3 .05p p-values: p1 < .05 p2 < .05 p4 < .05


X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:


…… ……


X1

+ + + + ……

X2 X41 2

4

Y

Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model

delete that term, and recompute new coefficients!


X1 X2 X4

+ + + ……

1 2 4

1

2

k

1Y 2Y kY

1 2 k

12

k

= ==H0:

Analysis of Variance (ANOVA)k 2 independent, equivariant, normally-distributed “treatment groups”

Recall ~

MODEL ASSUMPTIONS?

“Regression Diagnostics”

Re-plot data on a “log-log” scale.

Re-plot data on a “log” scale (of Y only)..

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)


0 1

ˆ ˆ ˆlnˆ1

X

0 1ˆ ˆ( )

1ˆ1 Xe

“MAXIMUM LIKELIHOOD ESTIMATION”

“log-odds” (“logit”) = example of a general “link function”

( )g

(Note: Not based on LS implies “pseudo-R2,” etc.)


0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

10 1 2 2

1

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X

1 1:X

1 0 :X 00 2 2

0

ˆ ˆ ˆ ˆlnˆ1

k kX X

SUBTRACT!


0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe


1, Age 500, Age 50

X


10

1

ˆ ˆlnˆ1

1 2 2ˆ ˆ X ˆ

k kX 1 1:X

1 0 :X 00

0

ˆ ˆlnˆ1

2 2ˆ X ˆ

k kX

SUBTRACT!


0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe


1, Age 500, Age 50

X


011

1 0

ˆˆ ˆln lnˆ ˆ1 1


0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe


1, Age 500, Age 50

X


1

11

0

0

ˆ

ˆ1 ˆlnˆ

ˆ1

1

odds of surgery given Age 50 ˆlnodds of surgery given Age 50


0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe


1, Age 500, Age 50

X


ODDS RATIO

1ˆ

OR e ………….. implies ………….. 1ˆln OR

(1 )d

adt

d y

a ydt

( )M y

1

(1 )d a dt

ln | | ln |1 | at b

ln1

at b

d ya y

dt

in population dynamics

Unrestricted population growth(e.g., bacteria)

Population size y obeys the following law

with constant a > 0.

1d y a dt

y

ln | |y at b a t by e a t be e

0a ty y e

a tC eWith initial condition 0(0)y y

Restricted population growth(disease, predation, starvation, etc.)

Population size y obeys the following law,constant a > 0, and “carrying capacity” M.

Exponential growth

Let survival probability = .y M

1 1

1d a dt

Logistic growth0

0 0(1 ) ate

CHAPTER 7 Linear Correlation & Regression Methods

Documents

Transcript of CHAPTER 7 Linear Correlation & Regression Methods