CHAPTER 7 Linear Correlation & Regression Methods

84
7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression CHAPTER 7 Linear Correlation & Regression Methods

description

CHAPTER 7 Linear Correlation & Regression Methods. 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression. Parameter Estimation via SAMPLE DATA …. Testing for association between two POPULATION variables X and Y …. - PowerPoint PPT Presentation

Transcript of CHAPTER 7 Linear Correlation & Regression Methods

Page 1: CHAPTER 7 Linear Correlation & Regression Methods

• 7.1 - Motivation

• 7.2 - Correlation / Simple Linear Regression

• 7.3 - Extensions of Simple Linear Regression

CHAPTER 7 Linear Correlation & Regression Methods

Page 2: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …Testing for association between two POPULATION variables X and Y…

• Categorical variables • Numerical variables

Categories of X

Categories of Y

Chi-squared Test ???????

Examples: X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)Y = Income level (Low, Middle, High)

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Page 3: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

2 2

( )

( )

X X

Y Y

E X

E Y

2

2

( )21

( )21

x x

x n

y y

y n

s

s

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y y

(can be +, –, or 0)

Covariance: Covariance:

Page 4: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

1 2 3 4, , , , , nx x x x x

1 2 3 4, , , , , ny y y y yx1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

(n data points) Covariance:

Page 5: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

(n data points) Covariance:

Page 6: CHAPTER 7 Linear Correlation & Regression Methods

Testing for association between two population variables X and Y…

• Numerical variables

???????

PARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E Y

Covariance:

( )( )XY X YE X Y

Linear Correlation Coefficient:

2 2

XY

X Y

Always between –1 and +1

LINEAR^

Page 7: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2

2

( )21

( )21

x x

x n

y y

y n

s

s

2 2

XY

X Y

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances:

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

(n data points)

Page 8: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … ynPARAMETERS

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y

• Numerical variables

???????

Means: [ ] [ ]X YE X E Y

Variances: 2 2

2 2

( )

( )

X X

Y Y

E X

E X

Covariance:

( )( )XY X YE X Y

PARAMETERSx y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

( )( )

1

x x y y

xy ns

(can be +, –, or 0)

X

Y

Scatterplot

(n data points)

Example in R (reformatted for brevity):

> c(mean(x), mean(y)) 7.05 12.08

> var(x) 29.48944

> var(y) 43.76178

> cov(x, y) -25.86667

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

Always between –1 and +1

> cor(x, y) -0.7200451

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

n = 10

plot(x, y, pch = 19)

Page 9: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

r measures the strength of linear association

(n data points)

Page 10: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

Page 11: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

Page 12: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

r measures the strength of linear association

(n data points)

r measures the strength of linear association

Page 13: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

2 2

xy

x y

sr

s s

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

• Numerical variables

X

Y

JAMA. 2003;290:1486-1493

Scatterplot

Linear Correlation Coefficient:

Always between –1 and +1

–1 0 +1positive linear

correlationnegative linear

correlation

r

(n data points)

> cor(x, y) -0.7200451

r measures the strength of linear association

Page 14: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between two numerical population variables X and Y…

Linear Correlation Coefficient

2 2

XY

X Y

0: 0 "

."

H No linear association

between X and Y

: 0 "

."AH Linear association

between X and Y

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

Now that we have r, we can conduct HYPOTHESIS TESTING on

Test Statistic for p-value

222 ~

1n

rT n t

r

2

0.7210 2

1 ( .72)

82.935 on t

p-value = .0189 < .052 * pt(-2.935, 8)

Page 15: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

in what sense??? 2ErrSS ie

Page 16: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

in what sense???

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

0 1ˆ ˆY X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

“Least Squares Regression Line”

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

2ErrSS ie

( , ) is on linex y

Page 17: CHAPTER 7 Linear Correlation & Regression Methods

0 1ˆ ˆY X

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

2 2

xy

x y

sr

s s

Linear Correlation Coefficient:

r measures the strength of linear association 0 1Y X

“Response = Model + Error”

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

2ErrSS ie

( , ) is on linex yCheck

Page 18: CHAPTER 7 Linear Correlation & Regression Methods

0 1ˆ ˆY X

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

2ErrSS ie

Page 19: CHAPTER 7 Linear Correlation & Regression Methods

Find estimates and for the “best” line0 1

ˆ 18.26391 0.87715Y X

> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response Y

2ErrSS ie

Page 20: CHAPTER 7 Linear Correlation & Regression Methods

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response Y

ˆ 18.26391 0.87715Y X 2

ErrSS ie

Page 21: CHAPTER 7 Linear Correlation & Regression Methods

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

Residuals

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~Y

ˆY Y

2ErrSS ie

Page 22: CHAPTER 7 Linear Correlation & Regression Methods

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2ErrSS ie 189.6555

Page 23: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression CoefficientsTest Statistic for p-value?

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

Page 24: CHAPTER 7 Linear Correlation & Regression Methods

ˆ 18.26391 0.87715Y X

Find estimates and for the “best” line0 1> cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

i.e., that minimizes

2ErrSS ie

1 2ˆ xy

x

s

s

0 1ˆ ˆy x

25.866670.87715

29.48944

12.08 ( 0.87715)(7.05)

18.26391

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

189.6555

Page 25: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression Coefficients

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

Test Statistic for p-value

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

Page 26: CHAPTER 7 Linear Correlation & Regression Methods

> plot(x, y, pch = 19)> lsreg = lm(y ~ x) # or lsfit(x,y)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…

Page 27: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Treatment

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

Page 28: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

Page 29: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1

Error

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

Page 30: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between two numerical population variables X and Y…

Linear Regression Coefficients

0 1: 0 "

."

H No linear association

between X and Y

1: 0 "

."AH Linear association

between X and Y

Linear Regression Coefficients

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

Test Statistic for p-value

1 2ˆ xy

x

s

s 0 1

ˆ ˆy x

0 1Y X

0 1ˆ ˆY X

“Response = Model + Error”

2Err ˆSS ( )y y

21 12

Err

ˆ( 1)

MSx nT n s t

ErrErr

SSMS

2n

0.87715 0(9)(29.48944)

189.6555 / 8

82.935 on t

Same t-score as H0: = 0! p-value = .0189

Errdf 8

Page 31: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1

Error 8

Total –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

?

?

?

?

Page 32: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df

Page 33: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Total

Total

SS

df

SSTot is a measure of the total amount

of variability in the observed responses (i.e., before any model-fitting).

2TotSS ( )y y 2( 1) yn s

Page 34: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

SSReg is a measure of the total amount

of variability in the fitted responses (i.e., after model-fitting.)

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

Page 35: CHAPTER 7 Linear Correlation & Regression Methods

Parameter Estimation via SAMPLE DATA …

Means:

Variances:

x y

n nx y 2

2

( )21

( )21

x x

x n

y y

y n

s

s

STATISTICS

Total

Total

SS

df

JAMA. 2003;290:1486-1493

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Scatterplot

(n data points)

SSErr is a measure of the total amount

of variability in the resulting residuals (i.e., after model-fitting).

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

Page 36: CHAPTER 7 Linear Correlation & Regression Methods

ˆ 18.26391 0.87715Y X > cor(x, y) -0.7200451

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

2( 1) yn s

= 189.656

= 393.856

= 9 (43.76178)

= 204.2

Page 37: CHAPTER 7 Linear Correlation & Regression Methods

ˆ 18.26391 0.87715Y X

( , )i ix y

ˆ( , )i ix y

ˆi i ie y y

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

~ E X E R C I S E ~

~ E X E R C I S E ~

Y

ˆY Y

Residuals

2Err ˆSS ( )y y

2TotSS ( )y y

2Reg ˆSS ( )y y

= 189.656

= 393.856

= 204.2

SSTot = SSReg + SSErr minimum

> cor(x, y) -0.7200451

TotErr

Reg

Page 38: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1 204.200 MSReg

Fk – 1, n – k 0 < p < 1

Error 8 189.656 MSErr

Total 9 393.856 –

ANOVA Table0 1

1

: 0

: 0A

H

H

0 1Y X

Page 39: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

ANOVA Table

Same as before!

0 1

1

: 0

: 0A

H

H

0 1Y X

Page 40: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 *Residuals 8 189.66 23.707

Page 41: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F-ratio p-value

Regression 1 204.200 204.200

8.61349 0.018857

Error 8 189.656 23.707

Total 9 393.856 –

Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Coefficient of Determination

Page 42: CHAPTER 7 Linear Correlation & Regression Methods

> cor(x, y) -0.7200451

Coefficient of Determination

Reg

Tot

SS

SS

204.2.

393.856 0.5185Moreover,

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

2 2( 0.72)r 0.5185

Page 43: CHAPTER 7 Linear Correlation & Regression Methods

> plot(x, y, pch = 19)> lsreg = lm(y ~ x)> abline(lsreg)> summary(lsreg)

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Reg

Tot

SS

SS 0.5185

2 2( 0.72)r 2 2( 0.72)r 0.5185Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Page 44: CHAPTER 7 Linear Correlation & Regression Methods

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means

x

y

Variances2

2

x

y

s

s

Covariance

xys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y

–1 r +1measures the strength of linear association

(ANOVA)All point estimates can be upgraded to CIs for hypothesis testing, etc.

Page 45: CHAPTER 7 Linear Correlation & Regression Methods

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

–1 r +1measures the strength of linear association

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x

minimizes SSErr =

(ANOVA)

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y

= SSTot – SSReg

2ˆ( )y y0 1ˆ ˆˆ XY

upper 95% confidence band

95% Confidence Intervals

lower 95% confidence band

y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

(see notes for “95% prediction intervals”)

Page 46: CHAPTER 7 Linear Correlation & Regression Methods

Given:

2 2

xy

x y

sr

s s

Linear Correlation Coefficient

–1 r +1measures the strength of linear association

Least Squares Regression Line

12ˆ ,xy xs s 0 1

ˆ ˆy x 0 1

ˆ ˆˆ XY minimizes SSErr =

(ANOVA)

Summary of Linear Correlation and Simple Linear Regression

x1 x2 x3 x4 … xn

y1 y2 y3 y4 … yn

Means Variances Covariance

x

y

2

2

x

y

s

sxys

JAMA. 2003;290:1486-1493

X

Y

X

Y

Coefficient of Determination

= SSTot – SSReg

2ˆ( )y y

All point estimates can be upgraded to CIs for hypothesis testing, etc.

Reg2

Tot

SS

SSr proportion of total variability modeled

by the regression line’s variability.

Page 47: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1k kY X X X X

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

0 1 2 3 1: 0kH 1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X : 0

for some 1,2,..., 1A iH

i k

"

."

Linear association between Y and

at least one of its predictors

“Response = Model + Error”

Multilinear Regression

“main effects”

For now, assume the “additive model,” i.e., main effects only.

Page 48: CHAPTER 7 Linear Correlation & Regression Methods

Multilinear Regression

Fitted response

Residual

True response yi

X1

X20

Y

(x1i , x2i)

Predictors

ˆiy

ˆii iyye

1 2( , , )x x y

0 1 1 2 2ˆ ˆ ˆ ˆY X X

Once calculated, how do we then test the null hypothesis?

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

ANOVA

Page 49: CHAPTER 7 Linear Correlation & Regression Methods

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1k kY X X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

R code example: lsreg = lm(y ~ x1+x2+x3)

Page 50: CHAPTER 7 Linear Correlation & Regression Methods

R code example: lsreg = lm(y ~ x1+x2+x3)R code example: lsreg = lm(y ~ x+x^2+x^3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

cubes +

k k

k k k

Y X X X X

X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

quadratic terms, etc.(“polynomial regression”)

Page 51: CHAPTER 7 Linear Correlation & Regression Methods

R code example: lsreg = lm(y ~ x+x^2+x^3)R code example: lsreg = lm(y ~ x1+x2+x1:x2)R code example: lsreg = lm(y ~ x1*x2)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

0 1 1 2 2 3 3 1 1

2 2 21,1 1 2,2 2 1, 1 1

1,2 1 2 1,3 1 3 1, 1 1 1

2,3 2 3 2,4 2 4 2, 1 2 1

cubes +

+

+

+

k k

k k k

k k

k k

Y X X X X

X X X

X X X X X X

X X X X X X

“Response = Model + Error”

Multilinear Regression

“main effects”

quadratic terms, etc.(“polynomial regression”)

“interactions”“interactions”

Page 52: CHAPTER 7 Linear Correlation & Regression Methods
Page 53: CHAPTER 7 Linear Correlation & Regression Methods
Page 54: CHAPTER 7 Linear Correlation & Regression Methods
Page 55: CHAPTER 7 Linear Correlation & Regression Methods
Page 56: CHAPTER 7 Linear Correlation & Regression Methods

Recall…

Example in R (reformatted for brevity):

> I = c(1,1,1,1,1,0,0,0,0,0)

> lsreg = lm(y ~ x*I)> summary(lsreg)

Coefficients:

Estimate(Intercept) 6.56463x 0.00998I 6.80422x:I 1.60858

Suppose these are actually two subgroups, requiring two distinct linear regressions!

Multiple Linear Reg with interactionwith an indicator (“dummy”) variable:

ˆ 6.56 0.01 6.80 1.61Y X I X I

I = 1

I = 0

ˆ 6.56 0.01Y X

ˆ 13.36 1.62Y X 0 1 2 3

ˆ ˆ ˆ ˆY X I X I

Page 57: CHAPTER 7 Linear Correlation & Regression Methods

ANOVA Table (revisited)

0 1 2 3 1: 0kH

From sample of n data points…. 0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

-11 2 3 k

"No linear association between Y and

any of its predictors X , X , X ,…, X ."

0 1 1 2 2 1 1k kY X X X

: 0

for some 1,2,..., 1A iH

i k

"Linear association between Y and

at least one of its predictors."

Note that if true, then it would follow that 0Y

But how are these regression coefficients calculated in general?

“Normal equations” solved via computer (intensive).

Note that if true, then it would follow that

0 .Y

0ˆ .y

Page 58: CHAPTER 7 Linear Correlation & Regression Methods

Source df SS MS F p-value

Regression

Error

Total

0 1 2 3 1: 0kH

1k

n k

1n 2

1

( )n

ii

y y

2

1

ˆ( )n

ii

y y

2

1

ˆ( )n

i ii

y y

SS

df

RegMS

ErrMS

Reg

Err

MS

MS

1,k n kF 0 1p

0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ

k kY X X X

1 2 3 1

"

, , , , ."k

No linear association between Y and

any of its predictors X X X X

ANOVA Table (revisited)

(based on n data points).

*** How are only the statistically significant variables determined? ***

Page 59: CHAPTER 7 Linear Correlation & Regression Methods

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

If significant, then…

3 .05p

Page 60: CHAPTER 7 Linear Correlation & Regression Methods

p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p 3 .05p

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

X1

+ + + + ……

X2 X41 2 X3

3 4

Y

delete that term,

If significant, then…

Page 61: CHAPTER 7 Linear Correlation & Regression Methods

3 .05p p-values: p1 < .05 p2 < .05 p4 < .05

“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

X1

+ + + + ……

X2 X3 X4

1 23 4

Y

0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:

Reject H0 Reject H0 Accept H0 Reject H0

…… ……

Step 2. Are all coefficients significant at level ? If not….

X1

+ + + + ……

X2 X41 2

4

Y

Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model

delete that term, and recompute new coefficients!

If significant, then…

X1 X2 X4

+ + + ……

1 2 4

Page 62: CHAPTER 7 Linear Correlation & Regression Methods

1

2

k

1Y 2Y kY

1 2 k

12

k

= ==H0:

Analysis of Variance (ANOVA)k 2 independent, equivariant, normally-distributed “treatment groups”

Recall ~

MODEL ASSUMPTIONS?

Page 63: CHAPTER 7 Linear Correlation & Regression Methods

“Regression Diagnostics”

Page 64: CHAPTER 7 Linear Correlation & Regression Methods
Page 65: CHAPTER 7 Linear Correlation & Regression Methods
Page 66: CHAPTER 7 Linear Correlation & Regression Methods
Page 67: CHAPTER 7 Linear Correlation & Regression Methods
Page 68: CHAPTER 7 Linear Correlation & Regression Methods
Page 69: CHAPTER 7 Linear Correlation & Regression Methods
Page 70: CHAPTER 7 Linear Correlation & Regression Methods
Page 71: CHAPTER 7 Linear Correlation & Regression Methods
Page 72: CHAPTER 7 Linear Correlation & Regression Methods

Re-plot data on a “log-log” scale.

Page 73: CHAPTER 7 Linear Correlation & Regression Methods
Page 74: CHAPTER 7 Linear Correlation & Regression Methods
Page 75: CHAPTER 7 Linear Correlation & Regression Methods

Re-plot data on a “log” scale (of Y only)..

Page 76: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

Page 77: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

Page 78: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1

ˆ ˆ ˆlnˆ1

X

0 1ˆ ˆ( )

1ˆ1 Xe

“MAXIMUM LIKELIHOOD ESTIMATION”

“log-odds” (“logit”) = example of a general “link function”

( )g

(Note: Not based on LS implies “pseudo-R2,” etc.)

Page 79: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

10 1 2 2

1

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X

1 1:X

1 0 :X 00 2 2

0

ˆ ˆ ˆ ˆlnˆ1

k kX X

SUBTRACT!

Page 80: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

10

1

ˆ ˆlnˆ1

1 2 2ˆ ˆ X ˆ

k kX 1 1:X

1 0 :X 00

0

ˆ ˆlnˆ1

2 2ˆ X ˆ

k kX

SUBTRACT!

Page 81: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

011

1 0

ˆˆ ˆln lnˆ ˆ1 1

Page 82: CHAPTER 7 Linear Correlation & Regression Methods

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

1

11

0

0

ˆ

ˆ1 ˆlnˆ

ˆ1

Page 83: CHAPTER 7 Linear Correlation & Regression Methods

1

odds of surgery given Age 50 ˆlnodds of surgery given Age 50

Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

0 1 1 2 2

ˆ ˆ ˆ ˆ ˆlnˆ1

k kX X X

0 1 1 2 2ˆ ˆ ˆ ˆ( )

1ˆ1 k kX X Xe

Suppose one of the predictor variables is binary… 1

1, Age 500, Age 50

X

“log-odds” (“logit”)

ODDS RATIO

OR e ………….. implies ………….. 1ˆln OR

Page 84: CHAPTER 7 Linear Correlation & Regression Methods

(1 )d

adt

d y

a ydt

( )M y

1

(1 )d a dt

ln | | ln |1 | at b

ln1

at b

d ya y

dt

in population dynamics

Unrestricted population growth(e.g., bacteria)

Population size y obeys the following law

with constant a > 0.

1d y a dt

y

ln | |y at b a t by e a t be e

0a ty y e

a tC eWith initial condition 0(0)y y

Restricted population growth(disease, predation, starvation, etc.)

Population size y obeys the following law,constant a > 0, and “carrying capacity” M.

Exponential growth

Let survival probability = .y M

1 1

1d a dt

Logistic growth0

0 0(1 ) ate