Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2....

26
+ Tutorial Regression and correlation Presented by Jessica Raterman Shannon Hodges

Transcript of Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2....

Page 1: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+

Tutorial Regression and

correlation

Presented by Jessica Raterman Shannon Hodges

Page 2: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Install the package > data(birthwt, package=“MASS”) or > install.packages(“MASS”) n  Load the data > library(MASS) n  Look over the raw data > print(birthwt) or > birthwt

Page 3: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

•  Raw Data low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557

Page 4: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Check data form and structure

•  Find variable names

> names(birthwt)

•  Look at data structure

> str(birthwt)

Page 5: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

•  Variables Names

[1] "low" "age" "lwt” "race" "smoke" "ptl" "ht" "ui" [9] "ftv" "bwt"

•  Look at data structure

'data.frame': 189 obs. of 10 variables: $ low : int 0 0 0 0 0 0 0 0 0 0 ... $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...

Page 6: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data •  Explore the variables > ?birthwt

or

> help(birthwt)

•  Check data summary

> summary(birthwt)

•  Rename the data if desired, e.g.

> bw <- birthwt

Page 7: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

•  Explore the variables lwt mother's weight in pounds at last menstrual bwt birth weight in grams. •  Check data summary

low age Min. :0.0000 Min. :14.00 1st Qu. :0.0000 1st Qu. :19.00 Median :0.0000 Median :23.00 Mean :0.3122 Mean :23.24 3rd Qu. :1.0000 3rd Qu. :26.00 Max. :1.0000 Max. :45.00

Page 8: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Examine all scatterplots > pairs(birthwt) •  Choose two variables to scatterplot > plot(birthwt$bwt, birthwt$lwt) •  Examine correlation results > cor(birthwt$bwt, birthwt$lwt)

Page 9: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

lwt x bwt

[1] 0.1857333

1000 2000 3000 4000 5000

100

150

200

250

birthwt$bwt

birthwt$lwt

Page 10: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Check normality and distribution

> hist(birthwt$lwt)

and/or

> stem(birthwt$lwt)

> hist(birthwt$bwt)

Page 11: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Check normality and distribution

Histogram of birthwt$lwt

birthwt$lwt

Frequency

100 150 200 250

010

2030

4050

6070

Histogram of birthwt$bwt

birthwt$bwt

Frequency

1000 2000 3000 4000 5000

010

2030

40

Page 12: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Transform your data if needed • Create a new vector (column) for this > sqrtlwt <- sqrt(birthwt$lwt) > loglwt <- log(birthwt$lwt) • Recheck your data > hist(sqrtlwt) > hist(loglwt)

Page 13: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Setting and checking your data

n  Transform your data if needed Histogram of loglwt

loglwt

Frequency

4.4 4.6 4.8 5.0 5.2 5.4 5.6

010

2030

4050

Page 14: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: correlation

n  Recheck your results > cor(loglwt, birthwt$bwt)

The default setting uses Pearson’s r > plot(loglwt, birthwt$bwt)

Page 15: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: correlation

[1] 0.2036035

4.4 4.6 4.8 5.0 5.2 5.4

1000

2000

3000

4000

5000

loglwt

birthwt$bwt

Page 16: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

n  Specify model for simple regression

> m1=lm(birthwt$bwt~loglwt)

n  Check your results with summary

> summary(m1)

You will want to check p-value, R2, slope, F-statistic

Page 17: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

n  Summary Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -390.8 1174.0 -0.333 0.73958 loglwt 688.9 242.2 2.844 0.00495 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 715.8 on 187 degrees of freedom Multiple R-squared: 0.04145, Adjusted R-squared: 0.03633 F-statistic: 8.087 on 1 and 187 DF, p-value: 0.004954

Page 18: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

n Plot your model, check normality > plot(m1) Plot shows: • Residuals vs fitted - Numbered data are

potential problem points skewing the model. • Q-Q plot

Page 19: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+

2600 2800 3000 3200 3400

-2000

-1000

01000

2000

Fitted values

Residuals

lm(birthwt$bwt ~ loglwt)

Residuals vs Fitted

131133

130

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

lm(birthwt$bwt ~ loglwt)

Normal Q-Q

131133

130

Page 20: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

n  Confidence and Prediction •  Confidence intervals for all parameters > confint(m1) > confint(m1, level = 0.95) •  CI for mean response > predict.lm(m1, interval=“confidence”) •  Single predicted values of mean response > predict.lm(m1, interval="prediction")

Page 21: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

n  Add the line of best fit

> abline(m1)

Rerun the plot if needed first:

> plot(loglwt, birthwt$bwt)

n  Find the regression equation

•  Infer from summary data: y = B0 +/- B1x

Page 22: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Parametric: Linear Regression

(Intercept) -390.8 loglwt 688.9

4.4 4.6 4.8 5.0 5.2 5.4

1000

2000

3000

4000

5000

loglwt

birthwt$bwt

Page 23: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Nonparametric

Use when there is residuals are not normally distributed (i.e. cannot assume linear relationship between x and y).

n  Correlation •  Change coeff. correl. to nonparametric option

> ?cor

> cor(birthwt$bwt, birthwt$lwt, method=c(“spearman”))

Page 24: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Nonparametric

n  Smooth with loess, then use linear reg. > m1.lo <- loess(birthwt$bwt~loglwt, span = 100, degree = 1)

> j <- order(loglwt) > plot(m1.lo) > lines(loglwt[j],m1.lo$fitted[j],col="red",lwd=3)

•  Check residuals again > summary(m1.lo)

Page 25: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Further practice

n  Try one run-through of the tutorial with a new set of data that meet parametric requirements, and one that meets the requirements of nonparametric data.

•  For new data: > data() •  Or browse online: https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html

Page 26: Regression and correlationpeople.tamu.edu/~alawing/materials/ESSM689/RCtutorial... · 2015. 2. 3. · Setting and checking your data • Raw Data low age lwt race smoke ptl ht ui

+Sources

n  Hartlaub, BA. 2011. “Introduction to R.” [internet]. Downloaded on January 26, 2015. Available at http://www2.kenyon.edu/Depts/Math/hartlaub/Math305%20Fall2011/R.htm

n  Hosmer DW, Lemeshow S, and Sturdivant RX, editors. 1989. Applied Logistic Regression, 3rd edition. New York: John Wiley & Sons Inc.

n  Stack Exchange. [internet]. “Fit a Line with LOESS in R.” Downloaded on January 30, 2015. Available at http://stackoverflow.com/questions/15337777/fit-a-line-with-loess-in-r