BIO503: Lecture 4 Statistical models in R
description
Transcript of BIO503: Lecture 4 Statistical models in R
![Page 2: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/2.jpg)
Statistical tests in R
Just some examples:
> t.test()> pairwise.t.test()> chisq.test()> fisher.test()> ks.test()> …
![Page 3: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/3.jpg)
One sample t-test
> data(ChickWeight)> t.test(ChickWeight[, 1], mu = 100)
One Sample t-test
data: ChickWeight[, 1] t = 7.3805, df = 577, p-value = 5.529e-13alternative hypothesis: true mean is not equal to 100 95 percent confidence interval: 116.0121 127.6246 sample estimates:mean of x 121.8183
![Page 4: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/4.jpg)
Two sample t-test
> t.test(ChickWeight$weight[ChickWeight$Diet == "1"],+ ChickWeight$weight[ChickWeight$Diet =="2"])
Welch Two Sample t-testdata: ChickWeight$weight[ChickWeight$Diet == "1"] and
ChickWeight$weight[ChickWeight$Diet == "2"]t = -2.6378, df = 201.384, p-value = 0.008995alternative hypothesis: true difference in means is not
equal to 095 percent confidence interval:-34.899942 -5.042482sample estimates:mean of x mean of y102.6455 122.6167
![Page 5: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/5.jpg)
Linear Regression Models
residual error
regression coefficient
dependent variable
intercept independent variable
Using the methods of least squares, we can derive the following estimators:
Our goal is to test the hypothesis: 0^
We can do this with a T test:
)(
0^
^
SEt
under the null hypothesis, this follows a T distribution with (n-1) df.
![Page 6: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/6.jpg)
Installing ISwR Package
Please install the ISwR package on your computer.
This package contains all the data sets used in Peter Dalgaard's book Introductory Statistics with R.
To load the package into your current R session: > library(ISwR)
To find out more information, including what objects are contained in a package:
> library(help=ISwR)
![Page 7: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/7.jpg)
Example Dataset tlc We'll be using the dataset tlc that exists in ISwR package. To load this dataset:> data(tlc) # tlc = total lung capacity
What kind of object is tlc?> class(tlc)To learn about this dataset:> help(tlc)By using the attach command, we release the columns of the
data.frame into the workspace.> attach(tlc)> age
![Page 8: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/8.jpg)
Linear Regression with lm
Is there a linear relationship between height and Total Lung Capacity (TLC)?
> lmObject <- lm(tlc ~ height, data=tlc)What kind of object is lmObject?> class(lmObject)The model object represents an object that encapsulates the
results of a model fit.The desired quantities can be obtained using extractor functions. A basic extractor function is summary:> summary(lmObject)
![Page 9: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/9.jpg)
Interpreting the Output from lm
Call:... Stores the call to the lm function. Handy for keeping track of what model you fit.
Residuals:...
Coefficients:...
Summary stats of the residuals from the model. Recall: residuals should follow approximately a standard Normal distribution if the model is a good fit.
Estimates from the model. Standard error. T statistics. P-value.
Residual standard error:...
22
22
n
xynSS iiRES
Residual variance
22
^^
nSS
nRES
Plug in estimates, to get RSE:
![Page 10: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/10.jpg)
Interpreting the Output from lm
Multiple R-squared:...
Pearson correlation coefficient2 of y, x (e.g. tlc, height).
Do cor(tlc[,4], height))^2 to check.
F statistic: This is the F test that the regression coefficient is zero.
Adjust R-squared:...
![Page 11: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/11.jpg)
Visualizing Our Fitted Model
So what does it really mean, to have fit a linear model of TLC and height?
Plot the data:> TLC <- tlc[,4]> plot(height, TLC)Add the regression line we fit with our model: > abline(lmObject)
![Page 12: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/12.jpg)
Values Fitted by the Linear Model
Retrieve the fitted values using the extractor function fitted:> fitted(lmObject)Convince yourself that these are the points that fall on the line
we just made in the previous plot.> plot(height, TLC) > abline(lmObject)> points(height, fitted(lmObject), pch=20,
col="red")
To grab the residual values: > resid(lmObject)
![Page 13: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/13.jpg)
Eyeballing Good FitWe can use the information from the residuals and fitted values to
create plots to see how good our model fit is. > plot(height, TLC)> abline(lmObject)
Use the segments function to draw vertical lines from the fitted values to the real data values.
> segments(height, fitted(lmObject), height, TLC)
We can also take a look at the residuals.> plot(height, resid(lmObject))And use a QQ plot to see if the residuals are normally distributed.> qqnorm(resid(lmObject))> qqline(resid(lmObject))
![Page 14: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/14.jpg)
Confidence Interval BandsConfidence bands are added to regression lines to reflect the
uncertainty about a true parameter that cannot be observed i.e. the true regression coefficient .
The more narrow the confidence band is, suggests a well-determined line.
Using the predict function without any other input arguments yields the fitted values predicted by the model.
> predict(lmObject)
To compute the confidence interval bands for the fitted values, you need to specify the interval argument:
> predict(lmObject, interval="confidence")
![Page 15: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/15.jpg)
Visualizing Confidence Interval Bands
Go ahead and plot these values:> pp <- predict(lmObject, interval="confidence")> plot(height, pp[,1], type="l", ylim=range(pp))> lines(height, pp[,2], col="red", lwd=3)> lines(height, pp[,3], col="blue", lwd=3)
What's the problem?
![Page 16: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/16.jpg)
Predicting on New DataOur problem will be solved if height was ordered sequentially.One solution: predict fitted values on new (ordered) height
values.
Create a sequence of numbers that go from min(height) to max(height) approximately.
> range(height)> new <- data.frame(height = seq(from=120,
to=200, by=2))
Compute new predictions:> pp.new <- predict(lmObject, new,
interval="confidence")
![Page 17: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/17.jpg)
Plotting Confidence Interval Bands
Now we can make the plot. First the fitted values: > plot(new$height, pp.new[,1], type="l")
Then the upper interval band:> lines(new$height, pp.new[,2], col="red",
lwd=2)
Finally the lower interval band:> lines(new$height, pp.new[,3], col="blue",
lwd=2)
![Page 18: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/18.jpg)
Prediction Interval Bands
Prediction interval bands reflect the uncertainty about future observations.
Prediction bands are generally wider than the confidence interval bands.
> predict(lmObject, interval="prediction")
Plot the fitted data with the prediction interval bands and confidence interval bands superimposed.
> pp.pred <- predict(lmObject, new, interval="prediction")
> pp.conf <- predict(lmObject, new, interval="confidence")
![Page 19: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/19.jpg)
Visualizing Fitted Values, Prediction and Confidence BandsNote: instead of using lines individually, we can also use the
function matlines which plots columns of a matrix. > help(matlines)
> plot(new$height, pp.pred[,1], ylim=range(pp.pred, pp.conf))
Add the prediction bands:> matlines(new$height, pp.pred[,-1], type=c("l",
"l"), lwd=c(3,3), col=c("red", "blue"), lty=c(1,1))
Add the confidence bands:> matlines(new$height, pp.conf[,-1], type=c("l",
"l"), lwd=c(3,3),col=c("red", "blue"), lty=c(2,2))
![Page 20: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/20.jpg)
Tutorial 1
![Page 21: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/21.jpg)
ANOVA Models
The (one-way) ANOVA model generalizes linear regression. – One factor that has G levels. – For each level we have J replicates.
ijiijY j = 1
j = JJ re
plic
ates
i = 1 i = Gi = 2 …
…
G levels
ANOVA Model:
![Page 22: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/22.jpg)
Analysis of Variance Models
j = 1
j = JJ re
plic
ates
i = 1 i = Gi = 2 …
…
G levels
1
_
x Gx_
ix_
_
x
ijx
11x
i jiB xxSSD
2__
i jiijW xxSSD
2_
i jijTOTALWB xxSSDSSDSSD
2_
Total variation:
Variation between groups:
Variation within groups:
ANOVA is all about splitting variation up.
![Page 23: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/23.jpg)
ANOVA ModelsQuestion: is there a significant relationship between our factor A
and the response variable Y? If yes, then ideally the variation within groups should be small. The variation between groups should be large.
Variation between groups:
Variation within groups:
1BSSD
MS BB
1WSSD
MS WW
Our test statistic:W
B
MSMS
F
General idea:Under the null hypothesis that the factor A has no effect: F = 1.Large values of F indicate the factor is significant.
B = # levels
W = n - B
![Page 24: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/24.jpg)
ANOVA ExampleLet's use a different dataset: > library(MASS)> data(ChickWeight)> attach(ChickWeight)The factor Diet has 4 levels.> levels(Diet)> anova(lm(weight ~ Diet, data=ChickWeight))Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07Residuals 574 2758693 4806
![Page 25: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/25.jpg)
Two-way ANOVAWe can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight))Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 11.5045 2.571e-07Chick 46 374243 8136 1.8015 0.001359Residuals 528 2384450 4516 The interpretation of the model output is sequential, from the bottom to
the top.
This line tests the model: weight ~ Diet + Chick
This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.
![Page 26: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/26.jpg)
Tutorial 2
![Page 27: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/27.jpg)
Multiple Linear Regression
Multiple explanatory variables to explain a response variable.
Can I explain the values of the response variable by the levels of the explanatory variables?
Do I need all explanatory variables to explain the response variable?
![Page 28: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/28.jpg)
Specifying ModelsIn R we use model formula to specify the model we want to fit to
our data. y ~ x Simple Linear Regressiony ~ x – 1 Simple Linear Regression
without the intercept (line goes through origin)
y ~ x1 + x2 + x3 Multiple Regressiony ~ x + I(x^2) Quadratic Regressionlog(y) ~ x1 + x2 Multiple Regression of
Transformed VariableFor factors A, B:y ~ A 1-way ANOVA y ~ A + B 2-way ANOVAy ~ A*B 2-way ANOVA + interaction term
![Page 29: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/29.jpg)
Fit multiple regression model to a data.frame
> ##Get data.frame> cfseal <- read.table("cfseal.txt", header=T,+ sep="\t")> heart.log <- log(cfseal$heart)> cfseal.log <- cfseal > cfseal.log[,1] <- heart.log > colnames(cfseal.log)[1] <- "heart.log"> ##Fit model> seal.lm <- lm(heart.log ~ ., data=cfseal.log)
![Page 30: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/30.jpg)
Update models and model selection Some handy functions to know about: new.model <- update(old.model, new.formula)
Model Selection functions available in the MASS packagedrop1, droptermadd1, addtermstep, stepAIC
Similarly, anova(modObj, test="Chisq")
![Page 31: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/31.jpg)
Generalized Linear Models
Linear regression models hinge on the assumption that the response variable follows a Normal distribution.
Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
![Page 32: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/32.jpg)
Logistic Regression
When faced with a binary response Y = (0,1), we use logistic regression.
),|1( xiii YP
T
ip
i
i
x
x
x
1
T
p
i
1where
jijj
T
ii
i
ii
ii xYP
YPxx
x
1log
),|0(
),|1(log
jijj
jijj
i
x
x
exp1
exp
Logit
![Page 33: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/33.jpg)
Logistic regression
![Page 34: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/34.jpg)
Problem 2 – Logistic Regression
Read in the anaesthetic data set, data file: anaesthetic.txt.
Covariates: move binary numeric vector for patient movement
(1 = movement, 0 = no movement)conc anaethestic concentration
Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.
![Page 35: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/35.jpg)
Fit the Logistic Regression Model> anes.logit <- glm(move ~ conc,
family=binomial(link=logit), data=anesthetic)
The output summary looks like this: > summary(anes.logit)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 **conc -5.567 2.044 -2.724 0.00645 **
Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
![Page 36: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/36.jpg)
Estimating Log Odds Ratio
To get back the log odds ratio> anes.logit$linear.predictors
> plot(anesthetic$conc, anes.logit$linear.predictors)
> abline(coefficients(anes.logit))
Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.
![Page 37: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/37.jpg)
Problem 3 – Multiple Logistic RegressionRead in data set birthwt.txt.
low indicator of birth weight less than 2.5kg age mother's age in years lwt mother's weight in pounds at last menstrual period race mother's race (1 = white, 2 = black, 3 = other) smoke smoking status during pregnancy ptl number of previous premature labours ht history of hypertension ui presence of uterine irritability ftv number of physician visits during the first trimester bwt birth weight in grams
We fit a logistic regression using the glm function and using the binomial family.
![Page 38: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/38.jpg)
Problem 4 - Poisson Regression
Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease.
Example: schooldata.csv.
We can fit the Poisson regression model using the glm function and the poisson family.
![Page 39: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/39.jpg)
Survival Analysis
library(survival)Example: aml leukemia data
Kaplan-Meier curve fit1 <-
survfit(Surv(aml$time[1:11],aml$status[1:11])~1)summary(fit1)plot(fit1)
Log-rank testsurvdiff(Surv(time, status)~x, data=aml)
![Page 40: BIO503: Lecture 4 Statistical models in R](https://reader036.fdocuments.net/reader036/viewer/2022070419/56815ccf550346895dcadf15/html5/thumbnails/40.jpg)
Survival analysis
> cp <- coxph(Surv(aml$time,+ aml$status)~x,data=aml)>> summary(cp)>> plot(survfit(Surv(aml$time,aml$status)~x,+ data=aml),col=c("red","green"),lwd=2)