INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car...

Post on 22-Dec-2015

219 views 0 download

Tags:

Transcript of INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car...

INTALL THESE PACKAGES

Setwd(‘your_wk’)Install.packages(“psych”, lib=getwd())

psychCar

calibrategrid

hexbinlatticesolaR

Workshop Tomsk

Linear ModellingIntroduction to R

Karim Malki and Maria Grazia Tosto

27th June 2015

AGENDA• R for statistical analysis

• Data pre-processing

• Understanding Linear Models

• Building Linear Models in R

• Graphing

• What if the model does not fit?

• ANOVA

R For Statistics

R is a powerful statistical program but it is first and foremost a programming language.

Many routines have been written for R by people all over the world and made freely available on the R project website as "packages".

The base installation contains a powerful set of tools for many statistical purposes including linear modelling

Requires library orMore advanced

Variance and CoVariance

Variance

• Sum of each data point minus the mean for that variable, squared

• When a participant deviates from the mean on one variable, do they deviate on another variable in a similar, or opposite, way? = “Covariance”.

22

1

x Xs

n

Correlationsample1 <- runif(10, 5.0, 15) sample2 <- sample(5:15, 10, replace=T)var(sample1)cov(sample1,sample2)

Standardising covariance measures Standardising a covariance value gives a measure of the strength of the relationship -> Correlation coefficient

E.g. covariance divided by (sd of X * sd of Y) is the ‘Pearson product moment correlation coefficient’ This will give coefficients between -1 (perfect negative relationship) and 1 (perfect positive relationship).

Correlation

Correlation graphsUse the basic defaults to create a scatter plot of your two variables plot(x.var~ y.var)Change the axes title plot(x.var, y.var, xlab="X-axis", ylab="Y-axis")This changes the plotting symbol to a solid circle plot(x.var, y.var, pch=16)Adds a line of best fit to your scatter plot abline(lm(y.var ~ x.var)

The default correlation returns the pearson correlation coefficient cor(var1, var2)If you specify "spearman" you will get the spearman correlation coefficient cor(var1, var2, method = "spearman")If you use a datset instead of separate variables you will return a matrix of all the pairwize correlation coefficients cor(dataset, method = "pearson")

?faithfuldata(faithful)summary(faithful)dim(faithful)str(faithful)names(faithful)

library(psych)describe(faithful)

Correlationls()hist(faithful$eruptions, col="grey")hist(faithful$waiting, col="grey")attach(faithful)plot(eruptions~waiting)abline(lm(faithful$eruptions~faithful$waiting))

cor(eruptions,waiting)cor(faithful, method = "pearson”)

library(car)scatterplot(eruptions~waiting, reg.line=lm, smooth=TRUE, spread=TRUE, id.method='mahal', id.n = 2, boxplots='xy', span=0.5, data=faithful)

library(psych)cor.test(waiting,eruptions)

Correlation

Correlation

Correlation

corr.mat<-cor.matrix(variables=d(eruptions,waiting),, data=faithful, test=cor.test, method='pearson’, alternative="two.sided")

> print(corr.mat)

Pearson's product-moment correlation

eruptions waiting Eruptions cor 1 0.9008 N 272 272 CI* (0.8757,0.9211) p-value 0.0000

Regression

• Linear Regression is conceptually similar to correlation

• However, correlation does not treat the two variables differently

• In contrast, Linear Regression is asking about the effect of one on the other.

It distinguishes between IVs (the thing that may influence) and DVs (the

things being influenced)

• So, sometimes problematically, you choose which you expect to have the

causal effect

• Fits a straight line that minimises squared error in the DV (vertical distances

of points from the line = “Method of Least Squares”

• And then asks about the relative variance explained by this straight line

model relative to the unexplained variance

Regressionx <- c(173, 169, 176, 166, 161, 164, 160, 158, 180, 187)y <- c(80, 68, 72, 75, 70, 65, 62, 60, 85, 92)# plot scatterplot and the regression linemod1 <- lm(y ~ x)plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))abline(mod1, lwd=2)# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre <- predict(mod1)# plot distances between points and the regression linesegments(x, y, x, pre, col="red")# add labels (res values) to pointslibrary(calibrate)textxy(x, y, res, cx=0.7)

Regression

Method of Least square

Parameters

The regression model know what is the best fitting line but it can tell you only two things. The slope (gradient or coefficient) and the intercept (or constant)

Parameters

Linear Modelling#data(faithful);ls()mod1<- lm(eruptions~waiting,data=faithful)mod1summary(mod1)Call:lm(formula = eruptions ~ waiting, data = faithful)

Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***waiting 0.075628 0.002219 34.09 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4965 on 270 degrees of freedomMultiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

co<-coef(mod1)

# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre<- predict(mod1)

# Residuals should be normally distributed and this is easy to checkhist(res)

library(MASS)truehist(res)qqnorm(res)abline(0,1)

Plot your regressionattach(faithful)mod1 <- lm(eruptions~waiting)plot(waiting, eruptions, xlim=c(min(faithful$waiting)-10, max(faithful$waiting)+5), ylim=c(min(faithful$eruptions)-3, max(faithful$eruptions))+1);abline(mod1, lwd=2)# plot distances between points and the regression linesegments(faithful$waiting, faithful$eruptions, faithful$waiting, pre, col='red')

Linear Modelling

Return p-valuelmp <- function (modelobject) { if (class(modelobject) != "lm") stop("Not an object of class 'lm' ") f <- summary(modelobject)$fstatistic p <- pf(f[1],f[2],f[3],lower.tail=F) attributes(p) <- NULL return(p)}

Model FitIf the model does not fit, it may be because of:

Outliers

Unmodelled covariates

Heteroscedasticity (residuals have unequal variance)

Clustering (residuals have lower variance within subgroups)

Autocorrelation (correlation between residuals at successive time points)

Model Fitlibrary(MASS, pysch, lattice, grid, hexbin)library(solaR)data(hills);?hillssplom(~hills)

data <- subset(hills, select=c('dist', 'time', 'climb' ))splom(hills, panel=panel.hexbinplot, colramp=BTC, diag.panel = function(x, ...){ yrng <- current.panel.limits()$ylim d <- density(x, na.rm=TRUE) d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) ) panel.lines(d) diag.panel.splom(x, ...) }, lower.panel = function(x, y, ...){ panel.hexbinplot(x, y, ...) panel.loess(x, y, ..., col = 'red') }, pscale=0, varname.cex=0.7 )

Model Fit

Building Robust Modelsmod2=lm(time~dist,data=hills)

summary(mod2)attach(hills)co2=coef(mod2)plot(dist,time)abline(co2)

fl=fitted(mod2)for(i in 1:35)

lines(c(dist[i],dist[i]),c(time[i],fl[i]),col=‘red’)

#Can you spot outliers?

sr=stdres(mod2)names(sr)truehist(sr,xlim=c(-3,5),h=.4)names(sr)[sr>3]names(sr)[sr<-3]

Outliers

What to do with outliersData driven methods for the removal of outliers – some limitations

Fit a better model

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean.

Influence: An observation is influential if removing the observation substantially changes the estimate of the regression coefficients.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

What to do with outliersattach(hills);summary(ols <- lm(time ~ dist))

opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))plot(ols, las = 1)Influence.measures(lm1)

#Using M estimator

rlm1=rlm(time~dist,data=hills,method=‘MM’)summary(rlm1)

attach(hills)plot(dist,time, ylim=c(0,250))abline(coef(lm1))abline(coef(rlm1),col="red”)

From Regression to ANOVAOne Way Anova (Between Factor)

fit <- aov(y ~ A, data=mydataframe)

Two Way Between subject design fit <- aov(y ~ A + B + A:B, data=mydataframe)fit <- aov(y ~ A*B, data=mydataframe)

Plot at two way interaction?interaction.plot(x.factor, trace.factor, response,

Links for ANOVAhttp://www.personality-project.org/r/r.anova.html