INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car...

28
INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR

Transcript of INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car...

Page 1: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

INTALL THESE PACKAGES

Setwd(‘your_wk’)Install.packages(“psych”, lib=getwd())

psychCar

calibrategrid

hexbinlatticesolaR

Page 2: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Workshop Tomsk

Linear ModellingIntroduction to R

Karim Malki and Maria Grazia Tosto

27th June 2015

Page 3: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

AGENDA• R for statistical analysis

• Data pre-processing

• Understanding Linear Models

• Building Linear Models in R

• Graphing

• What if the model does not fit?

• ANOVA

Page 4: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

R For Statistics

R is a powerful statistical program but it is first and foremost a programming language.

Many routines have been written for R by people all over the world and made freely available on the R project website as "packages".

The base installation contains a powerful set of tools for many statistical purposes including linear modelling

Requires library orMore advanced

Page 5: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Variance and CoVariance

Variance

• Sum of each data point minus the mean for that variable, squared

• When a participant deviates from the mean on one variable, do they deviate on another variable in a similar, or opposite, way? = “Covariance”.

22

1

x Xs

n

Page 6: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlationsample1 <- runif(10, 5.0, 15) sample2 <- sample(5:15, 10, replace=T)var(sample1)cov(sample1,sample2)

Standardising covariance measures Standardising a covariance value gives a measure of the strength of the relationship -> Correlation coefficient

E.g. covariance divided by (sd of X * sd of Y) is the ‘Pearson product moment correlation coefficient’ This will give coefficients between -1 (perfect negative relationship) and 1 (perfect positive relationship).

Page 7: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlation

Correlation graphsUse the basic defaults to create a scatter plot of your two variables plot(x.var~ y.var)Change the axes title plot(x.var, y.var, xlab="X-axis", ylab="Y-axis")This changes the plotting symbol to a solid circle plot(x.var, y.var, pch=16)Adds a line of best fit to your scatter plot abline(lm(y.var ~ x.var)

The default correlation returns the pearson correlation coefficient cor(var1, var2)If you specify "spearman" you will get the spearman correlation coefficient cor(var1, var2, method = "spearman")If you use a datset instead of separate variables you will return a matrix of all the pairwize correlation coefficients cor(dataset, method = "pearson")

?faithfuldata(faithful)summary(faithful)dim(faithful)str(faithful)names(faithful)

library(psych)describe(faithful)

Page 8: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlationls()hist(faithful$eruptions, col="grey")hist(faithful$waiting, col="grey")attach(faithful)plot(eruptions~waiting)abline(lm(faithful$eruptions~faithful$waiting))

cor(eruptions,waiting)cor(faithful, method = "pearson”)

library(car)scatterplot(eruptions~waiting, reg.line=lm, smooth=TRUE, spread=TRUE, id.method='mahal', id.n = 2, boxplots='xy', span=0.5, data=faithful)

library(psych)cor.test(waiting,eruptions)

Page 9: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlation

Page 10: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlation

Page 11: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Correlation

corr.mat<-cor.matrix(variables=d(eruptions,waiting),, data=faithful, test=cor.test, method='pearson’, alternative="two.sided")

> print(corr.mat)

Pearson's product-moment correlation

eruptions waiting Eruptions cor 1 0.9008 N 272 272 CI* (0.8757,0.9211) p-value 0.0000

Page 12: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Regression

• Linear Regression is conceptually similar to correlation

• However, correlation does not treat the two variables differently

• In contrast, Linear Regression is asking about the effect of one on the other.

It distinguishes between IVs (the thing that may influence) and DVs (the

things being influenced)

• So, sometimes problematically, you choose which you expect to have the

causal effect

• Fits a straight line that minimises squared error in the DV (vertical distances

of points from the line = “Method of Least Squares”

• And then asks about the relative variance explained by this straight line

model relative to the unexplained variance

Page 13: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Regressionx <- c(173, 169, 176, 166, 161, 164, 160, 158, 180, 187)y <- c(80, 68, 72, 75, 70, 65, 62, 60, 85, 92)# plot scatterplot and the regression linemod1 <- lm(y ~ x)plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))abline(mod1, lwd=2)# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre <- predict(mod1)# plot distances between points and the regression linesegments(x, y, x, pre, col="red")# add labels (res values) to pointslibrary(calibrate)textxy(x, y, res, cx=0.7)

Page 14: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Regression

Page 15: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Method of Least square

Page 16: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Parameters

The regression model know what is the best fitting line but it can tell you only two things. The slope (gradient or coefficient) and the intercept (or constant)

Page 17: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Parameters

Page 18: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Linear Modelling#data(faithful);ls()mod1<- lm(eruptions~waiting,data=faithful)mod1summary(mod1)Call:lm(formula = eruptions ~ waiting, data = faithful)

Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***waiting 0.075628 0.002219 34.09 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4965 on 270 degrees of freedomMultiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

Page 19: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

co<-coef(mod1)

# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre<- predict(mod1)

# Residuals should be normally distributed and this is easy to checkhist(res)

library(MASS)truehist(res)qqnorm(res)abline(0,1)

Plot your regressionattach(faithful)mod1 <- lm(eruptions~waiting)plot(waiting, eruptions, xlim=c(min(faithful$waiting)-10, max(faithful$waiting)+5), ylim=c(min(faithful$eruptions)-3, max(faithful$eruptions))+1);abline(mod1, lwd=2)# plot distances between points and the regression linesegments(faithful$waiting, faithful$eruptions, faithful$waiting, pre, col='red')

Linear Modelling

Page 20: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Return p-valuelmp <- function (modelobject) { if (class(modelobject) != "lm") stop("Not an object of class 'lm' ") f <- summary(modelobject)$fstatistic p <- pf(f[1],f[2],f[3],lower.tail=F) attributes(p) <- NULL return(p)}

Page 21: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Model FitIf the model does not fit, it may be because of:

Outliers

Unmodelled covariates

Heteroscedasticity (residuals have unequal variance)

Clustering (residuals have lower variance within subgroups)

Autocorrelation (correlation between residuals at successive time points)

Page 22: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Model Fitlibrary(MASS, pysch, lattice, grid, hexbin)library(solaR)data(hills);?hillssplom(~hills)

data <- subset(hills, select=c('dist', 'time', 'climb' ))splom(hills, panel=panel.hexbinplot, colramp=BTC, diag.panel = function(x, ...){ yrng <- current.panel.limits()$ylim d <- density(x, na.rm=TRUE) d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) ) panel.lines(d) diag.panel.splom(x, ...) }, lower.panel = function(x, y, ...){ panel.hexbinplot(x, y, ...) panel.loess(x, y, ..., col = 'red') }, pscale=0, varname.cex=0.7 )

Page 23: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Model Fit

Page 24: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Building Robust Modelsmod2=lm(time~dist,data=hills)

summary(mod2)attach(hills)co2=coef(mod2)plot(dist,time)abline(co2)

fl=fitted(mod2)for(i in 1:35)

lines(c(dist[i],dist[i]),c(time[i],fl[i]),col=‘red’)

#Can you spot outliers?

sr=stdres(mod2)names(sr)truehist(sr,xlim=c(-3,5),h=.4)names(sr)[sr>3]names(sr)[sr<-3]

Page 25: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

Outliers

Page 26: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

What to do with outliersData driven methods for the removal of outliers – some limitations

Fit a better model

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean.

Influence: An observation is influential if removing the observation substantially changes the estimate of the regression coefficients.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Page 27: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

What to do with outliersattach(hills);summary(ols <- lm(time ~ dist))

opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))plot(ols, las = 1)Influence.measures(lm1)

#Using M estimator

rlm1=rlm(time~dist,data=hills,method=‘MM’)summary(rlm1)

attach(hills)plot(dist,time, ylim=c(0,250))abline(coef(lm1))abline(coef(rlm1),col="red”)

Page 28: INTALL THESE PACKAGES Setwd(‘your_wk’) Install.packages(“psych”, lib=getwd()) psych Car calibrate grid hexbin lattice solaR.

From Regression to ANOVAOne Way Anova (Between Factor)

fit <- aov(y ~ A, data=mydataframe)

Two Way Between subject design fit <- aov(y ~ A + B + A:B, data=mydataframe)fit <- aov(y ~ A*B, data=mydataframe)

Plot at two way interaction?interaction.plot(x.factor, trace.factor, response,

Links for ANOVAhttp://www.personality-project.org/r/r.anova.html