Linear Modelling III

41
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics

description

Linear Modelling III. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis. Logistic Regression Mixed Models. Logistic Regression and Statistical Genetics. Applicable to Association Studies Data: Binary outcomes ( eg disease status) - PowerPoint PPT Presentation

Transcript of Linear Modelling III

Page 1: Linear Modelling III

Linear Modelling III

Richard MottWellcome Trust Centre for Human

Genetics

Page 2: Linear Modelling III

Synopsis

• Logistic Regression• Mixed Models

Page 3: Linear Modelling III

Logistic Regression and Statistical Genetics

• Applicable to Association Studies• Data:

– Binary outcomes (eg disease status)– Dependent on genotypes [+ sex, environment]

• Aim is to identify which factors influence the outcome• Rigorous tests of statistical significance• Flexible modelling language• Generalisation of Chi-Squared Test

Page 4: Linear Modelling III

Data

• Data for individual labelled i=1…n:– Binary Response yi

– Unphased Genotypes gij for markers j=1..m

Page 5: Linear Modelling III

Coding Unphased Genotypes

• Several possibilities:– AA, AG, GG original genotypes– 12, 21, 22– 1, 2, 3– 0, 1, 2 # of G alleles

• Missing Data– NA default in R

Page 6: Linear Modelling III

Using R

• Load genetic logistic regression tools• > source(‘logistic.R’)

• Read data table from file– > t <- read.table(‘geno.dat’, header=TRUE)

• Column names– names(t)

– t$y response (0,1)– t$m1, t$m2, …. Genotypes for each marker

Page 7: Linear Modelling III

Contigency Tables in R

• ftable(t$y,t$m31) prints the contingency table

> ftable(t$y,t$m31) 11 12 22 0 515 387 751 28 11 2>

Page 8: Linear Modelling III

Chi-Squared Test in R

> chisq.test(t$y,t$m31)

Pearson's Chi-squared test

data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464

Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

Page 9: Linear Modelling III

The Logistic Model

• Effect of SNP k on phenotype

• Null model

Page 10: Linear Modelling III

The Logistic Model

• Prob(Yi=1) = exp(hi)/(1+exp(hi))

• hi = Sj xij bj - Linear Predictor

• xij – Design Matrix (genotypes etc)

• bj – Model Parameters (to be estimated)

• Model is investigated by – estimating the bj’s by maximum likelihood– testing if the estimates are different from 0

Page 11: Linear Modelling III

The Logistic FunctionProb(Yi=1) = exp(hi)/(1+exp(hi))

h

Prob(Y=1)

Page 12: Linear Modelling III

Types of genetic effect at a single locus

AA AG GG

Recessive 0 0 1

Dominant 1 1 0

Additive 0 1 2

Genotype 0 a b

Page 13: Linear Modelling III

Additive Genotype Model• Code genotypes as

– AA x=0, – AG x=1, – GG x=2

• Linear Predictor– h = b0 + xb1

• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

Page 14: Linear Modelling III

Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88

h

Prob(Y=1)

Page 15: Linear Modelling III

Additive Model: b0 = 0 b1 = 2

PAA = 0.50 PAG = 0.88 PGG = 0.98

h

Prob(Y=1)

Page 16: Linear Modelling III

Recessive Model• Code genotypes as

– AA x=0, – AG x=0, – GG x=1

• Linear Predictor– h = b0 + xb1

• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))• PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0))• PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

Page 17: Linear Modelling III

Recessive Model: b0 = 0 b1 = 2

PAA = PAG = 0.50 PGG = 0.88

h

Prob(Y=1)

Page 18: Linear Modelling III

Genotype Model

• Each genotype has an independent effect• Code genotypes as (for example)

– AA x=0, z=0– AG x=1, z=0– GG x=0, z=1

• Linear Predictor– h = b0 + xb1+zb2 two parameters

• P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2))• PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0))• PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1))• PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))

Page 19: Linear Modelling III

Genotype Model: b0 = 0 b1 = 2 b2 = -1

PAA = 0.5 PAG = 0.88 PGG = 0.27

h

Prob(Y=1)

Page 20: Linear Modelling III

Models in Rresponse ygenotype g

AA AG GG model DF

Recessive 0 0 1 y ~ recessive(g) 1

Dominant 0 1 1 y ~ dominant(g) 1

Additive 0 1 2 y ~ additive(g) 1

Genotype 0 a b y ~ genotype(g) 2

Page 21: Linear Modelling III

Data Transformation

• g <- t$m1• use these functions to treat a genotype vector

in a certain way:– a <- additive(g)– r <- recessive(g)– d <- dominant(g)– g <- genotype(g)

Page 22: Linear Modelling III

Fitting the Model• afit <- glm( t$y ~ additive(g),family=‘binomial’)• rfit <- glm( t$y ~ recessive(g),family=‘binomial’)• dfit <- glm( t$y ~ dominant(g),family=‘binomial’)• gfit <- glm( t$y ~ genotype(g),family=‘binomial’)

• Equivalent models:– genotype = dominant + recessive– genotype = additive + recessive– genotype = additive + dominant– genotype ~ standard chi-squared test of genotype

association

Page 23: Linear Modelling III

Parameter Estimates

> summary(glm( t$y ~ genotype(t$m31), family='binomial'))

Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 ***b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

Page 24: Linear Modelling III

Analysis of DevianceChi-Squared Test

> anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq")Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m31) 2 3.96 1015 339.76

Page 25: Linear Modelling III

Model Comparison

• Compare general model with additive, dominant or recessive models:> afit <- glm(t$y ~ additive(t$m20))> gfit <- glm(t$y ~ genotype(t$m20))> anova(afit,gfit)Analysis of Deviance Table

Model 1: t$y ~ additive(t$m20)Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance1 1016 38.301 2 1015 38.124 1 0.177>

Page 26: Linear Modelling III

Scanning all Markers> logscan(t,model=‘additive’) Deviance DF Pval LogPvalm1 8.604197e+00 1 3.353893e-03 2.474450800m2 7.037336e+00 1 7.982767e-03 2.097846522m3 6.603882e-01 1 4.164229e-01 0.380465360m4 3.812860e+00 1 5.086054e-02 1.293619014m5 7.194936e+00 1 7.310960e-03 2.136025588m6 2.449127e+00 1 1.175903e-01 0.929628598m7 2.185613e+00 1 1.393056e-01 0.856031566m8 1.227191e+00 1 2.679539e-01 0.571939852m9 2.532562e+01 1 4.842353e-07 6.314943565m10 5.729634e+01 1 3.748518e-14 13.426140380m11 3.107441e+01 1 2.483233e-08 7.604982503………

Page 27: Linear Modelling III

Multilocus Models

• Can test the effects of fitting two or more markers simultaneously

• Several multilocus models are possible• Interaction Model assumes that each

combination of genotypes has a different effect

• eg t$y ~ t$m10 * t$m15

Page 28: Linear Modelling III

Multi-Locus Models> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’, test="Chisq")> anova(f)Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. DevNULL 1017 343.71genotype(t$m13) 2 108.68 1015 235.03genotype(t$m26) 2 1.14 1013 233.89genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86

> pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597

Page 29: Linear Modelling III

Adding the effects of Sex and other Covariates

• Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age

• Fit models of the form• fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

Page 30: Linear Modelling III

Adding the effects of Sex and other Covariates

• Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age

• anova(fit1,fit2)

Page 31: Linear Modelling III

Generalised Linear Models

• Applies linear modelling framework to data that are not normally distributed

• Can be used for probability distributions from the exponential family:– f(y|q) = h(y) exp ( ( ).h q T(y) – (A q))

• Expected value of y is a transformed version of a linear predictor• E(y) = g-1(Xb) • g() is called the link function• Examples include logistic regression, survival analysis• In R GLMs are fitted using glm() and models compared using the

anova() function

Page 32: Linear Modelling III

Random EffectsMixed Models

• So far our models have had fixed effects – each parameter can take any value independently of the

others– each parameter estimate uses up a degree of freedom– models with large numbers of parameters have problems

• saturation - overfitting• poor predictive properties• numerically unstable and difficult to fit• in some cases we can treat parameters as being sampled from a

distribution – random effects• only estimate the parameters needed to specify that distribution• can result in a more robust model

Page 33: Linear Modelling III

Example of a Mixed Model• Testing genetic association across a large number of of families

– yi = trait value of i’th individual

– bi = genotype of individual at SNP of interest

– fi = family identifier (factor)

– ei = error, variance s2

– y = + + m b f + e

– If we treat family effects f as fixed then we must estimate a large number of parameters– Better to think of these effects as having a distribution N(0,t2) for some variance t which

must be estimated• individuals from the same family have the same f and are correlated

– cov = t2

• individuals from different families are uncorrelated

– genotype parameters b still treated as fixed effects• mixed model

Page 34: Linear Modelling III

Mixed Models• y = + + m b f + e

• Fixed effects model– E(y) = + + m b f – Var(y) = Is2

• I is identity matrix

• Mixed model– E(y) = + m b– Var(y) = Is2 + Ft2

• F is a matrix, Fij = 1 if i, j in same family• Need to estimate both s2 and t2

Page 35: Linear Modelling III

Generalised Least Squares

• y = Xb + e• Var(y) = V (symmetric covariance matrix)

• V = Is2 (uncorrelated errors)• V = Is2 + Ft2 (single grouping random effect)• V = Is2 + Gt2 (G = genotype identity by state)

• GLS solution, if V is known

• unbiased, • efficient (minimum variance)• consistent (tends to true value in large samples)• asymptotically normally distributed

Page 36: Linear Modelling III

Multivariate Normal Distribution• Joint distribution of a vector of correlated observations• Another way of thinking about the data• Estimation of parameters in a mixed model is special case of likelihood

analysis of a MVN• y ~ MVN(m, V)

– m is vector of expected values– V is variance-covariance matrix– If V = Is2 then elements of y are uncorrelated and equivalent to n

independent Normal variables– probability density/Likelihood is

Page 37: Linear Modelling III

Henderson’s Mixed Model Equations

• General linear mixed model.• b are fixed effect• u are random effects• X, Z design matrices

Page 38: Linear Modelling III

Mixed Models

• Fixed effects models are special cases of mixed models

• Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal

distribution is reasonable– Differences between estimated parameters and statistical significance

in fixed vs mixed models

• Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.

Page 39: Linear Modelling III

Mixed Models in R• Several R packages available, e.g.

– lme4 general purpose mixed models package– emma for genetic association in structured populations

• Formulas in lme4, eg test for association between genotype and trait• H0: E(y) = m

• vs• H1: E(y) = + m b

• Var(y) = Is2 + Ft2

– h0 = lmer(y ~ 1 +(1|family), data=data)– h1 = lmer(y ~ genotype +(1|family), data=data)– anova(h0,h1)

Page 40: Linear Modelling III

Formulas in lmer()

• y ~ fixed.effects + (1|Group1) + (1|Group2) etc– random intercept models

• y ~ fixed.effects +(1|Group1) + (x|Group1)– random slope model

Page 41: Linear Modelling III

Example: HDL Data> library(lme4)> b=read.delim("Biochemistry.txt”)cc=complete.cases(b$Biochem.HDL)

> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)

> anova(f0)Analysis of Variance Table

Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family 175 147.11 0.84064 5.2098 < 2.2e-16 ***Residuals 1681 271.24 0.16136 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> f1Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev 2161 2178 -1078 2149 2155Random effects: Groups Name Variance Std.Dev. Family (Intercept) 0.066633 0.25813 Residual 0.161487 0.40185 Number of obs: 1857, groups: Family, 176

Fixed effects: Estimate Std. Error t value(Intercept) 1.60615 0.02263 70.97> plot(resid(f0),resid(f1))