Logistic regression analysis - staff.pubhealth.ku.dk

Post on 02-Oct-2021

6 views 0 download

Transcript of Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression analysis

Thomas Alexander Gerds

Department of Biostatistics, University of Copenhagen

1 / 51

Carpenter et al.

2 / 51

Carpenter et al.

3 / 51

Regression

The type of the outcome variable determines which kind of modelis relevant:

Quantitative (continuous) outcome

I Linear regresssionI Association parameters: differences between mean values

0-1 (binary) outcome

I Logistic regressionI Association parameters: odds ratio, differences between

log(odds)

4 / 51

Categorical explanatory variable

Group 1, . . . , K (especially binary: K=2)

Linear regression, continuous outcome Y

mean(Y|group k) - mean(Y|reference group)

E.g., the average systolic blood pressure was higher in malescompared to females

Logistic regression, binary outcome

Odds ratio =odds(group k)

odds(reference group)

E.g., the risk (odds) of coronary heart disease was higher in malescompared to females

5 / 51

Quantitative (continuous) explanatory variables

Linear regression, continuous outcome YDifferences in mean values per unit of X:

mean(Y|x+1)-mean(Y|x)

E.g., the average systolic blood pressure increased with age

Logistic regression, binary outcomeDifferences in log(odds) per unit of X

Odds ratio =odds(x+1)odds(x)

E.g., the risk (odds) of coronary heart disease increased with age

6 / 51

Linearity in regression models

For a continuous explanatory variable X, linearity means that theeffect of a unit change of X on the outcome does not depend onthe value of X .

Linear regression, continuous outcome Y

mean(Y |45+ 1)−mean(Y |45) = mean(Y |46+ 1)−mean(Y |46)= · · · = mean(Y |61+ 1)−mean(Y |61)

Logistic regression, binary outcome

odds(45+1)odds(45)

=odds(46+1)odds(46)

= · · · = odds(61+1)odds(61)

Linearity is a model assumption which should be investigated.

7 / 51

Binary outcome regressionIf the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good.

The regression line will gobelow 0 and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

8 / 51

Binary outcome regressionIf the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good.

The regression line will gobelow 0 and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

8 / 51

(Multiple) logistic regression

We denote the probability of the event Yi = 1 for a subject withexplanatory variables Xi , Zi , . . . as

P(Yi = 1|Xi ,Zi , . . . ) = pi .

The idea is to use the logit function. Instead of pi which isbounded between 0 and 1 we apply linear regression to log(odds):

logit(pi ) = log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

log(

pi1−pi

)can take both negative and positive values.

9 / 51

Warm-up exercisesComplete the following table

pi oddsi logit(pi )

0.001%-7.0-4.5

2.1%8%

50%3.8

99%11.5

Hint: the following functions take a vector as argument

logit <- function(p){log(p/(1 - p))}expit <- function(x){exp(x)/(1 + exp(x))}

10 / 51

Example: Framingham study

I SEX: 1 for males, 2 for femalesI AGE: age (years) at baseline (45-62)I FRW: "Framingham relative weight" (pct.) at baseline

(52-222; 11 persons have missing values)I SBP: systolic blood pressure at baseline (mmHg) (90-300)I DBP: diastolic blood pressure at baseline (mmHg) 50-160)I CHOL: cholesterol at baseline (mg/100ml) (96-430)I CIG: cigarettes per day at baseline (0-60; 1 person has missing

value)I CHD: 0 if no "coronary heart disease" during follow-up, 1 if

"coronary heart disease" at baseline (prevalent cases), x=2-10if "coronary heart disease" was diagnosed at follow-up no.x

11 / 51

Framingham study: data preparationlibrary(data.table)framingham <- fread("data/Framingham.csv")

## remove prevalent casesframingham <- framingham[CHD!=1,]

## define factor levels/labelsframingham[,Smoke:=factor(CIG>0,levels=c(FALSE,TRUE),labels=c("No","Yes"))]

framingham[,Sex:=factor(SEX,levels=c(1,2),labels=c("Male","Female"))]

## define binary outcome variableframingham[,Y:=factor(CHD>1,levels=c(FALSE,TRUE),labels=c("no

CHD","CHD"))]framingham

ID SEX AGE FRW SBP DBP CHOL CIG CHD Smoke Sex Y1: 1070 2 45 93 100 62 220 0 0 No Female no CHD2: 1081 1 48 93 108 70 340 0 0 No Male no CHD3: 1123 2 45 91 160 100 171 0 0 No Female no CHD4: 1215 1 50 110 110 70 224 0 0 No Male no CHD5: 1267 1 48 85 110 70 229 25 0 Yes Male no CHD

---1359: 6432 1 47 113 155 105 175 5 5 Yes Male CHD1360: 6434 1 59 98 124 84 227 20 2 Yes Male CHD1361: 6437 2 55 111 108 74 231 0 0 No Female no CHD1362: 6440 1 49 114 110 80 218 20 0 Yes Male no CHD1363: 6442 2 51 95 152 90 199 1 0 Yes Female no CHD

12 / 51

Framingham outcome

i = subject number: 1, . . . , 1406

Xi = age of subject i

Zi = gender of subject i

Vi = smoking status of subject i

Yi =

{1 subject i develops coronary heart diseased (CHD)0 subject i does not develop CHD

pi = P(Yi = 1|Xi ,Vi ,Zi , ...) = probability of CHD of subject i

pi(1− pi )

= odds of CHD of subject i

13 / 51

A binary explanatory variable

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a+ bZi =

{a females

a+ b males

That means,

b = (a+ b)− a = log(odds for males)− log(odds for females)

= log

(Odds for malesOdds for females

)and

−b = a− (a+ b) = log

(Odds for femalesOdds for males

)

14 / 51

A binary explanatory variable

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a+ bZi =

{a females

a+ b males

That means,

b = (a+ b)− a = log(odds for males)− log(odds for females)

= log

(Odds for malesOdds for females

)and

−b = a− (a+ b) = log

(Odds for femalesOdds for males

)14 / 51

Exercise: 2 by 2 contingency table

framingham[,table(Sex,Y)]

no CHD CHDMale 479 164Female 616 104

I use the tools for 2x2 tablesI compute the odds ratio with 95% confidence limits and

corresponding p-valueI report and interprete the result in a sentence

15 / 51

Logistic regression in R

fit1 <- glm(Y∼Sex, data=framingham, family=binomial)

I Y ∼ Sex tells R that Y is the outcome and Sex the explanatoryvariable

I data=framingham tells R where to find Y and SexI glm means generalized linear modelI family=binomial tells R that the outcome is binary and the

logit link should be used

16 / 51

Logistic regression in Rfit1 <- glm(Y∼Sex,data=framingham,family=binomial)summary(fit1)

Call:glm(formula = Y ~ Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7674 -0.7674 -0.5586 -0.5586 1.9672

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.07183 0.09047 -11.847 < 2e-16 ***SexFemale -0.70702 0.13937 -5.073 0.000000392 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1324.9 on 1361 degrees of freedomAIC: 1328.9

Number of Fisher Scoring iterations: 4

17 / 51

Confidence intervals for the odds ratio

library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

18 / 51

Confidence intervals for the odds ratio

library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

18 / 51

Changing the reference level

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

19 / 51

Changing the reference level

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

19 / 51

Two explanatory variables:

Zi =

{1 if i male0 female

and Vi =

{1 if i smokes0 otherwise

Data can be summarized as two 2 by 2 tables in two ways

Males (Z=1) Females (Z=0)V = 0 V = 1 V = 0 V = 1

Y = 0 191 288 Y = 0 423 192Y = 1 57 107 Y = 1 77 27

Smokers (V = 1) Non-smokers (V = 0)Males Females Males Females

Y = 0 288 192 Y = 0 191 423Y = 1 107 27 Y = 1 57 77

20 / 51

Cochran-Mantel-Haenszel test

In this way, we can study the effect of smoking adjusted for sex:

ORMantel-Haenszel = 0.97; p > 0.05

and also study the effect of Sex adjusted for smoking:

ORMantel-Haenszel = 2.03; p < 0.05

ConclusionsI there is no significant effect of smoking adjusted for sexI there is a significant effect of sex adjusted for smoking

21 / 51

Logistic regression model: two binary variables

log(

pi1−pi

)= a+ b1Zi + b2Vi

=

a Female non-smokera+ b1 Male non-smokera+ b2 Female smokera+ b1 + b2 Male smoker

Note: b1 = (a+ b1)− a= (a+ b1 + b2)− (a+ b2)

= logOR (males vs. females for given smoking status)

and b2 = (a+ b2)− a= (a+ b1 + b2)− (a+ b1)

= logOR (smokers vs. non-smokers for given gender)

22 / 51

Logistic regression resultsfit2=glm(Y∼Sex+Smoke,data=framingham,family=binomial)summary(fit2)

Call:glm(formula = Y ~ Sex + Smoke, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7716 -0.7607 -0.5564 -0.5564 1.9708

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.09215 0.12717 -8.588 < 2e-16 ***SexFemale -0.69521 0.14635 -4.750 0.00000203 ***SmokeYes 0.03296 0.14457 0.228 0.82---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1350.8 on 1361 degrees of freedomResidual deviance: 1324.5 on 1359 degrees of freedom

(1 observation deleted due to missingness)AIC: 1330.5

Number of Fisher Scoring iterations: 4

23 / 51

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).

Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Simple logistic regression: categorical explanatory variable:

Categorize age into 4 intervals:

45-48, 49-52, 53-56, 57-62

Summarize in 2 by 4 table

X = 0 X = 1 X = 2 X = 345-48 49-52 53-56 57-62

Y = 0 308 298 254 235 1095Y = 1 51 61 64 92 268

359 359 318 327 1363

(Note: both males and females)

25 / 51

ANOVA: χ2 test

We may test whether the risk of CHD differs between the 4 agegroups using a chi-square test statistic - in this case with 3 degreesof freedom:

Null hypothesis:

Odds(age45− 48) = Odds(age49− 52)= Odds(age53− 56) = Odds(age57−62)

∑ (OBS − EXP)2

EXP= 23.29 ∼ χ2

3, P < 0.001

Conclusion: CHD-risk differs significantly between the age groups.

26 / 51

Logistic regression: categorical variable with 4 levels:

log

(pi

1− pi

)=

a age 45− 48

a+ b1 age 49− 52a+ b2 age 53− 56a+ b3 age 57− 62

Reference category 45-48

a = log (Odds(45− 48))

b1 = log

(Odds(49− 52)Odds(45− 48)

)b2 = log

(Odds(53− 56)Odds(45− 48)

)b3 = log

(Odds(57− 62)Odds(45− 48)

)

27 / 51

Resultsframingham[,AgeCut:=cut(AGE,

c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001

AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for

group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.

28 / 51

Resultsframingham[,AgeCut:=cut(AGE,

c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001

AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for

group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.

28 / 51

Quantitative explanatory factor

It is often more natural to include the variable AGE (in years) as aquantitative explanatory factor in the model (i.e., NO grouping)

log

(pi

1− pi

)= a+ b · agei

a = log(odds(age=0))b = log(odds(age=a))− log(odds(age=a+1))

Interpretation: For each year

exp(b) = odds ratio

is the factor by which odds for CHD increases with each one unitincrease of age (here 1 year).

29 / 51

Resultsfit5=glm(Y∼AGE,data=framingham,family=binomial)summary(fit5)

Call:glm(formula = Y ~ AGE, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.8600 -0.7052 -0.6082 -0.5224 2.0294

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.88431 0.77372 -6.313 0.000000000274 ***AGE 0.06581 0.01446 4.550 0.000005374208 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1330.2 on 1361 degrees of freedomAIC: 1334.2

Number of Fisher Scoring iterations: 4

30 / 51

ResultsOne year change in age

fit5=glm(Y∼AGE,data=framingham,family=binomial)publish(fit5,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001

AGE 1.07 [1.04;1.10] < 0.0001

10-year change in age

framingham[,age10:=AGE/10]fit5=glm(Y∼age10,data=framingham,family=binomial)publish(fit5,intercept=1)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001

age10 1.93 [1.45;2.56] < 0.0001

31 / 51

Exercises

If we substract from each person’s age the value 50:

framingham[,Age50:=AGE-50]fit5a=glm(Y∼Age50,data=framingham,family=binomial)publish(fit5a,intercept=1)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.20 [0.17;0.24] < 0.0001

Age50 1.07 [1.04;1.10] < 0.0001

1. Report the coronary heart disease risk of a person aged 50.2. Report the association between age and risk of coronary heart

disease in a sentence with confidence interval and p-value.

32 / 51

Multiple logistic regression

Additive effects of several explanatory variables:

log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

Multiple logistic regression is a way to control confounding:

The effect on the outcome (odds ratio) of each explanatory variableis mutually adjusted for the other explanatory variables.

I The model assumes that the effect (odds ratio) of Z on Y isthe same for all values of X.

I In other words: the effect of Z on Y is not modified by thevalues of X (no statistical interaction).

33 / 51

Illustration of what "mutually adjusted" means

Additive model (no statistical interactions)

log

(pi

1− pi

)= a+ b1Zi + b2Xi

Effect of sex Zi (0 = female, 1 = male) adjusted for age (Xi)

odds(age=50, male)odds(age=50, female)

=exp(a+ b1 + b250)

exp(a+ b250)= exp(a+ b1 + b250− a− b250)= exp(b1).

The result is the same for age 46 and age 61 and all other ages.

34 / 51

Illustration of what "mutually adjusted" means (continued)Effect of age (Xi) for males:

odds(age=51, male)odds(age=50, male)

=exp(a+ b1 + b251)exp(a− b1 + b250)

= exp(a+ b1 + b251− a− b1 − b250)= exp(b2).

The result is the same for females:

odds(age=51, female)odds(age=50, female)

=exp(a+ b251)exp(a− b250)

= exp(a+ b251− a− b250)= exp(b2).

Linearity means that the result is the same for a comparison of age63 and age 62 and all other one year differences.

35 / 51

Resultsfit.add=glm(Y ∼ AGE + Sex, family = binomial,

data = framingham)summary(fit.add)

Call:glm(formula = Y ~ AGE + Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9910 -0.6927 -0.5958 -0.4500 2.1913

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.59208 0.78019 -5.886 0.00000000396 ***AGE 0.06672 0.01458 4.575 0.00000475151 ***SexFemale -0.71613 0.14052 -5.096 0.00000034612 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1303.5 on 1360 degrees of freedomAIC: 1309.5

Number of Fisher Scoring iterations: 4

36 / 51

Results

fit.add=glm(Y ∼ AGE + Sex, family = binomial, data =framingham)

publish(fit.add)

Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04;1.10] <0.0001Sex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.37;0.64] <0.0001

Logistic regression was used to investigate gender differences inodds (risks) of CHD adjusted for age.

The age adjusted odds ratio was 0.49 (95%-CI: [0.37;0.64])showing that the risks of CHD were significantly lower for womencompared to men (p<0.0001).

37 / 51

Predicted risks based on logistic regression model

A logistic regression model can be used to predict personalized risks:

log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

is equivalent to

pi =exp(a+ b1Zi + b2Xi + . . . )

1+ exp(a+ b1Zi + b2Xi + . . . )

The risks (and risk ratios) depend on all explanatory variablessimultaneously.

38 / 51

Predicted risks based on logistic regression modelPrediction makes most sense for new data

mydata=expand.grid(AGE=c(50,55),Sex=factor(c("Female","Male")))setDT(mydata)mydata

AGE Sex1: 50 Female2: 55 Female3: 50 Male4: 55 Male

mydata[,risk:=predict(fit.add,newdata=mydata,type="response")]mydata

AGE Sex risk1: 50 Female 0.12213812: 55 Female 0.16263533: 50 Male 0.22162844: 55 Male 0.2844255

39 / 51

Visualization of predicted risks

mydata2 <- setDT(expand.grid(AGE=seq(45,62,1),Sex=factor(c("Female","Male"))))

mydata2[,risk:=predict(fit.add,newdata=mydata2,type="response")]

library(ggplot2)ggplot(mydata2,aes(x=AGE,y=risk,group

=Sex,colour=Sex))+geom_line()+ylim(c(0,1))+xlab("Age (years)")+ylab("Risk of CHD")

0.00

0.25

0.50

0.75

1.00

45 50 55 60

Age (years)

Ris

k of

CH

D

Sex

Female

Male

40 / 51

Example with more variables

framingham[,Chol10:=CHOL/10]fit.multi <- glm(Y ∼ AGE + Sex + Chol10 + SBP + Smoke,

family = binomial,data = framingham)publish(fit.multi)

Variable Units Missing OddsRatio CI.95 p-valueAGE 0 1.06 [1.02;1.09] 0.0004181Sex Male 0 Ref

Female 0.38 [0.28;0.52] < 0.0001Chol10 0 1.05 [1.02;1.08] 0.0026086

SBP 0 1.02 [1.01;1.02] < 0.0001Smoke No 1 Ref

Yes 1.19 [0.88;1.60] 0.2510447

41 / 51

Exercise

1. Report the effect of cholesterol on coronary heart disease fromthe multiple logistic regression model

2. Predict the coronary heart disease risks of four smokingfemales all aged 50 and with 150 SBP but with differentcholesterol values:I person 1: 235, person 2: 245, person 3: 351, person 4: 361

mydata=data.frame(AGE=50,Sex=factor("Female",levels(framingham$Sex)),Smoke=factor("Yes",levels(framingham$Smoke)),SBP=150,Chol10=c(23.5,24.5,35.1,36.1))

1. Compute the risk ratios for 10 unit cholesterol changes from245 to 235 and from 361 to 351

2. Repeat 2. and 3. for a male person

42 / 51

Statistical interaction = Effectmodification

The effect of X on Y depends on Z

43 / 51

Effect modification

A statistical interaction (effect modification) requires 3variablesI two explanatory variables X,ZI one outcome Y

In logistic regression the odds ratio which describes the effect of Xon the odds of Y=1 depends on the value of Z

SymmetryIf the effect of variable X on Y is modified by Z then also the effectof Z on Y is modified X.

ExampleThe age (Z) effect on the CHD-risk (Y) may depend on sex (X).

44 / 51

Statistical interaction in R

summary(glm(Y ∼ AGE * SEX, family = binomial, data =framingham))

Alternative notation:

summary(glm(Y ∼ AGE + SEX + AGE:SEX, family = binomial, data = framingham))

45 / 51

Resultssummary(glm(Y ∼ AGE * SEX, family = binomial, data = framingham

))

Call:glm(formula = Y ~ AGE * SEX, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9171 -0.7284 -0.6074 -0.4010 2.3029

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.091694 2.361000 0.039 0.9690AGE -0.007736 0.044223 -0.175 0.8611SEX -3.544593 1.604311 -2.209 0.0271 *AGE:SEX 0.052967 0.029871 1.773 0.0762 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1300.4 on 1359 degrees of freedomAIC: 1308.4

Number of Fisher Scoring iterations: 4

46 / 51

Statistical interaction

publish(glm(Y ∼ AGE * Sex, family = binomial, data =framingham))

Variable Units OddsRatio CI.95 p-valueAGE: Sex(Male) 1.05 [1.01;1.09] 0.01629

AGE: Sex(Female) 1.10 [1.05;1.15] < 0.0001

Notes:

I The main effects for AGE and Sex have no interpretation (andare therefore not shown).

I One year more in age increases the odds by 5% in males andby 10% in females.

47 / 51

Predicted risk of the model with an additive effect of ageand sex (no effect modification)

Age (years)

Pre

dict

ed C

HD

ris

k

MaleFemale

0 %

25 %

50 %

75 %

100

%

45 50 55 60

48 / 51

Predicted risk of the model with an interaction between ageand sex (with effect modification)

Age (years)

Pre

dict

ed C

HD

ris

k

MaleFemale

0 %

25 %

50 %

75 %

100

%

45 50 55 60

49 / 51

Testing for statistical interactionfit.add=glm(Y ∼ AGE + SEX, family = binomial, data = framingham)fit.int=glm(Y ∼ AGE * SEX, family = binomial, data = framingham)anova(fit.add,fit.int,test="Chisq")

Analysis of Deviance Table

Model 1: Y ~ AGE + SEXModel 2: Y ~ AGE * SEX

Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 1360 1303.52 1359 1300.4 1 3.1676 0.07511 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is no statistically significant modification of the age effect by gender(p>0.05).

50 / 51

Take home messages

I (Multiple) logistic regression describes associations betweenone or several explanatory variables and the risk of an event(binary outcome).

I Analysis of an exposure of interest can be adjusted forpotential confounders

I In an additive model (no interactions), odds ratios do notdepend on the other explanatory variables

I Risks and risk ratios predicted by the model depend on theother explanatory variables

I Linearity and absence of interaction are assumptions whichshould be investigated

51 / 51