Logistic regression analysis - staff.pubhealth.ku.dk

58
Logistic regression analysis Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen 1 / 51

Transcript of Logistic regression analysis - staff.pubhealth.ku.dk

Page 1: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression analysis

Thomas Alexander Gerds

Department of Biostatistics, University of Copenhagen

1 / 51

Page 2: Logistic regression analysis - staff.pubhealth.ku.dk

Carpenter et al.

2 / 51

Page 3: Logistic regression analysis - staff.pubhealth.ku.dk

Carpenter et al.

3 / 51

Page 4: Logistic regression analysis - staff.pubhealth.ku.dk

Regression

The type of the outcome variable determines which kind of modelis relevant:

Quantitative (continuous) outcome

I Linear regresssionI Association parameters: differences between mean values

0-1 (binary) outcome

I Logistic regressionI Association parameters: odds ratio, differences between

log(odds)

4 / 51

Page 5: Logistic regression analysis - staff.pubhealth.ku.dk

Categorical explanatory variable

Group 1, . . . , K (especially binary: K=2)

Linear regression, continuous outcome Y

mean(Y|group k) - mean(Y|reference group)

E.g., the average systolic blood pressure was higher in malescompared to females

Logistic regression, binary outcome

Odds ratio =odds(group k)

odds(reference group)

E.g., the risk (odds) of coronary heart disease was higher in malescompared to females

5 / 51

Page 6: Logistic regression analysis - staff.pubhealth.ku.dk

Quantitative (continuous) explanatory variables

Linear regression, continuous outcome YDifferences in mean values per unit of X:

mean(Y|x+1)-mean(Y|x)

E.g., the average systolic blood pressure increased with age

Logistic regression, binary outcomeDifferences in log(odds) per unit of X

Odds ratio =odds(x+1)odds(x)

E.g., the risk (odds) of coronary heart disease increased with age

6 / 51

Page 7: Logistic regression analysis - staff.pubhealth.ku.dk

Linearity in regression models

For a continuous explanatory variable X, linearity means that theeffect of a unit change of X on the outcome does not depend onthe value of X .

Linear regression, continuous outcome Y

mean(Y |45+ 1)−mean(Y |45) = mean(Y |46+ 1)−mean(Y |46)= · · · = mean(Y |61+ 1)−mean(Y |61)

Logistic regression, binary outcome

odds(45+1)odds(45)

=odds(46+1)odds(46)

= · · · = odds(61+1)odds(61)

Linearity is a model assumption which should be investigated.

7 / 51

Page 8: Logistic regression analysis - staff.pubhealth.ku.dk

Binary outcome regressionIf the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good.

The regression line will gobelow 0 and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

8 / 51

Page 9: Logistic regression analysis - staff.pubhealth.ku.dk

Binary outcome regressionIf the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good.

The regression line will gobelow 0 and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

8 / 51

Page 10: Logistic regression analysis - staff.pubhealth.ku.dk

(Multiple) logistic regression

We denote the probability of the event Yi = 1 for a subject withexplanatory variables Xi , Zi , . . . as

P(Yi = 1|Xi ,Zi , . . . ) = pi .

The idea is to use the logit function. Instead of pi which isbounded between 0 and 1 we apply linear regression to log(odds):

logit(pi ) = log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

log(

pi1−pi

)can take both negative and positive values.

9 / 51

Page 11: Logistic regression analysis - staff.pubhealth.ku.dk

Warm-up exercisesComplete the following table

pi oddsi logit(pi )

0.001%-7.0-4.5

2.1%8%

50%3.8

99%11.5

Hint: the following functions take a vector as argument

logit <- function(p){log(p/(1 - p))}expit <- function(x){exp(x)/(1 + exp(x))}

10 / 51

Page 12: Logistic regression analysis - staff.pubhealth.ku.dk

Example: Framingham study

I SEX: 1 for males, 2 for femalesI AGE: age (years) at baseline (45-62)I FRW: "Framingham relative weight" (pct.) at baseline

(52-222; 11 persons have missing values)I SBP: systolic blood pressure at baseline (mmHg) (90-300)I DBP: diastolic blood pressure at baseline (mmHg) 50-160)I CHOL: cholesterol at baseline (mg/100ml) (96-430)I CIG: cigarettes per day at baseline (0-60; 1 person has missing

value)I CHD: 0 if no "coronary heart disease" during follow-up, 1 if

"coronary heart disease" at baseline (prevalent cases), x=2-10if "coronary heart disease" was diagnosed at follow-up no.x

11 / 51

Page 13: Logistic regression analysis - staff.pubhealth.ku.dk

Framingham study: data preparationlibrary(data.table)framingham <- fread("data/Framingham.csv")

## remove prevalent casesframingham <- framingham[CHD!=1,]

## define factor levels/labelsframingham[,Smoke:=factor(CIG>0,levels=c(FALSE,TRUE),labels=c("No","Yes"))]

framingham[,Sex:=factor(SEX,levels=c(1,2),labels=c("Male","Female"))]

## define binary outcome variableframingham[,Y:=factor(CHD>1,levels=c(FALSE,TRUE),labels=c("no

CHD","CHD"))]framingham

ID SEX AGE FRW SBP DBP CHOL CIG CHD Smoke Sex Y1: 1070 2 45 93 100 62 220 0 0 No Female no CHD2: 1081 1 48 93 108 70 340 0 0 No Male no CHD3: 1123 2 45 91 160 100 171 0 0 No Female no CHD4: 1215 1 50 110 110 70 224 0 0 No Male no CHD5: 1267 1 48 85 110 70 229 25 0 Yes Male no CHD

---1359: 6432 1 47 113 155 105 175 5 5 Yes Male CHD1360: 6434 1 59 98 124 84 227 20 2 Yes Male CHD1361: 6437 2 55 111 108 74 231 0 0 No Female no CHD1362: 6440 1 49 114 110 80 218 20 0 Yes Male no CHD1363: 6442 2 51 95 152 90 199 1 0 Yes Female no CHD

12 / 51

Page 14: Logistic regression analysis - staff.pubhealth.ku.dk

Framingham outcome

i = subject number: 1, . . . , 1406

Xi = age of subject i

Zi = gender of subject i

Vi = smoking status of subject i

Yi =

{1 subject i develops coronary heart diseased (CHD)0 subject i does not develop CHD

pi = P(Yi = 1|Xi ,Vi ,Zi , ...) = probability of CHD of subject i

pi(1− pi )

= odds of CHD of subject i

13 / 51

Page 15: Logistic regression analysis - staff.pubhealth.ku.dk

A binary explanatory variable

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a+ bZi =

{a females

a+ b males

That means,

b = (a+ b)− a = log(odds for males)− log(odds for females)

= log

(Odds for malesOdds for females

)and

−b = a− (a+ b) = log

(Odds for femalesOdds for males

)

14 / 51

Page 16: Logistic regression analysis - staff.pubhealth.ku.dk

A binary explanatory variable

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a+ bZi =

{a females

a+ b males

That means,

b = (a+ b)− a = log(odds for males)− log(odds for females)

= log

(Odds for malesOdds for females

)and

−b = a− (a+ b) = log

(Odds for femalesOdds for males

)14 / 51

Page 17: Logistic regression analysis - staff.pubhealth.ku.dk

Exercise: 2 by 2 contingency table

framingham[,table(Sex,Y)]

no CHD CHDMale 479 164Female 616 104

I use the tools for 2x2 tablesI compute the odds ratio with 95% confidence limits and

corresponding p-valueI report and interprete the result in a sentence

15 / 51

Page 18: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression in R

fit1 <- glm(Y∼Sex, data=framingham, family=binomial)

I Y ∼ Sex tells R that Y is the outcome and Sex the explanatoryvariable

I data=framingham tells R where to find Y and SexI glm means generalized linear modelI family=binomial tells R that the outcome is binary and the

logit link should be used

16 / 51

Page 19: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression in Rfit1 <- glm(Y∼Sex,data=framingham,family=binomial)summary(fit1)

Call:glm(formula = Y ~ Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7674 -0.7674 -0.5586 -0.5586 1.9672

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.07183 0.09047 -11.847 < 2e-16 ***SexFemale -0.70702 0.13937 -5.073 0.000000392 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1324.9 on 1361 degrees of freedomAIC: 1328.9

Number of Fisher Scoring iterations: 4

17 / 51

Page 20: Logistic regression analysis - staff.pubhealth.ku.dk

Confidence intervals for the odds ratio

library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

18 / 51

Page 21: Logistic regression analysis - staff.pubhealth.ku.dk

Confidence intervals for the odds ratio

library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

18 / 51

Page 22: Logistic regression analysis - staff.pubhealth.ku.dk

Changing the reference level

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

19 / 51

Page 23: Logistic regression analysis - staff.pubhealth.ku.dk

Changing the reference level

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

19 / 51

Page 24: Logistic regression analysis - staff.pubhealth.ku.dk

Two explanatory variables:

Zi =

{1 if i male0 female

and Vi =

{1 if i smokes0 otherwise

Data can be summarized as two 2 by 2 tables in two ways

Males (Z=1) Females (Z=0)V = 0 V = 1 V = 0 V = 1

Y = 0 191 288 Y = 0 423 192Y = 1 57 107 Y = 1 77 27

Smokers (V = 1) Non-smokers (V = 0)Males Females Males Females

Y = 0 288 192 Y = 0 191 423Y = 1 107 27 Y = 1 57 77

20 / 51

Page 25: Logistic regression analysis - staff.pubhealth.ku.dk

Cochran-Mantel-Haenszel test

In this way, we can study the effect of smoking adjusted for sex:

ORMantel-Haenszel = 0.97; p > 0.05

and also study the effect of Sex adjusted for smoking:

ORMantel-Haenszel = 2.03; p < 0.05

ConclusionsI there is no significant effect of smoking adjusted for sexI there is a significant effect of sex adjusted for smoking

21 / 51

Page 26: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression model: two binary variables

log(

pi1−pi

)= a+ b1Zi + b2Vi

=

a Female non-smokera+ b1 Male non-smokera+ b2 Female smokera+ b1 + b2 Male smoker

Note: b1 = (a+ b1)− a= (a+ b1 + b2)− (a+ b2)

= logOR (males vs. females for given smoking status)

and b2 = (a+ b2)− a= (a+ b1 + b2)− (a+ b1)

= logOR (smokers vs. non-smokers for given gender)

22 / 51

Page 27: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression resultsfit2=glm(Y∼Sex+Smoke,data=framingham,family=binomial)summary(fit2)

Call:glm(formula = Y ~ Sex + Smoke, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7716 -0.7607 -0.5564 -0.5564 1.9708

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.09215 0.12717 -8.588 < 2e-16 ***SexFemale -0.69521 0.14635 -4.750 0.00000203 ***SmokeYes 0.03296 0.14457 0.228 0.82---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1350.8 on 1361 degrees of freedomResidual deviance: 1324.5 on 1359 degrees of freedom

(1 observation deleted due to missingness)AIC: 1330.5

Number of Fisher Scoring iterations: 4

23 / 51

Page 28: Logistic regression analysis - staff.pubhealth.ku.dk

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Page 29: Logistic regression analysis - staff.pubhealth.ku.dk

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).

Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Page 30: Logistic regression analysis - staff.pubhealth.ku.dk

Extracting odds ratios with confidence intervals

publish(fit2,intercept=TRUE)

Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.

24 / 51

Page 31: Logistic regression analysis - staff.pubhealth.ku.dk

Simple logistic regression: categorical explanatory variable:

Categorize age into 4 intervals:

45-48, 49-52, 53-56, 57-62

Summarize in 2 by 4 table

X = 0 X = 1 X = 2 X = 345-48 49-52 53-56 57-62

Y = 0 308 298 254 235 1095Y = 1 51 61 64 92 268

359 359 318 327 1363

(Note: both males and females)

25 / 51

Page 32: Logistic regression analysis - staff.pubhealth.ku.dk

ANOVA: χ2 test

We may test whether the risk of CHD differs between the 4 agegroups using a chi-square test statistic - in this case with 3 degreesof freedom:

Null hypothesis:

Odds(age45− 48) = Odds(age49− 52)= Odds(age53− 56) = Odds(age57−62)

∑ (OBS − EXP)2

EXP= 23.29 ∼ χ2

3, P < 0.001

Conclusion: CHD-risk differs significantly between the age groups.

26 / 51

Page 33: Logistic regression analysis - staff.pubhealth.ku.dk

Logistic regression: categorical variable with 4 levels:

log

(pi

1− pi

)=

a age 45− 48

a+ b1 age 49− 52a+ b2 age 53− 56a+ b3 age 57− 62

Reference category 45-48

a = log (Odds(45− 48))

b1 = log

(Odds(49− 52)Odds(45− 48)

)b2 = log

(Odds(53− 56)Odds(45− 48)

)b3 = log

(Odds(57− 62)Odds(45− 48)

)

27 / 51

Page 34: Logistic regression analysis - staff.pubhealth.ku.dk

Resultsframingham[,AgeCut:=cut(AGE,

c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001

AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for

group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.

28 / 51

Page 35: Logistic regression analysis - staff.pubhealth.ku.dk

Resultsframingham[,AgeCut:=cut(AGE,

c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001

AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for

group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.

28 / 51

Page 36: Logistic regression analysis - staff.pubhealth.ku.dk

Quantitative explanatory factor

It is often more natural to include the variable AGE (in years) as aquantitative explanatory factor in the model (i.e., NO grouping)

log

(pi

1− pi

)= a+ b · agei

a = log(odds(age=0))b = log(odds(age=a))− log(odds(age=a+1))

Interpretation: For each year

exp(b) = odds ratio

is the factor by which odds for CHD increases with each one unitincrease of age (here 1 year).

29 / 51

Page 37: Logistic regression analysis - staff.pubhealth.ku.dk

Resultsfit5=glm(Y∼AGE,data=framingham,family=binomial)summary(fit5)

Call:glm(formula = Y ~ AGE, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.8600 -0.7052 -0.6082 -0.5224 2.0294

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.88431 0.77372 -6.313 0.000000000274 ***AGE 0.06581 0.01446 4.550 0.000005374208 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1330.2 on 1361 degrees of freedomAIC: 1334.2

Number of Fisher Scoring iterations: 4

30 / 51

Page 38: Logistic regression analysis - staff.pubhealth.ku.dk

ResultsOne year change in age

fit5=glm(Y∼AGE,data=framingham,family=binomial)publish(fit5,intercept=1L)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001

AGE 1.07 [1.04;1.10] < 0.0001

10-year change in age

framingham[,age10:=AGE/10]fit5=glm(Y∼age10,data=framingham,family=binomial)publish(fit5,intercept=1)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001

age10 1.93 [1.45;2.56] < 0.0001

31 / 51

Page 39: Logistic regression analysis - staff.pubhealth.ku.dk

Exercises

If we substract from each person’s age the value 50:

framingham[,Age50:=AGE-50]fit5a=glm(Y∼Age50,data=framingham,family=binomial)publish(fit5a,intercept=1)

Variable Units OddsRatio CI.95 p-value(Intercept) 0.20 [0.17;0.24] < 0.0001

Age50 1.07 [1.04;1.10] < 0.0001

1. Report the coronary heart disease risk of a person aged 50.2. Report the association between age and risk of coronary heart

disease in a sentence with confidence interval and p-value.

32 / 51

Page 40: Logistic regression analysis - staff.pubhealth.ku.dk

Multiple logistic regression

Additive effects of several explanatory variables:

log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

Multiple logistic regression is a way to control confounding:

The effect on the outcome (odds ratio) of each explanatory variableis mutually adjusted for the other explanatory variables.

I The model assumes that the effect (odds ratio) of Z on Y isthe same for all values of X.

I In other words: the effect of Z on Y is not modified by thevalues of X (no statistical interaction).

33 / 51

Page 41: Logistic regression analysis - staff.pubhealth.ku.dk

Illustration of what "mutually adjusted" means

Additive model (no statistical interactions)

log

(pi

1− pi

)= a+ b1Zi + b2Xi

Effect of sex Zi (0 = female, 1 = male) adjusted for age (Xi)

odds(age=50, male)odds(age=50, female)

=exp(a+ b1 + b250)

exp(a+ b250)= exp(a+ b1 + b250− a− b250)= exp(b1).

The result is the same for age 46 and age 61 and all other ages.

34 / 51

Page 42: Logistic regression analysis - staff.pubhealth.ku.dk

Illustration of what "mutually adjusted" means (continued)Effect of age (Xi) for males:

odds(age=51, male)odds(age=50, male)

=exp(a+ b1 + b251)exp(a− b1 + b250)

= exp(a+ b1 + b251− a− b1 − b250)= exp(b2).

The result is the same for females:

odds(age=51, female)odds(age=50, female)

=exp(a+ b251)exp(a− b250)

= exp(a+ b251− a− b250)= exp(b2).

Linearity means that the result is the same for a comparison of age63 and age 62 and all other one year differences.

35 / 51

Page 43: Logistic regression analysis - staff.pubhealth.ku.dk

Resultsfit.add=glm(Y ∼ AGE + Sex, family = binomial,

data = framingham)summary(fit.add)

Call:glm(formula = Y ~ AGE + Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9910 -0.6927 -0.5958 -0.4500 2.1913

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.59208 0.78019 -5.886 0.00000000396 ***AGE 0.06672 0.01458 4.575 0.00000475151 ***SexFemale -0.71613 0.14052 -5.096 0.00000034612 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1303.5 on 1360 degrees of freedomAIC: 1309.5

Number of Fisher Scoring iterations: 4

36 / 51

Page 44: Logistic regression analysis - staff.pubhealth.ku.dk

Results

fit.add=glm(Y ∼ AGE + Sex, family = binomial, data =framingham)

publish(fit.add)

Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04;1.10] <0.0001Sex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.37;0.64] <0.0001

Logistic regression was used to investigate gender differences inodds (risks) of CHD adjusted for age.

The age adjusted odds ratio was 0.49 (95%-CI: [0.37;0.64])showing that the risks of CHD were significantly lower for womencompared to men (p<0.0001).

37 / 51

Page 45: Logistic regression analysis - staff.pubhealth.ku.dk

Predicted risks based on logistic regression model

A logistic regression model can be used to predict personalized risks:

log

(pi

1− pi

)= a+ b1Zi + b2Xi + . . .

is equivalent to

pi =exp(a+ b1Zi + b2Xi + . . . )

1+ exp(a+ b1Zi + b2Xi + . . . )

The risks (and risk ratios) depend on all explanatory variablessimultaneously.

38 / 51

Page 46: Logistic regression analysis - staff.pubhealth.ku.dk

Predicted risks based on logistic regression modelPrediction makes most sense for new data

mydata=expand.grid(AGE=c(50,55),Sex=factor(c("Female","Male")))setDT(mydata)mydata

AGE Sex1: 50 Female2: 55 Female3: 50 Male4: 55 Male

mydata[,risk:=predict(fit.add,newdata=mydata,type="response")]mydata

AGE Sex risk1: 50 Female 0.12213812: 55 Female 0.16263533: 50 Male 0.22162844: 55 Male 0.2844255

39 / 51

Page 47: Logistic regression analysis - staff.pubhealth.ku.dk

Visualization of predicted risks

mydata2 <- setDT(expand.grid(AGE=seq(45,62,1),Sex=factor(c("Female","Male"))))

mydata2[,risk:=predict(fit.add,newdata=mydata2,type="response")]

library(ggplot2)ggplot(mydata2,aes(x=AGE,y=risk,group

=Sex,colour=Sex))+geom_line()+ylim(c(0,1))+xlab("Age (years)")+ylab("Risk of CHD")

0.00

0.25

0.50

0.75

1.00

45 50 55 60

Age (years)

Ris

k of

CH

D

Sex

Female

Male

40 / 51

Page 48: Logistic regression analysis - staff.pubhealth.ku.dk

Example with more variables

framingham[,Chol10:=CHOL/10]fit.multi <- glm(Y ∼ AGE + Sex + Chol10 + SBP + Smoke,

family = binomial,data = framingham)publish(fit.multi)

Variable Units Missing OddsRatio CI.95 p-valueAGE 0 1.06 [1.02;1.09] 0.0004181Sex Male 0 Ref

Female 0.38 [0.28;0.52] < 0.0001Chol10 0 1.05 [1.02;1.08] 0.0026086

SBP 0 1.02 [1.01;1.02] < 0.0001Smoke No 1 Ref

Yes 1.19 [0.88;1.60] 0.2510447

41 / 51

Page 49: Logistic regression analysis - staff.pubhealth.ku.dk

Exercise

1. Report the effect of cholesterol on coronary heart disease fromthe multiple logistic regression model

2. Predict the coronary heart disease risks of four smokingfemales all aged 50 and with 150 SBP but with differentcholesterol values:I person 1: 235, person 2: 245, person 3: 351, person 4: 361

mydata=data.frame(AGE=50,Sex=factor("Female",levels(framingham$Sex)),Smoke=factor("Yes",levels(framingham$Smoke)),SBP=150,Chol10=c(23.5,24.5,35.1,36.1))

1. Compute the risk ratios for 10 unit cholesterol changes from245 to 235 and from 361 to 351

2. Repeat 2. and 3. for a male person

42 / 51

Page 50: Logistic regression analysis - staff.pubhealth.ku.dk

Statistical interaction = Effectmodification

The effect of X on Y depends on Z

43 / 51

Page 51: Logistic regression analysis - staff.pubhealth.ku.dk

Effect modification

A statistical interaction (effect modification) requires 3variablesI two explanatory variables X,ZI one outcome Y

In logistic regression the odds ratio which describes the effect of Xon the odds of Y=1 depends on the value of Z

SymmetryIf the effect of variable X on Y is modified by Z then also the effectof Z on Y is modified X.

ExampleThe age (Z) effect on the CHD-risk (Y) may depend on sex (X).

44 / 51

Page 52: Logistic regression analysis - staff.pubhealth.ku.dk

Statistical interaction in R

summary(glm(Y ∼ AGE * SEX, family = binomial, data =framingham))

Alternative notation:

summary(glm(Y ∼ AGE + SEX + AGE:SEX, family = binomial, data = framingham))

45 / 51

Page 53: Logistic regression analysis - staff.pubhealth.ku.dk

Resultssummary(glm(Y ∼ AGE * SEX, family = binomial, data = framingham

))

Call:glm(formula = Y ~ AGE * SEX, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9171 -0.7284 -0.6074 -0.4010 2.3029

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.091694 2.361000 0.039 0.9690AGE -0.007736 0.044223 -0.175 0.8611SEX -3.544593 1.604311 -2.209 0.0271 *AGE:SEX 0.052967 0.029871 1.773 0.0762 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1300.4 on 1359 degrees of freedomAIC: 1308.4

Number of Fisher Scoring iterations: 4

46 / 51

Page 54: Logistic regression analysis - staff.pubhealth.ku.dk

Statistical interaction

publish(glm(Y ∼ AGE * Sex, family = binomial, data =framingham))

Variable Units OddsRatio CI.95 p-valueAGE: Sex(Male) 1.05 [1.01;1.09] 0.01629

AGE: Sex(Female) 1.10 [1.05;1.15] < 0.0001

Notes:

I The main effects for AGE and Sex have no interpretation (andare therefore not shown).

I One year more in age increases the odds by 5% in males andby 10% in females.

47 / 51

Page 55: Logistic regression analysis - staff.pubhealth.ku.dk

Predicted risk of the model with an additive effect of ageand sex (no effect modification)

Age (years)

Pre

dict

ed C

HD

ris

k

MaleFemale

0 %

25 %

50 %

75 %

100

%

45 50 55 60

48 / 51

Page 56: Logistic regression analysis - staff.pubhealth.ku.dk

Predicted risk of the model with an interaction between ageand sex (with effect modification)

Age (years)

Pre

dict

ed C

HD

ris

k

MaleFemale

0 %

25 %

50 %

75 %

100

%

45 50 55 60

49 / 51

Page 57: Logistic regression analysis - staff.pubhealth.ku.dk

Testing for statistical interactionfit.add=glm(Y ∼ AGE + SEX, family = binomial, data = framingham)fit.int=glm(Y ∼ AGE * SEX, family = binomial, data = framingham)anova(fit.add,fit.int,test="Chisq")

Analysis of Deviance Table

Model 1: Y ~ AGE + SEXModel 2: Y ~ AGE * SEX

Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 1360 1303.52 1359 1300.4 1 3.1676 0.07511 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is no statistically significant modification of the age effect by gender(p>0.05).

50 / 51

Page 58: Logistic regression analysis - staff.pubhealth.ku.dk

Take home messages

I (Multiple) logistic regression describes associations betweenone or several explanatory variables and the risk of an event(binary outcome).

I Analysis of an exposure of interest can be adjusted forpotential confounders

I In an additive model (no interactions), odds ratios do notdepend on the other explanatory variables

I Risks and risk ratios predicted by the model depend on theother explanatory variables

I Linearity and absence of interaction are assumptions whichshould be investigated

51 / 51