Log linear Regressions - ictr.johnshopkins.edu · Some Useful Stata Information: John McGready 1...

Some Useful Stata Information: John McGready

1

Log linear Regressions

John McGready

Johns Hopkins University

Quick Review

Linear Regression : a method for estimating the mean level of a continuous outcome variable as a linear function of potentially multiple predictors:

Eg:

Slopes have “mean difference” interpretation

Examples:

Average Hospital LOS -> age, sex, SBP on admission

Average SBP -> age, sex , BMI …

...ˆˆˆ22110 xxy


2

Quick Review

Regression : more generally a method for estimating a function of the mean level of an outcome variable as a linear function of potentially multiple predictors:

Again, Linear:

Could write as:

Where is the “identity” function: ie:

...ˆˆˆ22110 xxy

...ˆˆˆ)( 22110 xxyf

f aaf )(

Section A

– The Case for Logistic Regression –


3

5

Patients with Sepsis

Sample of 106 patients admitted to the ICU at a large U.S. hospital1

Predictors of death in the ICU for patients with severe sepsis (blood poisoning) Can we predict the risk of death (y) from 5 potential predictors

(x’s) in one model?

This is an observational study

1 Pine R et al. Determinants of Organ Malfunction or Death in Patients With Intra-abdominal Sepsis: A Discriminant Analysis .Archives Surgery, Feb 1983; 118: 242 - 249.

6

y 0 Survive1 Death

Patients with Sepsis: The Data

Outcome of interest: Death


4

7


Potential predictors Shock (x1) Malnutrition (x2) Alcohol Use(x3) Age (x4) Bowel infarction (x5)

Shock, malnutrition, alcohol, and bowel infarction are all recorded as binary (Each is coded “1” for “yes” or “0” for “no”)

Age is continuous and measured in years

8


Twenty-one of the 106 (19.8 %) patients died in the ICU

Nine-percent were in shock at time of ICU admission, 21% had a history of alcohol abuse, 30% were malnourished and 13% had a bowel infarction at the time of ICU admission

The average age of the patients is 51 years and the age range in the sample is from 17 to 94 years


5

9


We want to answer some questions: “How is death for patients with severe sepsis associated with

these five potential predictors?” Do certain predictors confound each other’s relationship with

death? Can we estimate risk (proportion of patients who will die) given

patient characteristics at time of ICU admission?

Can employ logistic regression

Examples

Let’s start with age as the first predictor we look at

Could we use linear regression? Here is a scatterplot of y versus x

0.2

.4.6

.81

dea

th

20 40 60 80 100age

ICU Patients with Severe Sepsis Upon AdmissionDeath vs. Age


6

Could we use linear regression?

Examples

01

Dea

th

20 40 60 80 100Age (years)

death Fitted values

with Fit from Linear RegressionDeath vs. Age

Logistic Regression

A regression method to deal with the case when the dependent (outcome) variable y is binary (dichotomous).

There can be many predictor variables (x’s).


7

Objectives of Logistic Regression

Estimating magnitude of outcome/exposure relationship To evaluate the association of a binary outcome with a set of

predictors

Prediction Develop an equation to determine the probability or likelihood

that individual has the condition (y = 1) that depends on the independent variables (the x’s)

Linear vs. Logistic Regression

Linear regression Outcome variable y is continuous

Logistic regression Outcome variable y is binary (dichotomous)

The only (data type) question a researcher need ask when choosing a regression method is: “What does my outcome look like?” Either regression method allows for many x’s (independent

variables). These x’s can be either continuous or discrete.


8

The Logistic Regression Model

Equation for Pr(y = 1) – the proportion of subjects with y =1

e is the “natural constant” 2.718

p = probability (proportion) of y=1

...

...

22110

22110

1

xx

xx

e

ep


Why is this equation appropriate?

And so it follows:

........22110 xxe0

11

0...

...

22110

22110

xx

xx

e

e


9


0 < p 1

This formulation for p ensures that our estimates of the probability of having the condition “y” is between 0 and 1


Can be transformed as follows

sometimes written as:

where ln ( or log) is the natural logarithm (base e)

...xˆxˆˆ)p1

plog( 2211o

...xˆxˆˆ)p1

pln( 2211o


10


Recall, the odds of an event is defined as:

Where p = probability of having the event “y”, i.e. the proportion of persons with y=1

p1

podds

Logistic Regression Model

For the ICU data set, we could try to estimate the following:

p = probability of death in the ICU(proportion of persons who die in the ICU), x1 = age

and are called regression coefficients

Another way to write the above equation:

11o xββ)p1

pln( ˆˆ

110 xˆˆDeath) of log(ODDS

0̂ 1̂


11

21

Logistic Regression Model

The higher the odds of an event, the larger the probability of an event

A predictor x1 that is positively associated with the odds will also be positively associated with the probability of the event (i.e. estimated slope will be positive)

A predictor x1 that is negatively associated with the odds will also be negatively associated with the probability of the event (i.e. estimated slope will be negative)

1̂

1̂

22

Example: Death and Age

Results from logistic regression of log odds of Death on age:

Variable Estimated Coefficient Standard Error

Age 0.05 0.015

Constant – 4.37 0.98

1̂

0̂


12

23

Example: Death CHD and Age

The resulting equation

Where p is estimated probability of evidence(i.e. the estimated proportions of persons with CHD evidence) amongst persons of a given age

Age 0.0537.4p1

pln

24


The estimated coefficient ( ) of age (x1) is positive; hence, we have estimated a positive association between age and log odds of death

Therefore, we have estimated a positive association between age and probability of death

How can we actually interpret the value 0.05, though?

Lets write out the equation comparing two groups of individuals who differ in age by one year: Group 1, age = k years Group 2, age = k + 1 years

1̂


13

25


The resulting equations estimating the ln odds of CHD evidence in each age group

1)(kβ̂β̂1)k xDeath; of ln(Odds 101

kβ̂β̂)k Death x of ln(Odds 101

26


Multiplying out, and taking the difference (subtracting)

So, when the dust settles:

1101 β̂kβ̂β̂1)k xDeath; of ln(Odds

kβ̂β̂)k xDeath; of ln(Odds 101

1̂

k) xDeath; of ln(Odds1)k xDeath; of ln(Odds β̂ 111


14

27


Now

“Reversing” one of the famous properties of logarithms:

So , the estimated slope for x1 is the natural log of an estimated odds ratio:

To get the estimated odds ratio, exponentiate , i.e.:

k) xDeath; of ln(Odds1)k xDeath; of ln(Odds β̂ 111

)R̂ln(O )k xDeath; of Odds

1k xDeath; of Oddsln( β̂

1

11

1̂

1̂

1̂ˆ eRO

In our example, recall

Here,

The estimated odds ratio of Death for a one-year age difference is 1.05, older to younger

If we were to compare two groups of people who differ by one year of age, the estimated odds ratio for death of the older group to the younger group is 1.05 (This is valid for age comparisons within our original range of data, 17-94 years)

60 year olds to 59 years olds45 year old to 44 year olds27 year old to 26 year olds

28


05.01̂

05.1ˆ 05.0ˆ1 eeRO


15

29

General Interpretation :Slope in Logistic Regression

is the estimated change in the log odds of the outcome for a one unit increase in x1

“Change in the log odds of CHD for a one year increase in age”

It estimates the log odds ratio for comparing two groups of observations:― One group with x1 one unit higher than the other

This estimated slope can be exponentiated to get the corresponding estimated odds ratio

1̂

30


is just an estimate for the true “population level” slope; similarly, just an estimate of population level odds ratio

Can get 95% interval for slope by taking

0.05 ± 2×0.015 -> (0.02, 0.08)

Can get 95% confidence interval for odds ratio by exponentiating(anti-logging) endpoints of 95% CI for slope

(e0.02, e0.08) = (1.02, 1.08)

1̂

)ˆ(ˆ2ˆ11 ES

1̂e


16

31


• Question: What is estimated odds ratio (and 95% CI) of death for 50 year old subjects compared to 40 year old subjects?

32

General Interpretation :Intercept in Logistic Regression

• Question: What is the interpretation of the intercept?

Age 0.0537.4p1

pln


17

33

Patients with Sepsis: A Multivariable Model

Possible next step: fit a logistic regression with all 5 predictors

x1 – x5 defined as before

p = Pr(y = 1), the probability of death

.ˆˆˆˆˆˆ1 55443322110 xxxxx

p

pln

34

Patients with Sepsis: The Model

Note: other possible analyses : we could review other potential models, leaving out non-statistically significant predictors from the previous model, looking at “intermediate “ models with some subset of the 5 predictors etc…

Model building is part art, part science

Because there are only 106 observations, I am going to refit model without malnutrition and infarction as they were not statistically significant, if only to see how their omission impacts the other 3 associations


18

35

Presenting the Results

Frequently, the results of the unadjusted and adjusted analyses are presented in one table

Not only is this a concise summary, it allows for side-by-side comparisons of the unadjusted and adjusted estimates for each predictor which helps give a sense of confounding amongst the predictors

36


Table of results


19

37


All three results point to a larger odds (and hence risk) of death for patients in shock at time of admission; adjusted estimates are larger than the unadjusted odds ratios, but all three are statistically significant, and the 95% CI share a lot of common values- there is a lot of uncertainty in the estimates

Similar results for the association between odds (risk) of death and history of alcohol use

Both malnutrition and infarction positively associated with increased odds of death in both the unadjusted and adjusted estimates; and for both magnitudes of unadjusted and adjusted odds ratios are similar, but are not statistically significant in the multiple logistc regression model (possibly because of low sample size/power)

38


Similar results for the association between odds (risk) of death and history of alcohol use

Both malnutrition and infarction positively associated with increased odds of death in both the unadjusted and adjusted estimates; and for both magnitudes of unadjusted and adjusted odds ratios are similar, but are not statistically significant in the multiple logistic regression model (possibly because of low sample size/power)

The odds ratio of death for a one year difference in age was relatively consistent in value across the three models compared


20

39


Odds ratios give an estimate of relative odds of outcome –can help us assess risk factors

However, odds ratios are neither direct comparisons of risk, nor do they tell us anything about the actual risk of death for different subsets of patients with different characteristics at the time of study

As this is not a case-control study, we are allowed to estimate risk and relative risk via the sample – how can we do this with logistic regression results?

40


Our estimated equation (multiple logistic regression)

We can use this to estimate the ln odds of death for any group of patients with any combination of values for the 5 predictors

infarctionage

alcoholmalnutshockp

pln

85.1082.0

91.294.043.367.81


21

41


By the formulation of logistic regression:

Translate equation back into (estimated) probability function

))ˆ(ln(

))ˆ(ln(

1ˆ1

ˆˆ

SDOD

SDOD

e

e

SDOD

SDODp

55443322110

55443322110

ˆˆˆˆˆˆ

ˆˆˆˆˆˆ

1ˆ

xxxxx

xxxxx

e

e p

42

How Can You Present These Results?

So for example, the estimated ln odds of death for a 50 year old patient with sepsis who has history of alcohol, but is not in shock, not malnourished, and does not have infarction at the time of surgery is given by:

66.1

1.491.267.8

085.150082.0

191.2094.0043.367.81

p

pln


22

43


So the estimate proportion (probability, risk) of death during surgery for this group of patients is given by:

16.019.1

19.0

1ˆ1

ˆˆ

66.1

66.1

e

e

SDOD

SDODp

44


Possible graphical display

Alcohol and malnutrition

Alcohol


23

How logistic regression results should be presented

• The units of the predictor variables should be clearly indicated. They should “real” units (like inches), not statistical ones (like standard deviations). The units in the report do not have to be the same units used in the analysis.e.g.: Birthweight (in grams), Smoking (Yes/No), Age (in years)

• The ranges of the predictor variables should be indicated (so we know when we are extrapolating beyond the data), and/or the number of subjects within each range of predictor variable.

• The methods by which the model was constructed and the assumptions checked should be clear.

How logistic regression results should be presented

• The odds ratios (e ) and their 95% confidence intervals should always be reported, NOT the ’s!

• the baseline odds (e) should be reported unless the study is a case-control.

• If the model is intended to be used for prediction that could determine a medical action (like hospital admission), it must bevalidated.


24

Statistical methods

… we used multivariable logistic regression analysis to generate the odds ratio of receiving chemotherapy in women with breast cancer and to determine the effect of age (Table 1) on chemotherapy use. In this model, we adjusted for race (white, black, or others), tumor stage (stage I, stage II, or stage IIIA), node status, hormone receptor status (Table 2), whether the patient had received surgery and radiation therapy (categorized as breast-conserving surgery without radiation, breast-conserving surgery with radiation, or mastectomy), and adjuvant hormone therapy use (yes or no). In addition to odds ratios, we generated the probabilities of receiving chemotherapy from the parameters of the logistic regression for women with different ages by holding other factors constant.

Finally, we performed sensitivity analyses to assess the potential effects of unmeasured confounders on the associations observed between age and chemotherapy use (23).


25

Logistic Regression Results

How not to present regression results

How Not to Presenet Logistic Regression Results


26

Section C

When Time is Of Interest: Regressions for Incidence Rate Data

When Time Is Of Interest

Logistic Regression handles the occurrence/non-occurrence of events without regard to exposure time differences between subjects/groups

Frequently, not only is the count of outcomes important, but also the time at risk

Examples: (time to) relapse among remissive cancer patients on different

treatments (time to) smoking cessation amongst subjects on a nicotene

patch versus those who also receive intensive counseling

52


27


Ignoring the time component may throw away important information

Example: Cancer Patients in Remission

Treatment A: 40% of patients relapse in 5 year follow-up

Treatment B: 40% of patients relapse in 5 year follow-up

53


Ignoring the time component may throw away important information

Example: Cancer Patients in Remission

Treatment A: 40% of patients relapse in 5 year follow-up but majority of relapses occurred with 1 year of startingtreatment

Treatment B: 40% of patients relapse in 5 year follow-up but majority of relapses occurred with 4-5 years of startingtreatment

54


28

Two Possible regression choices: Poisson Regression, and Cox Proportional Hazards Regression

Both model function of incidence rate (IR) as a linear combination of predictors

Where , i.e.


55

...ˆˆˆ)( 22110 xxIRf

)ln()( aaf

...ˆˆˆ)ln( 22110 xxIR

By similar logic as with logistic regression, slopes are interpretable as ln(incidence rate ratios)

Incidence Rate Ratio synonyms include Hazard Ratio and Relative Risk


56


29

Poisson Regression requires data to be “grouped” into subcategories; Cox PH regression can work with individual level data

Treatment Group (n=3):

Person 1 has event: 3 weeks follow-up time

Person 2 has no event: 5 weeks follow-up

Person 3 has event: 7 weeks of follow up

Cox can work with the 3 individual pieces; Poisson would require information to be aggregated into group rate:

Comparison of Regression Choices, Part 1

57

weeksevents

weeks

events15

2)753(

)101(

Poisson Regression and Cox PH regression handle time as a predictor differently

What we are doing with both approaches is modeling the ln(incidence rate) as a function of potentially multiple predictors. One of these predictors can be time:


58

ln(h

aza

rd)

Follow-Up Time

ln(hazard) vs. timeSeveral Scenarios


30

Poisson Regression allows for the user to specify nature of the relationship between ln(IR) and time; ie, to add “x’s” for time

PROs: This allows researcher to investigate different possibilities for

relationship between ln(hazard) and time (linear? Constant? Non-linear)

This allows researcher to investigate changing associations between other predictors and risk over time (non-proportional hazards)

CONs: The relationship between ln(hazard) and time can be

incorrectly specified


59

Cox Regression “takes care” of time on its own; user cannot include “x’s” for time

PROs: Takes data at “face” value and figures out best estimate of

relationship between ln(hazard) and time

CONs: This does note allow researcher to investigate changing

associations between other predictors and risk over time (non-proportional hazards)


60


31

Body Checking in Ice Hockey2

2 Emery C, et al. Risk of Injury Associated With Body Checking Among Youth Ice Hockey Players Journal of the American Medical Association Vol 303, No 22. (2010)

Example :Poisson Regression

61

RCT for Melanoma treatments

Taken directly from methods:

3 Chapman P, et al. Improved Survival with Vemurafenib in Melanoma with BRAF V600E Mutation New England Journal of Medicine Vol 364, No 26. (2011)

Example :Cox Regression

62

“Hazard ratios for treatment with vemurafenib, as compared with dacarbazine, were estimated with theuse of unstratified Cox regression. We estimated event–time distributions using the Kaplan–Meiermethod. All reported P values are two-sided, and confidence intervals are at the 95% level.”


32



63



64

Log linear Regressions - ictr.johnshopkins.edu · Some Useful Stata Information: John McGready 1...

Documents

Transcript of Log linear Regressions - ictr.johnshopkins.edu · Some Useful Stata Information: John McGready 1...