Log linear Regressions - ictr.johnshopkins.edu · Some Useful Stata Information: John McGready 1...
Transcript of Log linear Regressions - ictr.johnshopkins.edu · Some Useful Stata Information: John McGready 1...
Some Useful Stata Information: John McGready
1
Log linear Regressions
John McGready
Johns Hopkins University
Quick Review
Linear Regression : a method for estimating the mean level of a continuous outcome variable as a linear function of potentially multiple predictors:
Eg:
Slopes have “mean difference” interpretation
Examples:
Average Hospital LOS -> age, sex, SBP on admission
Average SBP -> age, sex , BMI …
...ˆˆˆ22110 xxy
Some Useful Stata Information: John McGready
2
Quick Review
Regression : more generally a method for estimating a function of the mean level of an outcome variable as a linear function of potentially multiple predictors:
Again, Linear:
Could write as:
Where is the “identity” function: ie:
...ˆˆˆ22110 xxy
...ˆˆˆ)( 22110 xxyf
f aaf )(
Section A
– The Case for Logistic Regression –
Some Useful Stata Information: John McGready
3
5
Patients with Sepsis
Sample of 106 patients admitted to the ICU at a large U.S. hospital1
Predictors of death in the ICU for patients with severe sepsis (blood poisoning) Can we predict the risk of death (y) from 5 potential predictors
(x’s) in one model?
This is an observational study
1 Pine R et al. Determinants of Organ Malfunction or Death in Patients With Intra-abdominal Sepsis: A Discriminant Analysis .Archives Surgery, Feb 1983; 118: 242 - 249.
6
y 0 Survive1 Death
Patients with Sepsis: The Data
Outcome of interest: Death
Some Useful Stata Information: John McGready
4
7
Patients with Sepsis: The Data
Potential predictors Shock (x1) Malnutrition (x2) Alcohol Use(x3) Age (x4) Bowel infarction (x5)
Shock, malnutrition, alcohol, and bowel infarction are all recorded as binary (Each is coded “1” for “yes” or “0” for “no”)
Age is continuous and measured in years
8
Patients with Sepsis: The Data
Twenty-one of the 106 (19.8 %) patients died in the ICU
Nine-percent were in shock at time of ICU admission, 21% had a history of alcohol abuse, 30% were malnourished and 13% had a bowel infarction at the time of ICU admission
The average age of the patients is 51 years and the age range in the sample is from 17 to 94 years
Some Useful Stata Information: John McGready
5
9
Patients with Sepsis: The Data
We want to answer some questions: “How is death for patients with severe sepsis associated with
these five potential predictors?” Do certain predictors confound each other’s relationship with
death? Can we estimate risk (proportion of patients who will die) given
patient characteristics at time of ICU admission?
Can employ logistic regression
Examples
Let’s start with age as the first predictor we look at
Could we use linear regression? Here is a scatterplot of y versus x
0.2
.4.6
.81
dea
th
20 40 60 80 100age
ICU Patients with Severe Sepsis Upon AdmissionDeath vs. Age
Some Useful Stata Information: John McGready
6
Could we use linear regression?
Examples
01
Dea
th
20 40 60 80 100Age (years)
death Fitted values
with Fit from Linear RegressionDeath vs. Age
Logistic Regression
A regression method to deal with the case when the dependent (outcome) variable y is binary (dichotomous).
There can be many predictor variables (x’s).
Some Useful Stata Information: John McGready
7
Objectives of Logistic Regression
Estimating magnitude of outcome/exposure relationship To evaluate the association of a binary outcome with a set of
predictors
Prediction Develop an equation to determine the probability or likelihood
that individual has the condition (y = 1) that depends on the independent variables (the x’s)
Linear vs. Logistic Regression
Linear regression Outcome variable y is continuous
Logistic regression Outcome variable y is binary (dichotomous)
The only (data type) question a researcher need ask when choosing a regression method is: “What does my outcome look like?” Either regression method allows for many x’s (independent
variables). These x’s can be either continuous or discrete.
Some Useful Stata Information: John McGready
8
The Logistic Regression Model
Equation for Pr(y = 1) – the proportion of subjects with y =1
e is the “natural constant” 2.718
p = probability (proportion) of y=1
...
...
22110
22110
1
xx
xx
e
ep
The Logistic Regression Model
Why is this equation appropriate?
And so it follows:
........22110 xxe0
11
0...
...
22110
22110
xx
xx
e
e
Some Useful Stata Information: John McGready
9
The Logistic Regression Model
0 < p 1
This formulation for p ensures that our estimates of the probability of having the condition “y” is between 0 and 1
The Logistic Regression Model
Can be transformed as follows
sometimes written as:
where ln ( or log) is the natural logarithm (base e)
...xˆxˆˆ)p1
plog( 2211o
...xˆxˆˆ)p1
pln( 2211o
Some Useful Stata Information: John McGready
10
The Logistic Regression Model
Recall, the odds of an event is defined as:
Where p = probability of having the event “y”, i.e. the proportion of persons with y=1
p1
podds
Logistic Regression Model
For the ICU data set, we could try to estimate the following:
p = probability of death in the ICU(proportion of persons who die in the ICU), x1 = age
and are called regression coefficients
Another way to write the above equation:
11o xββ)p1
pln( ˆˆ
110 xˆˆDeath) of log(ODDS
0̂ 1̂
Some Useful Stata Information: John McGready
11
21
Logistic Regression Model
The higher the odds of an event, the larger the probability of an event
A predictor x1 that is positively associated with the odds will also be positively associated with the probability of the event (i.e. estimated slope will be positive)
A predictor x1 that is negatively associated with the odds will also be negatively associated with the probability of the event (i.e. estimated slope will be negative)
1̂
1̂
22
Example: Death and Age
Results from logistic regression of log odds of Death on age:
Variable Estimated Coefficient Standard Error
Age 0.05 0.015
Constant – 4.37 0.98
1̂
0̂
Some Useful Stata Information: John McGready
12
23
Example: Death CHD and Age
The resulting equation
Where p is estimated probability of evidence(i.e. the estimated proportions of persons with CHD evidence) amongst persons of a given age
Age 0.0537.4p1
pln
24
Example: Death and Age
The estimated coefficient ( ) of age (x1) is positive; hence, we have estimated a positive association between age and log odds of death
Therefore, we have estimated a positive association between age and probability of death
How can we actually interpret the value 0.05, though?
Lets write out the equation comparing two groups of individuals who differ in age by one year: Group 1, age = k years Group 2, age = k + 1 years
1̂
Some Useful Stata Information: John McGready
13
25
Example: Death and Age
The resulting equations estimating the ln odds of CHD evidence in each age group
1)(kβ̂β̂1)k xDeath; of ln(Odds 101
kβ̂β̂)k Death x of ln(Odds 101
26
Example: Death and Age
Multiplying out, and taking the difference (subtracting)
So, when the dust settles:
1101 β̂kβ̂β̂1)k xDeath; of ln(Odds
kβ̂β̂)k xDeath; of ln(Odds 101
1̂
k) xDeath; of ln(Odds1)k xDeath; of ln(Odds β̂ 111
Some Useful Stata Information: John McGready
14
27
Example: Death and Age
Now
“Reversing” one of the famous properties of logarithms:
So , the estimated slope for x1 is the natural log of an estimated odds ratio:
To get the estimated odds ratio, exponentiate , i.e.:
k) xDeath; of ln(Odds1)k xDeath; of ln(Odds β̂ 111
)R̂ln(O )k xDeath; of Odds
1k xDeath; of Oddsln( β̂
1
11
1̂
1̂
1̂ˆ eRO
In our example, recall
Here,
The estimated odds ratio of Death for a one-year age difference is 1.05, older to younger
If we were to compare two groups of people who differ by one year of age, the estimated odds ratio for death of the older group to the younger group is 1.05 (This is valid for age comparisons within our original range of data, 17-94 years)
60 year olds to 59 years olds45 year old to 44 year olds27 year old to 26 year olds
28
Example: Death and Age
05.01̂
05.1ˆ 05.0ˆ1 eeRO
Some Useful Stata Information: John McGready
15
29
General Interpretation :Slope in Logistic Regression
is the estimated change in the log odds of the outcome for a one unit increase in x1
“Change in the log odds of CHD for a one year increase in age”
It estimates the log odds ratio for comparing two groups of observations:― One group with x1 one unit higher than the other
This estimated slope can be exponentiated to get the corresponding estimated odds ratio
1̂
30
General Interpretation :Slope in Logistic Regression
is just an estimate for the true “population level” slope; similarly, just an estimate of population level odds ratio
Can get 95% interval for slope by taking
0.05 ± 2×0.015 -> (0.02, 0.08)
Can get 95% confidence interval for odds ratio by exponentiating(anti-logging) endpoints of 95% CI for slope
(e0.02, e0.08) = (1.02, 1.08)
1̂
)ˆ(ˆ2ˆ11 ES
1̂e
Some Useful Stata Information: John McGready
16
31
General Interpretation :Slope in Logistic Regression
• Question: What is estimated odds ratio (and 95% CI) of death for 50 year old subjects compared to 40 year old subjects?
32
General Interpretation :Intercept in Logistic Regression
• Question: What is the interpretation of the intercept?
Age 0.0537.4p1
pln
Some Useful Stata Information: John McGready
17
33
Patients with Sepsis: A Multivariable Model
Possible next step: fit a logistic regression with all 5 predictors
x1 – x5 defined as before
p = Pr(y = 1), the probability of death
.ˆˆˆˆˆˆ1 55443322110 xxxxx
p
pln
34
Patients with Sepsis: The Model
Note: other possible analyses : we could review other potential models, leaving out non-statistically significant predictors from the previous model, looking at “intermediate “ models with some subset of the 5 predictors etc…
Model building is part art, part science
Because there are only 106 observations, I am going to refit model without malnutrition and infarction as they were not statistically significant, if only to see how their omission impacts the other 3 associations
Some Useful Stata Information: John McGready
18
35
Presenting the Results
Frequently, the results of the unadjusted and adjusted analyses are presented in one table
Not only is this a concise summary, it allows for side-by-side comparisons of the unadjusted and adjusted estimates for each predictor which helps give a sense of confounding amongst the predictors
36
Presenting the Results
Table of results
Some Useful Stata Information: John McGready
19
37
Presenting the Results
All three results point to a larger odds (and hence risk) of death for patients in shock at time of admission; adjusted estimates are larger than the unadjusted odds ratios, but all three are statistically significant, and the 95% CI share a lot of common values- there is a lot of uncertainty in the estimates
Similar results for the association between odds (risk) of death and history of alcohol use
Both malnutrition and infarction positively associated with increased odds of death in both the unadjusted and adjusted estimates; and for both magnitudes of unadjusted and adjusted odds ratios are similar, but are not statistically significant in the multiple logistc regression model (possibly because of low sample size/power)
38
Presenting the Results
Similar results for the association between odds (risk) of death and history of alcohol use
Both malnutrition and infarction positively associated with increased odds of death in both the unadjusted and adjusted estimates; and for both magnitudes of unadjusted and adjusted odds ratios are similar, but are not statistically significant in the multiple logistic regression model (possibly because of low sample size/power)
The odds ratio of death for a one year difference in age was relatively consistent in value across the three models compared
Some Useful Stata Information: John McGready
20
39
Presenting the Results
Odds ratios give an estimate of relative odds of outcome –can help us assess risk factors
However, odds ratios are neither direct comparisons of risk, nor do they tell us anything about the actual risk of death for different subsets of patients with different characteristics at the time of study
As this is not a case-control study, we are allowed to estimate risk and relative risk via the sample – how can we do this with logistic regression results?
40
Presenting the Results
Our estimated equation (multiple logistic regression)
We can use this to estimate the ln odds of death for any group of patients with any combination of values for the 5 predictors
infarctionage
alcoholmalnutshockp
pln
85.1082.0
91.294.043.367.81
Some Useful Stata Information: John McGready
21
41
Presenting the Results
By the formulation of logistic regression:
Translate equation back into (estimated) probability function
))ˆ(ln(
))ˆ(ln(
1ˆ1
ˆˆ
SDOD
SDOD
e
e
SDOD
SDODp
55443322110
55443322110
ˆˆˆˆˆˆ
ˆˆˆˆˆˆ
1ˆ
xxxxx
xxxxx
e
e p
42
How Can You Present These Results?
So for example, the estimated ln odds of death for a 50 year old patient with sepsis who has history of alcohol, but is not in shock, not malnourished, and does not have infarction at the time of surgery is given by:
66.1
1.491.267.8
085.150082.0
191.2094.0043.367.81
p
pln
Some Useful Stata Information: John McGready
22
43
How Can You Present These Results?
So the estimate proportion (probability, risk) of death during surgery for this group of patients is given by:
16.019.1
19.0
1ˆ1
ˆˆ
66.1
66.1
e
e
SDOD
SDODp
44
How Can You Present These Results?
Possible graphical display
Alcohol and malnutrition
Alcohol
Some Useful Stata Information: John McGready
23
How logistic regression results should be presented
• The units of the predictor variables should be clearly indicated. They should “real” units (like inches), not statistical ones (like standard deviations). The units in the report do not have to be the same units used in the analysis.e.g.: Birthweight (in grams), Smoking (Yes/No), Age (in years)
• The ranges of the predictor variables should be indicated (so we know when we are extrapolating beyond the data), and/or the number of subjects within each range of predictor variable.
• The methods by which the model was constructed and the assumptions checked should be clear.
How logistic regression results should be presented
• The odds ratios (e ) and their 95% confidence intervals should always be reported, NOT the ’s!
• the baseline odds (e) should be reported unless the study is a case-control.
• If the model is intended to be used for prediction that could determine a medical action (like hospital admission), it must bevalidated.
Some Useful Stata Information: John McGready
24
Statistical methods
… we used multivariable logistic regression analysis to generate the odds ratio of receiving chemotherapy in women with breast cancer and to determine the effect of age (Table 1) on chemotherapy use. In this model, we adjusted for race (white, black, or others), tumor stage (stage I, stage II, or stage IIIA), node status, hormone receptor status (Table 2), whether the patient had received surgery and radiation therapy (categorized as breast-conserving surgery without radiation, breast-conserving surgery with radiation, or mastectomy), and adjuvant hormone therapy use (yes or no). In addition to odds ratios, we generated the probabilities of receiving chemotherapy from the parameters of the logistic regression for women with different ages by holding other factors constant.
Finally, we performed sensitivity analyses to assess the potential effects of unmeasured confounders on the associations observed between age and chemotherapy use (23).
Some Useful Stata Information: John McGready
25
Logistic Regression Results
How not to present regression results
How Not to Presenet Logistic Regression Results
Some Useful Stata Information: John McGready
26
Section C
When Time is Of Interest: Regressions for Incidence Rate Data
When Time Is Of Interest
Logistic Regression handles the occurrence/non-occurrence of events without regard to exposure time differences between subjects/groups
Frequently, not only is the count of outcomes important, but also the time at risk
Examples: (time to) relapse among remissive cancer patients on different
treatments (time to) smoking cessation amongst subjects on a nicotene
patch versus those who also receive intensive counseling
52
Some Useful Stata Information: John McGready
27
When Time Is Of Interest
Ignoring the time component may throw away important information
Example: Cancer Patients in Remission
Treatment A: 40% of patients relapse in 5 year follow-up
Treatment B: 40% of patients relapse in 5 year follow-up
53
When Time Is Of Interest
Ignoring the time component may throw away important information
Example: Cancer Patients in Remission
Treatment A: 40% of patients relapse in 5 year follow-up but majority of relapses occurred with 1 year of startingtreatment
Treatment B: 40% of patients relapse in 5 year follow-up but majority of relapses occurred with 4-5 years of startingtreatment
54
Some Useful Stata Information: John McGready
28
Two Possible regression choices: Poisson Regression, and Cox Proportional Hazards Regression
Both model function of incidence rate (IR) as a linear combination of predictors
Where , i.e.
When Time Is Of Interest
55
...ˆˆˆ)( 22110 xxIRf
)ln()( aaf
...ˆˆˆ)ln( 22110 xxIR
By similar logic as with logistic regression, slopes are interpretable as ln(incidence rate ratios)
Incidence Rate Ratio synonyms include Hazard Ratio and Relative Risk
When Time Is Of Interest
56
Some Useful Stata Information: John McGready
29
Poisson Regression requires data to be “grouped” into subcategories; Cox PH regression can work with individual level data
Treatment Group (n=3):
Person 1 has event: 3 weeks follow-up time
Person 2 has no event: 5 weeks follow-up
Person 3 has event: 7 weeks of follow up
Cox can work with the 3 individual pieces; Poisson would require information to be aggregated into group rate:
Comparison of Regression Choices, Part 1
57
weeksevents
weeks
events15
2)753(
)101(
Poisson Regression and Cox PH regression handle time as a predictor differently
What we are doing with both approaches is modeling the ln(incidence rate) as a function of potentially multiple predictors. One of these predictors can be time:
Comparison of Regression Choices, Part 2
58
ln(h
aza
rd)
Follow-Up Time
ln(hazard) vs. timeSeveral Scenarios
Some Useful Stata Information: John McGready
30
Poisson Regression allows for the user to specify nature of the relationship between ln(IR) and time; ie, to add “x’s” for time
PROs: This allows researcher to investigate different possibilities for
relationship between ln(hazard) and time (linear? Constant? Non-linear)
This allows researcher to investigate changing associations between other predictors and risk over time (non-proportional hazards)
CONs: The relationship between ln(hazard) and time can be
incorrectly specified
Comparison of Regression Choices, Part 2
59
Cox Regression “takes care” of time on its own; user cannot include “x’s” for time
PROs: Takes data at “face” value and figures out best estimate of
relationship between ln(hazard) and time
CONs: This does note allow researcher to investigate changing
associations between other predictors and risk over time (non-proportional hazards)
Comparison of Regression Choices, Part 2
60
Some Useful Stata Information: John McGready
31
Body Checking in Ice Hockey2
2 Emery C, et al. Risk of Injury Associated With Body Checking Among Youth Ice Hockey Players Journal of the American Medical Association Vol 303, No 22. (2010)
Example :Poisson Regression
61
RCT for Melanoma treatments
Taken directly from methods:
3 Chapman P, et al. Improved Survival with Vemurafenib in Melanoma with BRAF V600E Mutation New England Journal of Medicine Vol 364, No 26. (2011)
Example :Cox Regression
62
“Hazard ratios for treatment with vemurafenib, as compared with dacarbazine, were estimated with theuse of unstratified Cox regression. We estimated event–time distributions using the Kaplan–Meiermethod. All reported P values are two-sided, and confidence intervals are at the 95% level.”
Some Useful Stata Information: John McGready
32
RCT for Melanoma treatments
Example :Cox Regression
63
RCT for Melanoma treatments
Example :Cox Regression
64