April 6
description
Transcript of April 6
April 6
• Logistic Regression– Estimating probability based on logistic model
– Testing differences among multiple groups
– Assumptions for model
Logistic regression equation
Model log odds of outcome as a linear function of one or more variables
Xi = predictors, independent variables
is increase in log odds of 1-unit increase in X
eis relative odds of a 1-unit increase in X
...)1
log( 22110
xx
The model is:
Logistic Regression PredictionEstimating Probability of Y=1
Goal: Estimate for a set of X values
Solve for
...)1
log( 22110
xx
The model is:
exp ( 0 + 1x1 + 2x2)
1 + exp ( 0 + 1x1 + 2x2)
ODDS
1 + ODDS=
Steps in Estimating
• Pick values for x1, x2, …, xp
• Compute log odds for your values of Xs using results– LO = b0 + b1x1 + b2x2 + … bpxp
• EXP LO to get odds– Odds = EXP (LO)
• Compute estimate of – = ODDS/(ODDS + 1)
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -6.0621 1.2884 22.1395 <.0001AGE 1 0.0605 0.0223 7.3310 0.0068women 1 -0.3967 0.3166 1.5701 0.2102
log(odds) = - 6.0621 + 0.0605*age –0.3967*women
What is estimated probability of CVD for a man 60 years old?
Log(odds) = -6.0621 + 0.0605(60) –0.3967(0) = -2.4321
Odds = exp(-2.4321) = 0.0878
Prob = 0.0878 / (1 + 0.0878) = 0.0808
How old does a women have to be to have the same risk?
1-Year of age increases log(odds) by 0.0605
Being female decreases log(odds) by –0.3967
Compute 0.3967/.0605 = 6.6 or women would have to 66.6 years to have P = .0808
PROC LOGISTIC DATA=temp DESCENDING; MODEL clinical = age women/CLODDS=WALD; UNITS age = 5 women = 1;RUN;
Getting Odds Ratio for Differences Other Than 1
SAS OUTPUTWald Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
AGE 5.0000 1.353 1.087 1.685women 1.0000 0.673 0.362 1.251
EXP (5*0.0605)
Testing Differences Among Multiple Groups Using Logistic Regression
• Ho:
• Ha: i not all equal
• Can test using logistic regression since if ’s are equal then log odds are equal
• Can code in SAS two ways– Create dummy (design) variables to represent the groups
– Use a CLASS statement under PROC LOGISTIC
TOMHS Example: Is CVD Rate EqualIn Four Clinical Centers?
• Ho:
• SAS CODE in datastep (create own design variables):
DATA temp; SET tomhs.bpstudy; clinicA = 0; clinicB = 0; clinicC = 0; clinicD = 0; if clinic = 'A' then clinicA = 1; else if clinic = 'B' then clinicB = 1; else if clinic = 'C' then clinicC = 1; else if clinic = 'D' then clinicD = 1;RUN;
Do Simple Analyses First
PROC MEANS N MEAN SUM MIN MAX DATA=temp; CLASS clinic; VAR clinical;RUN;
Analysis Variable : CLINICAL Indicator - Clinical Endpoint
NCLINIC Obs N Mean Sum Minimum Maximum------------------------------------------------------------------------------A 195 195 0.0974359 19.0000000 0 1.0000000
B 251 251 0.0517928 13.0000000 0 1.0000000
C 296 296 0.0472973 14.0000000 0 1.0000000
D 160 160 0.0312500 5.0000000 0 1.0000000
The relative odds (A/D) should be about 3. All betas should be > 0
PROC LOGISTIC CODE
* Using class statement;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; CLASS clinic/PARAM=REF; MODEL clinical = clinic ;RUN;
* Using user defined design variables;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; MODEL clinical = clinica clinicb clinicc;RUN;
Uses 0/1 coding
Last group as reference
Gives summary statistics
SAS OUTPUT USING CLASS STATEMENT
Response Profile
Ordered Total Value CLINICAL Frequency
1 1 51 2 0 851
Probability modeled is CLINICAL=1.
Class Level Information
Design Variables
Class Value 1 2 3
CLINIC A 1 0 0 B 0 1 0 C 0 0 1 D 0 0 0
Same coding as in datastep
Clinic D reference
SAS OUTPUT USING CLASS STATEMENT
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 7.9632 3 0.0468Score 8.6122 3 0.0349Wald 8.1300 3 0.0434
Type III Analysis of Effects
WaldEffect DF Chi-Square Pr > ChiSq
CLINIC 3 8.1300 0.0434
These are equal because no other variables are in model
SAS OUTPUT USING CLASS STATEMENT
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.4339 0.4544 57.1196 <.0001CLINIC A 1 1.2080 0.5145 5.5114 0.0189CLINIC B 1 0.5266 0.5363 0.9644 0.3261CLINIC C 1 0.4311 0.5305 0.6604 0.4164
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
CLINIC A vs D 3.347 1.221 9.175CLINIC B vs D 1.693 0.592 4.844CLINIC C vs D 1.539 0.544 4.353
SAS OUTPUT USING MODEL clinical = clinicA clinicB clinicC
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.4339 0.4544 57.1196 <.0001clinicA 1 1.2080 0.5145 5.5114 0.0189clinicB 1 0.5266 0.5363 0.9644 0.3261clinicC 1 0.4311 0.5305 0.6604 0.4164
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
clinicA 3.347 1.221 9.175clinicB 1.693 0.592 4.844clinicC 1.539 0.544 4.353
Maybe clinic rates of CVD differ because age varies among centers
SAS OUTPUT USING MODEL clinical = clinic age
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 16.5582 4 0.0024Score 17.2001 4 0.0018Wald 16.2760 4 0.0027
Type III Analysis of Effects
WaldEffect DF Chi-Square Pr > ChiSq
CLINIC 3 8.9604 0.0298AGE 1 8.4904 0.0036
Test if age and clinic are related to CVD
SAS OUTPUT USING MODEL clinical = clinic age
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -7.2250 1.4096 26.2725 <.0001CLINIC A 1 1.3211 0.5189 6.4816 0.0109CLINIC B 1 0.6448 0.5400 1.4256 0.2325CLINIC C 1 0.5163 0.5335 0.9366 0.3332AGE 1 0.0662 0.0227 8.4904 0.0036
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
CLINIC A vs D 3.747 1.355 10.361CLINIC B vs D 1.906 0.661 5.492CLINIC C vs D 1.676 0.589 4.768AGE 1.068 1.022 1.117
Assumptions: Linear Versus Logistic Regression
• Y normally distributed
• y linearly related to X
• constant over X
• Each observation independent of other observations
• Large N not needed for tests if Y is normally distributed
• Y binary
• Log odds linearly related to X
• N/A
• Each observation independent of other observations
• Large enough N to justify using 2
Illustration of Linearity in Log Odds Assumption
Log odds = -6.2428 + 0.0613* Age
AGE ODDS
50 0.039
60 0.072
70 0.134
RO = 1.85 = .072/.039
RO = 1.85 = .134/.072
Increased relative odds from going from 50 to 60 year is same as going from 60 to 70 years
Note: Absolute risk is not linear with age
Fitted regression line
xp
po 1)
1log(
Curve based on:
o effects location
1 effects curvature