April 6

19
April 6 Logistic Regression Estimating probability based on logistic model Testing differences among multiple groups Assumptions for model

description

April 6. Logistic Regression Estimating probability based on logistic model Testing differences among multiple groups Assumptions for model. Logistic regression equation. Model log odds of outcome as a linear function of one or more variables X i = predictors, independent variables - PowerPoint PPT Presentation

Transcript of April 6

Page 1: April 6

April 6

• Logistic Regression– Estimating probability based on logistic model

– Testing differences among multiple groups

– Assumptions for model

Page 2: April 6

Logistic regression equation

Model log odds of outcome as a linear function of one or more variables

Xi = predictors, independent variables

is increase in log odds of 1-unit increase in X

eis relative odds of a 1-unit increase in X

...)1

log( 22110

xx

The model is:

Page 3: April 6

Logistic Regression PredictionEstimating Probability of Y=1

Goal: Estimate for a set of X values

Solve for

...)1

log( 22110

xx

The model is:

exp ( 0 + 1x1 + 2x2)

1 + exp ( 0 + 1x1 + 2x2)

ODDS

1 + ODDS=

Page 4: April 6

Steps in Estimating

• Pick values for x1, x2, …, xp

• Compute log odds for your values of Xs using results– LO = b0 + b1x1 + b2x2 + … bpxp

• EXP LO to get odds– Odds = EXP (LO)

• Compute estimate of – = ODDS/(ODDS + 1)

Page 5: April 6

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -6.0621 1.2884 22.1395 <.0001AGE 1 0.0605 0.0223 7.3310 0.0068women 1 -0.3967 0.3166 1.5701 0.2102

log(odds) = - 6.0621 + 0.0605*age –0.3967*women

What is estimated probability of CVD for a man 60 years old?

Log(odds) = -6.0621 + 0.0605(60) –0.3967(0) = -2.4321

Odds = exp(-2.4321) = 0.0878

Prob = 0.0878 / (1 + 0.0878) = 0.0808

How old does a women have to be to have the same risk?

1-Year of age increases log(odds) by 0.0605

Being female decreases log(odds) by –0.3967

Compute 0.3967/.0605 = 6.6 or women would have to 66.6 years to have P = .0808

Page 6: April 6

PROC LOGISTIC DATA=temp DESCENDING; MODEL clinical = age women/CLODDS=WALD; UNITS age = 5 women = 1;RUN;

Getting Odds Ratio for Differences Other Than 1

SAS OUTPUTWald Confidence Interval for Adjusted Odds Ratios

Effect Unit Estimate 95% Confidence Limits

AGE 5.0000 1.353 1.087 1.685women 1.0000 0.673 0.362 1.251

EXP (5*0.0605)

Page 7: April 6

Testing Differences Among Multiple Groups Using Logistic Regression

• Ho:

• Ha: i not all equal

• Can test using logistic regression since if ’s are equal then log odds are equal

• Can code in SAS two ways– Create dummy (design) variables to represent the groups

– Use a CLASS statement under PROC LOGISTIC

Page 8: April 6

TOMHS Example: Is CVD Rate EqualIn Four Clinical Centers?

• Ho:

• SAS CODE in datastep (create own design variables):

DATA temp; SET tomhs.bpstudy; clinicA = 0; clinicB = 0; clinicC = 0; clinicD = 0; if clinic = 'A' then clinicA = 1; else if clinic = 'B' then clinicB = 1; else if clinic = 'C' then clinicC = 1; else if clinic = 'D' then clinicD = 1;RUN;

Page 9: April 6

Do Simple Analyses First

PROC MEANS N MEAN SUM MIN MAX DATA=temp; CLASS clinic; VAR clinical;RUN;

Analysis Variable : CLINICAL Indicator - Clinical Endpoint

NCLINIC Obs N Mean Sum Minimum Maximum------------------------------------------------------------------------------A 195 195 0.0974359 19.0000000 0 1.0000000

B 251 251 0.0517928 13.0000000 0 1.0000000

C 296 296 0.0472973 14.0000000 0 1.0000000

D 160 160 0.0312500 5.0000000 0 1.0000000

The relative odds (A/D) should be about 3. All betas should be > 0

Page 10: April 6

PROC LOGISTIC CODE

* Using class statement;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; CLASS clinic/PARAM=REF; MODEL clinical = clinic ;RUN;

* Using user defined design variables;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; MODEL clinical = clinica clinicb clinicc;RUN;

Uses 0/1 coding

Last group as reference

Gives summary statistics

Page 11: April 6

SAS OUTPUT USING CLASS STATEMENT

Response Profile

Ordered Total Value CLINICAL Frequency

1 1 51 2 0 851

Probability modeled is CLINICAL=1.

Class Level Information

Design Variables

Class Value 1 2 3

CLINIC A 1 0 0 B 0 1 0 C 0 0 1 D 0 0 0

Same coding as in datastep

Clinic D reference

Page 12: April 6

SAS OUTPUT USING CLASS STATEMENT

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 7.9632 3 0.0468Score 8.6122 3 0.0349Wald 8.1300 3 0.0434

Type III Analysis of Effects

WaldEffect DF Chi-Square Pr > ChiSq

CLINIC 3 8.1300 0.0434

These are equal because no other variables are in model

Page 13: April 6

SAS OUTPUT USING CLASS STATEMENT

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -3.4339 0.4544 57.1196 <.0001CLINIC A 1 1.2080 0.5145 5.5114 0.0189CLINIC B 1 0.5266 0.5363 0.9644 0.3261CLINIC C 1 0.4311 0.5305 0.6604 0.4164

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

CLINIC A vs D 3.347 1.221 9.175CLINIC B vs D 1.693 0.592 4.844CLINIC C vs D 1.539 0.544 4.353

Page 14: April 6

SAS OUTPUT USING MODEL clinical = clinicA clinicB clinicC

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -3.4339 0.4544 57.1196 <.0001clinicA 1 1.2080 0.5145 5.5114 0.0189clinicB 1 0.5266 0.5363 0.9644 0.3261clinicC 1 0.4311 0.5305 0.6604 0.4164

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

clinicA 3.347 1.221 9.175clinicB 1.693 0.592 4.844clinicC 1.539 0.544 4.353

Page 15: April 6

Maybe clinic rates of CVD differ because age varies among centers

SAS OUTPUT USING MODEL clinical = clinic age

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 16.5582 4 0.0024Score 17.2001 4 0.0018Wald 16.2760 4 0.0027

Type III Analysis of Effects

WaldEffect DF Chi-Square Pr > ChiSq

CLINIC 3 8.9604 0.0298AGE 1 8.4904 0.0036

Test if age and clinic are related to CVD

Page 16: April 6

SAS OUTPUT USING MODEL clinical = clinic age

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -7.2250 1.4096 26.2725 <.0001CLINIC A 1 1.3211 0.5189 6.4816 0.0109CLINIC B 1 0.6448 0.5400 1.4256 0.2325CLINIC C 1 0.5163 0.5335 0.9366 0.3332AGE 1 0.0662 0.0227 8.4904 0.0036

Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limits

CLINIC A vs D 3.747 1.355 10.361CLINIC B vs D 1.906 0.661 5.492CLINIC C vs D 1.676 0.589 4.768AGE 1.068 1.022 1.117

Page 17: April 6

Assumptions: Linear Versus Logistic Regression

• Y normally distributed

• y linearly related to X

• constant over X

• Each observation independent of other observations

• Large N not needed for tests if Y is normally distributed

• Y binary

• Log odds linearly related to X

• N/A

• Each observation independent of other observations

• Large enough N to justify using 2

Page 18: April 6

Illustration of Linearity in Log Odds Assumption

Log odds = -6.2428 + 0.0613* Age

AGE ODDS

50 0.039

60 0.072

70 0.134

RO = 1.85 = .072/.039

RO = 1.85 = .134/.072

Increased relative odds from going from 50 to 60 year is same as going from 60 to 70 years

Note: Absolute risk is not linear with age

Page 19: April 6

Fitted regression line

xp

po 1)

1log(

Curve based on:

o effects location

1 effects curvature