Logistic Regression A solution for imperfect binary data
Xue Yao, Lisa Lix
Department of Community Health Sciences
Winnipeg SAS Users Group
May 17, 2013
Outline
• Brief Review of Logistic Regression
• Logistic Regression in SAS
• Case Study: Utilizing logistic regression to deal with
imperfect binary data (i.e. missing and misclassification)
– SAS Global Forum Paper 283-2013: A Flexible Method to Apply
Multiple Imputation Using SAS/IML® Studio
• To model the PROBABILITY of the event of interest based on
the values of independent variables
• 𝑙𝑜𝑔𝑖𝑡 𝑌 = 1 𝑿 = log𝑃
1−𝑃= 𝜷𝑿
where 𝑃 = 𝑃𝑟𝑜𝑏 𝑌 = 1 𝑿 = 𝐸(𝑌 = 1|𝑿)
• 𝑃 =exp(𝜷𝑿)
1+exp(𝜷𝑿)
• o𝑑𝑑𝑠 =𝑃
1−𝑃= exp 𝜷𝑿
Logistic Regression
Logistic Regression
M1 is a continuous
independent variable
M2 is a binary
independent variable
Logistic Regression in SAS
• CATMOD, GENMOD, PROBIT and LOGISTIC procedures perform
logistic regression in SAS
• LOGISTIC Procedure Syntax:
PROC LOGISTIC ;
CLASS discrete variable ;
MODEL variable= ;
OUTPUT ;
Case Study: Utilizing logistic regression to
deal with imperfect binary data
• Problem: Misclassification of disease status (0 or 1) results in
bias of either descriptive or inferential analysis based on
disease status using administrative health data
• Solution: Using logistic regression as predictive model for
multiple imputation method
• Data Sources: Validation dataset (e.g. medical chart) includes
accurate measures which can be linked to the administrative
health data
Illustration of Data
Using Logistic Predictive Model For Multiple
Imputation
Figure 1. A Schematic Diagram of the Multiple Imputation Method
Logistic predictive model
Step 1: Prepare the Data
Step 2: Build the Logistic Predictive Model
Step 3: Generate Multiple Parameters
Step 4: Create Multiple Datasets
Step 5: Analyze the Multiple Complete Datasets
Using Logistic Predictive Model For Multiple
Imputation
Building Logistic Predictive Model
𝑙𝑜𝑔𝑖𝑡 𝑌 = 1 𝑴 = 𝛽0 + 𝛽1𝑀1 + 𝛽2𝑀2
• To estimate the parameters of the model using LOGISTIC
procedure in SAS IML studio
Using Logistic Predictive Model For Multiple
Imputation
• To generate the multiple coefficients of the logistic predictive
model for multiple imputation based on the estimated
coefficients and covariance matrix from LOGISTIC procedure
Using Logistic Predictive Model For Multiple
Imputation
• To predict/impute the disease status (1 or 0) multiple times
using the generated coefficients and logistic predictive model
• To save the dataset that contains the variable of imputed
disease status and the number of imputations for further
analysis
Using Logistic Predictive Model For Multiple
Imputation
• To estimate the disease prevalence
– PROC UNIVARIATE to estimate prevalence of each dataset
– PROC MIANALYZE to combine the estimates from each dataset
Using Logistic Predictive Model For Multiple
Imputation
• Outputs of PROC UNIVARIATE and MIANALYZE
Results
• Multiple imputation based on logistic model improves the
accuracy of disease prevalence estimate
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
True Prev Obs Prev Imp Prev
Conclusions
• Imputation of missing data is better than discarding incomplete
observations. – Frank E. Harrell
• Misclassification of binary data can be treated as a missing data
problem
• For monotone missing data pattern, the binary data can be
imputed using PROC MI with logistic model
• For arbitrary missing data pattern, the proposed approach can
be used to impute more than one binary variables
simultaneously
Thank you!
Your comments and questions are valued and encouraged, please
contact
mailto:[email protected]Top Related