poster_INFORMS_healthcare_2015 - condensed

1
Patient-Level Data Integration of De-Identified Healthcare Databases to Support Improved Predictive Analytics Yang Yang, Reza Sharifi Sedeh, Min Xue, Nandini Raghavan, Daniel Elgort Contact: [email protected]; reza.sharifi[email protected]; [email protected] 1, Introduction Various types of de-identified healthcare databases, from clinical and administrative to utilization, have emerged recently, which enable researchers to perform analyses in each individual domain. However, in the absence of pa- tient identifying data features, current methods do not al- low for patient record level integration across these de- identified databases. In this paper, we propose a novel approach to over- come this limitation and integrate multiple de-identified databases on the patient record level so that inter-domain research problems become addressable. In addition, we have developed a scalable healthcare data analytics pipeline, which incorporates multiple machine learning methods, including penalized and splined linear models, logistic regression, random forest, and survival models. Based on the nature of the integrated database and the an- alytics purpose, users are provided with options to use any combination of the available machine learning methods in a timely manner. Adopting this strategy, users could ob- tain more meaningful findings from the integrated dataset compared with using a single database or relying on a sin- gle analytical method. 2, Data Integration Approach Many aggregated healthcare databases are strictly de- identified that all of the hospital and patient identifiers are removed before any secondary use by researchers [Meystre et al., 2010]. This fact makes the integration across databases highly challenging. Recently we devel- oped a hierarchical approach to integrate de-identified databases on the patient level using non-uniquely identi- fying patient features. For example, age, sex, weight, pri- mary diagnosis and length of hospital stay. The general approach follows: generate UID from features for each patient. calculate patient rarity score for each patient. use rarest patients to identify the same hospitals across databases. match patients belonging to the same hospital across databases and repeat it for all matched hospitals. categorize matching results into confident, impossi- ble and possible matches. Below is an example for a patient with UID 1.5.1.122.18, and his calculated rarity score. Table 1.: Calculating rarity score for the Native American, 18-year-old, male patient who has a LOS of 122 days and has died in hospital. The rarity score 4.5 * 10 -11 can be interpreted as, in every 22 billion patients from the hospital population, there is only one patient with the same UID as him. 3, Data Integration Approach con’t After generating UIDs for patients, we further added the diagnosis codes to reduce duplicated patient matches. The ICD9 [for Medicare et al., 2011] codes was collapsed to Clinical Classifications [Cost et al., 2010] for better accuracy and robustness of the matching. The general rules of matching two pa- tients can be summarized: the patients have the exact, same patient UID. the patients share at least 50% of the diagnosis codes of the patient with less number of diag- nosis codes. For example, if six and ten diagno- sis codes have been assigned to Patient A in the "clinical" database and Patient B in the "claims" database respectively, then Patient A and Pa- tient B must share at least three diagnosis codes to convince us there is a match. Finally, we summarized the hierarchical match- ing algorithm into the following flowchart: A B C Set the rarity coefficient threshold, r, to 10 -10 . Match the patients of the “clinical” hospital X with coefficients less than r to the “claims” patients. Are there five patient matches and 30% matching rate between the “clinical” hospital X and any “claims” Hospital Y? YES Link the de-identified hospital IDs in the two databases. NO Increase r by a factor of 10. Our matching criteria of two paents’ records are dened as: Same paent ID. Share at least 50% diagnosis. paent Diagnosis paentID A 1, 2, 3, 3, 5 12345678 a 2, 3 , , 12345678 Patient matching using basic features Identified one-to-one matched records No matched record Multiple matched records Age Gender Race Primary Diagnosis LOS Mortality Using secondary feature to narrow the possible pairs Confident Matching DL AS Impossible Matching Possible Matching yes no yes no Within single year and single hospital Figure 1.: A. Integration of eICU and HCUP using pro- vided common features (eICU and HCUP are two differ- ent healthcare databases. See Section 5. Data Application); B. Hospital matching algorithm flowchart; C. Individual patient matching algorithm flowchart. 4, Analytics Pipeline PhilipsHealthcareBDS is an automated pipeline which gives the user opportunities to execute a range of statis- tical/machine learning models on a specific dataset in a neat and fast manner. The whole pipeline is written in R language and it is a Linux command-line based pro- gram. The pipeline contains five modules in a flowchart (Figure 2). The pipeline features flexible parallel/serial scheme, flexible model parameter tuning, robustness to different datasets with mixed types of explanatory and response variables, complete logging and error collecting system, and the ease to add more models in the future. Currently the pipeline contains ten models/algorithms, in- cluding Generalized Linear Model with stepwise variable selection; Lasso, Ridge and Elastic Net algorithm; Group Elastic Net algorithm; SCAD/MCP algorithm; Random Forest; Random Survival Forest; Quantile Regression and Normal-Probit Bivariate Model. Figure 2.: Flowchart of PhilipsHealthcareBDS. Module within square parentheses is optional. 6, Results 4e5 2e5 0 -2e5 Residual Error Rate 1e5 2e5 3e5 4e5 5e5 6e5 0 Predicted Value # of Trees Variable Importance A B Figure 3.: (A) Linear regression residual plot of in-hospital expen- diture. (B) Random forest tree error rate (left panel) and variable importance rank (right panel). The two variables with the highest effects (blue box) on in-hospital expenditure were plotted versus the in-hospital expenditure (the two rightmost plots). Summary of conclusions We found a significant correlation between the actually observed values of mortality or length of stay (from eICU) and the in-hospital expenditure (from HCUP). We learned that the in-hospital expenditures (HCUP) of the patients who died in hospital (eICU) are higher than those alive. We found the patients in either extremely bad condition or excellent condition, inferred from "Predicted Hospital/ICU mortality" or "Acute Physiology Score" (eICU), have higher in-hospital expenditures than patients with moderate condition. These two variables were ranked as the top two predictors of ex- penditure by a random forest method (Figure 3B). In addition, there are several other findings: Asian or Pacific Islander patients paid more; patients with more interventional procedures paid more; patients with longer actual hospital/ICU lengths of stay paid more; patients admitted from other health facilities paid less. 5, Data Application We integrated patients from Philips eICU database and Healthcare Cost and Utilization Project (HCUP a ) State In- patient Database (SID) for Massachusetts between 2008 and 2011. From this full dataset, by "DX1" (primary diag- nosis ICD-9 code) values we further extracted those with Heart Disease (i.e., Heart Failure and Cardiovascular My- ocardial Infarction). The variables available are clinical variables, utilization variables, billing variables, demo- graphic variables and hospital characteristics. We selected and applied five analytical methods on the real data including: 1, Linear regression with stepwise variable selection by AIC criteria; 2, Penalized linear model such as elastic net, SCAD and MCP; 3, Group based penalized linear model; 4, Random Forest; 5, Quantile Regression. a Disclaimer: Study design, Data sources, analysis and findings de- scribed in this paper were executed in compliance with the Data Use Agree- ment of HCUP. References Healthcare Cost, Utilization Project, et al. Clinical classifications software (ccs) for icd-9-cm. Rockville, MD: Agency for Healthcare Research and Quality, 2010. Centers for Medicare, Medicaid Services, et al. Icd-9-cm official guidelines for coding and reporting. US GPO, Washington, DC, 2011. Stephane M Meystre, F Jeffrey Friedlin, Brett R South, Shuying Shen, and Matthew H Samore. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology, 10(1):70, 2010.

Transcript of poster_INFORMS_healthcare_2015 - condensed

Page 1: poster_INFORMS_healthcare_2015 - condensed

Patient-Level Data Integration of De-Identified Healthcare Databasesto Support Improved Predictive AnalyticsYang Yang, Reza Sharifi Sedeh, Min Xue, Nandini Raghavan, Daniel ElgortContact: [email protected]; [email protected]; [email protected]

1, IntroductionVarious types of de-identified healthcare databases, fromclinical and administrative to utilization, have emergedrecently, which enable researchers to perform analyses ineach individual domain. However, in the absence of pa-tient identifying data features, current methods do not al-low for patient record level integration across these de-identified databases.

In this paper, we propose a novel approach to over-come this limitation and integrate multiple de-identifieddatabases on the patient record level so that inter-domainresearch problems become addressable. In addition,we have developed a scalable healthcare data analyticspipeline, which incorporates multiple machine learningmethods, including penalized and splined linear models,logistic regression, random forest, and survival models.Based on the nature of the integrated database and the an-alytics purpose, users are provided with options to use anycombination of the available machine learning methods ina timely manner. Adopting this strategy, users could ob-tain more meaningful findings from the integrated datasetcompared with using a single database or relying on a sin-gle analytical method.

2, Data Integration ApproachMany aggregated healthcare databases are strictly de-identified that all of the hospital and patient identifiersare removed before any secondary use by researchers[Meystre et al., 2010]. This fact makes the integrationacross databases highly challenging. Recently we devel-oped a hierarchical approach to integrate de-identifieddatabases on the patient level using non-uniquely identi-fying patient features. For example, age, sex, weight, pri-mary diagnosis and length of hospital stay. The generalapproach follows:

• generate UID from features for each patient.

• calculate patient rarity score for each patient.

• use rarest patients to identify the same hospitalsacross databases.

• match patients belonging to the same hospital acrossdatabases and repeat it for all matched hospitals.

• categorize matching results into confident, impossi-ble and possible matches.

Below is an example for a patient with UID1.5.1.122.18, and his calculated rarity score.

Table 1.: Calculating rarity score for the Native American, 18-year-old,male patient who has a LOS of 122 days and has died in hospital.

The rarity score 4.5 ∗ 10−11 can be interpreted as, in every22 billion patients from the hospital population, there isonly one patient with the same UID as him.

3, Data Integration Approach con’tAfter generating UIDs for patients, we further addedthe diagnosis codes to reduce duplicated patientmatches. The ICD9 [for Medicare et al., 2011]codes was collapsed to Clinical Classifications [Costet al., 2010] for better accuracy and robustness of thematching. The general rules of matching two pa-tients can be summarized:

• the patients have the exact, same patient UID.• the patients share at least 50% of the diagnosis

codes of the patient with less number of diag-nosis codes. For example, if six and ten diagno-sis codes have been assigned to Patient A in the"clinical" database and Patient B in the "claims"database respectively, then Patient A and Pa-tient B must share at least three diagnosis codesto convince us there is a match.

Finally, we summarized the hierarchical match-ing algorithm into the following flowchart:

A

B

C

Set the rarity coefficient threshold, r, to 10-10.

Match the patients of the “clinical” hospital X with coefficients less than

r to the “claims” patients.

Are there five patient matches and 30% matching rate between the “clinical”

hospital X and any “claims” Hospital Y?

YES

Link the de-identified hospital IDs in the two databases.

NO

Increase r by a factor of 10.

Our matching criteria of two patients’ records are defined as:• Same patient ID.• Share at least 50% diagnosis.

patient Diagnosis patientID

A 1, 2, 3, 3, 5 12345678

a 2, 3 , , 12345678

Patient matching using basic features

Identified one-to-one matched records

No matched record

Multiple matched records

Age GenderRacePrimary Diagnosis LOS Mortality

Using secondary feature to narrow the possible pairs

Confident Matching

DLAS

Impossible Matching Possible Matching

yes

no

yes

no

Within single yearand single hospital

Figure 1.: A. Integration of eICU and HCUP using pro-vided common features (eICU and HCUP are two differ-ent healthcare databases. See Section 5. Data Application);B. Hospital matching algorithm flowchart; C. Individualpatient matching algorithm flowchart.

4, Analytics PipelinePhilipsHealthcareBDS is an automated pipeline whichgives the user opportunities to execute a range of statis-tical/machine learning models on a specific dataset in aneat and fast manner. The whole pipeline is written inR language and it is a Linux command-line based pro-gram. The pipeline contains five modules in a flowchart(Figure 2). The pipeline features flexible parallel/serialscheme, flexible model parameter tuning, robustness todifferent datasets with mixed types of explanatory andresponse variables, complete logging and error collectingsystem, and the ease to add more models in the future.Currently the pipeline contains ten models/algorithms, in-cluding Generalized Linear Model with stepwise variableselection; Lasso, Ridge and Elastic Net algorithm; GroupElastic Net algorithm; SCAD/MCP algorithm; RandomForest; Random Survival Forest; Quantile Regression andNormal-Probit Bivariate Model.

Figure 2.: Flowchart of PhilipsHealthcareBDS. Module within squareparentheses is optional.

6, Results

4e5

2e5

0

-2e5

Res

idua

l

Err

or R

ate

1e5 2e5 3e5 4e5 5e5 6e50Predicted Value # of Trees Variable Importance

A B

Figure 3.: (A) Linear regression residual plot of in-hospital expen-diture. (B) Random forest tree error rate (left panel) and variableimportance rank (right panel). The two variables with the highesteffects (blue box) on in-hospital expenditure were plotted versusthe in-hospital expenditure (the two rightmost plots).

Summary of conclusions

• We found a significant correlation betweenthe actually observed values of mortality orlength of stay (from eICU) and the in-hospitalexpenditure (from HCUP).

• We learned that the in-hospital expenditures(HCUP) of the patients who died in hospital(eICU) are higher than those alive.

• We found the patients in either extremelybad condition or excellent condition, inferredfrom "Predicted Hospital/ICU mortality" or"Acute Physiology Score" (eICU), have higherin-hospital expenditures than patients withmoderate condition. These two variableswere ranked as the top two predictors of ex-penditure by a random forest method (Figure3B).

In addition, there are several other findings: Asianor Pacific Islander patients paid more; patients withmore interventional procedures paid more; patientswith longer actual hospital/ICU lengths of stay paidmore; patients admitted from other health facilitiespaid less.

5, Data ApplicationWe integrated patients from Philips eICU database andHealthcare Cost and Utilization Project (HCUPa) State In-patient Database (SID) for Massachusetts between 2008and 2011. From this full dataset, by "DX1" (primary diag-nosis ICD-9 code) values we further extracted those withHeart Disease (i.e., Heart Failure and Cardiovascular My-ocardial Infarction). The variables available are clinicalvariables, utilization variables, billing variables, demo-graphic variables and hospital characteristics.

We selected and applied five analytical methods on the realdata including: 1, Linear regression with stepwise variableselection by AIC criteria; 2, Penalized linear model suchas elastic net, SCAD and MCP; 3, Group based penalizedlinear model; 4, Random Forest; 5, Quantile Regression.

aDisclaimer: Study design, Data sources, analysis and findings de-scribed in this paper were executed in compliance with the Data Use Agree-ment of HCUP.

ReferencesHealthcare Cost, Utilization Project, et al. Clinical classifications

software (ccs) for icd-9-cm. Rockville, MD: Agency for HealthcareResearch and Quality, 2010.

Centers for Medicare, Medicaid Services, et al. Icd-9-cm officialguidelines for coding and reporting. US GPO, Washington, DC,2011.

Stephane M Meystre, F Jeffrey Friedlin, Brett R South, ShuyingShen, and Matthew H Samore. Automatic de-identificationof textual documents in the electronic health record: a reviewof recent research. BMC medical research methodology, 10(1):70,2010.