Fully Bayesian Unsupervised Disease Progression Modeling · 2019-11-12 · ipescu, B. J. Hunt, R....

3
Fully Bayesian Unsupervised Disease Progression Modeling Arya Pourzanjani * Department of Computer Science University of California, Santa Barbara Santa Barbara, CA [email protected] David St ¨ uck Evidation Health Santa Barbara, CA [email protected] David Sontag Department of Computer Science New York University New York, NY [email protected] Luca Foschini Evidation Health Santa Barbara, CA [email protected] Abstract We present a practical implementation of a fully unsupervised disease progression model [10]. The implementation utilizes all new components we developed for generic use in Bayesian disease progression modeling. It improves upon [10] by providing a more informative fully Bayesian approach and a faster inference algorithm. The implementation is completely built on the pyMC3 open-source library making it easy to extend the model and apply to new settings. 1 Disease Progression Models Traditionally, disease severity and progression have been assessed manually by physicians using guidelines such as the GOLD criteria for COPD [6]. These guidelines are typically based on rules applied to the patient’s biomarkers, demographics, and other data easily extracted from health records. The sub-area of machine learning called disease progression modeling (DPM) focuses on automating this process [5]. Automation leads to more accurate diagnoses and optimal treatment paths which can literally be the difference between life and death as in the case of coagulopathy patients [9]. More broadly, we expect that algorithms that learn disease progression models from electronic health records will lead to new insights on the progression of rare and difficult to stage chronic diseases, guiding both clinical practice and medical research. 2 Bayesian Models and pyMC3 Bayesian networks provide a natural framework for modeling disease progression. They allow for the flexible modeling of “hidden states” which often arise in medical scenarios where measurements are simply proxies for variables of interest. Furthermore, Bayesian posteriors provide a full descrip- tion of parameters of interest as oppose to point estimates and simple confidence intervals. Several examples of Bayesian network models for disease progression exist in the literature [1, 2, 4, 7, 10]. pyMC3 is a Python module that provides a unified and comprehensive framework for fitting Bayesian models using MCMC [8]. pyMC3’s key strength is its modularity and extensibility: ran- * Research performed while interning at Evidation Health 1

Transcript of Fully Bayesian Unsupervised Disease Progression Modeling · 2019-11-12 · ipescu, B. J. Hunt, R....

Fully Bayesian Unsupervised Disease ProgressionModeling

Arya Pourzanjani∗Department of Computer Science

University of California, Santa BarbaraSanta Barbara, CA

[email protected]

David StuckEvidation Health

Santa Barbara, [email protected]

David SontagDepartment of Computer Science

New York UniversityNew York, NY

[email protected]

Luca FoschiniEvidation Health

Santa Barbara, [email protected]

Abstract

We present a practical implementation of a fully unsupervised disease progressionmodel [10]. The implementation utilizes all new components we developed forgeneric use in Bayesian disease progression modeling. It improves upon [10]by providing a more informative fully Bayesian approach and a faster inferencealgorithm. The implementation is completely built on the pyMC3 open-sourcelibrary making it easy to extend the model and apply to new settings.

1 Disease Progression Models

Traditionally, disease severity and progression have been assessed manually by physicians usingguidelines such as the GOLD criteria for COPD [6]. These guidelines are typically based onrules applied to the patient’s biomarkers, demographics, and other data easily extracted from healthrecords. The sub-area of machine learning called disease progression modeling (DPM) focuses onautomating this process [5]. Automation leads to more accurate diagnoses and optimal treatmentpaths which can literally be the difference between life and death as in the case of coagulopathypatients [9]. More broadly, we expect that algorithms that learn disease progression models fromelectronic health records will lead to new insights on the progression of rare and difficult to stagechronic diseases, guiding both clinical practice and medical research.

2 Bayesian Models and pyMC3

Bayesian networks provide a natural framework for modeling disease progression. They allow forthe flexible modeling of “hidden states” which often arise in medical scenarios where measurementsare simply proxies for variables of interest. Furthermore, Bayesian posteriors provide a full descrip-tion of parameters of interest as oppose to point estimates and simple confidence intervals. Severalexamples of Bayesian network models for disease progression exist in the literature [1, 2, 4, 7, 10].

pyMC3 is a Python module that provides a unified and comprehensive framework for fittingBayesian models using MCMC [8]. pyMC3’s key strength is its modularity and extensibility: ran-

∗Research performed while interning at Evidation Health

1

dom variables in a Bayesian network can be easily added or replaced to construct a model andmultiple general purpose samplers are available.

3 New pyMC3 Tools for DPM

We implemented several tools in Python that could be used with pyMC3 to aid with generic diseaseprogression modeling:

• A Markov Jump Process for continuous modeling of state using discrete observations thatcome at irregular times

• A multi-dimensional binary Markov process for modeling the onset of comorbidities• A noisy-or network for modeling health measurements as symptoms of comorbidities

We are releasing these tools as free and open-source software, which we hope will help accelerateprogress on machine learning research of disease progression modeling.

4 Using Our Tools to Implement an Unsupervised DPM

We demonstrate our tools’ capabilities by implementing the unsupervised DPM by Wang et al.described in [10]. The model utilizes all three newly implemented components to infer comorbiditesand disease stage from electronic medical records. The top layer is a Markov Jump Process thatreveals disease stage at discrete time points. The middle layer is a vector of comorbidities that areeither on or off at each time step with some probability according to disease stage. The last layerconsists of a noisy-or network where observations or measurements are triggered by comorbiditiesor a leak term. Figure 1 provides an overview.

fore the time granularity of patient records vary sig-nificantly across di↵erent patients, and for the samepatient over di↵erent time periods.

• Limited Supervision. For some diseases we havevery limited yet crucial domain knowledge available,e.g. the known symptoms of a particular disease. In-corporating the available domain knowledge into theprogression model is a nontrivial task.

To address these challenges, we propose an unsuperviseddisease progression model. As show in Figure 1, our model iscomposed of three layers: The top layer is a Markov JumpProcess which captures the continuous-time diseases statetransitions. The middle layer is a set of Markov chainscapturing the relationship between the hidden state tran-sitions and the onset pattern of a set of comorbid conditions(comorbidities). The third layer is a noisy-or network [14](Figure 2) capturing the relationship between those comor-bidities and the observed clinical evidence. Note that in-stead of linking the clinical observations directly to the dis-ease progression states, we “group” them into comorbidities,which tend to evolve coherently with the progression of dis-ease. This abstraction also makes the learned DPM more ro-bust and interpretable. An Expectation-Maximization (EM)based algorithm is presented to estimate the parameters aswell as the hidden variables. We apply our model to a real-world COPD patient cohort to demonstrate its capabilities.

It is worthwhile to highlight the following aspects of theproposed model:

• Our model is unsupervised. We learn the entire dis-ease progression trajectory from the observed patientrecords without any training labels on the ground truthstages that a patient was in.

• Our model learns the continuous-time disease progres-sion trajectory even though the medical records wereonly observed at discrete timestamps with irregular in-tervals.

• Our model can “stitch together” partial disease tra-jectories (i.e., incomplete records from individual pa-tients) into a global path of disease progression.

• Our model can learn meaningful comorbidities associ-ated with di↵erent disease stages. We allow the in-jection of anchor findings, which are clinical featuresthat distinctly signifies the presence of a certain co-morbidity, to improve the interpretability and medicalvalidity of our model.

2. RELATED WORKDisease progression modeling is an important topic in

medical informatics [11]. Existing work on disease progres-sion models have been proved e↵ective for drug develop-ment and early intervention. For example, Post et al. [12]proposed a family of models to describe the progression ofdegenerative diseases (such as type 2 diabetes and Parkin-son’s disease) as a function of disease process and treatmente↵ects. De Winter et al. [3] developed a mechanism basedtechnique for modeling the progression of diabetes mellitusby tracking the interaction between several key indicators.Ito et al. [8] presented a model based on literature meta-analysis to describe the longitudinal changes of patients with

X1,1 X1,2 X1,T-1 X1,T

X2,1 X2,2 X2,T-1 X2,T

……

……

S1 S2 ST-1 ST……

S(τ)

X2,1 X2,2 X2,T-1 X2,T

XK,1 XK,2 XK,T-1 XK,T

……

……

……O1 O2 OT-1 OT

……

N patients

Figure 1: The outline of our model: S are progres-sion state variables, X are comorbidity variables,and O are observed clinical findings.

X1 X2 XK-1 XK L……

LDZKD

K Comorbidities(hidden)

Leak Term(hidden)

O1 O2 OD-1 OD……

D Clinical Findings(Observable)

Figure 2: The noisy-or Bayesian network (alsoknown as QMR-DT network). The clinical findingscan be activated by a present comorbidity, or by thealways-on leak term. The starred finding (O1) is ananchor, which means it can only be activated by aspecific comorbidity (X1 in this case).

mild to moderate Alzheimer’s disease. These papers fromthe medical field tend to be specific to a single target disease.They require substantial domain knowledge on the progres-sion, mechanism, and key indicators/measurements for thetarget disease. This is not the case for our model becausewe aim to learn a general-purpose model for any chronic dis-ease based on a general input data type: Electronic HealthRecords. We do not assume prior knowledge of either theground truth progression stages or the key indicators thatsignify the stage transitions.

Another line of e↵orts, to which our approach belongs,model the progression of disease using machine learning andstatistical techniques based on observational data, also re-ferred to as evidence based modeling. For example, Jack-son et al. [9] developed a multistage Hidden Markov Modeland applied it to an aneurysm screening study. Sukkar etal. [15] applied Hidden Markov Model to Alzheimer’s dis-ease. Cohen et al. [2] performed hierarchical clustering of45 physiological, clinical, and treatment variables collected

Figure 1: Overview of DPM from Wang et al. [10]

Our implementation of the model from [10] not only provides full posterior estimates (as opposed topoint estimates from the EM algorithm), but takes advantage of pyMC3’s automatic differentiationto enable the NUTS sampler[3]. NUTS is a self-tuning Hamiltonian Monte Carlo sampler whichtakes advantage of the gradient of the likelihood to propose larger, more intelligent steps.

After implementing the model in pyMC3, we evaluated it using a synthetic dataset consisting ofN = 100 patients with 2 to 34 time points each, M = 4 disease states, K = 4 comorbidities, andD = 16 types of observations, resulting in a total of 1609 clinical observations. We evaluated the ro-bustness of the model to parameter initialization by comparing the model behavior when parametersare initialized at random vs. initialized to their ground truth value. We found that a random initial-ization rapidly finds the highest likelihood region achieving the same value of a near-equilibriuminitialization. Additionally, the inferred distribution of leak terms for both initialization scenarioslooks almost identical. The results are reported in Figure 2.

References[1] E. Choi, N. Du, R. Chen, L. Song, and J. Sun. Constructing disease network and temporal

progression model via context-sensitive hawkes process.

2

Figure 2: Top row: Total model log-likelihood when using NUTS. When parameters are initializedat near-equilibrium the log-likelihood hovers around a stable value (left). The same log-likelihoodvalue is rapidly achieved when parameters are initialized at random (right). Bottom row: distri-butions of the leak terms in the noisy-or network for near-equilibrium (left) and random (right)initializations. The distributions look almost identical.

[2] K. P. Exarchos, T. P. Exarchos, C. V. Bourantas, M. Papafaklis, K. K. Naka, L. K. Michalis,O. Parodi, D. Fotiadis, et al. Prediction of coronary atherosclerosis progression using dynamicbayesian networks. In Engineering in Medicine and Biology Society (EMBC), 2013 35th An-nual International Conference of the IEEE, pages 3889–3892. IEEE, 2013.

[3] M. D. Homan and A. Gelman. The no-u-turn sampler: Adaptively setting path lengths inhamiltonian monte carlo. The Journal of Machine Learning Research, 15(1):1593–1623, 2014.

[4] C. H. Jackson, L. D. Sharples, S. G. Thompson, S. W. Duffy, and E. Couto. Multistate markovmodels for disease progression with classification error. Journal of the Royal Statistical Soci-ety: Series D (The Statistician), 52(2):193–209, 2003.

[5] D. Mould. Models for disease progression: new approaches and uses. Clinical Pharmacology& Therapeutics, 92(1):125–131, 2012.

[6] R. A. Pauwels, A. S. Buist, P. M. Calverley, C. R. Jenkins, and S. S. Hurd. Global strat-egy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease.American journal of respiratory and critical care medicine, 2012.

[7] R. Pivovarov, A. J. Perotte, E. Grave, J. Angiolillo, C. H. Wiggins, and N. Elhadad. Learningprobabilistic phenotypes from heterogeneous ehr data. Journal of biomedical informatics,2015.

[8] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck. Probabilistic programming in python usingpymc.

[9] D. R. Spahn, B. Bouillon, V. Cerny, T. J. Coats, J. Duranteau, E. Fernandez-Mondejar, D. Fil-ipescu, B. J. Hunt, R. Komadina, G. Nardi, et al. Management of bleeding and coagulopathyfollowing major trauma: an updated european guideline. Crit Care, 17(2):R76, 2013.

[10] X. Wang, D. Sontag, and F. Wang. Unsupervised learning of disease progression models. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery anddata mining, pages 85–94. ACM, 2014.

3