Discriminative Regularization for Latent Variable Models with...

Discriminative Regularization for Latent Variable Modelswith Applications to Electrocardiography

Andrew C. Miller 1 Ziad Obermeyer 2 John P. Cunningham 3 Sendhil Mullainathan 4

AbstractGenerative models often use latent variables torepresent structured variation in high-dimensionaldata, such as images and medical waveforms.However, these latent variables may ignore subtle,yet meaningful features in the data. Some featuresmay predict an outcome of interest (e.g. heartattack) but account for only a small fraction of vari-ation in the data. We propose a generative modeltraining objective that uses a black-box discrimina-tive model as a regularizer to learn representationsthat preserve this predictive variation. Withthese discriminatively regularized latent variablemodels, we visualize and measure variation in thedata that influence a black-box predictive model,enabling an expert to better understand eachprediction. With this technique, we study modelsthat use electrocardiograms to predict outcomesof clinical interest. We measure our approach onsynthetic and real data with statistical summariesand an experiment carried out by a physician.

1. IntroductionMachine learning research has led to extraordinary predic-tion results across many data types but less success in deriv-ing human-level insight from (or incorporating that insightinto) our models. Two such problems include model diag-nosis: using expert human knowledge to understand andimprove upon model shortcomings; and model-based dis-covery: understanding what human-interpretable featuresdrive predictive performance. To be concrete, consider thesebiomedical applications:

(a) A newly trained model detects a common heart condi-

1Data Science Institute, Columbia University, New York, NY,USA. 2School of Public Health, UC Berkeley, Berkeley, CA, USA.3Department of Statistics, Columbia University, New York, NY,USA. 4Booth School of Business, University of Chicago, Chicago,IL, USA.. Correspondence to: ACM <[email protected]>.

Proceedings of the 36th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

tion (e.g. atrial fibrillation or afib, a condition observablein an EKG) with reasonable accuracy, but below thelevel of a skilled cardiologist. How can we use humanexpertise to diagnose model shortcomings?

(b) By contrast, using EKG waveforms as inputs to a black-box predictor can outperform a cardiologist in predict-ing future cardiac events, such as heart attacks. It iscritical to understand what EKG features that predictoris using, insight which can be used to motivate futureexperiments, treatments, and care protocols.

Though these scenarios are specific, they raise a generalproblem. Given a black-box predictive model m(x) (e.g. aneural network) that converts a high-dimensional signal x(e.g. an EKG) into a prediction score of output y (e.g. afib),we would like to show domain experts what the algorithm“sees” — what variation in x influences predictions of y.

In recent years, model interpretability has received consid-erable attention (Ribeiro et al., 2016; Doshi-Velez & Kim,2017; Kindermans et al., 2017; Narayanan et al., 2018). Oneway to interrogate a predictive model is to examine pertur-bations of an input x that influence the prediction the most.1

For example, we can compute the gradient of m(x) with re-spect to model input (e.g. pixels) at point some data point x.This will produce a saliency map that identifies which partsof the input most influence the outcome of interest (Baehrenset al., 2010; Simonyan et al., 2013). One shortcoming of thisapproach is that the gradient contains only local information;for high-dimensional structured inputs, following the gradi-ent flow for any moderate distance will produce meaninglessinputs that bear little resemblance to real observations.

We can improve upon saliency maps by restricting our per-turbations to remain on the “data manifold.” For instance,the activation maximization approach uses a deep generativemodel to represent the space of natural images with a latentvariable z (Nguyen et al., 2016). To interrogate an input, anew observation is synthesized by finding the z that max-imizes the prediction score for some classifier of interest.This data manifold constraint enables the visualization ofrealistic image features that influence the prediction.

1For clarity, we will always use the term prediction to refer tothe output of m(x). Further, we assume the discriminative modelm(x) has been given to us — its parameters have been fit and fixed.

Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography

One issue with this approach is that a generative model mayignore subtle, yet important features of the data. Generativemodels fit with maximum likelihood optimize for recon-struction error, which focuses on representing directionsof high variation. This can be problematic if the variationwe are interested in visualizing and exploring are predictivefeatures of x — potentially subtle, structured variation inthe data that influence m(·) the most. We show that thestandard (approximate) maximum likelihood objective canlead to under-representation of predictive features in x.

To address these issues, we propose the discriminitively reg-ularized variational autoencoder (DR-VAE), a simple wayto constrain latent variables to represent the variation that ablack-box model m(x) uses to form predictions. To fit a DR-VAE, we augment the typical maximum likelihood-basedobjective with a regularization term that depends on the dis-criminative model m(x). This constraint gives the user morecontrol over the semantic meaning of the latent representa-tion even when we train on a separate set of unlabeled data.

The contributions of this work are: (i) a constrained gen-erative model objective that retains targeted, semanticallymeaningful properties;2 (ii) an information theoretic inter-pretation of the regularizer; (iii) an empirical study of trade-offs in a synthetic, a toy data, and a real setting with EKGsthat demonstrates the advantages of incorporating such con-straints; (iv) the use of model-morphed data to interpretblack-box predictors in a real biomedical informatics appli-cation; and (v) validation of the technique using statisticalsummaries and physician assessment.

2. Background and Problem SetupWe are given an already-trained and fixed discriminativemodel m :X 7→ R that maps a data point x ∈ X to a real-valued prediction score (e.g. the conditional log-odds for abinary classifier) for an outcome of interest y. In our moti-vating example, x is a D-dimensional EKG, where D = 300(3 leads × 100 samples per lead) and y = 1 indicates thepatient exhibits “ST elevation,” a clinically important indi-cator of potential heart attack (myocardial ischemia). Thediscriminative model computes the conditional probabilitythat a patient has ST elevation given the EKG observation:

m(x) = logit(Pr(y= 1 |x))¬ s . (1)

An example of such a discriminative model m(·) is a neuralnetwork that transforms x into intermediate representations(e.g. hidden layers) and eventually a scalar prediction. Weoften ask how a neural network arrives at its predictions.Given a neural network model trained and validated on data

2To be concrete, throughout we use “semantically meaningfulinformation” to mean information used by the predictor to formprediction m(x). We measure this with discriminative reconstruc-tion error defined in Equation 19.

from another hospital system but fails in our hospital setting,how do we diagnose these model shortcomings?

Popular approaches examine variation in x that change theprediction m(x) by following the gradient of m(x). Onelimitation, however, is that the gradient is local and ignoresglobal structure in X -space. Following the gradient flowdefined by ∇xm(x) to increase the prediction score, we willgenerate physiologically implausible inputs (e.g. meaning-less images or EKGs). We can constrain this explorationby restricting our search over X with a generative latentvariable model (Nguyen et al., 2016).

Generative latent variable models A latent variablemodel is a probabilistic description of a high-dimensionaldata vector, x, using a low-dimensional representation, z.A common model for continuous-valued x ∈ RD uses aparametric mean function gθ (z), resulting in the model

z∼ p(z) , x∼ gθ (z) +σ · ε (2)

where ε ∼ N (0,1) and σ is the observation noise scale.When gθ (·) is a deep neural network, we call this a deepgenerative model (DGM). One way to fit a DGM to datais the autoencoding variational Bayes (AEVB) framework(Kingma & Welling, 2013; Rezende et al., 2014). TheAEVB framework fits DGMs via optimization of the ev-idence lower bound (ELBO), a tractable lower bound tothe marginal likelihood of the data, p(x |θ ) (Jordan et al.,1999; Blei et al., 2017). For efficiency, AEVB introducesan inference network parameterized by a global variationalparameter λ that specifies the posterior approximation as aconditional distribution qλ(z |x). Together, the generativenetwork and inference network are referred to as a varia-tional autoencoder (VAE). For a single observation x, theVAE objective is

LVAE(θ ,λ) = Eqλ(z|x) [ln pθ (x,z)− ln qλ(z|x)]≤ ln pθ (x) .

We maximize this objective with respect to generative pa-rameters θ and inference network parameters λ. A gen-erative model captures the distribution of the data x viathe latent representation z. By varying z, we can explorethe structural variation in x (represented by the generativemodel) that influence m(x).

3. Discriminative RegularizationFitting a generative model requires choosing what variationin x is structure and what is noise. A deep generative modelimplicitly makes this choice by explaining variation in xwith either gθ (z) (i.e. structure) or σ ·ε (i.e. noise). What isvariation and what is noise, however, is not always obviousand not always easy to specify. A discriminative model m(x)may be influenced by subtle features in x. If the maximumlikelihood objective targets reconstruction error (e.g. as a


✓<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

x<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

(a) DGM

x<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

g✓(·)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

m(·)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

s<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

x̄<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

s̄<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µ�(·),��(·)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

(b) DR-VAE

Figure 1. Comparison of the VAE and DR-VAE. Left: Graphicalrepresentation of a DGM — low-dimensional latent variable zgenerates observed data x for all N observations. Right: Compu-tational graph of a VAE (solid black) and a DR-VAE (additionaldotted red). A VAE feeds data x through the recognition network(i.e. the “encoder”) which yields an approximate posterior distribu-tion over z. A sample from this posterior is then mapped throughthe generative model gθ (z) (i.e. the “decoder”) producing the syn-thetic observation x̄ — a loss penalizes the divergence between xand x̄. The DR-VAE adds an additional penalty — the predictivefunction we care about m(·) must also be close for x and x̄. Thestrength of this penalty is determined by a tuning parameter β > 0.

Gaussian likelihood does), then subtle variation in x can beignored — the optimization procedure pushes this variationinto the noise term and gθ (z) ignores these features.

Ignoring predictive features in x will prevent model-basedexploration of m(x) with latents z — gθ (·) and z will notbe able to reconstruct certain features in x that are impor-tant to m(x) because the generative model is focused onreconstructing features that are unimportant to m(x).

Additionally, deep generative models can be overly flexible.This flexibility results in an equivalence class of generativemodels with similar likelihood performance, but with dif-ferent reconstruction properties. Our goal is learn a DGMthat uses this flexibility to represent potentially subtle, yetpredictive features with the latent variable z. We proposemaximizing the following objective

L (m)DR-VAE(θ ,λ) = LVAE(θ ,λ)︸︷︷︸

model likelihood

−β · L (m)disc.(θ ,λ)︸︷︷︸

discrim. regularizer

. (3)

We refer to a DGM fit with this objective as a discrimina-tively regularized VAE (DR-VAE). This objective augmentsthe typical evidence lower bound with a penalty on parame-ters that do not result in good discriminative reconstructionfor predictor m(x). This regularizer penalizes differencesin the discriminative model score between real data andreconstructed data

L (m)disc.(θ ,λ) = Ez∼qλ(z |x) [D (m(x) ||m(x̄))] (4)

x̄¬ gθ (z) . (5)

Here D(· || ·) is a divergence function whose specific form

depends on the output of m(·)— we consider binary andcontinuous outcomes below. The discriminative penalty canbe viewed as a type of posterior predictive check (Gelmanet al., 2013). During model training, this term ensures thatsimulated data from our approximate posterior encode fea-tures we care about — the discriminative model predictions.

Note that to optimize Equation 3 with gradient-based up-dates, we will need to compute the gradient of m(x) withrespect to the input — a similar requirement for saliencymaps (Baehrens et al., 2010; Simonyan et al., 2013) andactivation maximization (Nguyen et al., 2016).

Binary outcomes For binary outcomes, the discrim-inative model m(x) outputs the conditional probabil-ity m(x) = Pr(y= 1 |x), a scalar value. To constrain the dif-ference between m(x) and m(x̄) we penalize the Kullback-Leibler divergence between the two conditional distributions

DK L(q || p) = q · lnqp+ (1− q) · ln

1− q1− p

, (6)

where q = m(x) and p = m(x̄). We also consider thesquared loss in the logit-probability values

Dlogi t(q || p) =�

lnq

1− q− ln

p1− p

�2

. (7)

Continuous outcomes For models that predict a continu-ous outcome, the predictive model encodes the conditionalexpectation, m(x) = E[y |x]. Assuming the model is ho-moscedastic, the predictive variance V(y |x) = σ2 is thesame for all x. Further assuming the outcome is condition-ally Gaussian distributed, the KL-divergence between thepredictive distribution m(x) and m(x̄) is simply a functionof the squared difference in expectations

DK L(m(x) ||m(x̄)) =1

2σ2(m(x)−m(x̄))2 (8)

=1

2σ2(E[y |x]−E[y | x̄])2 . (9)

Valid evidence lower bound The divergence functionsdescribed above are all bounded below by zero (and onlyequal to zero when m(x) and m(x̄) are equal). Fixing theregularization coefficient β > 0 ensures the loss remains avalid lower bound to the marginal likelihood

L (m)DR-VAE(θ ,λ)≤ ln pθ (x) . (10)

This eliminates potential optimization pathologies (e.g. theoptimizer only maximizing the regularization term).

Relationship to mutual information For additional intu-ition, we can relate the discriminative penalty to informationtheoretic quantities. The conditional mutual information is

I(x;y |z) = Ep(x,z) [DK L(p(y |z,x) || p(y |z))] . (11)


We can interpret the discriminative penalty as

Ldisc.(θ ,λ) =1N

N∑

n=1

Ez∼qλ(z |xn) [DK L(m(x) ||m(x̄))] (12)

≈ Ep(x)qλ(z |x) [DK L (p(y |x) || p(y |z))] (13)

= I(x;y |z) . (14)

where we have made two assumptions: (i) y is conditionallyindependent of z given x (e.g. z adds no information about yabove x), and (ii) the term m(x̄) is approximately

p(y |z) = Ep(x |z) [p(y |x)]≈ p(y |Ep(x |z) [x]) = m(x̄) .

The discriminative regularizer can be interpreted as a penaltyon the approximate conditional mutual information betweenx and y given z. This interpretation offers a key insight: min-imizing the value of Equation 14 encourages the generativemodel to minimize the informativeness of x about y whenwe know the value of z — this forces y to be explained by zalone and not by variation described by σ · ε.

3.1. Model Morphings

A generative model for data x ∈ X with latent variable zequips us with new tools. We can use this representation tofind similar patients (e.g. patients close in z-space). We cancompute the probability of new observations, enabling us tocluster patients, detect outliers, or compare models.

We can also use the model to synthesize new samples —a task we focus on in this section. Given a generativemodel gθ (z), we can manipulate z in various ways to gener-ate new instances of data. Synthesizing new data can help usunderstand the patient-specific features that are predictive ofa particular outcome. This may enable the discovery of newfeatures or reveal the physiological roots of a phenomenonthat can be targeted with more precise measurements, thera-pies, or further analysis. Synthesized data can also be usedfor pedagogical purposes, highlighting subtle features iden-tifiable to an expert but difficult to discern for a novice. Andas stated before, synthesized data can help us understandwhat features are important to a black-box predictor. How-ever, for any of these efforts to be successful, that latentspace must encode semantically meaningful variation withrespect to the outcome of interest.

To explore features of x important to predictor m(·), wepropose following the gradient of m(·) with respect to theinput along the subspace X̄ = {x : x = gθ (z),z ∈ RK}.Starting at some x(0) with some z(0), we trace out a model-morphed trajectory by recursively computing

z(t+1)← z(t) +δ ·∂m∂ x∂ x∂ z(z(t)) (15)

x̃(t+1)← gθ (z(t+1)) (16)

for some small step size δ over T steps. This yields a model-morphed sequence, x̃(0), . . . , x̃(T ), that gradually increasesthe predictive score m(·) (or decreases if δ < 0). By con-struction, each element in the sequence is constrained to beon the data manifold defined by the generative model. Thisallows us to move along long paths away from the initialdatum x(0). Further, because the generative model has beenconstrained to reproduce predictive features, we will seethat the DR-VAE model-morphings can capture much morepredictive variation than the standard VAE.

3.2. Related Work

Trade-offs between generative and discriminative models forclassification performance are well-studied (Ng & Jordan,2002; Bouchard & Triggs, 2004). Developing generativemodels with good discriminative properties is also an activearea of research. Recent research has explored the use ofdiscriminative models to influence generative model train-ing and performance. As an example, Hughes et al. (2018)present an approach that constrains generative model learn-ing to maintain good discriminative proprieties, and applythem to a commonly used latent factor model.

Incorporating the activations from discriminative modelsinto generative model training has been explored as amethod to produce sharper natural image samples (Lambet al., 2016), and improve classification performance(Kuleshov & Ermon, 2017).

Interpreting complex predictive models is another area ofrelated research. Work built on saliency maps (Simonyanet al., 2013) and plug-and-play activation maximization(Nguyen et al., 2016) use similar ideas to explore latentspaces of generative models. For instance, (Killoran et al.,2017) extend the activation maximization framework to syn-thesize discrete DNA sequences with desired properties byusing a pre-trained generative model. Similarly, (Gómez-Bombarelli et al., 2018) apply this model-based latent spaceoptimization to generate novel molecular compounds. Weextend these methods by incorporating a semantically mean-ingful constraint into the generative model training itselfand show that this captures relevant information in x thatmay otherwise be ignored.

We also note that our approach is similar in spirit to Cy-cleGANs (Zhu et al., 2017), which incorporates a cycleconsistency loss into the objective. This cycle consistencyloss ensures that the original image can be recovered afterbeing translated from one domain to another. Similarly, ourdiscriminative penalty ensures the original prediction canbe recovered after the image has been encoded into a low-dimensional latent representation. Additional discussion ofrelated work is in the supplement.


0 10 20 30 40 50dimension i

−2

0

2

xi

valu

e

(a) Synthetic Data Examples.

VAE

β=

1.56

β=

3.12

β=

6.25

β=

12.50

β=

25.00

β=

50.00

β=

100.00

0.05

0.10

0.15

0.20

RM

SE x-avg rmse

p(y|x) rmse

(b) Error summary

Figure 2. The DR-VAE can trade little generative error for im-proved discriminative error. (a) Example data — 50-dimensionalcorrelated Gaussian draws with a linear increase in marginal vari-ance, Σ1,1 = .1 to Σ50,50 = 1. Predictive information is containedentirely in the first dimension (marked by on the left) — blueindicates lower probability and red indicates higher probability.(b) Generative (red ) and discriminative (yellow È) test recon-struction error for a sequence of increasingly regularized models.As β grows, we trade some generative accuracy (x-avg rmse) forsubstantial discriminative accuracy (p(y |x) rmse).

4. Experiments4.1. Synthetic Data

We apply DR-VAEs to a synthetic data set that, by construc-tion, has predictive power in the dimension of least variation.We generate data according to the following model

x∼N (0,Σ) , y |x∼ Bern( f (x;φ)) , (17)

where f (x;φ) is a multi-layer perceptron and Σ is a fullrank, structured covariance matrix.

We construct the covariance matrix with two properties: (i)smooth sample paths, and (ii) increasing marginal variance.The Σmatrix is constructed such that the marginal variances(along the diagonal) increase; Σ1,1 = .1 increases linearlyto ΣD,D = 1. Examples of the synthetic data are visualizedin Figure 2a and experiment details are in the supplement.

The black-box predictor, f (x;φ), is a multi-layer perceptronthat depends only on x1, the first dimension of x and thedimension of least variation. Reconstruction of directionsof maximal variation will only carry predictive informationabout y through correlations with x1. In this synthetic setup,we examine the generative and discriminative reconstructiontrade-off as we increase β .

We fit a linear variational autoencoder (i.e. factor analysis) tolearn the distribution of x with discriminative regularization.We limit the latent z dimension to be K < D; we set K = 10and D = 50 in this example. To accurately reconstruct thediscriminative score, p(y = 1 |x), the model of x needs tocontain information about the first dimension of x.

We optimize the objective in Equation 3 for increasing val-ues of β and compute the generative reconstruction error(gen-err) and the discriminative reconstruction error (disc-err) on held out test data

gen-err= Ex∼p(x)

�

1D

∑

d

(xd − x̄d)2

�1/2

(18)

disc-err= Ex∼p(x)

�

(m(x)−m(x̄))2�1/2

. (19)

In Figure 2b we visualize these errors as a function of theregularization strength. Note that we report discriminativeerror instead of predictive accuracy because the object of ourstudy is not the outcome itself but the predictive model m(·).We want our generative model reconstructions to result inthe same predictions (and including the same mistakes) fromthe discriminative model. With these highly structured Gaus-sian observations, we trade little generative reconstructionerror to obtain nearly perfect discriminative reconstruction.The discriminative regularizer competes with the ELBOterm for model capacity. When β = 0 (i.e. a VAE), themodel ignores the directions of least variation. As we in-crease β , the discriminative regularizer competes with theELBO, encouraging the generative model to represent the di-rections of variation that influence the discriminative model.

Digit Experiment For additional empirical analysis in asemi-synthetic setting, we include an additional experimenton a modified MNIST data set in the supplement.3

4.2. EKG Data

We now apply DR-VAEs to a data set of clinical EKGs,using tracings from three leads V1, II, and V5 (analogous tochannels, each recording the heart’s electrical activity froma different point on the chest), which contain a 10-secondsequence of individual beats. We train discriminative mod-els using segmented EKG beats (Christov, 2004), depictedin Figure 4, to predict the following outcomes

• ST elevation (binary): A subtle feature interpreted byeye that indicates heart attack. We use this outcome toillustrate model-morphings.

• Bundle Branch Block (binary): A delay or blockage ofthe electrical system in the heart that can be labeled froman EKG observation by eye.

• Major adverse cardiac events with six months (MACE)(binary): These events include myocardial infarction, car-

3Code available at https://github.com/andymiller/DR-VAE.

https://github.com/andymiller/DR-VAE


0.10 0.15 0.20 0.25 0.30 0.35

discrim. rmse

0.06

0.07

0.08

0.09

0.10

0.11

0.12

gen.

rmse

VAE

DR-VAE β = 1

DR-VAE β = 5

DR-VAE β = 10

DR-VAE β = 100

K=100

K=50

K=30

K=20

K=10

(a) Age

0.04 0.06 0.08 0.10 0.12

discrim. rmse

0.06

0.08

0.10

0.12

0.14

gen.

rmse

(b) Bundle Branch Block

0.02 0.03 0.04 0.05 0.06 0.07

discrim. rmse

0.07

0.08

0.09

0.10

0.11

0.12

0.13

gen.

rmse

(c) MACE

0.050 0.075 0.100 0.125 0.150 0.175

discrim. rmse

0.08

0.09

0.10

0.11

0.12

0.13

0.14

0.15

gen.

rmse

(d) ST Elevation

Figure 3. DR-VAEs contain more predictive information. Depicted are discriminative (horizontal) vs. generative (vertical reconstructionerror on held out data for the four outcomes (age, bundle branch block, MACE, and ST elevation). A single trace represents a singlevalue of β ; shapes (e.g. , �, or �) depict different latent dimensionality. We can see in some examples that there is a constant frontier —we can achieve better discriminative reconstruction accuracy and sacrifice no generative reconstruction accuracy. The discriminativeregularizer is pushing the generative model to prefer this more meaningful representation over other, equivalent generative models. Wealso note that predictive accuracy of the actual outcome y remains very similar using x and x̄.

0.0 0.2 0.4 0.6 0.8

time (seconds)

mV V1

II

V5

0.0 0.2 0.4 0.6 0.8 1.0

time (seconds)

mV V1

II

V5

0.0 0.2 0.4

time (seconds)

mV V1

II

V5

Figure 4. Example data: three EKG beats from three differentpatients. The EKG signal considered records three leads (V1, II,V5). Beats are segmented and placed on the unit interval.

diac arrest, stent, or a coronary artery bypass grafting.MACE risk is not predicted with EKGs.

• Age (continuous): Patient age has a noisy relationshipto cardiac function. Visualizing the cardiac functionalcorrelates of aging is potentially of clinical and scientificinterest (Jones et al., 1990; Khane et al., 2011).

EKG data are a good fit for DR-VAEs as a subtle deviationfrom normal cardiac function can be predictive of asignificant clinical outcome. Understanding ST elevationand bundle branch block predictors is model diagnosiscategory, while age and MACE predictors fall undermodel-based discovery (defined in Section 1).

Data and model setup We construct a data set for eachoutcome with close to even base-rates of y= 1. We split ourcohort by patients into a training/validation development set(75% of patients) and a testing set (25% of patients) — nopatients overlap in these two groups. We report predictiveperformance and model reconstruction statistics only on theheld-out test patients. Further details and statistics of dataused in each experiment are in the supplementary material.

For each outcome, discriminative model m(·) is a multi-

layer perceptron classifier (or regressor) trained to minimizethe cross entropy (or squared error) loss. Each discrimina-tive model has two hidden layers of size 100 and a ReLUnonlinearity. Discriminative models are trained with dropout(p = .5) and stochastic gradient optimization with Adam(Kingma & Ba, 2014), starting with the learning rate set to.001 and halved every 25 epochs. We save the model withthe best performance on the validation set.

Throughout our experiments we compare a standard VAE tothe DR-VAE with values of β = [1,5, 10, 100] (note thata DR-VAE with β = 0 corresponds to a standard VAE).All deep generative models have one hidden layer with 500units and a ReLU nonlinearity. We also train generativemodels with gradient-based stochastic optimization usingAdam (Kingma & Ba, 2014), with an initial learning rate of.001 that is halved every 25 epochs. Accompanying codecontains all model architecture and training details.

Discriminative vs. generative trade-offs For each out-come, we train all five generative models and compare bothgenerative and discriminative reconstruction error. In Fig-ure 3 we visualize the trade-off between generative anddiscriminative reconstruction error on held-out test data. Inthese experiments, there exists a trade-off between genera-tive reconstruction and discriminative reconstruction. How-ever, the standard VAE lies on a nearly constant part of thefrontier — we can see this by comparing a shape (e.g. , �,or �) across the different traces (colored lines). Four of themodels are essentially constant in terms of generative rmsefrom the VAE to DR-VAE with β = 1. For these outcomes,we can trade negligible generative performance for mean-ingful discriminative performance. With higher values of β ,we see the trade-off emerge.

As a concrete example, for ST Elevation and latent size K =50, the discriminative RMSE for the DR-VAE with β = 1is 20% lower than the RMSE for a standard VAE, while


0.0 0.2 0.4 0.6

time (seconds)

−1.0

−0.5

0.0

0.5

1.0m

VPr(y = 1 | x̃)

= 0.08

= 0.27

= 0.44

= 0.63

= 0.81

(a) Model free

0.0 0.2 0.4 0.6

time (seconds)

−0.5

0.0

0.5

mV

Pr(y = 1 | x̃)

= 0.09

= 0.27

= 0.44

= 0.62

= 0.79

(b) DR-VAE, β = 5

Figure 5. Model-based morphings can identify more meaningfulfeatures. Depicted are morphed examples (only lead II) comparingthe (a) direct gradient and (b) DR-VAE with β = 5. The thickblue line indicates the starting EKG (low ST elevation), and thethin red line indicates the morphed EKG to Pr(y |x) = .8 (highST elevation). Without a model, the EKG quickly starts to lookunrealistic. The model preserves structure — smoothness, andcharacteristic EKG features.

−0.2

0.0

0.2

Lead II (total var = 1.31)

(a) VAE (β = 0)

−0.2

0.0

0.2

Lead II (total var = 2.10)

(b) DR-VAE, β = 5

Figure 6. DR-VAEs describe more predictive variation. Depicted isthe mean morphing delta, E [x̄− x̃] (and marginal variance Σ̂morph)for a single lead (II) while varying β . The expectation was esti-mated on 1024 test beats using the ST Elevation predictor. On theleft, the standard VAE exhibits smaller morphing variation. As weincrease β , the model morphs exhibit more variation, indicatingthe DR-VAE represents a richer set of predictive features than thestandard VAE. Note that generative reconstruction is similar forthese models.

the generative reconstruction error remains identical (seenin Figure 3d). This improvement indicates that we havelearned a representation z that contains more informationabout m(·) than the standard VAE.

Exploring z-space with morphings We qualitatively andquantitatively examine the difference between the suite ofgenerative models trained on EKG observations. To com-pare the z space of each generative model, we examine

VAE

DR-VAE β = 1

DR-VAE β = 5

DR-VAE β = 10

DR-VAE β = 100

Model

4

6

8

10

To

tal

Var

iati

on

(a) low-to-high.

VAE

DR-VAE β = 1

DR-VAE β = 5

DR-VAE β = 10

DR-VAE β = 100

Model

3

4

5

6

7

To

tal

Var

iati

on

(b) high-to-low.

Figure 7. DR-VAEs capture more predictive variation. Depicted isthe trace of the Σ̂morph matrix defined in Equation 20 variation bymodel. As we increase β , the latent space finds more ways (overthe N = 1024 test examples) to increase (left) and decrease (right)the probability of ST elevation according to m(·). Note that theVAE and β = 1, 5, and 10 all have roughly the same generativereconstruction error (see Figure 3).

morphed pairs of test data. For a generative model gθ (·), tocompute a morphed pair for test example xn we• compute x̄n = µλ(xn) via the encoder• update x̄n into x̃n via morphing z with the operation de-

fined in Equation 16 using model gθ (·)• stop updating when m(x̃) has reached a certain value

(e.g. m(x̃) reaches the second highest decile)This creates a pair, x̄n and x̃n, the latter of which hasmorphed from the former according to the gradient ofmodel m(·) and gθ (·). Figure 5b depicts a model-morphingtrajectory. The initial EKG starts at the thick blue line (lowST elevation) and is morphed to the thin red line (highST elevation). By comparison, Figure 5a depicts a modelfree morph, the result of following the gradient flow ofm(x) which produces unrealistic synthetic examples —particularly non-smooth sample paths and the elimination ofphysiologically characteristic waves. We repeat for N exam-ples in our test set, yielding a collection of morphed pairs.

A key question is, does a DR-VAE have more capacity toexplore predictive variation than a standard VAE? Intuitively,if two models achieve equal generative performance, butmodel a has more morphing variation than model b, modela has more capacity to explore predictive variation thanmodel b. One could imagine using an entropy estimatorto measure this capacity. We use a simpler summary toquantify this, the empirical covariance of the differencebetween x̄ and x̃ according to generative model gθ (z)

Σ̂morph = cov�

{x̄n − x̃n}Nn=1

�

. (20)

The trace of Σ̂ summarizes how much predictive variationmodel gθ (·) has captured with respect to discriminativemodel m(x). We can also examine the spectrum of Σ̂morphfor a clearer picture of our model-morphs — e.g. how manyeffective dimensions of X = RD are used?

We conduct two experiments that center on statistics of mor-phed pairs: (i) low-to-high: we take test examples from


the lowest decile of m(x) (e.g test patients with the low-est probability of ST elevation) and morph them up suchthat m(x̃) = .75 (e.g. high probability of ST elevation); (ii)high-to-low: we take test examples from the highest decileof m(x) (e.g. test patients with the highest probability of STelevation) and morph them down such that m(x̃) = .1.

For each direction (low-to-high and high-to-low) we com-pute morphed pairs for all five generative models. In Fig-ures 7a and 7b we plot the total variation of Σ̂morph for theST elevation outcome. We can see that as β increases, sodoes the total variation in the morphed pairs. In fact thetotal number of dimensions with non-zero variation alsoincreases with β (visualized in the supplemental material).In Figure 6, we show the marginal variance of the model-morphings as we increase β (for lead V1), summarizing thetotal lead variation in the title. As β grows, the marginalvariance of the morph also grows, indicating higher varia-tion in predictive features induced by the morph.

These summaries are evidence that the DR-VAE is able tocapture a wider range of EKG features that indicate high STelevation (according to the predictor m(x)). This tool canbe used to reveal potentially unknown ways in which m(x)behaves — allowing us to diagnose the model or trace aprediction to its physiological origins. Using just a VAE (asin Nguyen et al. (2016)) will fail to visualize much of thepredictive variation in x-space.

Comparing variation in observational data may capture cor-related modes of variation — if two EKG features frequentlyco-occur, an information-constrained latent space will learnto express them together. If the predictor were trained witha non-confounded data set, and the generative model weretrained with a confounded data set, our exploratory approachmay not draw a distinction between them.

Physician validation We validate our generative model ofEKGs with an expert labeling experiment. Our goal is tomeasure how real the model-based EKGs appear and howconvincing model-morphed features are. We construct atwo-alternative forced choice labeling task. We present aphysician with N trials, where each trial compares a pair ofEKG beats — one real and one fake (in a randomized order).The fake beat has been constructed with a ST elevation-regularized DR-VAE (β = 5) in one of four ways: (i) a x̄with low ST elevation (second decile), (ii) a x̄ with high STelevation (ninth decile), (iii) a x̃ morphed from low-to-highST elevation, and (iv) a x̃ morphed from high-to-low STelevation. The morphed examples are morphed from thesecond to the ninth decile (or vice versa). Here, the decile iswith respect to values of the predictor m(·).

We present the expert with fifty of each category — 200 intotal — with the task of identifying which EKG beat is realand which has ST elevation.

For the real-vs-fake question, the expert was able todistinguish slightly above random guessing, with an overallaccuracy rate of 60% [54-66% CI]. For a more detailedpicture, we compute accuracy rates (and 95% confidenceintervals) for each type of synthetic data. For non-morphedtypes, (i) and (ii), we observe rates of 60% [46-74%]and 64% [50-78%], respectively. For morphed types, (iii)and (iv), we observe rates of 76% [64-88%] and (iv) 42%[28-56%], respectively.

Model-morphed examples are interpreted asymmetrically— high ST elevation examples morphed to look like lowST elevation examples are nearly indistinguishable fromreal data. However, examples morphed to have high STelevation are more likely to be flagged as fake. This may bedue to model smoothness becoming more conspicuous orpoor EKG feature reconstruction. Overall, our model wasable to fool an expert between 40% and 24% of the time,depending on the construction of the synthetic data.

For the feature-inducing accuracy, the expert label of ST ele-vation almost always matched the model-morphing inducedfeature. The low-to-high ST elevation morphed exampleswere correctly labeled 88% [78-96%] of the time, and thehigh-to-low morphed examples were correctly labeled 94%[88-100%]. This result is evidence that model-morphs caninduce clinically relevant features that are recognizable andvisually interpretable to an expert. We view these results asa promising first step toward model-based feature discoveryand understanding for black-box EKG predictors. Furtherexperiment details are in the supplement.

5. Discussion and ConclusionWe described discriminitively regularized VAEs (DR-VAEs),a method to constrain the representation of a deep genera-tive model to contain targeted information about a black-boxpredictor of interest. We motivated the regularizer from aninformation theoretic perspective. We applied DR-VAEsto synthetic and EKG data, and empirically showed thatDR-VAE representations can contain more predictive in-formation than standard VAEs. We measured how realis-tic synthesized EKGs from the deep generative model arewith an expert labeling experiment. While the results werepromising, the task of further developing and validatingricher generative models remains open.

In future work, we will use this technique to highlight fea-tures of an EKG that are predictive of various cardiovasculardiseases. We hope to use model-morphs to relate predictionsback to the underlying cardiac physiology driving the predic-tion. Along this same thread, we would like to develop flexi-ble generative models an inductive bias rooted in cardiology,uniting physiological plausibility with DGM flexibility.


AcknowledgmentsACM and JPC are funded by NSF NeuroNex Award DBI-1707398 and The Gatsby Charitable Foundation GAT3419.The authors thank Scott Linderman for taking a close lookat an early draft and CJ Chang for the tireless interpretationof synthetic and real EKG tracings.

ReferencesBaehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M.,

Hansen, K., and MÃžller, K.-R. How to explain individ-ual classification decisions. Journal of Machine LearningResearch, 2010.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-tional inference: A review for statisticians. Journal of theAmerican Statistical Association, 2017.

Bouchard, G. and Triggs, B. The tradeoff between genera-tive and discriminative classifiers. In 16th IASC Interna-tional Symposium on Computational Statistics (COMP-STAT’04), pp. 721–728, 2004.

Christov, I. I. Real time electrocardiogram QRS detectionusing combined adaptive threshold. Biomedical engineer-ing online, 2004.

Doshi-Velez, F. and Kim, B. Towards a rigorous sci-ence of interpretable machine learning. arXiv preprintarXiv:1702.08608, 2017.

Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B.,Vehtari, A., and Rubin, D. B. Bayesian data analysis.Chapman and Hall/CRC, 2013.

Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D.,Hernández-Lobato, J. M., Sánchez-Lengeling, B., She-berla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams,R. P., and Aspuru-Guzik, A. Automatic chemical de-sign using a data-driven continuous representation ofmolecules. ACS central science, 2018.

Hughes, M. C., Hope, G., Weiner, L., McCoy Jr, T. H.,Perlis, R. H., Sudderth, E., and Doshi-Velez, F. Semi-supervised prediction-constrained topic models. In Inter-national Conference on Artificial Intelligence and Statis-tics, 2018.

Jones, J., Srodulski, Z., and Romisher, S. The agingelectrocardiogram. The American journal of emergencymedicine, 1990.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,L. K. An introduction to variational methods for graphicalmodels. Machine learning, 1999.

Khane, R. S., Surdi, A. D., and Bhatkar, R. S. Changes inECG pattern with advancing age. Journal of basic andclinical physiology and pharmacology, 2011.

Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., and Frey,B. J. Generating and designing DNA with deep generativemodels. arXiv preprint arXiv:1712.06148, 2017.

Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M.,Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. The(un) reliability of saliency methods. arXiv preprintarXiv:1711.00867, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Kuleshov, V. and Ermon, S. Deep hybrid models: Bridgingdiscriminative and generative approaches. In Proceedingsof the Conference on Uncertainty in AI, 2017.

Lamb, A., Dumoulin, V., and Courville, A. Discrimina-tive regularization for generative models. arXiv preprintarXiv:1602.03220, 2016.

Narayanan, M., Chen, E., He, J., Kim, B., Gershman, S., andDoshi-Velez, F. How do humans understand explanationsfrom machine learning systems? An evaluation of thehuman-interpretability of explanation. arXiv preprintarXiv:1802.00682, 2018.

Ng, A. Y. and Jordan, M. I. On discriminative vs. generativeclassifiers: A comparison of logistic regression and naivebayes. In Advances in neural information processingsystems, 2002.

Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., andClune, J. Synthesizing the preferred inputs for neuronsin neural networks via deep generator networks. In Ad-vances in Neural Information Processing Systems, 2016.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. In International Conference on MachineLearning, 2014.

Ribeiro, M. T., Singh, S., and Guestrin, C. Why should itrust you?: Explaining the predictions of any classifier.In Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,2016.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-sification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.


Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In IEEE International Conference onComputer Vision, 2017.

Discriminative Regularization for Latent Variable Models with...

Documents

Transcript of Discriminative Regularization for Latent Variable Models with...