Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series · 2019. 11....

Factorized Inference in Deep Markov Modelsfor Incomplete Multimodal Time Series

Tan Zhi-Xuan,1, 2 Harold Soh,3 Desmond C. Ong1, 4

1A*STAR Artificial Intelligence Initiative, A*STAR, Singapore2Department of Electrical Engineering and Computer Science, MIT

3Department of Computer Science, National University of Singapore4Department of Information Systems and Analytics, National University of Singapore

[email protected], [email protected], [email protected]

AbstractIntegrating deep learning with latent state space models hasthe potential to yield temporal models that are powerful, yettractable and interpretable. Unfortunately, current models arenot designed to handle missing data or multiple data modali-ties, which are both prevalent in real-world data. In this work,we introduce a factorized inference method for MultimodalDeep Markov Models (MDMMs), allowing us to filter andsmooth in the presence of missing data, while also performinguncertainty-aware multimodal fusion. We derive this methodby factorizing the posterior p(z|x) for non-linear state spacemodels, and develop a variational backward-forward algo-rithm for inference. Because our method handles incomplete-ness over both time and modalities, it is capable of interpo-lation, extrapolation, conditional generation, label prediction,and weakly supervised learning of multimodal time series.We demonstrate these capabilities on both synthetic and real-world multimodal data under high levels of data deletion. Ourmethod performs well even with more than 50% missing data,and outperforms existing deep approaches to inference in la-tent time series.

IntroductionVirtually all sensory data that humans and autonomous sys-tems receive can be thought of as multimodal time series—multiple sensors each provide streams of information aboutthe surrounding environment, and intelligent systems haveto integrate these information streams to form a coherentyet dynamic model of the world. These time series are of-ten asynchronously or irregularly sampled, with many time-points having missing or incomplete data. Classical time se-ries algorithms, such as the Kalman filter, are robust to suchincompleteness: they are able to infer the state of the world,but only in linear regimes (Kalman 1960). On the other hand,human intelligence is robust and complex: we infer com-plex quantities with nonlinear dynamics, even from incom-plete temporal observations of multiple modalities—for ex-ample, intended motion from both eye gaze and arm move-ments (Dragan, Lee, and Srinivasa 2013), or desires andemotions from actions, facial expressions, speech, and lan-guage (Baker et al. 2017; Ong, Zaki, and Goodman 2015).

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

There has been a proliferation of attempts to learn thesenonlinear dynamics by integrating deep learning with tra-ditional probabilistic approaches, such as Hidden MarkovModels (HMMs) and latent dynamical systems. Most ap-proaches do this by adding random latent variables to eachtime step in an RNN (Chung et al. 2015; Bayer and Osendor-fer 2014; Fabius and van Amersfoort 2014; Fraccaro et al.2016). Other authors begin with latent sequence models asa basis, then develop deep parameterizations and inferencetechniques for these models (Krishnan, Shalit, and Sontag2017; Archer et al. 2015; Johnson et al. 2016; Karl et al.2016; Doerr et al. 2018; Lin, Khan, and Hubacher 2018;Chen et al. 2018). Most relevant to our work are the DeepMarkov Models (DMMs) proposed by Krishnan, Shalit, andSontag (2017), a generalization of HMMs and GaussianState Space Models where the transition and emission dis-tributions are parameterized by deep networks.

Unfortunately, none of the approaches thus far are de-signed to handle inference with both missing data and mul-tiple modalities. Instead, most approaches rely upon RNNinference networks (Krishnan, Shalit, and Sontag 2017;Karl et al. 2016; Che et al. 2018), which can only handlemissing data using ad-hoc approaches such as zero-masking(Lipton, Kale, and Wetzel 2016), update-skipping (Krish-nan, Shalit, and Sontag 2017), temporal gating mechanisms(Neil, Pfeiffer, and Liu 2016; Che et al. 2018), or providingtime stamps as inputs (Chen et al. 2018), none of which haveintuitive probabilistic interpretations. Of the approaches thatdo not rely on RNNs, Fraccaro et al. handle missing databy assuming linear latent dynamics (Fraccaro et al. 2017),while Johnson et al. (Johnson et al. 2016) and Lin et al.(Lin, Khan, and Hubacher 2018) use hybrid message passinginference that is theoretically capable of marginalizing outmissing data. However, these methods are unable to learnnonlinear transition dynamics, nor do they handle multiplemodalities.

We address these limitations—the inability to handlemissing data, over multiple modalities, and with nonlin-ear dynamics—by introducing a multimodal generalizationof Deep Markov Models, as well as a factorized inferencemethod for such models that handles missing time pointsand modalities by design. Our method allows for both fil-

arX

iv:1

905.

1357

0v3

[cs

.LG

] 2

2 N

ov 2

019

zt-1 zt zt+1

xt-1 xt xt+1x1 x1 x1

xt-1 xt xt+1x2 x2 x2

zt-1 zt

xt-1 xtx1 x1

xt-1 xtx2 x2

zt+1

xt+1x1

xt+1x2

zt-1 zt

xt-1 xtx1 x1

xt-1 xtx2 x2

zt+1

xt+1x1

xt+1x2

(a) Model p(z1:T , x1:T) (b) Filtering p(zt|x1:t) (c) Smoothing p(zt|x1:T)

Figure 1: (a) A Multimodal Deep Markov Model (MDMM) with M = 2 modalities. Observations (filled) are generated fromunobserved latent states (unfilled). (b) Filtering infers the current latent state zt (bold dashed outline) given all observations upto t (solid outlines), and marginalizes (dotted outline) over past latent states. (c) Smoothing infers zt given past, present, andfuture observations, and marginalizes over both past and future latent states.

tering and smoothing given incomplete time series, whilealso performing uncertainty-aware multimodal fusion a lathe Multimodal Variational Autoencoder (MVAE) (Wu andGoodman 2018). Because our method handles incomplete-ness over both time and modalities, it is capable of (i) inter-polation, (ii) forward / backward extrapolation, and (iii) con-ditional generation of one modality from another, includinglabel prediction. It also enables (iv) weakly supervised learn-ing from incomplete multimodal data. We demonstrate thesecapabilities on both a synthetic dataset of noisy bidirectionalspirals, as well as a real world dataset of labelled human ac-tions. Our experiments show that our method learns and per-forms excellently on each of these tasks, while outperform-ing state-of-the-art inference methods that rely upon RNNs.

MethodsWe introduce Multimodal Deep Markov Models (MDMMs)as a generalization of Krishnan et al.’s Deep Markov Models(DMMs) (Krishnan, Shalit, and Sontag 2017). In a MDMM(Figure 1a), we model multiple sequences of observations,each of which is conditionally independent of the other se-quences given the latent state variables. Each observationsequence corresponds to a particular data or sensor modality(e.g. video, audio, labels), and may be missing when othermodalities are present. An MDMM can thus be seen as asequential version of the MVAE (Wu and Goodman 2018).

Formally, let zt and xmt respectively denote the latent stateand observation for modality m at time t. An MDMM withM modalities is then defined by the transition and emissiondistributions:

zt ∼ N (µθ(zt−1),Σθ(zt−1)) (Transition) (1)xmt ∼ Π(κmθ (zt)) (Emission) (2)

Here, N is the Gaussian distribution, and Π is some arbi-trary emission distribution. The distribution parameters µθ,Σθ and κmθ are functions of either zt−1 or zt. We learn thesefunctions as neural networks with weights θ. We also usezt1:t2 to denote the time-series of z from t1 to t2, and xm1:m2

t1:t2to denote the corresponding observations from modalitiesm1 to m2. We omit the modality superscripts when allmodalities are present (i.e., xt ≡ x1:Mt ).

We want to jointly learn the parameters θ of the genera-tive model pθ(z1:T , x1:T ) = pθ(x1:T |z1:T )pθ(z1:T ) and theparameters φ of a variational posterior qφ(z1:T |x1:T ) whichapproximates the true (intractable) posterior pθ(z1:T |x1:T ).To do so, we maximize a lower bound on the log marginallikelihood L(x; θ, φ) ≤ pθ(x1:T ), also known as the evi-dence lower bound (ELBO):

L(x; θ, φ) = Eqφ(z1:T |x1:T )[log pθ(x1:T |z1:T )] (3)

− Eqφ(z1:T |x1:T )[KL(qφ(z1:T |x1:T )||pθ(z1:T ))]

In practice, we can maximize the ELBO with respect toθ and φ via gradient ascent with stochastic backpropaga-tion (Kingma and Welling 2014; Rezende, Mohamed, andWierstra 2014). Doing so requires sampling from the varia-tional posterior qφ(z1:T |x1:T ). In the following sections, wederive a variational posterior that factorizes over time-stepsand modalities, allowing us to tractably infer the latent statesz1:T even when data is missing.

Factorized Posterior DistributionsIn latent sequence models such as MDMMs, we often wantto perform several kinds of inferences over the latent states.The most common of such latent state inferences are:Filtering Inferring zt given past observations x1:t.Smoothing Inferring some zt given all observations x1:T .Sequencing Inferring the sequence z1:T from x1:T

Most architectures that combine deep learning with statespace models focus upon filtering (Fabius and van Amers-foort 2014; Chung et al. 2015; Hafner et al. 2018; Buesinget al. 2018), while Krishnan et al. optimize their DMM forsequencing (Krishnan, Shalit, and Sontag 2017). One of ourcontributions is to demonstrate that we can learn the fil-tering, smoothing, and sequencing distributions within thesame framework, because they all share similar factoriza-tions (see Figures 1b and 1c for the shared inference struc-ture of filtering and smoothing). A further consequence ofthese factorizations is that we can naturally handle infer-ences given missing modalities or time-steps.

To demonstrate this similarity, we first factorize the se-quencing distribution p(z1:T |x1:T ) over time:

p(z1:T |x1:T ) = p(z1|x1:T )∏Tt=2 p(zt|zt−1, xt:T ) (4)

This factorization means that each latent state zt dependsonly on the previous latent state zt−1, as well as all cur-rent and future observations xt:T , and is implied by thegraphical structure of the MDMM (Figure 1a). We termp(zt|zt−1, xt:T ) the conditional smoothing posterior, be-cause it is the posterior that corresponds to the conditionalprior p(zt|zt−1) on the latent space, and because it combinesinformation from both past and future (hence ‘smoothing’).

Given one or more modalities, we can show that the con-ditional smoothing posterior p(zt|zt−1, xt:T ), the backwardfiltering distribution p(zt|xt:T ), and the smoothing distribu-tion p(zt|x1:T ) all factorize almost identically:

Backward Filtering (5)

p(zt|xt:T ) ∝ p(zt|xt+1:T )[∏

mp(zt|xmt )p(zt)

]

Forward Smoothing (6)

p(zt|x1:T ) ∝ p(zt|xt+1:T )[∏

mp(zt|xmt )p(zt)

]p(zt|x1:t−1)

p(zt)

Conditional Smoothing Posterior (7)

p(zt|zt−1, xt:T ) ∝ p(zt|xt+1:T )[∏

mp(zt|xmt )p(zt)

]p(zt|zt−1)p(zt)

Equations 5–7 show that each distribution can be de-composed into (i) its dependence on future observations,p(zt|xt+1:T ), (ii) its dependence on each modality m in thepresent, p(zt|xmt ), and, excluding filtering, (iii) its depen-dence on the past p(zt|zt−1) or p(zt|x1:t−1). Their sharedstructure is due to the conditional independence of xt:Tgiven zt from all prior observations or latent states. Here weshow only the derivation for Equation 7, because the othersfollow by either dropping zt−1 (Equation 5), or replacingzt−1 with x1:t−1 (Equation 6):

p(zt|zt−1, x1:Mt:T )

= p(x1:Mt+1:T |zt)p(x1:Mt |zt) p(zt|zt−1)

p(x1:Mt:T |zt−1)

∝ p(x1:Mt+1:T |zt)[∏M

m=1 p(xmt |zt)

]p(zt|zt−1)

=p(zt|x1:M

t+1:T )p(x1:Mt+1:T )

p(zt)

[∏Mm=1

p(zt|xmt )p(xmt )p(zt)

]p(zt|zt−1)

∝ p(zt|x1:Mt+1:T )[∏M

m=1p(zt|xmt )p(zt)

]p(zt|zt−1)p(zt)

The factorizations in Equations 5–7 lead to several use-ful insights. First, they show that any missing modalitiesm ∈ [1,M ] at time t can simply be left out of the prod-uct over modalities, leaving us with distributions that cor-rectly condition on only the modalities [1,M ] \ {m} thatare present. Second, they suggest that we can compute allthree distributions if we can approximate the dependence onthe future, q(zt|xt+1:T ) ' p(zt|xt+1:T ), learn approximateposteriors q(zt|xmt ) ' p(zt|xmt ) for each modality m, andknow the model dynamics p(zt), p(zt|zt−1).

Multimodal Fusion via Product of GaussiansHowever, there are a few obstacles to performing tractablecomputation of Equations 5–7. One obstacle is that it is nottractable to compute the product of generic probability dis-tributions. To address this, we adopt the approach used for

the MVAE (Wu and Goodman 2018), making the assump-tion that each term in Equations 5–7 is Gaussian. If eachdistribution is Gaussian, then their products or quotients arealso Gaussian and can be computed in closed form. Sincethis result is well-known, we state it in the supplement (seeWu and Goodman (2018) for a proof).

This Product-of-Gaussians approach has the added ben-efit that the output distribution is dominated by the in-put Gaussian terms with lower variance (higher precision),thereby fusing information in a way that gives more weightto higher-certainty inputs (Cao and Fleet 2014; Ong, Zaki,and Goodman 2015). This automatically balances the infor-mation provided by each modalitym, depending on whetherp(zt|xmt ) is high or low certainty, as well as the informationprovided from the past and future through p(zt|zt−1) andp(zt|xt+1:T ), thereby performing multimodal temporal fu-sion in a manner that is uncertainty-aware.

Approximate Filtering with Missing DataAnother obstacle to computing Equations 5–7 is the depen-dence on future observations, p(zt|xt+1:T ), which does notadmit further factorization, and hence does not readily han-dle missing data among those future observations. Other ap-proaches to approximating this dependence on the futurerely on RNNs as recognition models (Krishnan, Shalit, andSontag 2017; Che et al. 2018), but these are not designed towork with missing data.

To address this obstacle in a more principled manner, ourinsight was that p(zt|xt+1:T ) is the expectation of p(zt|zt+1)under the backwards filtering distribution, p(zt+1|xt+1:T ):

p(zt|xt+1:T ) = Ep(zt+1|xt+1:T ) [p(zt|zt+1)] (8)

For tractable approximation of this expectation, we use anapproach similar to assumed density filtering (Huber, Beut-ler, and Hanebeck 2011). We assume both p(zt|xt+1:T ) andp(zt|zt+1) to be multivariate Gaussian with diagonal co-variance, and sample the parameters µ, Σ of p(zt|zt+1) un-der p(zt+1|xt+1:T ). After drawing K samples, we approxi-mate the parameters of p(zt|xt+1:T ) via empirical moment-matching:

µzt|xt+1:T= 1

K

∑Kk=1 µk (9)

Σzt|xt+1:T= 1

K

∑Kk=1

(Σk + µ2

k

)− µ2

zt|xt+1:T(10)

Approximating p(zt|xt+1:T ) by p(zt|zt+1) led us to threeimportant insights. First, by substituting the expectationfrom Equation 8 into Equation 5, the backward filtering dis-tribution becomes:

p(zt|xt:T ) ∝ E [p(zt|zt+1)]p(zt+1|xt+1:T )

[∏Mm=1

p(zt|xmt )p(zt)

](11)

In other words, by sampling under the filtering distributionfor time t+ 1, p(zt+1|xt+1:T ), we can compute the filteringdistribution for time t, p(zt|xt:T ). We can thus recursivelycompute p(zt|xt:T ) backwards in time, starting from t = T .

Second, once we can perform filtering backwards in time,we can use this to approximate p(zt|xt+1:T ) in the smooth-ing distribution (Equation 6) and the conditional smoothing

posterior (Equation 7). Backward filtering hence allows usto approximate both smoothing and sequencing.

Third, this approach removes the explicit dependence onall future observations xt+1:T , allowing us to handle miss-ing data. Suppose the data points X@ = {xmiti } are miss-ing, where ti and mi are the time-step and modality of theith missing point respectively. Rather than directly computethe dependence on an incomplete set of future observations,p(zt|xt+1:T \X@), we can instead sample zt+1 under the fil-tering distribution conditioned on incomplete observations,p(zt+1|xt+1:T \X@), and then compute p(zt|zt+1) given thesampled zt+1, thereby approximating p(zt|xt+1:T \X@).

Backward-Forward Variational InferenceWe now introduce factorized variational approximations ofEquations 5–7. We replace the true posteriors p(zt|xmt )with variational approximations q(zt|xmt ):=q(zt|xmt )p(zt),where q(zt|xmt ) is parameterized by a (time-invariant) neu-ral network for each modality m. As in the MVAE, we learnthe Gaussian quotients q(zt|xmt ):=q(zt|xmt )/p(zt) directly,so as to avoid the constraint required for ensuring a quotientof Gaussians is well-defined. We also parameterize the tran-sition dynamics p(zt|zt−1) and p(zt|zt+1) using neural net-works for the quotient distributions. This gives the followingapproximations:

Backward Filtering (12)q(zt|xt:T ) ∝ E← [p(zt|zt+1)]

∏m q(zt|xmt )

Forward Smoothing (13)

q(zt|x1:T ) ∝ E← [p(zt|zt+1)]∏m q(zt|xmt ) E→[p(zt|zt−1)]

p(zt)

Conditional Smoothing Posterior (14)

q(zt|zt−1, xt:T ) ∝ E← [p(zt|zt+1)]∏m q(zt|xmt ) p(zt|zt−1)

p(zt)

Here, E← is shorthand for the expectation under the ap-proximate backward filtering distribution q(zt+1|xt+1:T ),while E→ is the expectation under the forward smoothingdistribution q(zt−1|x1:T ).

To calculate the backward filtering distribution q(zt|xt:T ),we introduce a variational backward algorithm (Algorithm1) to recursively compute Equation 12 for all time-steps t ina single pass. Note that simply by reversing time in Algo-rithm 1, this gives us a variational forward algorithm thatcomputes the forward filtering distribution q(zt|x1:t).

Unlike filtering, smoothing and sequencing require infor-mation from both past (p(zt|zt−1)) and future (p(zt|zt+1)).This motivates a variational backward-forward algorithm(Algorithm 2) for smoothing and sequencing. Algorithm 2first uses Algorithm 1 as a backward pass, then performsa forward pass to propagate information from past to fu-ture. Algorithm 2 also requires knowing p(zt) for each t.While this can be computed by sampling in the forward pass,we avoid the instability (of sampling T successive latentswith no observations) by instead assuming p(zt) is constantwith time, i.e., the MDMM is stationary when nothing is ob-served. During training, we add KL(p(zt)||Ezt−1

p(zt|zt−1))and KL(p(zt)||Ezt+1

p(zt|zt+1)) to the loss to ensure that thetransition dynamics obey this assumption.

Algorithm 1 A variational backward algorithm for approxi-mate backward filtering.

function BACKWARDFILTER(x1:T , K)Initialize q(zt|xT+1:T )← p(zT )for t = T to 1 do

LetM⊆ [1,M ] be the observed modalities at tq(zt|xt:T )← q(zt|xt+1:T )

∏M q(zt|xm

t )

Sample K particles zkt ∼ q(zt|xt:T ) for k ∈ [1,K]Compute p(zt−1|zkt ) for each particle zktq(zt−1|xt:T )← 1

K

∑Kk=1 p(zt−1|zkt )

end forreturn {q(zt|xt:T ), q(zt|xt+1:T ) for t ∈ [1, T ]}

end function

Algorithm 2 A variational backward-forward algorithm forapproximate forward smoothing.

function FORWARDSMOOTH(x1:T , Kb, Kf )Initialize p(zt|x1:0)← 1Collect q(zt|xt+1:T ) from BACKWARDFILTER(x1:T , Kb)for t = 1 to T do

LetM⊆ [1,M ] be the observed modalities at tq(zt|x1:T )← q(zt|xt+1:T )

∏M[q(zt|xm

t )]q(zt|x1:t−1)

p(zt)

Sample Kf particles zt ∼ q(zt|x1:T ) for k ∈ [1,Kf ]Compute p(zt+1|zkt ) for each particle zktq(zt+1|x1:t)← 1

Kf

∑Kfk=1 p(zt+1|zkt )

end forreturn {q(zt|x1:T ), q(zt|x1:t−1) for t ∈ [1, T ]}

end function

While Algorithm 1 approximates the filtering distribu-tion q(zt|xt:T ), by setting the number of particles K = 1,it effectively computes the (backward) conditional filteringposterior q(zt|zt+1, xt) and (backward) conditional priorp(zt|zt+1) for a randomly sampled latent sequence z1:T .Similarly, while Algorithm 2 approximates smoothing bydefault, when Kf = 1, it effectively computes the (for-ward) conditional smoothing posterior q(zt|zt−1, xt:T ) and(forward) conditional prior p(zt|zt−1) for a random latentsequence z1:T . These quantities are useful not only becausethey allow us to perform sequencing, but also because wecan use them to compute the ELBO for both backward fil-tering and forward smoothing:

Lfilter =∑Tt=1

[Eq(zt|xt:T ) log p(xt|zt)−

Eq(zt+1|xt+1:T )KL(q(zt|zt+1, xt)||p(zt|zt+1)

)](15)

Lsmooth =∑Tt=1

[Eq(zt|x1:T ) log p(xt|zt)−

Eq(zt−1|x1:T )KL(q(zt|zt−1, xt:T )||p(zt|zt−1)

)](16)

Lfilter is the filtering ELBO because it corresponds to a‘backward filtering’ variational posterior q(z1:T |x1:T ) =∏t q(zt|zt+1, xt), where each zt is only inferred using

the current observation xt and the future latent state zt+1.Lsmooth is the smoothing ELBO because it corresponds to thecorrect factorization of the posterior in Equation 4, whereeach term combines information from both past and fu-ture. Since Lsmooth corresponds to the correct factorization, itshould theoretically be enough to minimize Lsmooth to learn

Method Recon. Drop Half Fwd. Extra. Bwd. Extra. Cond. Gen. Label Pred.

Spirals Dataset: MSE (SD)BFVI (ours) 0.02 (0.01) 0.04 (0.01) 0.12 (0.10) 0.07 (0.03) 0.26 (0.26) –F-Mask 0.02 (0.01) 0.06 (0.02) 0.10 (0.08) 0.18 (0.07) 1.37 (1.39) –F-Skip 0.04 (0.01) 0.10 (0.05) 0.13 (0.11) 0.19 (0.06) 1.51 (1.54) –B-Mask 0.02 (0.01) 0.04 (0.01) 0.18 (0.14) 0.04 (0.01) 1.25 (1.23) –B-Skip 0.05 (0.01) 0.19 (0.05) 0.32 (0.22) 0.37 (0.15) 1.64 (1.51) –

Weizmann Video Dataset: SSIM or Accuracy* (SD)BFVI (ours) .85 (.03) .84 (.04) .84 (.04) .83 (.05) .85 (.03) .69 (.33)*F-Mask .68 (.18) .66 (.18) .68 (.18) .66 (.17) .60 (.15) .33 (.33)*F-Skip .70 (.12) .68 (.14) .70 (.12) .67 (.16) .63 (.12) .21 (.26)*B-Mask .79 (.04) .79 (.04) .79 (.04) .79 (.04) .76 (.06) .46 (.34)*B-Skip .80 (.04) .79 (.04) .80 (.04) .79 (.04) .74 (.08) .29 (.37)*

Table 1: Evaluation metrics on both datasets across inference methods and tasks. Best performance per task (column) in bold.(Top) Spirals Dataset: MSE (lower is better) per time-step between reconstructions and ground truth spirals. For scale, theaverage squared spiral radius is about 5 sq. units. (Bottom) Weizmann Video Dataset: SSIM or label accuracy (higher is better)per time-step with respect to original videos. Means and Standard Deviations (SD) are across the test set.

good MDMM parameters θ, φ. However, in order to com-pute Lsmooth, we must perform a backward pass which re-quires sampling under the backward filtering distribution.Hence, to accurately approximate Lsmooth, the backward fil-tering distribution has to be reasonably accurate as well. Thismotivates learning the parameters θ, φ by jointly maximiz-ing the filtering and smoothing ELBOs as a weighted sum.We call this paradigm backward-forward variational in-ference (BFVI), due to its use of variational posteriors forboth backward filtering and forward smoothing.

ExperimentsWe compare BFVI against state-of-the-art RNN-based in-ference methods on two multimodal time series datasets overa range of inference tasks. F-Mask and F-Skip use for-ward RNNs (one per modality), using zero-masking and up-date skipping respectively to handle missing data. They arethus multimodal variants of the ST-L network in (Krishnan,Shalit, and Sontag 2017), and similar to the variational RNN(Chung et al. 2015) and recurrent SSM (Hafner et al. 2018).B-Mask and B-Skip use backward RNNs, with masking andskipping respectively, and correspond to the Deep KalmanSmoother in (Krishnan, Shalit, and Sontag 2017). The un-derlying MDMM architecture is constant across inferencemethods. Architectural and training details can be found inthe supplement. Code is available at https://git.io/Jeoze.

DatasetsNoisy Spirals. We synthesized a dataset of 1000 noisy 2Dspirals (600 train / 400 test) similar to Chen et al. (Chen etal. 2018), treating the x and y coordinates as two separatemodalities. Spiral trajectories vary in direction (clockwiseor counter-clockwise), size, and aspect ratio, and Gaussiannoise is added to the observations. We used 5 latent dimen-sions, and two-layer perceptrons for encoding q(zt|xmt ) anddecoding p(xmt |zt). For evaluation, we compute the mean

squared error (MSE) per time step between the predicted tra-jectories and ground truth spirals.

Weizmann Human Actions. This is a video dataset of 9people each performing 10 actions (Gorelick et al. 2007).We converted it to a trimodal time series dataset by treatingsilhouette masks as an additional modality, and treating ac-tions as per-frame labels, similar to He et al. (2018). EachRGB frame was cropped to the central 128×128 windowand resized to 64×64. We selected one person’s videos asthe test set, and the other 80 videos as the training set, al-lowing us to test action label prediction on an unseen person.We used 256 latent dimensions, and convolutional / decon-volutional neural networks for encoding and decoding. Forevaluation, we compute the Structural Similarity (SSIM) be-tween the input video frames and the reconstructed outputs.

Inference TasksWe evaluated all methods on the following suite of tem-poral inference tasks for both datasets: reconstruction: re-construction given complete observations; drop half: recon-struction after half of the inputs are randomly deleted; for-ward extrapolation: predicting the last 25% of a sequencewhen the rest is given; and backward extrapolation: infer-ring the first 25% of a sequence when the rest is given.

When evaluating these tasks on the Weizmann dataset,we provided only video frames as input (i.e. with neitherthe silhouette masks nor the action labels), to test whetherthe methods were capable of unimodal inference after mul-timodal training.

We also tested cross-modal inference using the follow-ing conditional generation / label prediction tasks: condi-tional generation (Spirals): given x- and initial 25% of y-coordinates, generate rest of spiral; conditional generation(Weizmann): given the video frames, generate the silhou-ette masks; and label prediction (Weizmann): infer actionlabels given only video frames.

ground truth observations x observedpredicted y observed

MSE: 0.018

B-M

ask

MSE: 0.029 MSE: 0.178 MSE: 0.023 MSE: 2.157

MSE: 0.014

BFVI

Recon.

MSE: 0.023

Drop Half

MSE: 0.085

Fwd. Extra.

MSE: 0.040

Bwd. Extra.

MSE: 0.058

Cond. Gen

Figure 2: Reconstructions for all 5 spiral inference tasks forBFVI and the next best method, B-Mask. BFVI outperformsB-Mask significantly on both forward extrapolation and con-ditional generation.

Table 1 shows the results for the inference tasks, whileFigure 2 and 3 show sample reconstructions from the Spiralsand Weizmann datasets respectively. On the Spirals dataset,BFVI achieves high performance on all tasks, whereas theRNN-based methods only perform well on a few. In par-ticular, all methods besides BFVI do poorly on the condi-tional generation task, which can be understood from theright-most column of Figure 2. BFVI generates a spiralthat matches the provided x-coordinates, while the next-bestmethod, B-Mask, completes the trajectory with a plausiblespiral, but ignores the x observations entirely in the process.

On the more complex Weizmann video dataset, BFVI out-performs all other methods on every task, demonstratingboth the power and flexibility of our approach. The RNN-based methods performed especially poorly on label pre-diction, and this was the case even on the training set (notshown in Table 1). We suspect that this is because the RNN-based methods lack a principled approach to multimodal fu-sion, and hence fail to learn a latent space which capturesthe mutual information between action labels and images.In contrast, BFVI learns to both predict one modality fromanother, and to propagate information across time, as can beseen from the reconstruction and predictions in Figure 3.

Weakly Supervised LearningIn addition to performing inference with missing data testtime, we compared the various methods ability to learn withmissing data at training time, amounting to a form of weaklysupervised learning. We tested two forms of weakly su-pervised learning on the Spirals dataset, corresponding todifferent conditions of data incompleteness. The first waslearning with data missing uniformly at random. This con-dition can arise when sensors are noisy or asynchronous.The second was learning with missing modalities, or semi-supervised learning, where a fraction of the sequences in thedataset only has a single modality present. This conditioncan arise when a sensor breaks down, or when the datasetis partially unlabelled by annotators. We also tested learningwith missing modalities on the Weizmann dataset.

Results for these experiments are shown in Figure 4,which compare BFVI’s performance on increasing levels ofmissing data against the next best method, averaged across10 trials. Our method (BFVI) performs well, maintaining

Figure 3: Snapshots of a ‘running’ video and silhouette maskfrom the Weizmann dataset (rows 1–2), with half of theframes deleted at random (14 out of 28 frames), and nei-ther action labels nor silhouettes provided as observations(row 3). BFVI reconstructs the video and a running silhou-ette, and also correctly predicts the action (rows 4–5). Bycontrast, B-Skip (the next best method) creates blurred andwispy reconstructions, wrongly predicts the action label, andvacillates between different possible action silhouettes overtime (rows 6–7).

good performance on the Spirals dataset even with 70% uni-form random deletion (Figure 4a) and 60% uni-modal exam-ples (Figure 4b), while degrading gracefully with increasingmissingness. This is in contrast to B-Mask, which is barelyable to learn when even 10% of the spiral examples are uni-modal, and performs worse than BFVI on the Weizmanndataset at all levels of missing data (Figure 4c).

Conclusion

In this paper, we introduced backward-forward variationalinference (BFVI) as a novel inference method for Multi-modal Deep Markov Models. This method handles incom-plete data via a factorized variational posterior, allowing usto easily marginalize over missing observations. Our methodis thus capable of a large range of multimodal temporal in-ference tasks, which we demonstrate on both a syntheticdataset and a video dataset of human motions. The abilityto handle missing data also enables applications in weaklysupervised learning of labelled time series. Given the abun-dance of multimodal time series data where missing datais the norm rather than the exception, our work holds greatpromise for many future applications.

Figure 4: Learning curves under various forms of weak supervision. (a) Learning with randomly missing data on the Spiralsdataset. (b) Semi-supervised learning on the Spirals dataset, where some sequences are entirely missing the y-coordinate. (c)Semi-supervised learning on the Weizmann dataset, where some sequences have no action labels.

AcknowledgementsThis work was supported by the A*STAR Human-CentricArtificial Intelligence Programme (SERC SSF Project No.A1718g0048).

References[Archer et al. 2015] Archer, E.; Park, I. M.; Buesing, L.;Cunningham, J.; and Paninski, L. 2015. Black box vari-ational inference for state space models. arXiv preprintarXiv:1511.07367.

[Baker et al. 2017] Baker, C. L.; Jara-Ettinger, J.; Saxe, R.;and Tenenbaum, J. B. 2017. Rational quantitative attribu-tion of beliefs, desires and percepts in human mentalizing.Nature Human Behaviour 1(4):0064.

[Bayer and Osendorfer 2014] Bayer, J., and Osendorfer, C.2014. Learning stochastic recurrent networks. In NIPS 2014Workshop on Advances in Variational Inference.

[Buesing et al. 2018] Buesing, L.; Weber, T.; Racaniere, S.;Eslami, S.; Rezende, D.; Reichert, D. P.; Viola, F.; Besse, F.;Gregor, K.; Hassabis, D.; et al. 2018. Learning and query-ing fast generative models for reinforcement learning. arXivpreprint arXiv:1802.03006.

[Cao and Fleet 2014] Cao, Y., and Fleet, D. J. 2014. Gener-alized product of experts for automatic and principled fusionof gaussian process predictions. In Modern Nonparametrics3: Automating the Learning Pipeline Workshop, NeurIPS2014.

[Che et al. 2018] Che, Z.; Purushotham, S.; Li, G.; Jiang, B.;and Liu, Y. 2018. Hierarchical deep generative models formulti-rate multivariate time series. In Dy, J., and Krause, A.,eds., Proceedings of the 35th International Conference onMachine Learning, volume 80 of Proceedings of MachineLearning Research, 784–793. Stockholmsmssan, Stock-holm Sweden: PMLR.

[Chen et al. 2018] Chen, T. Q.; Rubanova, Y.; Bettencourt,J.; and Duvenaud, D. K. 2018. Neural ordinary differentialequations. In Advances in Neural Information ProcessingSystems, 6571–6583.

[Chung et al. 2015] Chung, J.; Kastner, K.; Dinh, L.; Goel,K.; Courville, A. C.; and Bengio, Y. 2015. A recurrent latentvariable model for sequential data. In Advances in neuralinformation processing systems, 2980–2988.

[Doerr et al. 2018] Doerr, A.; Daniel, C.; Schiegg, M.; Duy,N.-T.; Schaal, S.; Toussaint, M.; and Sebastian, T. 2018.Probabilistic recurrent state-space models. In Dy, J., andKrause, A., eds., Proceedings of the 35th InternationalConference on Machine Learning, volume 80 of Proceed-ings of Machine Learning Research, 1280–1289. Stock-holmsmssan, Stockholm Sweden: PMLR.

[Dragan, Lee, and Srinivasa 2013] Dragan, A. D.; Lee,K. C.; and Srinivasa, S. S. 2013. Legibility and predictabil-ity of robot motion. In Proceedings of the 8th ACM/IEEEinternational conference on Human-robot interaction,301–308. IEEE Press.

[Fabius and van Amersfoort 2014] Fabius, O., and vanAmersfoort, J. R. 2014. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581.

[Fraccaro et al. 2016] Fraccaro, M.; Sønderby, S. K.; Paquet,U.; and Winther, O. 2016. Sequential neural models withstochastic layers. In Advances in Neural Information Pro-cessing Systems, 2199–2207.

[Fraccaro et al. 2017] Fraccaro, M.; Kamronn, S.; Paquet,U.; and Winther, O. 2017. A disentangled recognition andnonlinear dynamics model for unsupervised learning. InAdvances in Neural Information Processing Systems, 3601–3610.

[Gorelick et al. 2007] Gorelick, L.; Blank, M.; Shechtman,E.; Irani, M.; and Basri, R. 2007. Actions as space-time

shapes. IEEE transactions on pattern analysis and machineintelligence 29(12):2247–2253.

[Hafner et al. 2018] Hafner, D.; Lillicrap, T.; Fischer, I.; Vil-legas, R.; Ha, D.; Lee, H.; and Davidson, J. 2018. Learninglatent dynamics for planning from pixels. arXiv preprintarXiv:1811.04551.

[He et al. 2018] He, J.; Lehrmann, A.; Marino, J.; Mori, G.;and Sigal, L. 2018. Probabilistic video generation usingholistic attribute control. In Proceedings of the EuropeanConference on Computer Vision (ECCV), 452–467.

[Huber, Beutler, and Hanebeck 2011] Huber, M. F.; Beutler,F.; and Hanebeck, U. D. 2011. Semi-analytic gaussian as-sumed density filter. In Proceedings of the 2011 AmericanControl Conference, 3006–3011. IEEE.

[Johnson et al. 2016] Johnson, M.; Duvenaud, D. K.;Wiltschko, A.; Adams, R. P.; and Datta, S. R. 2016.Composing graphical models with neural networks forstructured representations and fast inference. In Advancesin Neural Information Processing Systems, 2946–2954.

[Kalman 1960] Kalman, R. E. 1960. A new approach tolinear filtering and prediction problems. Journal of basicEngineering 82(1):35–45.

[Karl et al. 2016] Karl, M.; Soelch, M.; Bayer, J.; and van derSmagt, P. 2016. Deep variational bayes filters: Unsupervisedlearning of state space models from raw data. arXiv preprintarXiv:1605.06432.

[Kingma and Welling 2014] Kingma, D. P., and Welling, M.2014. Auto-encoding variational bayes. In InternationalConference on Learning Representations.

[Krishnan, Shalit, and Sontag 2017] Krishnan, R. G.; Shalit,U.; and Sontag, D. 2017. Structured inference networks fornonlinear state space models. In Thirty-First AAAI Confer-ence on Artificial Intelligence.

[Lin, Khan, and Hubacher 2018] Lin, W.; Khan, M. E.; andHubacher, N. 2018. Variational message passing with struc-tured inference networks. In International Conference onLearning Representations.

[Lipton, Kale, and Wetzel 2016] Lipton, Z. C.; Kale, D.; andWetzel, R. 2016. Directly modeling missing data in se-quences with rnns: Improved classification of clinical timeseries. In Machine Learning for Healthcare Conference,253–270.

[Neil, Pfeiffer, and Liu 2016] Neil, D.; Pfeiffer, M.; and Liu,S.-C. 2016. Phased lstm: Accelerating recurrent networktraining for long or event-based sequences. In Advances inneural information processing systems, 3882–3890.

[Ong, Zaki, and Goodman 2015] Ong, D. C.; Zaki, J.; andGoodman, N. D. 2015. Affective cognition: Exploring laytheories of emotion. Cognition 143:141–162.

[Rezende, Mohamed, and Wierstra 2014] Rezende, D. J.;Mohamed, S.; and Wierstra, D. 2014. Stochastic back-propagation and approximate inference in deep generativemodels. In International Conference on Machine Learning,1278–1286.

[Wu and Goodman 2018] Wu, M., and Goodman, N.2018. Multimodal generative models for scalable weakly-

supervised learning. In Advances in Neural InformationProcessing Systems, 5575–5585.

Supplemental MaterialFactorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Tan Zhi-Xuan,1, 2 Harold Soh,3 Desmond C. Ong1, 4

1A*STAR Artificial Intelligence Initiative, A*STAR, Singapore2Department of Electrical Engineering and Computer Science, MIT

3Department of Computer Science, National University of Singapore4Department of Information Systems and Analytics, National University of Singapore

[email protected], [email protected], [email protected]

A Products and Quotients of Gaussian Distributions

Here we state the closed form solution for the quotient of two products a finite number of Gaussiandistributions. We define:

g(x) =

k∏

i=1

f(x|µi,Σi)

/ l∏

j=k+1

f(x|µj ,Σj),

where each f(x|µi,Σi) is a multivariate Gaussian with mean µi, covariance Σi and precisionTi = Σ−1i . Under the constraint that

∑ki=1 Ti >

∑lj=k+1 Tj element-wise, g(x) is also multivariate

Gaussian with mean and covariance:

µg =

∑ki=1 µiTi −

∑lj=k+1 µjTj

∑ki=1 Ti −

∑lj=k+1 Tj

Σg =

k∑

i=1

Ti −l∑

j=k+1

Tj

−1

If the constraint is not satisfied, then g(x) cannot be normalized into a well-defined probabilitydistribution. The reader may refer to [1] for a modern proof.

B Multimodal Bidirectional Training Loss

Bidirectional factorized variational inference (BFVI) requires training the Multimodal Deep MarkovModel (MDMM) to perform both backward filtering and forward smoothing. In addition, the modelhas to learn how to perform inference given multiple modalities, both jointly and in isolation. Assuch, we extend the multimodal training paradigm proposed for the MVAE [1]. For each batch oftraining data, we compute and minimize the following multimodal bidirectional training loss:

L = λfilter

[L1:M

filter +

M∑

m=1

Lmfilter

]+ λsmooth

[L1:M

smooth +

M∑

m=1

Lmsmooth

]+ λmatchLmatch

Here, λfilter, λsmooth and λmatch are loss multipliers. L1:Mfilter and L1:M

smooth are the multimodal ELBOlosses for filtering and smoothing respectively:

L1:Mfilter =

T∑

t=1

[E

q(zt|x1:Mt:T)

M∑

m=1

λm log p(xmt |zt)− Eq(zt+1|x1:Mt+1:T

)

β KL(q(zt|zt+1, x

1:Mt )

∣∣∣∣∣∣∣∣p(zt|zt+1)

)]

L1:Msmooth =

T∑

t=1

[E

q(zt|x1:M1:T)

M∑

m=1

λm log p(xmt |zt)− Eq(zt−1|x1:M1:T

)

β KL(q(zt|zt−1, x

1:Mt:T )

∣∣∣∣∣∣∣∣p(zt|zt−1)

)]

arX

iv:1

905.

1357

0v3

[cs

.LG

] 2

2 N

ov 2

019

λm is the reconstruction loss multiplier for modality m, and β is the loss multiplier for the KLdivergence. Lm

filter and Lmsmooth are the corresponding unimodal ELBO losses:

Lmfilter =

T∑

t=1

[E

q(zt|xmt:T )λm log p(xmt |zt)− E

q(zt+1|xmt+1:T)β KL

(q(zt|zt+1, x

mt )

∣∣∣∣∣∣∣∣p(zt|zt+1)

)]

Lmsmooth =

T∑

t=1

[E

q(zt|xm1:T )λm log p(xmt |zt)− E

q(zt−1|xm1:T )β KL

(q(zt|zt−1, x

mt:T )

∣∣∣∣∣∣∣∣p(zt|zt−1)

)]

Lmatch is the prior matching loss, to ensure that the forward and backward dynamics conform theassumption that p(zt) is invariant with t:

Lmatch = KL(p(zt)

∣∣∣∣∣∣∣∣Ezt−1p(zt|zt−1)

)+ KL

(p(zt)

∣∣∣∣∣∣∣∣Ezt+1p(zt|zt+1)

)

To compute Lmatch, we need to sample particles from p(zt), introducing Kp, the number of priormatching particles, as a hyper-parameter. Similarly, we need to perform backward filtering withsampling to compute the smoothing ELBOs, for which we use Kb backward filtering particles.

Lfilter := L1:Mfilter +

∑Mm=1 L

mfilter and L1:M

smooth := L1:Msmooth +

∑Mm=1 L

msmooth are minimized together so as

to ensure that the model learns accurate backward dynamics. This is done because computation ofLsmooth involves a backward pass which requires sampling under the backward filtering distribution.As such, to accurately approximate Lsmooth, the backward filtering distribution has to be reasonablyaccurate as well, which motivates optimizing Lfilter jointly.

Empirically, we found that optimizing Lsmooth alone led to poor performance for backward extrapo-lation, and to small gradients for the backward transition parameters, suggesting that the backwardpass was too far upstream in the computation graph. Optimizing Lfilter in addition to Lsmooth rectifiedthis issue with vanishing gradients. While we chose to weight Lfilter and Lsmooth equally, many othertraining schemes are possible, e.g., placing a high weight on Lfilter initially and then annealing it tozero, such that only Lsmooth (i.e. the correct ELBO if one’s goal is to perform only smoothing) isoptimized eventually. Such training schemes should be investigated in future work.

C Model and Inference Architectures

C.1 Transition Functions

We use a variant of the Gated Transition Function (GTF) from [2] to parameterize both the forwardtransition p(zt|zt−1) and backward transition p(zt|zt−1):

gt = Sigmoid ◦ Linear|z|←|h| ◦ ReLU ◦ Linear|h|←|z|(zt)

µt = Linear|z|←|h| ◦ ReLU ◦ Linear|h|←|z|(zt)

µt = Linear|z|←|z|(zt)

µt = gt � µt + (1− gt)� µt

σt = Softplus ◦ Linear|z|←|z|(µt)

where Linearb←a is a linear mapping from a to b dimensions, ◦ is function composition, and � iselement-wise multiplication, |z| is the dimension of the latent space (5 for spirals, 256 for videos),and |h| is the dimension of the hidden layer (20 for spirals, 256 for videos). To stabilize trainingwhen using BFVI, we found it necessary to add a small constant (0.001) to the standard deviation, σt.To be clear, we learn separate networks each for the forward and backward transitions.

C.2 Encoders and Decoders

For the Spirals dataset, we used multi-layer perceptrons as encoders and decoders for the x and ydata, with |h| = 20 hidden units. Let v stand for either x or y, and recall that for the encoder, welearn a network that parameterizes the quotient q(zt|vt) = q(zt|vt)/p(zt). The architectures are:

q(zt|vt) ∼ N (µz, σz)

h = ReLU ◦ Linear|h|←|1|(vt)

µz = Linear|z|←|h|(h)

σz = Softplus ◦ Linear|z|←|h|(h)

p(vt|zt) ∼ N (µv, σv)

h = ReLU ◦ Linear|h|←|z|(zt)

µv = Linear1←|h|(h)

σv = Softplus ◦ Linear1←|h|(h)

2

For the Weizmann video dataset, we used a variant of the above to work with categorical distributionsover the action labels. Again we use v as a placeholder. The architectures are:

q(zt|vt) ∼ N (µz, σz)

e = ReLU ◦ Embedding|h|(vt)

h = ReLU ◦ Linear|h|←|h|(e)



p(vt|zt) ∼ Categorical(p1, ..., pk)

h = ReLU ◦ Linear|h|←|z|(zt)

(p1, ..., pk) = Softmax ◦ Lineark←|h|(h)

The number of categories is denoted by k (10 for actions), and we used |h| = 256. For the encodersand decoders from and to images (video frames or silhouette masks), see Figure S1 below.

3@64x6416@32x32 32@16x16

64@8x8

1x256

Conv +BatchNorm +ReLU



Dense +Softplus

Dense

1x256

�

�

(a) Image Encoder

3@64x6416@32x3232@16x16

64@8x8

1x256

Deconv +BatchNorm +ReLU



Dense +ReLU

z

(b) Image Decoder

Figure S1: Image encoder and decoder architectures. Convolutional layers use 3×3 kernels, de-convolutional layers use 4×4 kernels, both use a stride of 2 and padding of 1. Silhouette masks werehad 1 input channel instead of 3 input channels.

C.3 Inference Networks

The encoder architectures described in the previous section are designed to work with BFVI, andso they directly output distribution parameters µz , σz for the latent state zt. To adapt them to workwith the RNN-based structured inference networks (F-Mask, F-Skip, B-Mask, B-Skip), we simplyremove the Gaussian output layers from each encoder, using the features fmt of the penultimate layeras inputs to the RNN at each time step.

We follow the original DMM implementation [2] in using Gated Recurrent Units (GRUs) for thestructured inference networks. To generalize this to the multimodal case, we have one GRU permodality, which we label GRUm for modalitym. The difference between the Mask and Skip methodsis that the former uses zero-masking of missing inputs:

hmt±1 =

{GRUm(fmt , h

tm) if xmt is present

GRUm(0, htm) otherwise

whereas the latter skips the update for the hidden state hmt of GRUm whenever xmt is missing:

hmt±1 =

{GRUm(fmt , h

tm) if xmt is present

htm otherwise

Update skipping is the method used in [2]’s implementation to handle missing data. For the forwardRNNs (F-Mask and F-Skip), we take the positive sign in the plus-minus signs above, while for thebackward RNNs (B-Mask and B-Skip), we take the negative sign.

In order to combine the information from the RNNs with the previous latent state zt−1 to infer thecurrent latent state zt, we use a variant of the combiner function from [2] that takes in zt−1 and thehidden states htm of all modalities as inputs:

q(zt|zt−1, x1:T ) ∼ N (µz, σz)

h = ReLU ◦ Linear|h|←(zt−1, h1t , ..., h

Mt )



We use |h| = 5 for the spirals dataset and |h| = 256 for the videos (same dimensions for the both thehidden layer and the GRU hidden states). Because the above formulation has no direct connection

3

from the current input xt to the current latent state zt, it seemed like it would hurt performance whenwe needed zt to be a good low-dimensional representation of xt. As such, for the video dataset,we used a variant that also takes in the encoded features fmt as inputs (if xmt is missing, then thecorresponding feature fmt is zero-masked):

q(zt|zt−1, x1:T ) ∼ N (µz, σz)

h = ReLU ◦ Linear|h|←(zt−1, h1t , ..., h

Mt , f

1t , ..., f

Mt )



Table S1 compares the number of neural network parameters (model networks + inference networks)required for BFVI vs. the RNN-based inference methods. Despite matching hidden layer dimensions,BFVI still uses 2 to 3 times less parameters, because it does not require inference RNNs for eachmodality, and uses Product-of-Gaussians for fusion instead of a combiner network.

Method Number of parametersSpirals Weizmann

BFVI 1854 4542503RNN-based 7124 7494183

Table S1: Number of network parameters for each method.

D Training Parameters

Parameter Spirals Weizmann

Filtering ELBO mult. λfilter 0.5 ”Smoothing ELBO mult. λsmooth 0.5 ”Prior matching mult. λmatch 0.01 β ”Reconstruction mult. λm x, y: 1 Vid.: 1, Sil.: 1, Act.:10KL divergence mult. β Anneal(0,1) over Eβ ”Bwd. filtering particles Kb 25 ”Prior matching particles Kp 50 ”Training epochs E 500 400β-annealing epochs Eβ 100 250Early stopping Yes NoBatch size 100 25Sequence splitting No Into 25 time-step segmentsOptimizer ADAM ”Learning rate 0.02 (BFVI), 0.01 (Rest) 5× 10−4

Weight decay 1× 10−4

Burst deletion rate 0.1 0.2

Table S2: Training parameters for spirals and Weizmann datasets.

Table S2 provides the training parameters we used. Unless specified, parameters were kept constantacross inference methods. As in most VAE-like models, we anneal β, the multiplier for the KLdivergence loss from 0 to 1 over time. This incentivizes the model to first find encoders and decodersthat reconstruct well, before regularizing the latent space [3]. Since the prior matching multiplierλmatch also serves to regularize the latent space, we tie its value to β, so that it increases as β increases.

To speed up training on the video dataset, we split each input sequence into segments that were 25time steps long, and trained on those segments. To improve the robustness to missing data whentraining each methods, we also introduced burst deletion errors — random contiguous deletions ofinputs. For the spirals dataset, we deleted input segments at training time that were 0.1 long as theoriginal video lengths. For the video dataset, we deleted input segments at training time that were 0.2long as the original video lengths. Deletion start points were selected uniformly at random.

4

We briefly discuss training differences between BFVI and other methods here. For the RNN-basedmethods, no bidirectional training is involved, so we only minimized the (forward) filtering ELBO(for F-Mask, F-Skip) or the smoothing ELBO (for B-Mask, B-Skip). While we also tried to usethe multimodal training paradigm (i.e. minimizing the sum of L1:M and Lm) for the RNN-basedmethods, the effects of this were mixed: On the spirals dataset, performance dropped very sharplywith the multimodal paradigm, whereas performance increased on the Weizmann video dataset. Assuch, the results we report in the main manuscript use the multimodal training paradigm only for thevideo dataset. For the spirals dataset, we only minimize L1:M , with all inputs provided, but not Lm,which is computed with only modality m provided as input. Finally, we used a higher learning rate(0.02) for BFVI on the spirals dataset because we noticed slow convergence with the lower rate of0.01. Increasing the learning rate to 0.02 for the other methods hurt their performance.

E Evaluation Details

When evaluating the models, we estimate the MAP latent sequence z∗1:T = arg maxz1:T p(z1:T |x1:T )by estimating zt = arg maxz1:T q(zt|zt−1, xt:T ) for each t (i.e., we recursively take the mean of theapproximate conditional smoothing posterior, q(zt|zt−1, x1:T ).) For reconstruction, we then decodethis inferred latent sequence and take the MLE of the observations, x1:T = arg maxx1:T

p(x1:T |z1:T ).While this does not infer the exact MAP sequence, it has been standard practice for recurrent latentvariable models [2], and we find it empirically adequate in lieu of approximate Viterbi sequencing.

In the case of BFVI, we also have to specify number of filtering particles Kb for the backward pass atevaluation time. We used Kb = 200 for additional stability, though we found that lower values alsoproduced similar results.

To evaluate the performance of BFVI and the next-best method under weakly supervised learning,we performed 10 training runs for each method and each level of data deletion, then reported theaverage of best 3 runs. This was done in order to exclude outlying runs where training instabilityemerged, leading to divergence of the loss function. At test time, the inference task for the Spiralsdataset was to reconstruct each spiral given that the first and last 25% of data points were removed,and an additional 50% of data points were deleted at random. The inference task for the Weizmanndataset was to predict the action labels when only the video modality was provided, and 50% of thevideo frames were missing.

F Additional Results

F.1 Spiral reconstructions

To supplement the reconstruction figures in the main paper which only compared BFVI to the next-best method, here we show a more complete comparison of all 5 methods on the Spirals dataset(Figure S2).

References

[1] Mike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervisedlearning. In Advances in Neural Information Processing Systems, pages 5575–5585, 2018.

[2] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinearstate space models. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[3] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and SamyBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.

5

ground truth observations x observedpredicted y observed

MSE: 0.014

BFV

I

Recon.

MSE: 0.023

Drop Half

MSE: 0.085

Fwd. Extra.

MSE: 0.040

Bwd. Extra.

MSE: 0.058

Cond. Gen

MSE: 0.018

F-M

ask

MSE: 0.030 MSE: 0.061 MSE: 0.036 MSE: 1.415

MSE: 0.041

F-Ski

p

MSE: 0.101 MSE: 0.159 MSE: 0.088 MSE: 1.625

MSE: 0.018

B-M

ask

MSE: 0.029 MSE: 0.178 MSE: 0.023 MSE: 2.157

MSE: 0.032

B-S

kip

MSE: 0.165 MSE: 0.162 MSE: 0.206 MSE: 0.517

Figure S2: Reconstructions for all 5 spiral inference tasks (columns) for all methods (rows). OurBFVI method does well on all tasks, but especially better than the other methods on the ConditionedGeneration task (right-most column), where only the x- and first 25% of the y-coordinates are given.

6

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series · 2019. 11....

Documents

Transcript of Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series · 2019. 11....