Dynamic Copula Networks for Modeling Real-valued Time...

9
247 Dynamic Copula Networks for Modeling Real-valued Time Series Elad Eban Gideon Rothschild Adi Mizrahi Israel Nelken Gal Elidan The Hebrew University School of Computer Science UCSF Center for Integrative Neuroscience The Hebrew University Institute of Life Sciences The Hebrew University Department of Statistics Abstract Probabilistic modeling of temporal phenom- ena is of central importance in a variety of fields ranging from neuroscience to economics to speech recognition. While the task has received extensive attention in recent decades, learning temporal models for multivariate real-valued data that is non-Gaussian is still a formidable challenge. Recently, the power of copulas, a framework for representing complex multi-modal and heavy-tailed distributions, was fused with the formalism of Bayesian networks to allow for flexible modeling of high- dimensional distributions. In this work we introduce Dynamic Copula Bayesian Networks (DCBNs), a generalization aimed at capturing the distribution of rich temporal sequences. We apply our model to three markedly different real-life domains and demonstrate substantial quantitative and qualitative advantages. 1 Introduction Probabilistic modeling of temporal phenomena is of great interest in diverse fields ranging from computa- tional biology to economics to speech recognition. A central challenge in such modeling is the daunting di- mension of the data. For example, even a simple EEG signal may include many thousands of time points. The sequential nature of such phenomena, however, often allows us to make realistic simplifying assump- tions. First, a periodic behavior is often observed. Second, reasonable Markovian assumptions allow us to construct models over entire sequences via local Appearing in Proceedings of the 16 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by the authors. building blocks that are limited to a finite horizon. Using these assumptions, general purpose tools such as Markov chains, Hidden Markov models, and Kalman filters are widely used with great success in a profusion of temporal applications. Dynamic Bayesian networks (DBNs) [Dean and Kanazawa, 1989] are a temporal extension of the widely used framework of Bayesian networks [Pearl, 1988]. Like all probabilistic graphical models, DBNs rely on local building blocks that capture the close range behavior of random variables (both within each time “slice” and across time). Importantly, the framework generalizes many of the commonly used temporal constructs. Yet, despite wide empirical success, DBNs are susceptible to critical limitations. Most notably, while DBNs are in principle applicable to real-valued domains, practical considerations when tackling such scenarios almost always force us to use simple parametric building blocks. In fact, the overwhelming majority of continuous DBNs rely on a simple Gaussian representation. Obviously, the Gaussian assumption, while math- ematically convenient, is often unrealistic. Neural activity levels, for example, are characterized by noisy and lengthy “rest” periods and rare extreme events (spikes), resulting in a heavy-tailed distribution; Financial daily changes measurements are often characterized by normally distributed stable periods intermingled with economic upheavals whose erratic behavior is far from Gaussian; Sensor measurements of human activity (e.g. walking, running) often has a highly peaked behavior due the nature of the activity performed and/or our physical limitations. Our goal is to model such complex non-Gaussian phenomena. To cope with continuous non-Gaussian challenges in a non-temporal context, Elidan [2010] recently pro- posed the Copula BN model (CBN) that fuses the frameworks of the statistical copula and BNs. Briefly, copulas allows us to model complex real-valued dis- tributions by separating the choice of the (possibly nonparametric) univariate marginals and the depen- dence function that “couples” them into a coherent

Transcript of Dynamic Copula Networks for Modeling Real-valued Time...

Page 1: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

247

Dynamic Copula Networks for Modeling Real-valued Time Series

Elad Eban Gideon Rothschild Adi Mizrahi Israel Nelken Gal ElidanThe Hebrew University

School of Computer ScienceUCSF Center for

Integrative NeuroscienceThe Hebrew University

Institute of Life SciencesThe Hebrew UniversityDepartment of Statistics

Abstract

Probabilistic modeling of temporal phenom-ena is of central importance in a varietyof fields ranging from neuroscience toeconomics to speech recognition. Whilethe task has received extensive attention inrecent decades, learning temporal modelsfor multivariate real-valued data that isnon-Gaussian is still a formidable challenge.Recently, the power of copulas, a frameworkfor representing complex multi-modaland heavy-tailed distributions, was fusedwith the formalism of Bayesian networksto allow for flexible modeling of high-dimensional distributions. In this workwe introduce Dynamic Copula BayesianNetworks (DCBNs), a generalization aimedat capturing the distribution of rich temporalsequences. We apply our model to threemarkedly different real-life domains anddemonstrate substantial quantitative andqualitative advantages.

1 Introduction

Probabilistic modeling of temporal phenomena is ofgreat interest in diverse fields ranging from computa-tional biology to economics to speech recognition. Acentral challenge in such modeling is the daunting di-mension of the data. For example, even a simple EEGsignal may include many thousands of time points.The sequential nature of such phenomena, however,often allows us to make realistic simplifying assump-tions. First, a periodic behavior is often observed.Second, reasonable Markovian assumptions allow usto construct models over entire sequences via local

Appearing in Proceedings of the 16th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP31. Copyright 2013 by the authors.

building blocks that are limited to a finite horizon.Using these assumptions, general purpose tools such asMarkov chains, Hidden Markov models, and Kalmanfilters are widely used with great success in a profusionof temporal applications.

Dynamic Bayesian networks (DBNs) [Dean andKanazawa, 1989] are a temporal extension of thewidely used framework of Bayesian networks [Pearl,1988]. Like all probabilistic graphical models, DBNsrely on local building blocks that capture the closerange behavior of random variables (both withineach time “slice” and across time). Importantly, theframework generalizes many of the commonly usedtemporal constructs. Yet, despite wide empiricalsuccess, DBNs are susceptible to critical limitations.Most notably, while DBNs are in principle applicableto real-valued domains, practical considerations whentackling such scenarios almost always force us touse simple parametric building blocks. In fact, theoverwhelming majority of continuous DBNs rely on asimple Gaussian representation.

Obviously, the Gaussian assumption, while math-ematically convenient, is often unrealistic. Neuralactivity levels, for example, are characterized by noisyand lengthy “rest” periods and rare extreme events(spikes), resulting in a heavy-tailed distribution;Financial daily changes measurements are oftencharacterized by normally distributed stable periodsintermingled with economic upheavals whose erraticbehavior is far from Gaussian; Sensor measurementsof human activity (e.g. walking, running) often has ahighly peaked behavior due the nature of the activityperformed and/or our physical limitations. Our goalis to model such complex non-Gaussian phenomena.

To cope with continuous non-Gaussian challenges ina non-temporal context, Elidan [2010] recently pro-posed the Copula BN model (CBN) that fuses theframeworks of the statistical copula and BNs. Briefly,copulas allows us to model complex real-valued dis-tributions by separating the choice of the (possiblynonparametric) univariate marginals and the depen-dence function that “couples” them into a coherent

Page 2: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

248

Dynamic Copula Networks for Modeling Real-valued Time Series

joint distribution. Using the language of probabilisticgraphical models, the CBN model extends the ideato high-dimension and in practice leads to substantialquantitative and qualitative gains (see [Elidan, 2010]for more details).

In this work we extend CBNs to work in the temporaldomain in the same way that DBNs extend non-temporal BNs, and formulate the Dynamic CopulaBayesian Network (DCBN) model. Although theextension is straightforward theoretically, its practicalmerits are substantial. In particular, with littlecomputational effort, the model allows us to capturecomplex multivariate temporal phenomena whosebehavior is quite far from that of a multivariateGaussian distribution.

We apply our model to the three markedly differenttemporal settings discussed above: neural activity lev-els, EMG physical action measurements, and financialdaily return data. We demonstrate a significant quan-titative advantage in all cases in terms of the quality ofthe multivariate density learned, relative to temporaland non-temporal Gaussian baselines. In the actiondataset, where a discriminative task may also be of in-terest, we also show a marked improvement in terms ofpredictive ability. Finally, for the neural activity data,we also provide a qualitative evaluation that enhancesthe ability of our model to faithfully capture real-lifecomplex phenomena.

The rest of the paper is organized as follows: follow-ing a description of the necessary background mate-rial in Section 2, in Section 3 we describe the DCBNtemporal model and briefly explain how the model isautomatically learned from data. We demonstrate thepractical merit of the model in Section 4, and end withconcluding remarks in Section 5.

2 Background

In this section we briefly review the framework of cop-ulas and the recently introduced Copula BN model[Elidan, 2010]. Let X = {X1, . . . , XN} be a finite setof scalar real-valued random variables and let FX (x) ≡P (X1 ≤ x1, . . . , Xn ≤ xN ) be a (cumulative) distribu-tion over X , with lower case letters denoting assign-ment to variables. For compactness, we use Fi(xi) ≡FXi

(xi) = P (Xi ≤ xi, XX/Xi= ∞), and for density

functions we similarly use fi(xi) ≡ fXi(xi). Whenthere is no ambiguity we sometimes abuse notationand use F (xi) ≡ FXi

(xi), and similarly for densitiesand for sets of variables.

2.1 Copulas

A copula function [Sklar, 1959] links marginal distri-butions to form a multivariate one. Formally,

Definition 2.1 Let U1, . . . , UN be real random vari-ables marginally uniformly distributed on [0, 1]. A cop-ula function C : [0, 1]N → [0, 1] is a joint distribution

Cθ(u1, . . . , uN ) = P (U1 ≤ u1, . . . , UN ≤ uN ),

where θ are the parameters of the copula function.

Sklar’s seminal theorem states that any joint distribu-tion FX (x) can be represented as a copula function Cof its univariate marginals

FX (x) = C(F1(x1), . . . , FN (xN )).

When the univariate marginals are continuous, C isuniquely defined. The constructive converse, whichis of central interest from a modeling perspective, isalso true: any copula function taking any marginaldistributions {Fi(xi)} as its arguments, defines a validjoint distribution with marginals {Fi(xi)}. Thus, cop-ulas are “distribution generating” functions that allowus to separate the choice of the univariate marginalsand that of the dependence structure, encoded in thecopula function C. Importantly, this flexibility oftenresults in a construction that is beneficial in practice.

Assuming C has Nth order partial derivatives (truealmost everywhere when continuous), the joint den-sity can be derived from the copula function using thederivative chain rule

f(x) =∂NC(F1(x1), . . . , FN (xN ))

∂F1(x1) . . . ∂FN (xN )

∏i

fi(xi)

≡ cθ(F1(x1), . . . , FN (xN ))∏i

fi(xi), (1)

where c(·) is called the copula density.

Example 2.1: The Gaussian copula is undoubtedlythe most commonly used copula family with applica-tions ranging from mainstream financial risk assess-ment to climatology applications. Its distribution isdefined as

CΣ({Ui}) = ΦΣ

(Φ−1(U1), . . . ,Φ−1(UN )

), (2)

where Σ is a correlation matrix, Φ is the standardnormal distribution, and ΦΣ is a zero mean normaldistribution with correlation matrix Σ.

Figure 1 exemplifies the flexibility that comes with thisseemingly limited elliptical copula family. Shown aresamples from this copula using two different marginals.As can be seen, a variety of markedly different andmulti-modal distributions can be constructed. Gener-ally, and without any added computational difficulty,we can mix and match any marginals with any copulafunction to form a valid joint distribution.

Page 3: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

249

Elad Eban, Gideon Rothschild, Adi Mizrahi Israel Nelken, Gal Elidan

Figure 1: Samples from thebivariate Gaussian copula withcorrelation θ = 0.25. (left)with unit variance Gaussianmarginals; (right) with a mix-ture of Gaussian and Gammamarginals.

Normal marginals Gaussian Mix and Gamma marginals

2.2 Bayesian networks and Copula BayesianNetworks

Let G be a directed acyclic graph (DAG) whosenodes correspond to the set of random variablesX = {X1, . . . , XN}, and let Pai = {Pai1, . . . ,Paiki}be the parents of Xi in G. A Bayesian network (BN)[Pearl, 1988] is used to represent a joint densityover X using a qualitative graph structure andquantitative parameters that define local conditionaldensities. The graph G encodes the independencestatements I(G) = {(Xi ⊥ NonDesci | Pai)}, where ⊥denotes the independence relationship, and NonDesciare nodes that are not descendants of Xi in G. Itis easy to show that if I(G) hold, then the jointdensity decomposes into a product of local conditionaldensities fX (x) =

∏i fi(Xi | Pai). Conversely, any

such product of local conditional densities defines avalid joint density where I(G) hold.

We now describe the recently introduced CBN modelthat fuses the BN and copula formalisms:

Definition 2.2: A Copula Bayesian Network (CBN)is a triplet C = (G,ΘC ,Θf ) that defines fX (x). Gencodes the independencies (Xi ⊥ NonDesci | Pai),assumed to hold in fX (x). ΘC is a set of local cop-ula functions Ci(F (xi), F(pai1), . . . , F(paiki)) that areassociated with the nodes of G that have at least oneparent. In addition, Θf is the set of parameters repre-senting the marginal densities fi(xi) (and distributionsFi(xi)). The joint density fX (x) is defined as

fX (x) =

N∏i=1

Rci(F (xi), F(pai1), . . . , F(paiki)

)fi(xi),

where, if Xi has at least one parent in the graph G,the term Rci

(F (xi), F(pai1), . . . , F(paiki)

)is defined

as

Rci(·) ≡ci(F (xi), F(pai1), . . . , F(paiki))

∂KCi(1,F(pai1),...,F(paiki))

∂F (pai1)...∂F (paiki)

.

When Xi has no parents in G, Rci (·) ≡ 1.

The term Rci(F (xi), F(pai1), . . . , F(paiki)

)fi(xi) is

always a valid conditional density, namely f(xi | pai),and can be easily computed. In particular, when thecopula density c(·) has an explicit form, so does Rci (·)since it involves derivatives of a lesser order.

Elidan [2010] showed that a CBN defines a valid jointdensity so that, like other graphical models, a CBNtakes advantage of the independence assumptions torepresent fX (x) compactly. Differently, as in copu-las, when the graph is tree structured, the univariatemarginals of a CBN are exactly fi(xi). For more gen-eral structures, the marginals can be skewed, thoughonly slightly so in practice. Empirically, CBNs offerssignificant advantages in terms of generalization ability(see Elidan [2010] for more details).

3 Dynamic Copula Bayesian Network

A dynamic Bayesian network (DBN) [Dean andKanazawa, 1989] is a powerful probabilistic modelthat generalizes BNs, allowing us to represent andreason about the dynamics of complex structureddistributions. Briefly, using a standard temporalMarkovian assumption, a joint distribution is definedvia a template Bayesian network that encodes thestructured probability of variables X t at time t givenother variables at time t, as well as those of thepreceding time point t − 1. Like standard BNs, therepresentation is quite general in that it relies on ablack-box representation of each variable given itsparents in the model. See Figure 2 for an illustration.

However, when the domain is real-valued, computa-tional considerations make the use of complex localrepresentation infeasible. As noted, the overwhelm-ing majority of continuous dynamic Bayesian networksrely on the linear Gaussian parameterization. Ourgoal is to overcome this limitation by building on thepower of copulas. To do so, we generalize the recentlyintroduced CBN model that fuses the copula and BNformalisms, in the same way that DBNs generalize BNsto capture temporal dynamics.

Page 4: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

250

Dynamic Copula Networks for Modeling Real-valued Time Series

3.1 The DCBN Model

Let X be a set of template random variables and letX (0:T ) = [X (0), X(1) . . . ,X (T )] be a replication of thisset for the different times 0, . . . , T . Like a DBN, aDCBN is used to represent a joint distribution overX (0:T ). Differently than a DBN, the representationrelies on a copula-based building block. Formally,

Definition 3.1 A Dynamic Copula Bayesian Net-work (DCBN) is a 5-tuple DC = (G0,ΘC0 ,G→,ΘC→ ,Θf ):

• G0 is a DAG that encodes independencies overX (0) (as described in Section 2.2)

• ΘC0are local copula densities ci(F (xi), {F (paik})

defined over the random variables Xi ∈ X (0) andtheir parents in G0.

• G→ is a DAG defined over a two-replication set ofrandom variables X ′ and X ′′. G→ is constrainedso that no (backward) edges from variables in X ′′to variables in X ′ are allowed.

• ΘC→ are a set of local copula densities definedover variables Xi ∈ X ′′ only and their parents inG→, which may be in X ′′ as well as in X ′.

• Θf are the parameters describing the univariatemarginal distributions (and densities) of all thetemplate random variables X .

Given a temporal sequence x(0:T ), the joint density ofthe entire temporal sequence is then defined as

fX (0:T )(x(0:T )) = fX (0)(x(0))

T−1∏t=0

fX (t:t+1)(x(t+1)|x(t)),

where the initial density is defined as:

fX (0)(x(0)) ≡∏i∈X

R(0)ci (F (xi), {F (Pai

G0)})fi(xi),

and the copula ratio R(0)ci is as defined in Section 2.2

with the parameters θC0and the assignment x(0). The

transition density is defined as

fX (t:t+1)(x(t+1)|x(t)) ≡∏i∈X Rci(F (x

(t)i ), {F (Pai

G→)})fi(x(t)i ),

where Rci is defined as above but with the parameteri-zation defined by θC→ that is not dependent on t.

Thus, the DCBN model generalizes the CBN model tothe temporal domain in that it allows us to separatethe choice of the univariate marginals and that of thedependency structure:

(a) Representation of G0 (b) Representation of G→

Figure 2: An illustration of a dynamic structuredprobabilistic graphical model (DBN or DCBN)

Lemma 3.1 : Let X be a set of real-valued randomvariables. Given any positive univariate marginal den-sities θf , and any set of local copula functions θC0

andθC→ , the DCBN model defined above represents a validjoint density over X (0:T ).

The proof is similar to the proof of Elidan [2010] forthe CBN model. The modeling flexibility that the con-struction offers allows us to learn powerful structuredtemporal models. In particular, as we demonstratein Section 4, the explicit control over the univariatemarginals that the framework offers leads to quantita-tive and qualitative advantages.

3.2 Learning the Copula Parameters

As is standard for probabilistic graphical models, thedecomposable form of the joint density defined by theDCBN model facilitates relatively efficient structurelearning and estimation using standard machinery.

Given a complete training data set D of M sequencesof length Tm each, the log-likelihood of the DCBNmodel DC can be written as

`(D : DC) =

M∑m=1

log(fX (0:Tm)

m(x(0:Tm)[m])

)(3)

where x(0:Tm)[m] is used to denote the assignmentsto the variables in the mth sequence. As in the caseof DBNs, Eq. (3) can (after some rearrangement ofterms), be written as a decomposable sum of the fam-ilies (variables and their parents) in G0 and G→.

The only complication, from an estimation perspec-tive, is that univariate marginals {Fi(xi)} are sharedacross all terms, hindering our ability to perform de-composable estimation. To overcome this difficulty, weadopt the common solution in the copula community

Page 5: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

251

Elad Eban, Gideon Rothschild, Adi Mizrahi Israel Nelken, Gal Elidan

−4 −2 0 2 40

0.5

1

signal level

cum

ula

tive d

istr

ibution

Figure 3: The (typical) cumulative distribution of oneof the EMG sensor measurements. The solid red linesshows the best Gaussian fit.

(with appealing theoretical and practical properties)of estimating the marginals first [Joe and Xu, 1996].Given the marginals, the parameters of each local cop-ula can be estimated independently of the others, usinga closed form where possible (e.g., for the Gaussiancopula) or, for example, a conjugate gradient proce-dure. In both cases, estimation is carried out by con-catenating statistics across all time slices. See Kollerand Friedman [2009] for details on how this is carriedout in the context of standard DBNs.

3.3 Structure Learning

Very briefly, to learn the structure of both G0 andG→, we use a standard greedy search that applies localstructure modifications (e.g., add/delete/reverse edge)based on a model selection score. In this work we usethe Bayesian Information Criterion (BIC) of Schwarz[1978] that, like other scores, balances the likelihoodand the complexity of the model:

score(G : D) = `(D : θ,G)− 0.5 log(K)|ΘG |, (4)

where θ are the maximum-likelihood parameters, K isthe number of samples and |ΘG | is the number of freeparameters associated with the graph structure G.

When learning a DCBN (and similarly for a standardDBNs), the decomposition of the likelihood allows usto use the BIC score to learn G0 and G→ separately.When learning G0, K is the number of sequences , andwhen learning G→, K is the overall number of tran-sitions in all sequences. To cope with local maximaduring the search we also use a TABU list and randomrestarts [Glover and Laguna, 1993]. See Koller andFriedman [2009] for details on this standard learningprocedure for temporal models.

3.4 Univariate Marginal Estimation

The central strength of the copula-based representa-tion is that it allows us to separate the choice of theunivariate marginal distribution and that of the depen-dency copula functions. Thus, we can use nonparamet-ric marginal estimation which, in the univariate case, istypically extremely accurate and robust. Perhaps themost common choice is to use kernel density estimationwith a Gaussian kernel (see, for example, [Bowmanand Azzalini, 1997]).

Unfortunately, in some cases the Gaussian kernel maynot be sufficiently powerful. For example, the distri-bution of the EMG activity measurements in EMGdataset that we use in our experimental evaluationis quite heavy-tailed as can be seen in Figure 3. Astandard Gaussian (solid red line) is clearly a bad fitfor such a distribution. Obviously, nonparametric ker-nel estimation with a Gaussian kernel is significantlymore accurate than a naive Gaussian fit. However, itis still exponentially concentrated around the trainingdata and, as confirmed in preliminary experiments, haspoor generalization performance.

Instead, since in choosing the marginal distribution weare not constrained in any way, we use a simpler his-togram representation that is robust to outliers. Con-cretely, let {xi}n1 be a given set of training samplesof the random variable Xi and use L = min{xi} andR = max{xi} to denote the minimum and maximumvalues, respectively. We partition the interval [L,R]into K equal bins of width w = R−L

K . To account foroutliers in the test data, we add a bin at [L−w,L] andone at [R,R+w], and assign a single pseudo sample toeach. Using Nj to denote the number of samples thatfalls in the interval that corresponds to the jth bin,the density in that interval is then set to (Nj)/(n+2),and zero elsewhere.

4 Experimental Evaluation

In this section we demonstrate the merit of the DCBNmodel for capturing real-valued temporal phenomenain three markedly different real-life scenarios.

In principle, we can use any family of local copulas inthe model and even mix different copula families with-out significant computational difficulty. However, forsimplicity, and to emphasize the generic power of theDCBN model, in all the experiments below we instan-tiate the model with the simple and commonly usedGaussian copula defined in Eq. (2). To model the uni-variate marginal distribution of each variable, we usethe nonparametric histogram described in Section 3.4,using 50 bins. We compare our DCBN model to twobaselines: a temporal model where a full multivariate

Page 6: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

252

Dynamic Copula Networks for Modeling Real-valued Time Series

Gaussian is used to model the conditional probabilityof all neurons at time t, given all neurons at timet − ∆ (tMVG); A standard dynamic Bayesian net-work (DBN) with a linear Gaussian representation,so that each variable is modeled as a normal distribu-tion centered around a linear combination of its par-ents with a variable dependent variance Xi|XPari ∼N(β0 + βXPari , σi). The structure of both the DBNmodel and our DCBN model was learned using thesame greedy structure search procedure guided by theBayesian Information Criterion [Schwarz, 1978] score,as described in Section 3.3.

We consider the following datasets:

• Neural. Calcium concentration (a proxy for neu-ral activity) of neurons of a mice auditory cor-tex loaded with calcium fluoresces indicator. Thedataset includes measurements of 5-17 neuronsin 13 recording sessions each with 24,00-57,000time points. In each experiment, we corrected aDC drift in the signal by subtracting from eachsequence a linear term. Specifically, for a sequences(t) : t ∈ {1 . . . T}, denote by s the least squarelinear approximation of s. The output of our pre-processing is simply sout(t) = s(t)− s(t). Finally,due to the density of the signal, modeling neuralactivity at time t given time t − 1 is a trivialproblem. Thus, for all models during both trainand test time, we use models where X(t) is con-ditioned on X(t − ∆). We use ∆ = 1

2 secondswhich is sensible biologically (results were similarfor other values, see supplementary results).

• EMG. The EMG physical action dataset fromthe UCI repository. This data consists of EMGrecordings of four subjects while performing tennormal and ten aggressive activities. An EMGapparatus with skin-surface electrodes placed onthe subjects’ limbs was used to record ∼ 10, 000measurements in eight locations. Since differentrecordings can be of completely different scales(e.g., of different subjects), the measurements ofeach variable in each recording was centered andits variance was normalized to 1.

• DOW. Daily changes of the stocks comprisingthe Dow Jones 30 index. The data includes 1508daily adjusted changes aver a period of five years(2001-2005). To avoid arbitrary imputation, twostocks not included as part of the index in all ofthese days were excluded (KFT,TRV).

For all datasets, when splitting the data into train andtest instances, we maintain temporal coherence: Forthe Neural and DOW datasets, the entire sequence ofmeasurements was split into five consecutive segments

of equal length. In each fold, one segment is held outfor testing while the others are used for training. In theEMG domain, the natural division is by the subjectcarrying out the activity so that in each experimentwe train on the activity of three subjects and test onthat of the fourth held-out subject.

4.1 Quantitative Likelihood Evaluation

We start with a quantitative evaluation of the dif-ferent models on all three datasets in terms of log-probability (log-loss) per instance performance on heldout test data. Figure 4 (left) show the performanceof the models on the DOW dataset as a function ofthe maximal number of parents allowed for each vari-able during structure search. As a further point ofreference, we also shows the performance of a non-temporal multivariate Gaussian model (MVG). Thebenefit from temporal modeling, and the consistentadvantage of our copula-based DCBN model over theDBN model across the entire range is evident. Notethat an advantage of 3 bits, for example, translate intoeach test instance being 23 = 8 times more likely, sothat the advantages shown are substantial.

The superiority our DCBN over the tMVG model isparticularly striking when we consider the complexitiesof the two models. For N stocks, the tMVG model

requires 2N(2N−1)2 + 2N parameters. In contrast, the

DCBN model requires at most 2 × N × (K + 1) pa-rameters, where K is the maximal number of parentsallowed. Interestingly, our DCBN model is superiorto tMVG even when K = 0, and the advantage growssubstantially as more parents are allowed. While thismay sound surprising at first, the explanation is quitesimple. Intuitively, the nonparametric marginals pro-vide an extremely accurate estimate of the univariatedistribution thus boosting the ability of the model tocapture the overall joint distribution. At the sametime, the parameters of the univariate marginals donot take place in the joint learning process. This isparticularly appealing since learning can be carried outboth efficiently and robustly.

Figure 4 (left) is typical and is qualitatively similar forall 13 experiments of the Neural dataset and 18/20of the EMG dataset experiments (see supplementarymaterial). To get a broad view of the performancefor these datasets across experiments, Figure 4 (centerand right) compare the performance of our DCBNmodel to that of a standard DBN across all experi-ments and all model complexities (number of maximalparents allows). To put all experiments on approxi-mately the same scale, the units of both axes are inbits/instance improvement over the independence (noparent) Gaussian model. Impressively, in the over-

Page 7: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

253

Elad Eban, Gideon Rothschild, Adi Mizrahi Israel Nelken, Gal Elidan

0 1 2 3 4 5 6−50

−48

−46

−44

−42

−40lo

g−

loss / insta

nce

max mumber of parents

DCBN

tMVG

DBN

MVG

0 2.5 5 7.5 100

2.5

5

7.5

10

DBN log−loss / instance gain

DC

BN

log−

loss / insta

nce g

ain

0 14 280

14

28

DBN log−loss / instance gain

DC

BN

lo

g−

loss /

in

sta

nce

ga

in

Figure 4: (left) DOW dataset average test log-loss per instance of the different models over 5 folds (y-axis) asa function of the maximal number of parents allowed in the model. Compared are our copula-based DCBNmodel, a standard linear Gaussian DBN model and a full multivariate Gaussian model (tMVG). Also shownis a non-temporal MVG model. The structure of both the DBN and DCBN models was learned using thesame greedy procedure and the BIC score. (center) Neural and (right) EMG average log-loss per instanceimprovement relative to the full independent Gaussian model. Each point corresponds to a specific experimentand a specific bound on the number of parents in the network. The colors correspond to different experiments.

whelming majority of cases our model leads to sub-stantial gains in test set performance. Specifically, ourmodel is superior in all of the 65 Neural experimentsand 55/60 of EMG experiments.

4.2 Qualitative Neuronal Assessment

As exemplified in Figure 1, the strength of the copularepresentation is that it can faithfully capture a com-plex distribution. Thus, aside from the quantitativeadvantage reported above, we would expect a copula-based model to better capture the underlying qualita-tive phenomenon. We now demonstrate that this is thecase for the Neural data. Concretely, in contrast to astandard DBN, we show that our model allows us to“identify” the biologically meaningful spiking events.

We start by evaluating the likelihood of the modelsin the physiologically relevant periods around spikingevents. To do so, we use the procedure described inRothschild et al. [2010] to identify spike-evoked tran-sients. Since our model is defined over all neuronsat each time point, we consider firing of any of theneurons to be an event of interest. We can then mea-sure performance (relative to the independent Gaus-sian model), as a function of the temporal distancefrom the event. A typical result is shown Figure 6,where the advantage of our DCBN model over theDBN model is evident (similar results for the other12 neuronal experiments are provided in the supple-mentary material).

Importantly, our DCBN model “identifies” the salientregions, although this information was not available attraining time, and offers the greatest advantage in thevicinity of the event. The explanation is simple: near

an event, the activity of neurons is more correlated andthe advantage over an independent model manifestsmore clearly. In contrast, the DBN model is not ableto capture the more complex distribution and, in manycases, degrades in performance in that region. Thisshould not come as a surprise given that fact thatthe DBN model has Gaussian marginals that cannotcapture heavy-tailed behavior.

Our results also shed light on an ongoing biological de-bate [Schneidman et al., 2006, Ganmor et al., 2011]. Itwas suggested [Schneidman et al., 2006] that pairwiseinteractions dominate neuronal network dynamics andhigher-order interactions account for only 10% of theoverall information. We can use our DCBN model toevaluate the contribution of higher degree interactionsin terms of test likelihood performance. The result issummarized in Figure 5. As expected, the contributionof the first parent in the network (corresponding topairwise interactions) is the largest and accounts forapproximately 65% of the information gain. At thesame time, the third and forth order interactions alonecontributing close to 30% of the gain. This suggeststhat higher-order interactions play an important rolein neuronal activity. We leave further biological inves-tigation of these findings to future work.

4.3 EMG - Discriminative Assessment

The EMG dataset involves a variety of different activ-ities, and naturally lends itself to the discriminativetask of activity recognition. Given a test sequence,we simply predict the activity to be the one that cor-responds to the model whose posterior probability is

Page 8: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

254

Dynamic Copula Networks for Modeling Real-valued Time Series

1 2 3 4 5 6 7 8 9 10 11 12 130

0.2

0.4

0.6

0.8

1fr

action o

f lo

g−

loss im

pro

vm

ent

Experiment

(a)

(b)

Figure 5: Log-loss/instance improvement of ourDCBN model as a fraction of the information gainof a 5-parent model relative to the independencebaseline. The lower bars (a) show the improvementwhen allowing one parent for each variable, corre-sponding to pairwise dependencies. The second layer(b) correspond to the additional improvement whenallowing up to two parents, and so on. The dottedline shows the first level average across experiments.

maximized:

α = argmaxα

Pr(Mα|seq) ∝ Pr(seq|Mα),

where Mα is the model learned for the activity α. Wesimilarly use the temporal linear Gaussian DBN as agenerative baseline. We also compare to a discrim-inative baseline. Since there are only 4 recordingsfor every activity, we can only consider a simple kNNclassifier (K=3), applied to first and second momentsof the variables (we also tried kNN with a Fourier basisrepresentation of the sequences with worse results).

When attempting to identify one of twenty activities,the performance of all methods is comparable andunimpressive and averages around 30% (with a smalladvantage to our DCBN model). In this case, amodel with no parents achieves similar performance.Intuitively, the similarity between some activitiesis simply too high for greater differences betweenthe models to manifest. Indeed, the experimentalsetting in which the data was collected was describedas including 10 aggressive and 10 normal activities.When we consider this coarser-grained binaryclassification problem, the differences between themodels are more evident: while the nearest neighborand the DBN baseline are able to achieve a solid 85%classification accuracy, our DCBN model offers animpressive 5% improvement in predictive accuracy.

−500 0 500−1

0

1

2

3

4

5

ave

rag

e lo

g−

loss g

ain

ms from event

DCBN

DBN

Figure 6: Average test set improvement in log-loss/instance relative to the independent Gaussianmodel (y-axis) as a function of the distance from aspike event (x-axis) for the Neural dataset. Shownis the improvement of the DCBN and DBN modelswith up to 4 parents per variable

5 Conclusions and Future Work

In this work we introduced Dynamic Copula BayesianNetworks (DCBN), a model aimed at copingwith the challenge of modeling the joint temporalbehavior of multiple real-valued variables. Using thestatistical framework of copulas, our model allowsus to separately choose the univariate marginalrepresentation and the joint dependence function,thus generalizing the recently introduced copula BNmodel. Importantly, our model offers great flexibilityin accurately capturing real-valued non-Gaussiantemporal phenomena.

We applied our generic construction to three markedlydifferent real-life domains. In all cases, our genericmodel offers consistent and significant quantitative ad-vantages, even when using only straightforward uni-variate marginal histograms and the simple Gaussiancopula. We also demonstrated the merit of the modelin terms of the ability to qualitatively capture physi-cally meaningful aspects of the domain.

More generally, the DCBN model allows for the mix-and-match of any univariate marginal representationand an expressive range of copula families withoutany computational difficulty. Obviously, and incontrast to temporal modeling using standard DBNs,this amounts to substantial modeling flexibility.Thus, our model opens the door for numerous novelapplications in a variety of complex domains where,for computational reasons, only relatively simple(e.g., Gaussian) representations were explored.

Page 9: Dynamic Copula Networks for Modeling Real-valued Time Seriesproceedings.mlr.press/v31/eban13a.pdf · Dynamic Copula Networks for Modeling Real-valued Time Series ... quantitative

255

Elad Eban, Gideon Rothschild, Adi Mizrahi Israel Nelken, Gal Elidan

Acknowledgements

The research was supported by the Gatsby Charita-ble Foundation, Intel Cooperation, and the Israel Sci-ence Foundation. The authors thank Yair Weiss, UriHeinemann, Daniel Zoran, and Ofer Meshi for usefulcomments in various stages of the preparation of thispaper.

References

A. Bowman and A. Azzalini. Applied SmoothingTechniques for Data Analysis. Oxford UniversityPress, 1997.

T. Dean and K. Kanazawa. A model for reasoningabout persistence and causation. ComputationalIntelligence, 5:142–150, 1989.

G. Elidan. Copula bayesian networks. Advances inNeural Information Processing Systems, 24, 2010.

E. Ganmor, R. Segev, and E. Schneidman. Thearchitecture of functional interaction networks inthe retina. The Journal of Neuroscience, 31(8):3044–3054, 2011.

F. Glover and M. Laguna. Tabu search. In C. Reeves,editor, Modern Heuristic Techniques for Combina-torial Problems, Oxford, England, 1993. BlackwellScientific Publishing.

H. Joe and J. Xu. The estimation method of infer-ence functions for margins for multivariate models.Technical Report 166, Department of Statistics,University of British Columbia, 1996.

D. Koller and N. Friedman. Probabilistic GraphicalModels: Principles and Techniques. The MIT Press,2009.

J. Pearl. Probabilistic Reasoning in Intelligent Sys-tems. Morgan Kauffman, 1988.

G. Rothschild, I. Nelken, and A. Mizrahi. Functionalorganization and population dynamics in the mouseprimary auditory cortex. Nature neuroscience, 13(3):353–360, 2010.

E. Schneidman, M. Berry, R. Segev, and W. Bialek.Weak pairwise correlations imply strongly corre-lated network states in a neural population. Nature,440(7087):1007–1012, 2006.

G. Schwarz. Estimating the dimension of a model.Annals of Statistics, 6:461–464, 1978.

A. Sklar. Fonctions de repartition a n dimensions etleurs marges. Publications de l’Institut de Statistiquede L’Universite de Paris, 8:229–231, 1959.