A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for...

10
A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models Trong Nghia Hoang § IDMHTN@NUS. EDU. SG Quang Minh Hoang HQMINH@COMP. NUS. EDU. SG Bryan Kian Hsiang Low LOWKH@COMP. NUS. EDU. SG Department of Computer Science, National University of Singapore, Republic of Singapore § Interactive and Digital Media Institute, National University of Singapore, Republic of Singapore Abstract This paper presents a novel distributed varia- tional inference framework that unifies many par- allel sparse Gaussian process regression (SGPR) models for scalable hyperparameter learning with big data. To achieve this, our frame- work exploits a structure of correlated noise pro- cess model that represents the observation noises as a finite realization of a high-order Gaussian Markov random process. By varying the Markov order and covariance function for the noise pro- cess model, different variational SGPR models result. This consequently allows the correlation structure of the noise process model to be char- acterized for which a particular variational SGPR model is optimal. We empirically evaluate the predictive performance and scalability of the dis- tributed variational SGPR models unified by our framework on two real-world datasets. 1. Introduction The rich class of Bayesian non-parametric Gaussian pro- cess (GP) models has recently established itself as a lead- ing approach to probabilistic non-linear regression due to its capability of representing highly complex correlation structure underlying the data. However, the full-rank GP regression (FGPR) model incurs a cost of cubic time in the data size for computing the predictive distribution and in each iteration of gradient ascent to refine the estimate of its hyperparameters to improve the log-marginal likelihood (Rasmussen & Williams, 2006), hence limiting its usage to only small datasets in practice. To boost its scalability, a wealth of sparse GPR (SGPR) models (azaro-Gredilla et al., 2010; Low et al., 2015; Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). Qui˜ nonero-Candela & Rasmussen, 2005; Snelson & Ghahramani, 2007) utilizing varying low-rank approximate representations of FGPR have been proposed, many of which are spanned by the unifying view of Qui˜ nonero- Candela & Rasmussen (2005) based on the notion of in- ducing variables (Section 3) such as the subset of regres- sors (SoR) (Smola & Bartlett, 2001), deterministic train- ing conditional (DTC) (Seeger et al., 2003), fully inde- pendent training conditional (FITC) (Snelson & Gharah- mani, 2005), fully independent conditional (FIC), partially independent training conditional (PITC) (Schwaighofer & Tresp, 2003), and partially independent conditional (PIC) (Snelson & Ghahramani, 2007) approximations. Conse- quently, they incur linear time in the data size, which is still prohibitively expensive for training with million- sized datasets. To scale up these SGPR models further for performing real-time predictions necessary in many time- critical applications and decision support systems (e.g., en- vironmental sensing (Cao et al., 2013; Dolan et al., 2009; Ling et al., 2016; Low et al., 2008; 2009; 2011; 2012; Podnar et al., 2010; Zhang et al., 2016), traffic monitor- ing (Chen et al., 2012; 2013b; 2015; Hoang et al., 2014a;b; Low et al., 2014a;b; Ouyang et al., 2014; Xu et al., 2014; Yu et al., 2012)), a number of these models have been parallelized (e.g., FITC, FIC, PITC, and PIC (Chen et al., 2013a), and l ow-rank-cum-M arkov a pproximation (LMA) (Low et al., 2015) unifying a spectrum of SGPR mod- els with PIC and FGPR at the two extremes), but the re- sulting parallel SGPR models do not readily extend to in- clude hyperparameter learning. The work of Deisenroth & Ng (2015) has recently introduced a practical product-of- expert (PoE) paradigm for GP which imposes a factorized structure on the marginal likelihood that allows it to be op- timized effectively in a parallel/distributed fashion. However, the main criticism of the above approximation paradigms is their lack of a rigorous approximation since they do not require optimizing some loss criterion incurred by an approximation model (Titsias, 2009b). To resolve this, the work of Titsias (2009a) has introduced an alterna-

Transcript of A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for...

Page 1: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for UnifyingParallel Sparse Gaussian Process Regression Models

Trong Nghia Hoang§ [email protected] Minh Hoang† [email protected] Kian Hsiang Low† [email protected]†Department of Computer Science, National University of Singapore, Republic of Singapore§Interactive and Digital Media Institute, National University of Singapore, Republic of Singapore

AbstractThis paper presents a novel distributed varia-tional inference framework that unifies many par-allel sparse Gaussian process regression (SGPR)models for scalable hyperparameter learningwith big data. To achieve this, our frame-work exploits a structure of correlated noise pro-cess model that represents the observation noisesas a finite realization of a high-order GaussianMarkov random process. By varying the Markovorder and covariance function for the noise pro-cess model, different variational SGPR modelsresult. This consequently allows the correlationstructure of the noise process model to be char-acterized for which a particular variational SGPRmodel is optimal. We empirically evaluate thepredictive performance and scalability of the dis-tributed variational SGPR models unified by ourframework on two real-world datasets.

1. IntroductionThe rich class of Bayesian non-parametric Gaussian pro-cess (GP) models has recently established itself as a lead-ing approach to probabilistic non-linear regression due toits capability of representing highly complex correlationstructure underlying the data. However, the full-rank GPregression (FGPR) model incurs a cost of cubic time in thedata size for computing the predictive distribution and ineach iteration of gradient ascent to refine the estimate ofits hyperparameters to improve the log-marginal likelihood(Rasmussen & Williams, 2006), hence limiting its usage toonly small datasets in practice.

To boost its scalability, a wealth of sparse GPR (SGPR)models (Lazaro-Gredilla et al., 2010; Low et al., 2015;

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

Quinonero-Candela & Rasmussen, 2005; Snelson &Ghahramani, 2007) utilizing varying low-rank approximaterepresentations of FGPR have been proposed, many ofwhich are spanned by the unifying view of Quinonero-Candela & Rasmussen (2005) based on the notion of in-ducing variables (Section 3) such as the subset of regres-sors (SoR) (Smola & Bartlett, 2001), deterministic train-ing conditional (DTC) (Seeger et al., 2003), fully inde-pendent training conditional (FITC) (Snelson & Gharah-mani, 2005), fully independent conditional (FIC), partiallyindependent training conditional (PITC) (Schwaighofer &Tresp, 2003), and partially independent conditional (PIC)(Snelson & Ghahramani, 2007) approximations. Conse-quently, they incur linear time in the data size, whichis still prohibitively expensive for training with million-sized datasets. To scale up these SGPR models further forperforming real-time predictions necessary in many time-critical applications and decision support systems (e.g., en-vironmental sensing (Cao et al., 2013; Dolan et al., 2009;Ling et al., 2016; Low et al., 2008; 2009; 2011; 2012;Podnar et al., 2010; Zhang et al., 2016), traffic monitor-ing (Chen et al., 2012; 2013b; 2015; Hoang et al., 2014a;b;Low et al., 2014a;b; Ouyang et al., 2014; Xu et al., 2014;Yu et al., 2012)), a number of these models have beenparallelized (e.g., FITC, FIC, PITC, and PIC (Chen et al.,2013a), and low-rank-cum-Markov approximation (LMA)(Low et al., 2015) unifying a spectrum of SGPR mod-els with PIC and FGPR at the two extremes), but the re-sulting parallel SGPR models do not readily extend to in-clude hyperparameter learning. The work of Deisenroth &Ng (2015) has recently introduced a practical product-of-expert (PoE) paradigm for GP which imposes a factorizedstructure on the marginal likelihood that allows it to be op-timized effectively in a parallel/distributed fashion.

However, the main criticism of the above approximationparadigms is their lack of a rigorous approximation sincethey do not require optimizing some loss criterion incurredby an approximation model (Titsias, 2009b). To resolvethis, the work of Titsias (2009a) has introduced an alterna-

Page 2: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

tive formulation of variational inference for DTC approxi-mation that involves minimizing the Kullback-Leibler (KL)distance between the variational DTC approximation andthe posterior distribution of some latent variables (includ-ing the inducing variables) induced by the FGPR modelgiven the data/observations, or equivalently maximizinga lower bound of the log-marginal likelihood. Hyperpa-rameter learning can then be achieved by maximizing thisvariational lower bound as a function of the hyperparam-eters. Its incurred time per iteration of gradient ascent isstill linear in the data size but can be significantly reducedby parallelization on multiple distributed machines/cores,as demonstrated by Gal et al. (2014). Despite their theo-retical rigor and scalability, it has been shown by Hoanget al. (2015) that DTC has utilized the most crude approx-imation among all SGPR models (except SoR) spannedby the unifying view of Quinonero-Candela & Rasmussen(2005), thus severely compromising its predictive perfor-mance. As such, it remains an open question whether ef-ficient and scalable hyperparameter learning of more re-fined SGPR models (e.g., PIC, LMA) for big data can beachieved through distributed variational inference1.

To address this question, we first observe that variationalDTC (Titsias, 2009a) and its distributed variant (Gal et al.,2014) have implicitly assumed the observation noises tobe independently distributed with constant variance, whichis often violated in practice (Huizenga & Molenaar, 1995;Koochakzadeh et al., 2015; Rasmussen & Williams, 2006).This strong assumption has been relaxed slightly by Titsias(2009b) to that of input-dependent noise variance which al-lows a variational inference formulation for FITC approxi-mation to be derived. This seems to suggest a possibility ofderiving variational inference formulations for the more re-fined sparse GP approximations and, perhaps surprisingly,their distributed variants by exploiting more sophisticatednoise process models such as those being used by exist-ing GP works. Such GP works, however, suffer from poorscalability to big data: Notably, the work of Goldberg et al.(1997) has proposed a heteroscedastic GPR (HGPR) modelthat extends the FGPR model by representing the noisevariance with a log-GP (in addition to the original GP mod-eling the noise-free latent measurements), hence allowingit to vary across the input space; the observation noisesremain independently distributed though. But, the exactHGPR model cannot be computed tractably while approxi-mate HGPR models (Kersting et al., 2007; Lazaro-Gredilla& Titsias, 2011) still incur cubic time in the data size,thus scaling poorly to big data (i.e., million-sized datasets).This is similarly true for FGPR models (Murray-Smith &

1The work of Campbell et al. (2015) has separately developeda distributed variational inference framework for Bayesian non-parametric models that are limited to only clustering processes(e.g., Dirichlet, Pitman-Yor, and their variants) not including GPs.

Girard, 2001; Rasmussen & Williams, 2006) that repre-sent correlation of observation noises with an additionalcovariance function. Unfortunately, variational DTC (Tit-sias, 2009a), its distributed variant (Gal et al., 2014), andvariational FITC (Titsias, 2009b) cannot readily accommo-date such heteroscedastic or correlated noise process mod-els without sacrificing their time efficiency. So, the keychallenge remains in being able to specify some structureof the noise process model that can be exploited for effi-cient and scalable hyperparameter learning of more refinedSGPR models (e.g., PITC, PIC, LMA) through distributedvariational inference, which is the focus of our work here.

To tackle this challenge, this paper presents a novel varia-tional inference framework (Section 3) for deriving sparseGP approximations to a new FGPR model with observationnoises that vary as a finite realization of a high-order Gaus-sian Markov random process (Section 2), thus enriching theexpressiveness of HGPR models by correlating the noisesacross the input space. Interestingly, our proposed frame-work can unify many SGPR models via specific choicesof the Markov order and covariance function for the noiseprocess model (Section 4), which include variational DTCand FITC (Titsias, 2009a;b) and the more refined PITC,PIC, and LMA. This then enables the characterization ofthe correlation structure of the noise process model forwhich a particular sparse GP approximation is (variation-ally) optimal2 and explains why PIC and LMA tend to out-perform DTC and FITC in practice despite not being char-acterized as optimal when independently distributed obser-vation noises are assumed. More importantly, our frame-work is amenable to parallelization by distributing its com-putational load of hyperparameter learning on multiple ma-chines/cores (Section 5), hence reducing its incurred lin-ear time per iteration of gradient ascent by a factor closeto the number of machines/cores. We empirically evaluatethe predictive performance and scalability of the distributedvariational SGPR models (e.g., state-of-the-art distributedvariational DTC (Gal et al., 2014)) unified by our frame-work on two real-world datasets (Section 6).

2. Gaussian Processes with Correlated NoisesLet X be a set representing the input domain such that eachd-dimensional input feature vector x 2 X is associatedwith a latent output variable f

x

and its corresponding noisyoutput y

x

, fx

+ ✏x

differing by an additive noise ✏x

. Let{f

x

}x2X

denote a Gaussian process (GP), that is, every fi-2Such a characterization, which is important to many real-

world applications of GP involving different noise structures, can-not be realized from the unifying framework of Hoang et al.(2015) relying on reverse variational inference to obtain the vari-ational lower bound for a SGPR model. Furthermore, it is unclearor at least non-trivial to determine whether it is amenable to par-allelization for learning the hyperparameters of LMA which doesnot meet its assumed decomposability conditions.

Page 3: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

nite subset of {fx

}x2X

follows a multivariate Gaussian dis-tribution. Then, the GP is fully specified by its prior meanµx

, E[fx

] and covariance kxx

0 , cov[fx

, fx

0] for all

x,x0 2 X , the latter of which can be defined, for example,by the widely-used squared exponential covariance func-tion k

xx

0 , �2s exp(�0.5(x � x

0

)

>

�2(x � x

0

)) where⇤ , diag[`1, . . . , `d] and �2

s are its length-scale and sig-nal variance hyperparameters, respectively. Similarly, let{✏

x

}x2X

denote another GP with prior mean E[✏x

] = 0

and covariance k✏xx

0 , cov[✏x

, ✏x

0] for all x,x0 2 X , the

latter of which is defined by a covariance function like thatused for k

xx

0 (albeit with different hyperparameter values).

Supposing a column vector yD

, (yx

)

>

x2D

of noisy out-puts is observed for some set D ⇢ X of training inputs, aFGPR model with correlated observation noises (Murray-Smith & Girard, 2001; Rasmussen & Williams, 2006) canperform probabilistic regression by providing a GP pos-terior/predictive distribution p(f

U

|yD

) = N (fU

|µU

+

KUD

(KDD

+ SDD

)

�1(y

D

� µD

),KUU

� KUD

(KDD

+

SDD

)

�1KDU

) of the unobserved outputs fU

, (fx

)

>

x2U

for any set U ⇢ X \ D of test inputs where µU

(µD

) isa column vector with mean components µ

x

for all x 2 U(x 2 D), K

UD

(KDD

) is a matrix with covariance com-ponents k

xx

0 for all x 2 U ,x0 2 D (x,x0 2 D), KDU

=

K>

UD

, and SDD

is a matrix with covariance componentsk✏xx

0 for all x,x0 2 D representing the correlation of obser-vation noises ✏

D

, (✏x

)

>

x2D

⇠ N (0, SDD

) which impliesp(y

D

|fD

) = N (yD

|fD

, SDD

). However, the FGPR modelscales poorly in the size |D| of data because computing theGP predictive distribution incurs O(|D|3) time due to theinversion of K

DD

+ SDD

.

To improve its scalability, our key idea stems from im-posing a B-th order Markov property on the observationnoise process: Specifically, let the set D (U ) of training(test) inputs be partitioned evenly into M disjoint sub-sets D1,D2, . . . ,DM (U1,U2, . . . ,UM ). In the same spiritas a Gaussian Markov random process, imposing a B-thorder Markov property on the observation noise process{✏

x

}x2D

with respect to such a partition implies that ob-servation noises ✏

Di and ✏Dj are conditionally independent

given ✏D\(Di[Dj) if |i � j| > B. As shown in (Low et al.,

2015), this can be realized by partitioning the matrix SVV

for the entire set V , D [U of training and test inputs intoM ⇥ M square blocks, that is, S

VV

, [SViVj ]i,j=1,...,M

where Vi , Di [ Ui (hence, V =

SMi=1 Vi) and

SViVj ,

8>>><

>>>:

K✏ViVj

if |i� j| B,

K✏ViD

BiK✏ �1

D

Bi D

BiSD

Bi Vj

if j � i > B > 0,

SViD

BjK✏ �1

D

Bj D

BjK✏

D

Bj Vj

if i� j > B > 0,

0 if |i� j| > B = 0;

(1)

such that K✏ViVj

, [k✏xx

0 ]x2Vi,x0

2Vj, DB

i = Di+1 [ . . . [Di+B

where i+B , min(M, i + B), and 0 denotes a square

block comprising components of value 0. Though the ma-trix S

VV

(and hence its submatrix SDD

) is dense, its struc-ture (1) interestingly guarantees that P

DD

, S�1DD

is B-block-banded (Low et al., 2015). That is, if |i � j| > B,then P

DiDj = 0. Such sparsity of PDD

is the main ingre-dient to improving the scalability of the FGPR model withcorrelated observation noises, as revealed in Section 3.

3. Variational Sparse GP Regression ModelsThis section introduces a novel variational inference frame-work for deriving sparse GP approximations to the FGPRmodel with correlated observation noises that vary as a fi-nite realization of a B-th order Gaussian Markov randomprocess (Section 2). Similar to the variational inferenceformulations for DTC and FITC approximations (Titsias,2009a;b), we exploit a vector f

S

of inducing output vari-ables for some small set S ⇢ X of inducing inputs to con-struct sufficient statistics for y

D

such that yD

? fD

| fS

(i.e., p(fD

, fS

|yD

) = p(fD

|fS

) p(fS

|yD

)). However,choosing S for which this conditional independence prop-erty holds may not be possible in practice. Instead, it isapproximated by some (heuristic) choice of S as follows:

p(fD

, fS

|yD

) ' p(fD

|fS

) q(fS

) (2)

where q(fS

) is a free-form distribution to be opti-mized by minimizing the KL distance DKL(q) ,KL(p(f

D

|fS

)q(fS

)||p(fD

, fS

|yD

)) between p(fD

|fS

)

q(fS

) and p(fD

, fS

|yD

) on either sides of (2) withrespect to q(f

S

) and the hyperparameters defining priorcovariances k

xx

0 and k✏xx

0 (Section 2). To do this, note that

log p(yD

) = L(q,Z) +DKL(q) (3)

where Z denotes a set of hyperparameters 3 defining kxx

0

and k✏xx

0 and, as derived in Appendix A,

L(q,Z) , Eq(fS)[logN (yD

|⌫, SDD

)+log(p(fS

)/q(fS

))]

� 0.5Tr[S�1DD

(KDD

�QDD

)]

(4)where ⌫ , µ

D

+ KDS

K�1SS

(fS

� µS

) and QYY

0 ,K

YS

K�1SS

KSY

0 for all Y,Y 0 ⇢ X . Supposing Z isfixed/known, since DKL(q) � 0 and log p(y

D

) is inde-pendent of q(f

S

), minimizing DKL(q) is equivalent tomaximizing L(q,Z) which entails solving for the optimalchoice of q(f

S

) that satisfies @L(q,Z)/@q = 0 as a PDFparameterized by Z . As derived in Appendix B, this yields

log q(fS

) = log[N (yD

|⌫, SDD

) p(fS

)] + const (5)

where const is used to absorb all terms independent of fS

.By completing the quadratic form for f

S

(Appendix C), itcan be derived from (5) that

q(fS

) = N (fS

|µS

+KSS

�1SS

VSD

,KSS

�1SS

KSS

) (6)3All the terms in (3) and the equations to follow depend on Z .

To ease notational clutter, their dependence on Z are omitted fromtheir expressions, unless otherwise needed for clarity reasons.

Page 4: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

where �

SS

, KSS

+ KSD

PDD

KDS

, PDD

= S�1DD

, andVSD

, KSD

PDD

(yD

� µD

). By plugging (6) into (4),R(Z) , maxq L(q,Z) can be derived analytically. Then,maximizing R(Z) with respect to Z gives the optimal hy-perparameters4. The details of this maximization are de-ferred to Section 5 for the sake of clarity. In the rest of thissection, we will instead focus on deriving an equivalent re-formulation of q(f

S

) (6) that can be computed efficientlyin linear time in the data size |D| since computing q(f

S

)

directly using (6) generally incurs O(|D|3) time. More-over, its computation can be distributed among parallel ma-chines/cores to achieve scalability, as described next.

Our first result (Lemma 1) below derives the structureof each non-zero constituent block P

DiDj of the B-blockbanded matrix P

DD

(i.e., |i� j| B), which is the key in-gredient to be exploited for computing q(f

S

) (6) efficientlyand scalably (Theorem 1). Note that it suffices to simplycompute P

DiDj for i j i+B because PDD

is symmet-ric (i.e., P

DiDj = P>

DjDifor j i j+B ) and B-block-

banded (i.e., PDiDj = 0 for |i� j| > B) (Section 2):

Lemma 1 Let5

Gk ,"

GkDkDk

GkDkD

Bk

GkD

Bk Dk

GkD

Bk D

Bk

#,

"SDkDk S

DkDBk

SD

Bk Dk

SD

Bk D

Bk

#�1

for k = 1, . . . ,M . Then, for i, j = 1, . . . ,M such thati j i+B = min(M, i+B) and j�B , max(1, j �B),

PDiDj =

Pik=j�B

GkDiDk

Gk�1

DkDkGk

DkDj. (7)

See Appendix D for its proof. From Lemma 1, construct-ing Gk of size O(B|D|/M) = O(B|S|)6 by O(B|S|) in-curs O(B3|S|3) time. G1, . . . , GM can be computed in-dependently and hence in parallel on M distributed ma-chines/cores; otherwise, they can be constructed sequen-tially in O(|D||S|2B3

) = O(MB3|S|3) time.

Our next result exploits Lemma 1 and the B-block-bandedstructure of P

DD

for decomposing �

SS

and VSD

in (6) intolinear sums of independent terms whose computations cantherefore be distributed among parallel machines/cores:

Theorem 1 Let B+(k) , {k, k + 1, . . . , k+B}. Then,

SS

= KSS

+

PMk=1 �k and V

SD

=

PMk=1 ↵k where

4The formulation described thus far is reminiscent of vari-ational expectation-maximization (EM) (Wainwright & Jordan,2008) whose approximate E step involves maximizing the vari-ational lower bound L(q,Z) with respect to q given optimal Zfrom M step while its M step follows exactly that of EM bymaximizing L(q,Z) with respect to Z given optimal q from Estep. The implication is that, like variational EM, variational DTCand FITC (Titsias, 2009a;b), and distributed variational DTC (Galet al., 2014), though the log-marginal likelihood log p(yD) doesnot necessarily increase in each iteration, our formulation has anattractive interpretation of maximizing its lower bound.

5When B = 0, Gk= S�1

DkDkfor k = 1, . . . ,M .

6We choose the size of S such that |Di| = |D|/M = O(|S|).

↵k , Pi,j2B

+(k) KSDiGkDiDk

Gk�1

DkDkGk

DkDj(y

Dj � µDj )

�k , Pi,j2B

+(k) KSDiGkDiDk

Gk�1

DkDkGk

DkDjK

DjS . (8)

Its proof is in Appendix E. From (8), ↵1, . . . ,↵M and�1, . . . ,�M can again be computed independently andhence in parallel in O(|B+

(k)|2|S|3) = O(B2|S|3) timeon M distributed machines/cores; alternatively, they can becomputed sequentially in O(|D||S|2B2

) = O(MB2|S|3)time. �

SD

and VSD

can then be constructed in O(M |S|2)time and computing q(f

S

) (6) in turn incurs O(|S|3) time.So, the overall time complexity of deriving q(f

S

) (6) is lin-ear in |D| which can be further reduced via parallelizationby a factor close to the number of machines/cores.

Remark The efficiency in deriving q(fS

) (6) can beexploited for constructing a linear-time approximationq(f

Ui |yD) to the GP predictive distribution p(fUi |yD) for

any subset Ui = Vi \ Di of test inputs and its distributedvariant, the details of which are not needed to understandour distributed variational inference framework for hyper-parameter learning in Section 5 (but required for predic-tions in our experiments in Section 6) and hence deferred toAppendices F and G, respectively. By varying the choicesof the Markov order B, the covariance function k✏

xx

0 forthe noise process model, the number M of data partitions,and the approximation q(f

Ui |fS , yD) to the test conditionalp(f

Ui |fS , yD), they recover the predictive distribution ofvarious SGPR models spanned by the unifying view ofQuinonero-Candela & Rasmussen (2005) (i.e., SoR, DTC,FITC, FIC, PITC, PIC) and LMA, as detailed in Section 4.

4. Unifying Sparse GP Regression Models4.1. SGPR models spanned by unifying view of

Quinonero-Candela & Rasmussen (2005)We will first demonstrate how our variational inferenceframework (Section 3) can unify SoR, DTC, FITC, FIC,PITC, PIC. As discussed in Appendix F, it suffices to showhow the covariance function k✏

xx

0 for the noise processmodel and the Markov order B can be set such that theresulting variationally optimal distribution q(f

S

) (6) coin-cides with that of these SGPR models. Let us consider thefollowing covariance function for the noise process model:

k✏xx

0 , kxx

0 �KxS

K�1SS

KSx

0+ �2

nI(x = x

0

) . (9)where �2

n is a noise variance hyperparameter. It followsthat K✏

DD

= KDD

� KDS

K�1SS

KSD

+ �2nI = K

DD

�Q

DD

+ �2nI . Then, imposing the B-th order Markov prop-

erty on the observation noise process {✏x

}x2D

through (1)with B = 0 gives S

DD

= blkdiag[KDD

� QDD

] + �2nI

which implies SDiDj = I(i = j)(K

DiDi �QDiDi +�2

nI).Plugging this expression of S

DD

into (6) yields the sameq(f

S

) induced by PIC, as detailed in (Hoang et al., 2015)7.7The work of Hoang et al. (2015) assumes a GP with zero prior

mean which is equivalent to setting µS = µD = 0 in (6).

Page 5: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

Furthermore, if |Di| = 1 for i = 1, . . . ,M (i.e., M = |D|),then S

DD

= diag[KDD

� QDD

] + �2nI which can be

plugged into (6) to recover the same q(fS

) induced byFIC. Supposing k✏

xx

0 , �2nI(x = x

0

) instead of us-ing (9), S

DD

= �2nI that can be plugged into (6) to re-

cover the same q(fS

) induced by DTC. Finally, the predic-tive distributions q(f

Ui |yD) of the SGPR models spannedby the unifying view of Quinonero-Candela & Rasmussen(2005) such as PIC, FIC and DTC can be recovered byintegrating the resulting q(f

S

) with their approximationsq(f

Ui |fS , yD) to the test conditional p(fUi |fS , yD), as dis-

cussed in Appendix F. We omit details of unifying PITC,FITC, and SoR with our framework since they induce thesame q(f

S

) as that of PIC, FIC and DTC, respectively.

4.2. Low-rank-cum-Markov approximation (LMA)

To unify LMA with our variational inference framework,we have to derive its induced q(f

S

) since it is not explic-itly given in (Low et al., 2015). To do this, let us first derivean equivalent reformulation of the predictive distribution ofthe FGPR model with independently distributed observa-tion noises of constant variance, as shown in Appendix H:p(f

S

|yD

) , N (fS

|µS

+ KSS

�1SS

VSD

,KSS

�1SS

KSS

)

where �

SS

, KSS

+ KSD

R�1DD

KDS

and VSD

=

KSD

R�1DD

(yD

�µD

) such that RDD

, KDD

�QDD

+�2nI .

But, computing p(fS

|yD

) generally incurs O(|D|3) time.To reduce to linear time and achieve further scalability viaparallelization, the work of Low et al. (2015) has proposedLMA by approximating R

DD

with SDD

(1) based on thecovariance function in (9), which interestingly induces thesame q(f

S

) and predictive distribution (Appendix I) as thatproduced by our variational inference framework (see, re-spectively, (6) in Section 3 and (48) in Appendix F, thelatter of which relies on a specific choice of approximationq(f

Ui |fS , yD) (47) to the test conditional p(fUi |fS , yD)).

So, LMA is unified with our framework.

4.3. Key insight: Different variational SGPR modelsevolve from varying noise correlation structures

It is well-known that, among existing SGPR models, DTCand FITC are deemed variationally optimal as they mini-mize the KL distance to a FGPR model with independentlydistributed noises (i.e., S

DD

= �2nI (Titsias, 2009a) or

SDD

= diag[KDD

�QDD

] +�2nI (Titsias, 2009b)). How-

ever, empirical results in the existing literature show thatFITC and DTC often predict more poorly than PIC (Hoanget al., 2015), and LMA (Low et al., 2015)8, which are usu-ally explained by empirically inspecting how well theseSGPR models approximate the prior covariance matrix ofthe FGPR model (Low et al., 2015; Quinonero-Candela &

8In (Low et al., 2015), LMA outperforms PIC on the samedataset used by Hoang et al. (2015) to compare PIC and DTC.

Rasmussen, 2005; Snelson, 2007). But, to the best of ourknowledge, it has not been explicitly explained why SGPRmodels like PIC and LMA, despite being able to providemore refined approximations of the prior covariance ma-trix of the FGPR model, do not minimize their KL distanceto the FGPR model.

Our unification results (Sections 4.1 and 4.2) have ex-plained this by giving a theoretical justification basedon the correlation structure of the noise process model:Specifically, FITC and DTC can only minimize their KLdistances to a FGPR model under the assumption of in-dependently distributed noises, which is often violated inmany real-world datasets where observation noises tend tobe correlated (Huizenga & Molenaar, 1995; Koochakzadehet al., 2015; Rasmussen & Williams, 2006). However, theydo not necessarily minimize their KL distances to a FGPRmodel in such cases. In light of this fact, our unification re-sults reveal that different SGPR models minimize their KLdistances to a FGPR model under varying assumptions ofcorrelation structure of the noise process model: For exam-ple, Section 4.1 (4.2) shows that when our variational in-ference framework minimizes the KL distance to a FGPRmodel (Section 3) under the assumption of B = 0 (B > 0)and the covariance function k✏

xx

0 in (9) for the noise processmodel, it recovers the same q(f

S

) and predictive distribu-tion q(f

Ui |yD) induced by PIC (LMA).

5. Distributed Variational Inference forHyperparameter Learning

Recall from Section 3 that plugging the derived q(fS

) (6)into the variational lower bound L(q,Z) (4) gives

R(Z) = maxq L(q,Z)

= logN (yD

|µD

, SDD

+QDD

)�0.5Tr[S�1DD

(KDD

�QDD

)]

(10)where the last equality is derived in Appendix J. Hyperpa-rameter learning then involves iteratively refining the esti-mate of hyperparameters Z via gradient ascent to improvethe value of R(Z). Its time complexity depends on the ef-ficiency of computing the gradient @R/@Z per iteration ofgradient ascent. As shown in Appendix K, differentiatingboth sides of (10) with respect to z 2 Z yields

@R/@z =

�1

2

Tr

@P

DD

@zW

DD

+ PDD

@WDD

@z� S

DD

@PDD

@z

�1

2

Tr

�1SS

@�SS

@z�K�1

SS

@KSS

@z

+

1

2

Tr

V >

SD

@��1SS

@zVSD

+ 2

@V >

SD

@z�

�1SS

VSD

(11)

where WDD

, KDD

� QDD

+ (yD

� µD

)(yD

� µD

)

>.However, computing @R/@z directly using (11) is pro-hibitively expensive as it involves computing the matrixderivative @P

DD

/@z = �PDD

(@SDD

/@z)PDD

which

Page 6: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

generally incurs O �|D|3� time. In the rest of this sec-tion, we will exploit the B-block-banded structure ofPDD

(Section 2) for deriving an efficient reformulation of@R/@z (11) that can be computed in linear time in the datasize |D|. More importantly, it is amenable to parallelizationon distributed machines/cores by constructing and commu-nicating summaries of data clusters to be described next:

Definition 1 (B-th order Markov Cluster) A B-th orderMarkov cluster k is defined as MB

k , Si2B

+(k) Di � DBk

for k = 1, . . . ,M where B+(k) and DB

k are previouslydefined in Theorem 1 and just below (1), respectively.

From Definition 1, D =

SMk=1 MB

k and MBk \MB

k0 6= ;if |k � k0| B and B > 0. Each B-th order Markovcluster k of size |MB

k | = O (B|D|/M) = O(B|S|)6 isassigned to a separate machine/core k which constructs itslocal summary, as defined below:

Definition 2 (Local Summary) The local summary of aB-th order Markov cluster k is defined as a tuple Pk ,(↵k,�k, {↵z

k,�zk , �

zk,

zk}z2Z

) where ↵k and �k are previ-ously defined in Theorem 1, ↵z

k , @↵k/@z, �zk , @�k/@z,

�zk ,X

i,j2B

+(k)

Tr

@

@z

⇣Gk

DiDkGk�1

DkDkGk

DkDjW

DjDi

⌘�,

zk ,

X

i,j2B

+(k)

Tr

@

@z

⇣Gk

DiDkGk�1

DkDkGk

DkDj

⌘SDjDi

�.

The local summaries P1, . . . ,PM can be computed in-dependently and hence in parallel on M distributed ma-chines/cores: To construct local summary Pk, each ma-chine/core k only needs access to its Markov cluster MB

k ,the corresponding noisy outputs y

M

Bk

, and a common setS of inducing inputs known to all machines/cores. Eval-uating the derivative terms in local summary Pk essen-tially requires computing terms such as @Gk

DiDk/@z for

i 2 B+(k). To compute the latter terms, we exploit the

result of Lemma 1 for computing @Gk/@z in terms of@S

D

Bk D

Bk/@z, @S

D

Bk Dk

/@z, and @SDkDk/@z. For each

Markov cluster MBk , computing Gk (and hence @Gk/@z)

incurs only O(B3|S|3) time, as discussed in the para-graph after Lemma 1. Its corresponding local summaryPk can then be computed in O(B2|S|3) time. So, theoverall time complexity of computing local summary Pk

of Markov cluster MBk in each parallel machine/core k is

O(B3|S|3) which is independent of the data size |D|. Al-ternatively, P1, . . . ,PM can be constructed sequentially inO(|D||S|2B3

) = O(MB3|S|3) time that is linear in |D|.Every machine/core k then communicates the local sum-mary Pk of its assigned Markov cluster MB

k to a centralmachine that will assimilate these local summaries into aglobal summary defined as follows:

Definition 3 (Global Summary) The global summary isdefined as a tuple P

, (↵,�, {↵z,�z,�z}z2Z

) where

↵ ,MX

k=1

↵k , � , KSS

+

MX

k=1

�k , ↵z ,MX

k=1

↵zk ,

�z , @KSS

@z+

MX

k=1

�zk , and �z ,

MX

k=1

�zk � zk .

Our main result to follow presents an efficient reformula-tion of @R/@z (11) by exploiting the global summary P

:

Theorem 2 @R/@z (11) can be re-expressed in terms ofglobal summary P

= (↵,�, {↵z,�z,�z}z2Z

) as follows:

@R@z

=

1

2

Tr

K�1

SS

@KSS

@z

�+ ↵z>

��1↵

�1

2

Tr

⇥��1�z

⇤� 1

2

↵>��1�z��1↵� 1

2

�z.

(12)

Its proof is in Appendix L. Since ↵ and ↵z are column vec-tors of size |S|, � and �z are matrices of size |S| by |S|,and �z is a scalar, computing @R/@z using (12) only incursO(|S|3) time given the global summary P

. So, the overalltime complexity of computing @R/@z (12) in a distributedmanner on M parallel machines/cores (sequentially) is stillO(B3|S|3) (O(|D||S|2B3

) = O(MB3|S|3)).

6. Experiments and DiscussionThis section empirically evaluates the predictive perfor-mance and scalability of distributed variational SGPR mod-els unified by our framework (Section 5) such as the dis-tributed variants of PIC and LMA and the current state-of-the-art distributed variant of DTC (Gal et al., 2014), whichwe respectively call dPIC, dLMA, and dDTC9, on two real-world datasets:

(a) The AIMPEAK dataset (Chen et al., 2013a) contains41850 observations of traffic speeds (km/h) along 775 roadsegments of an urban road network during morning peakhours on April 20, 2011. Each observation comprises a 5-dimensional input vector featuring the length, number oflanes, speed limit, direction of the road segment as well asits recording time which is discretized into 54 five-minutetime slots. The outputs correspond to the traffic speeds.

(b) The AIRLINE dataset (Hensman et al., 2013; Hoanget al., 2015) contains 2055733 information records of com-mercial flights in the USA from January to April 2008. Theinput denotes a 8-dimensional feature vector of the age ofthe aircraft (year), travel distance (km), airtime, departureand arrival time (min), day of the week, day of the month,and month. The output is the delay time (min) of the flight.

Both datasets are modeled using GP with correlated noises(Section 2) whose prior covariance matrix is defined usingthe squared exponential covariance function described in

9The induced variational lower bound of dDTC (Eq. (10)) isequivalent to that of the Dist-VGP framework of Gal et al. (2014).

Page 7: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

0 2 4 6 8 1010.5

11

11.5

12

12.5

No. t of iterations

RMSE

(km/h

)

dDTC

2 4 6 8 10 12 14 1650

100

150

200

250

300

No. M′

of cores

Tim

e(sec)

dDTC

(a) (b)

0 2 4 6 8 103.5

4

4.5

5

5.5

6

6.5

7

No. t of iterations

RMSE

(km/h

)

dPIC

2 4 6 8 10 12 14 16100

200

300

400

500

600

700

800

No. M′

of cores

Tim

e(sec)

dPIC

(c) (d)

0 2 4 6 8 103.5

4

4.5

5

5.5

6

6.5

7

No. t of iterations

RMSE

(km/h

)

dLMA

2 4 6 8 10 12 14 16100

200

300

400

500

600

700

800

No. M′

of cores

Tim

e(sec)

dLMA

(e) (f)Figure 1. Graphs of RMSEs achieved by (a) dDTC, (c) dPIC, and(e) dLMA vs. number t of iterations, and graphs of the total train-ing times incurred by (b) dDTC, (d) dPIC, and (f) dLMA vs. num-ber M 0 of parallel cores for the AIMPEAK dataset.

Section 2. For dPIC and dLMA models, the prior covari-ance of the noise process is constructed using (1) (respec-tively, with Markov order B = 0 and 1) and the covariancefunction in (9). Likewise, for dDTC, the prior covarianceof the noise process is constructed using (1) with B = 0

and covariance function k✏xx

0 , �2nI . Both the training and

test data are then partitioned evenly into M blocks usingk-means (i.e., k = M ). All experiments are run on a Linuxsystem with Intelr Xeonr E5-2670 at 2.6GHz with 96

GB memory and 32 processing cores. Our distributed vari-ational SGPRs are implemented using Armadillo linear al-gebra library for C++ (Sanderson, 2010). For each testedmodel, we report the (a) root mean square error (RMSE),p|U|�1

Px2U

(yx

� byx

)

2, of its predictions {byx

}x2U

, (b)training time vs. no. of iterations, and (c) training timevs. no. of parallel cores. The results for each model areevaluated on 5% of the dataset with respect to its best con-figuration (e.g., learning rate, no. of inducing inputs, etc.).

AIMPEAK Dataset. We randomly remove 50 data pointsfrom the original dataset so that the experimented data canbe partitioned evenly into M = 100 blocks. The empiricalresults, observations, and analysis are described below:

2 4 6 8 10

4

6

8

10

12

14

16

No. t of iterations

RMSE

(km/h

)

dDTCdPICdLMA

2 4 6 8 100

50

100

150

200

250

No. t of iterations

Tim

e(sec)

dDTCdPICdLMA

(a) (b)Figure 2. Graphs of (a) RMSEs and (b) total incurred times ofdDTC, dPIC, and dLMA vs. number t of iterations with M 0

= 16

computing cores and M = 100 blocks for AIMPEAK dataset.

(a) Figs. 1 and 2 show results of RMSEs, incurred times,and parallel efficiencies of dPIC, dDTC, and dLMA av-eraged over 5 random instances with varying number tof iterations. In particular, it can be observed from bothFigs. 1 and 2 that the RMSEs of all tested methods decreaserapidly (see Figs 1a, 1c, and 1e) while their total incurredtraining times only increase linearly in the number t of it-erations (see Fig. 2b), which shows the effectiveness of ourdistributed hyperparameter learning framework (Section 5)for these variational SGPR models;

(b) Fig. 1 also shows results of the total training times in-curred by dPIC, dLMA, and dDTC over 10 iterations withvarying number M 0

= 2, 4, 8, 16 of parallel cores. As ex-pected, it can be observed from Figs. 1b, 1d, and 1f that thetraining times of the tested models decrease gradually whenthe number of computing cores increases, which corrobo-rates our complexity analysis of distributed SGPR modelsin Section 5. More specifically, our experiment reveals thatusing M 0

= 16 cores, dPIC (116 seconds), dLMA (193seconds), and dDTC (97 seconds) can achieve speedups(i.e., parallel efficiencies) of 5.5 to 7.2 over their central-ized counterparts (i.e., M 0

= 1): PIC (836 seconds), LMA(1266 seconds), and DTC (537 seconds).

(c) Fig. 2b further shows that dDTC consistently incursless training time than both dPIC and dLMA with varyingnumber t of learning iterations. This is expected becausethe primary cost of computing a local summary for dDTC,which involves computing Gk in Lemma 1, is constant10

while that of dPIC and dLMA grows cubically in the blocksize |D|/M . On the other hand, Fig. 2a, however, revealsthat the RMSEs achieved by dPIC (3.75) and dLMA (3.54)are both significantly lower than that achieved by dDTC(10.70). This inferior performance of dDTC is also ex-pected because of its more restrictive assumption of deter-ministic relation between the training and inducing outputs(Hoang et al., 2015).

AIRLINE Dataset. We extract 2M data points from theoriginal dataset for experiment to guarantee an even par-

10For dDTC, since SDD = �2nI , it follows straightforwardly

that Gk= ��2

n I which can be constructed in constant time.

Page 8: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

2 4 6 8 1036.5

37

37.5

38

38.5

No. t of iterations

RMSE

(min.)

dDTC

10 15 20 25 301400

1600

1800

2000

2200

2400

2600

2800

3000

No. M′

of cores

Tim

e(sec)

dDTC

(a) (b)

2 4 6 8 1016

18

20

22

24

26

28

No. t of iterations

RMSE

(min.)

dPIC

10 15 20 25 300.5

1

1.5

2

2.5

3

3.5x 10

4

No. M′

of cores

Tim

e(sec)

dPIC

(c) (d)

2 4 6 8 1012

14

16

18

20

22

24

No. t of iterations

RMSE

(min.)

dLMA

10 15 20 25 302

3

4

5

6

7x 10

4

No. M′

of cores

Tim

e(sec)

dLMA

(e) (f)Figure 3. Graphs of RMSEs achieved by (a) dDTC, (c) dPIC, and(e) dLMA vs. number t of iterations, and graphs of the total train-ing times incurred by (b) dDTC, (d) dPIC, and (f) dLMA vs. num-ber M 0 of parallel cores for the AIRLINE dataset.

tition into M = 2000 blocks. Figs. 3 and 4 show re-sults of RMSEs, incurred time, and parallel efficienciesof dPIC, dDTC, and dLMA averaged over 5 random in-stances with varying number t of iterations. The observa-tions are mostly similar to that of the AIMPEAK dataset:From Figs. 3a, 3c, and 3e, the RMSEs of dPIC, dDTC, anddLMA decrease quickly while their total incurred trainingtimes only increase linearly in the number t of iterations(Fig. 4b). Figs. 3b, 3d, and 3f then show a rapid decreaseof their total training times with increasing number of pro-cessing cores. Our experiments reveal that using M 0

= 32

cores, dLMA (24151 seconds), dPIC (9981 seconds), anddDTC (1419 seconds) incur less than 6.7 hours to optimizetheir hyperparameters and can achieve significant speed-ups from 12.69 to 13.58 over their centralized counterparts:LMA (306476 seconds), PIC (135542 seconds), and DTC(19426 seconds). It can also be observed from Fig. 4b thatdDTC incurs less training time than dPIC and dLMA but,as observed from Fig. 4a, both dLMA (13.20) and dPIC(17.88) outperform dDTC (36.84) by a huge margin.

Finally, the predictive performance of our proposed dis-tributed hyperparameter learning frameworks, dLMA and

2 4 6 8 1010

15

20

25

30

35

40

No. t of iterations

RMSE

(min.)

dDTCdPICdLMA

2 4 6 8 100

0.5

1

1.5

2

2.5x 10

4

No. t of iterations

Tim

e(sec)

dDTCdPICdLMA

(a) (b)Figure 4. Graphs of (a) RMSEs and (b) total incurred times ofdDTC, dPIC, and dLMA vs. number t of iterations with M 0

= 32

computing cores and M = 2000 blocks for AIRLINE dataset.

dPIC, are further evaluated on two benchmark settings ofthe AIRLINE dataset to compare with the previously bestreported RMSE results of the existing state-of-the-art dis-tributed methods such as Dist-VGP (Gal et al., 2014) andrBCM (Deisenroth & Ng, 2015). All results are reportedin Table 1 below, which essentially shows that dPIC anddLMA significantly outperform Dist-VGP and rBCM inboth settings. Notably, dLMA (16.50 and 13.20) managesto reduce the previously best reported RMSEs (27.10 and34.40) by 39.11% and 61.62%, respectively.

Dist-VGP rBCM dPIC dLMA700K/100K 33.00 27.10 21.09 16.502M/100K 35.30 34.40 17.88 13.20

Table 1. RMSEs achieved by existing distributed/parallel meth-ods on two standard benchmark settings (training/test data sizes)of AIRLINE dataset: (a) 700K/100K and (b) 2M/100K. The re-sults (RMSEs) of Dist-VGP (Gal et al., 2014) and rBCM (Deisen-roth & Ng, 2015) are reported from their respective papers.

7. ConclusionThis paper describes a novel distributed variational infer-ence framework that unifies many parallel SGPR models(e.g., DTC, FITC, FIC, PITC, PIC, LMA) for scalablehyperparameter learning, consequently reducing their in-curred linear time per iteration of gradient ascent signifi-cantly. To achieve this, our framework exploits a structureof correlated noise process model that represents the obser-vation noises as a finite realization of a B-th order Gaus-sian Markov random process. By varying the Markov orderand covariance function for the noise process model, dif-ferent variational SGPR models result. This consequentlyallows the correlation structure of the noise process modelto be characterized for which a particular variational SGPRmodel is optimal; in other words, different SGPR modelsminimize their KL distance to a FGPR model under varyingcharacterizations of the noise correlation structure. Empiri-cal evaluation on two real-world datasets show that our pro-posed framework can achieve significantly better predictiveperformance than the state-of-the-art distributed variationalDTC (Gal et al., 2014) and distributed GPs (Deisenroth &Ng, 2015) while preserving scalability to big data.

Page 9: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

Acknowledgments. This research is supported by the Na-tional Research Foundation, Prime Minister’s Office, Sin-gapore under its International Research Centre in Singa-pore Funding Initiative.

ReferencesAsif, A. and Moura, J. M. F. Block matrices with L-block-

banded inverse: Inversion algorithms. IEEE Trans. Sig-nal Processing, 53(2):630–642, 2005.

Campbell, T., Straub, J., Fisher III, J. W., and How, J. P.Streaming, distributed variational inference for Bayesiannonparametrics. In Proc. NIPS, pp. 280–288, 2015.

Cao, N., Low, K. H., and Dolan, J. M. Multi-robot infor-mative path planning for active sensing of environmentalphenomena: A tale of two algorithms. In Proc. AAMAS,pp. 7–14, 2013.

Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P.,Dolan, J. M., and Sukhatme, G. S. Decentralized datafusion and active sensing with mobile sensors for mod-eling and predicting spatiotemporal traffic phenomena.In Proc. UAI, pp. 163–173, 2012.

Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y.,and Jaillet, P. Parallel Gaussian process regression withlow-rank covariance matrix approximations. In Proc.UAI, pp. 152–161, 2013a.

Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing formobility-on-demand system. In Proc. RSS, 2013b.

Chen, J., Low, K. H., Jaillet, P., and Yao, Y. Gaussian pro-cess decentralized data fusion and active sensing for spa-tiotemporal traffic modeling and prediction in mobility-on-demand systems. IEEE Transactions on AutomationScience and Engineering, 12(3):901–921, 2015.

Deisenroth, M. P. and Ng, J. W. Distributed Gaussian pro-cesses. In Proc. ICML, 2015.

Dolan, J. M., Podnar, G., Stancliff, S., Low, K. H., Elfes,A., Higinbotham, J., Hosler, J. C., Moisan, T. A., andMoisan, J. Cooperative aquatic sensing using the tele-supervised adaptive ocean sensor fleet. In Proc. SPIEConference on Remote Sensing of the Ocean, Sea Ice,and Large Water Regions, volume 7473, 2009.

Gal, Y., van der Wilk, M., and Rasmussen, C. Distributedvariational inference in sparse gaussian process regres-sion and latent variable models. In Proc. NIPS, pp.3257–3265, 2014.

Goldberg, P. W., William, C. K. I., and Bishop, C. M. Re-gression with input-dependent noise: A Gaussian pro-cess treatment. In Proc. NIPS, pp. 493–499, 1997.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian pro-cesses for big data. In Proc. UAI, pp. 282–290, 2013.

Hoang, T. N., Low, K. H., Jaillet, P., and Kankanhalli,M. Active learning is planning: Nonmyopic ✏-Bayes-optimal active learning of Gaussian processes. In Proc.ECML/PKDD Nectar Track, pp. 494–498, 2014a.

Hoang, T. N., Low, K. H., Jaillet, P., and Kankanhalli, M.Nonmyopic ✏-Bayes-Optimal Active Learning of Gaus-sian Processes. In Proc. ICML, pp. 739–747, 2014b.

Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifyingframework of anytime sparse Gaussian process regres-sion models with stochastic variational inference for bigdata. In Proc. ICML, pp. 569–578, 2015.

Huizenga, H. M. and Molenaar, P. C. M. Equivalent sourceestimation of scalp potential fields contaminated by het-eroscedastic and correlated noise. Brain Topography, 8(1):13–33, 1995.

Kersting, K., Plagemann, C., Pfaff, P., and Burgard, W.Most likely heteroscedastic Gaussian process regression.In Proc. ICML, pp. 393–400, 2007.

Koochakzadeh, A., Malek-Mohammadi, M., Babaie-Zadeh, M., and Skoglund, M. Multi-antenna assistedspectrum sensing in spatially correlated noise environ-ments. Signal Processing, 108:69–76, 2015.

Lazaro-Gredilla, M. and Titsias, M. K. Variational het-eroscedastic Gaussian process regression. In Proc.ICML, pp. 841–848, 2011.

Lazaro-Gredilla, M., Quinonero-Candela, J., Rasmussen,C. E., and Figueiras-Vidal, A. R. Sparse spectrum Gaus-sian process regression. Journal of Machine LearningResearch, pp. 1865–1881, 2010.

Ling, C. K., Low, K. H., and Jaillet, P. Gaussian pro-cess planning with Lipschitz continuous reward func-tions: Towards unifying Bayesian optimization, activelearning, and beyond. In Proc. AAAI, pp. 1860–1866,2016.

Low, K. H., Dolan, J. M., and Khosla, P. Adaptive multi-robot wide-area exploration and mapping. In Proc. AA-MAS, pp. 23–30, 2008.

Low, K. H., Dolan, J. M., and Khosla, P. Information-theoretic approach to efficient adaptive path planning formobile robotic environmental sensing. In Proc. ICAPS,pp. 233–240, 2009.

Low, K. H., Dolan, J. M., and Khosla, P. Active Markovinformation-theoretic path planning for robotic environ-mental sensing. In Proc. AAMAS, pp. 753–760, 2011.

Page 10: A Distributed Variational Inference Framework for ...proceedings.mlr.press/v48/hoang16.pdfmodels for scalable hyperparameter learning with big data. To achieve this, our frame-work

A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

Low, K. H., Chen, J., Dolan, J. M., Chien, S., and Thomp-son, D. R. Decentralized active robotic exploration andmapping for probabilistic field classification in environ-mental sensing. In Proc. AAMAS, pp. 105–112, 2012.

Low, K. H., Chen, J., Hoang, T. N., Xu, N., and Jaillet,P. Recent advances in scaling up Gaussian process pre-dictive models for large spatiotemporal data. In Proc.DyDESS, 2014a.

Low, K. H., Xu, N., Chen, J., Lim, K. K., and Ozgul, E. B.Generalized online sparse Gaussian processes with ap-plication to persistent mobile robot localization. In Proc.ECML/PKDD Nectar Track, pp. 499–503, 2014b.

Low, K. H., Yu, J., Chen, J., and Jaillet, P. Parallel Gaussianprocess regression for big data: Low-rank representationmeets Markov approximation. In Proc. AAAI, pp. 2821–2827, 2015.

Murray-Smith, R. and Girard, A. Gaussian process priorswith ARMA noise models. In Proc. Irish Signals Sys-tems Conference, pp. 147–153, 2001.

Ouyang, R., Low, K. H., Chen, J., and Jaillet, P. Multi-robot active sensing of non-stationary Gaussian process-based environmental phenomena. In Proc. AAMAS, pp.573–580, 2014.

Podnar, G., Dolan, J. M., Low, K. H., and Elfes, A. Telesu-pervised remote surface water quality sensing. In Proc.IEEE Aerospace Conference, 2010.

Quinonero-Candela, J. and Rasmussen, C. E. A unifyingview of sparse approximate Gaussian process regression.Journal of Machine Learning Research, 6:1939–1959,2005.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for Machine Learning. MIT Press, 2006.

Sanderson, Conrad. Armadillo: An open source c++ linearalgebra library for fast prototyping and computationallyintensive experiments. Technical report, NICTA, 2010.

Schwaighofer, A. and Tresp, V. Transductive and inductivemethods for approximate Gaussian process regression.In Proc. NIPS, pp. 953–960, 2003.

Seeger, M., Williams, C. K. I., and Lawrence, N. D. Fastforward selection to speed up sparse Gaussian processregression. In Proc. AISTATS, 2003.

Smola, A. J. and Bartlett, P. L. Sparse greedy Gaussianprocess regression. In Proc. NIPS, pp. 619–625, 2001.

Snelson, E. and Gharahmani, Z. Sparse Gaussian processesusing pseudo-inputs. In Proc. NIPS, pp. 1259–1266,2005.

Snelson, E. L. Flexible and efficient Gaussian process mod-els for machine learning. Ph.D. Thesis, University Col-lege London, London, UK, 2007.

Snelson, E. L. and Ghahramani, Z. Local and global sparseGaussian process approximation. In Proc. AISTATS,2007.

Titsias, M. K. Variational learning of inducing variables insparse Gaussian processes. In Proc. AISTATS, pp. 567–574, 2009a.

Titsias, M. K. Variational model selection for sparse Gaus-sian process regression. Technical report, School ofComputer Science, University of Manchester, 2009b.

Wainwright, M. J. and Jordan, M. I. Graphical models,exponential families, and variational inference. Founda-tions and Trends R� in Machine Learning, 1(1–2):1–305,2008.

Xu, N., Low, K. H., Chen, J., Lim, K. K., and Ozgul, E. B.GP-Localize: Persistent mobile robot localization usingonline sparse Gaussian process observation model. InProc. AAAI, pp. 2585–2592, 2014.

Yu, J., Low, K. H., Oran, A., and Jaillet, P. Hierarchi-cal Bayesian nonparametric approach to modeling andlearning the wisdom of crowds of urban traffic routeplanning agents. In Proc. IAT, pp. 478–485, 2012.

Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M.Near-optimal active learning of multi-output Gaussianprocesses. In Proc. AAAI, pp. 2351–2357, 2016.