Optimal Bayesian distributed methods for nonparametric regression · 2020-07-17 · Bayesian...

MSc Mathematics

Master Thesis

Optimal Bayesian distributed methodsfor nonparametric regression

Author: Supervisor:Tammo Hesselink dr. B.T. Szabo

prof. dr. J. H. van Zanten

Examination date:October 30, 2018

Korteweg-de Vries Institute forMathematics

Abstract

Bayesian methods in nonparametric regression using Gaussian process priors have beenwidely used in practice. In this thesis we will consider Gaussian processes making useof both the Matern kernel and the squared exponential kernel. However neither of theseGaussian process priors lead to scalable Bayesian methods and therefore they are highlyimpractical for large data sets. To solve this, we turn to the distributed setting where wedivide the data of size n over m machines, which collectively help to compute a globalposterior. For the distributed setting, we will consider multiple methods, which consistof changing the local posterior and different aggregation methods for the local posteriors.These methods will be studied via simulation. We will pose theoretical results for some ofthe methods. Finally, we will perform simulation studies using the squared exponentialkernel, which will show to perform similarly to the Matern kernel.

Title: Optimal Bayesian distributed methods for nonparametric regressionAuthor: Tammo Hesselink, [email protected], 10385401Supervisor: dr. B.T. Szabo, prof. dr. J. H. van ZantenSecond Examiner: dr. B.J.K. KleijnExamination date: October 30, 2018

Korteweg-de Vries Institute for MathematicsUniversity of AmsterdamScience Park 105-107, 1098 XG Amsterdamhttp://kdvi.uva.nl

2

http://kdvi.uva.nl

Contents

Introduction 5

1. Nonparametric Bayesian regression methods using Gaussian processpriors 81.1. Nonparametric Gaussian process regression . . . . . . . . . . . . . . . . . 8

2. Distributed Bayesian methods using Gaussian process priors 162.1. Naive averaging of posteriors . . . . . . . . . . . . . . . . . . . . . . . . . 162.2. Aggregation by product of experts . . . . . . . . . . . . . . . . . . . . . . 192.3. Adjusted local likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4. Adjusted priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5. Adjusted local likelihoods and Wasserstein barycenters . . . . . . . . . . . 242.6. Generalised Product of Experts . . . . . . . . . . . . . . . . . . . . . . . . 262.7. The Bayesian Committee Machine . . . . . . . . . . . . . . . . . . . . . . 272.8. Summary of distributed methods . . . . . . . . . . . . . . . . . . . . . . . 29

3. Distributed Bayesian methods using the squared exponential kernel 313.1. Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2. Distributed adaptive methods . . . . . . . . . . . . . . . . . . . . . . . . . 363.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4. Proofs 394.1. Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2. The RKHS framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3. Lemmas used in the proofs . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4. Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5. Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.6. Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.7. Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.8. Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.9. Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.10. Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Conclusion 72

Appendices 75

3

A. Appendix 76A.1. Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.2. Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Popular summary 80

4

Introduction

Recent times have seen the importance of data analysis grow immensely. Due to digitaltechniques it is easier to keep track of data, which allows for various ways of analysis.Terms such as machine learning and big data are being thrown around frequently. Mostapplications make use of historically kept data to fine tune their application to the users.

When data sets get sufficiently large, the computation can take up an impracticalamount of time, or even be unattainable. In case of matrix multiplications, which arealso made of great use in the nonparametric regression model, there are methods forsparse matrices available which speed up the process, see [11, 22, 20, 26].

Besides improving the computation time, it is also possible to distribute the data overmachines or cores to decrease computation time. After the data is distributed over acluster of either machines or cores, it is then processed locally after which a centralmachine aggregates these computations into a global estimation. Next to computationalreasons, distributed methods are also used for convenience. For instance, in settingswhere data is first obtained in different locations, the local machines can already docomputations on that data before communicating it to a central machine. Moreover itcan be desired to not manage data in a central area for privacy or safety reasons, wheredistributing data offers a solution.

The nonparametric regression model is widely used for problems in which there islittle a priori knowledge of the shape of the curve to be fitted. As opposed to theparametric regression setting where the “true” function f0 is assumed to be in a certainfamily of functions with a certain amount of parameters to be infered from the data,the nonparametric setting considers all methods that compute an estimate for the truefunction without any a priori assumption on the shape of this function.

Bayesian inference on the model offers more flexibility to deal with the uncertainty[15]. Bayesian statistics, as opposed to frequentist statistics, does not look to infer asingle true parameter, but rather a distribution of the outcome. This method thus hasbuilt in uncertainty quantification. The model requires an assumed prior distributionof the truth which can be chosen in many ways, which offers options for many differentsettings. The choice of the prior however does add human bias to the model, which formore flexibility can be partially dealt with by endowing the parameters of the prior withhyper priors.

Gaussian processes are frequently used as prior distributions to estimate functionsdue to their flexibility for unknown functional parameters [10, 21]. They are determinedby a mean function that states the expectation at time t and a covariance functiondetermining the covariance between multiple points in time. These functions can bechosen freely, and thus offer a great amount of options to the final shape of the Gaussianprocess prior.

5

The Bayesian nonparametric approach using Gaussian process priors has already beenwell studied [31, 2, 30]. In theory, this method performs very well for estimating curveswith little information known a priori. The main problem arises in the scalability of thismethod. Applying the method to large data sets can take impractical computation timeor even become unattainable. This leaves for a solution to make the method scalable,which is possible by using distribution of data.

The theoretical properties of distributed Bayesian methods have been thoroughly stud-ied in the signal in white noise model [25], but not for the more convenient nonparametricregression model. From a methodological view it has been extensively studied, see forexample [18, 17, 13]. The aggregation of the local posteriors can be performed usingdifferent methods with varying results in contraction rate and credible set coverage.

Methods to adaptively estimate the hyper parameters which have shown optimal re-sults in the nondistributed setting fail to hold in the distributed case, with differentmethods attempted. Our aim is to find a method for which the distance between theestimator and the true function goes to zero at a similar rate to that of the nondis-tributed setting, with pointwise posterior credible set coverage for this estimator. Fornondistributed setting optimal results have been shown in [33], which will form resultsto strive for in the distributed setting. These results include L∞ distances between theposterior mean and the truth, and pointwise posterior coverage of credible bounds.

We do so by applying the methods analysed in [25] and observe that the nonpara-metric regression model acts similar to the signal in white noise model for the differentmethods of aggregation of distribution. These include different modifications of the lo-cal posteriors and different aggregation methods. They show that the naive method ofsimply taking the average of local posteriors gives suboptimal results both in terms ofrecovery and uncertainty quantification. Furthermore, they investigate a variety of otherdistributed methods which show optimal recovery, uncertainty quantification or both.

We first apply the nondistributed method on the Bayesian nonparametric regressionmodel using the Matern kernel to see how the method performs in the nondistributedsetting as a benchmark method. We recall the L∞ posterior contraction rate results andpointwise posterior credible set statements derived in [33] for the Matern kernel. Theseresults are used in later chapters to compare the performance of the different distributedmethods to.

In the second chapter we perform simulation studies for several distributed methodsto see their performance in terms of bias of the posterior mean and coverage of crediblebounds. This is done by applying different modifications of the local posteriors, wherewe will consider raising the power of either the likelihood or the prior, and consideringdifferent aggregation methods of the modified posteriors, which include naive averagingof posteriors, product of experts, Wasserstein Barycenter and the Bayesian committeemachine [27]. We will prove sup-norm upper bounds for the bias of the posterior meanas well frequentist coverage of the pointwise posterior credible interval confirming thesimulation studies.

In Chapter 3 we look at the squared exponential kernel as used in [18]. The distributedmethods used for the Matern kernel are applied to the same setting using the squared

6

exponential kernel for simulation studies on the posterior mean and credible set coverage.We also discuss the complications of proving sup-norm upper bounds for the posteriormean for the distributed methods using the squared exponential kernel.

Finally, we turn to a more realistic setting in which the optimal parameters of thesquared exponential kernel are unknown beforehand. We consider adaptive methodswhere the parameters of the squared exponential kernel are being tuned solely from thedata. We compare the performance of different methods through simulation studies.

Mathematical proofs of the stated results are collected in the last chapter.

7

1. Nonparametric Bayesian regressionmethods using Gaussian processpriors

In the regression model we consider y = f(x)+ε where ε ∼ N(0, σ2), the noise term, andthe set of design points Xn := {x1, . . . , xn} is distributed uniformly on the interval [0,1][33]. The aim is to infer the function f from training data of size n, the predictors Xn

and dependent variables Yn := {f(x1) + ε1, . . . , f(xn) + εn}. This inference can be donein multiple manners, which can be roughly spread into parametric and nonparametricregression.

In the parametric regression model, we assume the function f lies in some family offunctions F , defined by a finite amount of parameters. The goal is to estimate theseparameters from the data set Xn, Yn.

Nonparametric regression simply encloses all methods that estimate f without assum-ing it is of a certain form defined by a finite set of parameters. This generally makes fora method that deals better in situations where there is very little knowledge about thefunction a priori. This is frequently done by using kernel functions, which are used as aweighting function in the process of computing a weighted average. These kernels cantake on many different forms, determined by a certain bandwidth parameter [21].

1.1. Nonparametric Gaussian process regression

In the regression setting, we apply the Bayesian approach by placing a prior Π of choiceon the function of importance f , which is then combined with the likelihood of the datapf (Xn, Yn) to form the posterior distribution through Bayes’ rule [10],

dΠ(f |Xn, Yn) =pf (Xn, Yn)dΠ(f)∫pf (Xn, Yn)dΠ(f)

.

In this setting, it can be favourable to make use of Gaussian processes as prior distri-butions to estimate functions due to their flexibility for unknown functional parameters[10, 21]

A Gaussian process W is defined as a stochastic process of which all finite dimensionaldistributions (Wt1 ,Wt2 , . . . ,Wtn) are multivariate normal random variables. These Gaus-sian processes are widely used as priors, with many different applications [18, 33, 25, 21,16]. The multivariate normal distribution is defined by its mean function µ(t) = EXt

and its covariance function r(s, t) = Cov(Xs, Xt). In our analysis we first consider the

8

centered Gaussian process prior (having mean function µ(t) = 0) with the Matern kernelfunction [21],

k(x, y) =21−ν

Γ(ν)(√

2ν|x− y|)νBν(√

2ν|x− y|), (1.1)

where ν > 1 is the to be chosen smoothness parameter and Bν the Bessel function ofthe second kind. In the specific case where ν = 3

2 , this gives the absolute exponentialkernel [21]:

k(x, y) = e−|x−y|.

We want the variance of the prior to be decreasing in n, thus we will scale the prior bya term σ2(nλ)−1 as proposed in [33], where σ2 is the error of the measured data, whichwe will assume to be known. Another useful kernel, which we will study in Chapter 3,is the squared exponential kernel which is defined as

k(xi, xj) = e−12

(xi−xj)2l−1.

This gives us as a prior distribution

f(·) ∼ GP (0, σ2(nλ)−1K), (1.2)

where K := k(Xn, Xn) or Π(·) ∼ Nn(0,Σ) where Σij = k(xi, xj). To compute theposterior distribution, first note that

Cov(f(xi), f(xj)) = k(xi, xj),

thus

Cov(f(xi) + εi, f(xj) + εj) = Cov(f(xi), f(xj)) + Cov(εi, εj),

as f(xi) and εj are independent for all i and j. We thus obtain

Cov(yi, yj) = k(xi, xj) + σ2δ0(xi − xj),

as Cov(εi, εj) = 0 for i 6= j. Finally, we assume the likelihood pf0,σ for this model isgiven as

pf0,σ(Yn) =1√2πσ

e−12

∑ni=1(yi−f0(xi))

2

σ2 . (1.3)

The posterior distribution then takes the following form, based on [8], which we willprove for completeness.

Lemma 1. Let us have as a prior distribution on f of the form (1.2) and a likelihoodof the form (1.3). Then we have as a posterior distribution for y∗ := f(x∗) |Yn ∼N(µ(x∗), s2(x∗)) for a single test input x∗ where

µ(x∗) = kT∗ (K + nλI)−1Yn, (1.4)

s2(x∗) = σ2(nλ)−1(k∗∗ − kT∗ (K + nλI)−1k∗), (1.5)

where K := k(Xn, Xn) is the covariance matrix of the data, k∗ := k(x∗, Xn) the covari-ance of the test input x∗ with the data and k∗∗ = k(x∗, x∗).

9

Proof. The proof to this theorem is given in Section 4.4.

In view of [33], the posterior covariance kernel takes the form

Cn(x, x′) = σ2(nλ)−1(k(x, x′)− k(x,Xn)(K + nλI)−1k(Xn, x′)), (1.6)

which will be used in later computations regarding the frequentist coverage of the point-wise posterior credible sets. We use the posterior distribution given the data to drawsamples for the true function f0. When computing the posterior distribution, the matrixoperations increase the computation time required of the process significantly, with forexample the inversion of a matrix scaling cubicly for the data [14]. To prevent unneces-sary computations, we will store the matrices that are equal at f(x∗) |Xn, Yn for everyx∗ in a memory cache which can be accessed by the computer. In this case that thematrix (K + nλI)−1 will be calculated on beforehand and applied to the computationof f(x∗) |Xn, Yn for all x∗. Using the following algorithm we will compute posteriordistributions based on data with corresponding credible sets.

Algorithm 1 Calculating f(x∗) in the nondistributed setting on the interval [0,1]

1: Given Xn, Yn, σ2, ν.

2: Compute K.3: Compute the inverse of K + nλI and store in cache.4: for x∗ ∈ [0, 1] evenly spaced do5: Compute k∗, k∗∗.6: Use these to compute µ(x∗) and σ2(x∗).7: Draw samples for the posterior distribution and compute the (100 − β

2 )% andβ2 % percentiles which form the credible interval for f(x∗)

8: end for

The covariance kernel depends only on the parameter λ. Our aim is to find the valueof λ for which the L∞ distance between the posterior mean and the truth is as small aspossible, with high probability. The expected distance is bounded by a bias term and avariance term. The parameter λ for which the bias and the variance are of equal orderis considered the value which returns the minimax optimal rate.

Let fn = E[f |Xn, Yn] denote the posterior mean based on data of size n. From [33]we find that the upper bound for ||fn − f0||∞ is given up to a constant by

∥∥∥Ef − f0

∥∥∥∞

+ σ

√log n

nh, (1.7)

with probability at least 1 − n−10, where we define h := λ1/(2α) and ||Ef − f0||∞ thebias which we will look deeper into in Section 4.2. Here the first term corresponds tothe bias and the second term the variance. To find an upper bound for the bias, weneed to define regularity classes in which the true function f0 should lie. There are twofrequently used classes namely, Θα

S(B) the set of α-smooth Sobolev functions and ΘαH(B)

10

the set of α-smooth functions in the Holder class bounded. If we have an orthonormalbasis {φi}i∈N for L2([0, 1]) the sets are for α > 1

2 and B > 0 defined as follows

Θα+ 1

2S (B) =

{f =

∞∑i=1

fiφi ∈ L2([0, 1]) :∞∑i=1

f2i i

2(α+ 12

) ≤ B2

},

and

ΘαH(B) =

{f =

∞∑i=1

fiφi ∈ L2([0, 1]) :∞∑i=1

|fi|iα ≤ B

}. (1.8)

When the smoothness α of the to be infered function f0 is known, we choose the tuningparameter of the Matern kernel ν = α+ 1

2 .In the optimal situation, we have our bias and variance terms of equal order as in

the case where either one is larger, we have a mean square error of greater order, whichmeans the posterior mean will converge to the truth slower. Thus, finding the parameterh which balances out the bias and variance terms means that the choice of the hyperparameter is optimal. Setting the bias equal to the variance, for a sample size of n weobtain a hyper parameter which retains the best possible rate

h =

(B2n

σ2 log n

)− 12α+1

, (1.9)

for f ∈ Θα+ 1

2S (B) or f ∈ Θα

H(B) [33]. We use this to set our parameter λ = h12α .

We will use Algorithm 1 for a data set Xn, Yn, where yi = f0(xi) + ε, xi ∼ U(0, 1) andε ∼ N(0, 1

10) of size n = 500, for the following function

f0(x) =∞∑i=1

sin(i)√

2 cos

(π

(i− 1

2

)x

)i−

32 , (1.10)

where we will truncate the infinite sum at i = 1000 for computational reasons. Let‖·‖N denote the empirical mean satisfying ‖f‖2N = 1

N

∑Ni=1 f(x∗i )

2, where {x∗i }Ni=1 arethe N evenly spaced test points on [0, 1]. Note that these test points are not equal tothe training data points xi. We will use a test input of size N = 200. Lastly, we willcompute the point-wise posterior 95% credible intervals CI(x∗) around the posteriormean

Π(f(x∗) ∈ CI(x∗) |Xn, Yn) ≥ .95.

This is done by drawing 2000 samples of f(x∗) at every test coordinate x∗ and calculatingthe 2.5% and 97.5% quantiles at these coordinates. Together, these credible intervalsform a set CI = {CI(x1), CI(x2), . . . , CI(xntest)}, which is plotted around the posteriormean. Furthermore, we demonstrate the behaviour of the prior and the correspondingposterior for comparison.

11

Figure 1.1.: The data (Xn, Yn) ofsize n = 500 wherexi ∼ U(0, 1) and

yi ∼ f0(xi) + ε whereε ∼ N(0, 1

2).

Figure 1.2.: fn on [0,1] with the 95%credible set for tuninghyper parameter ν = 3

2 .

Figure 1.3.: fn on [0,1] with the 95%credible set for ν = 11

10 .Figure 1.4.: fn on [0,1] with the 95%

credible sets for ν = 52 .

Here one can observe the influence of ν on the posterior mean clearly. The bandwidthand smoothness of the kernel grows for larger values of ν, while it shrinks for smallervalues of ν. From this, it is apparent that both too small and too large values of ν arenot desired, as they respectively undersmooth and oversmooth the function, resultingin suboptimal estimations. Also one can notice the bias-variance trade-off, where asmoother estimation comes paired with smaller credible sets, and vice versa. This isalso visible in the following figure which shows the values of the empirical L2 distancebetween the posterior mean and the truth under different values of ν.

12

Figure 1.5.: ||fn − f0||2N on [0,1] fordifferent values of ν onan equal data set.

Figure 1.6.: Two draws from the prior with tuning hyper parameter ν = 32 .

Figure 1.7.: Two draws from the prior with ν = 1110 .

13

Figure 1.8.: Two draws from the prior with ν = 52 .

From this, we can conclude that the tuning parameter ν changes the bandwidth and thesmoothness of the regression kernel, which is seen in the prior distribution. As seen inthe posterior estimations, for the considered f0 we have an undersmooth prior for valuesof ν smaller than the smoothness of the truth, and an oversmooth prior for values of νgreater than the smoothness of the truth.

Regarding the inference of the posterior mean, we are interested in its consistencyand contraction rates. If a posterior distribution is consistent, its posterior mean willconverge to the true function f0. Let F be the function space of interest, then theposterior distribution Πn(· |Xn) is said to be weakly consistent at f0 ∈ F if Πn(f :d(f, f0) > ε |Xn)→ 0 in Pf0 probability as n→∞ for all ε > 0, with strong convergenceholding if the convergence is in the almost sure sense [10].

Next to the convergence towards the truth, we are interested in the rate at whichthis convergence happens. The contraction rate gives an upper bound for this rate.The posterior distribution Πn(· |Xn) is said to contract at rate εn → 0 at f0 ∈ F ifΠn(f : d(f, f0) > Mnεn |Xn) → 0 in Pf0 as n → ∞ for every Mn → ∞. This can beinterpreted as that the posterior distribution concentrates in balls of radius εn aroundthe true function f0 [10].

A method to derive contraction rates using rescaled Gaussian processes was shownin [29], based on a general method to prove contraction rates widely used from [30].Unfortunately, this method does not pose a way to analogously prove the contractionrate when using the distributed method. However, [33] uses direct computations to findupper bounds for the bias and the variance of the posterior mean, which we will applyto the distributed method.

From Corollary 2.1 of [33] follows that when using the Matern covariance kernel withprobability at least 1− n−10 for h equal to (1.9), we have

∥∥∥fn − f0

∥∥∥∞

. hα + σ

√log n

nh= B

12α+1

(σ2 log n

n

) α2α+1

, (1.11)

for f ∈ Θα+ 1


H(B). The minimax rate is n−α

2α+1 [25], so we achieve theminimax rate up to a logarithmic term. Here an . bn denotes an ≤ Cbn for C a constant

14

that is universal or fixed in the context. We will use this and & throughout the paper,where if we have both an . bn and an & bn, we will write an � bn to say an is boundboth from above and below, up to constants, by bn.

Let us define the half length ln(x;β) by

ln(x;β) := z(1+β)/2

√Cn(x, x),

dependent on the posterior covariance kernel (1.6). Then CIn(x;β) denotes the pointwiseposterior credible set of probability β at the point x defined as

CIn(x;β) := (fn(x)− ln(x;β), fn(x) + ln(x;β)).

By Corollary 3.1 of [33] the pointwise posterior credible sets in the nondistributed settingthat for a correct smoothness of the kernel α, for every β ∈ (0, 1) there exists a sequenceof functions {f0,n}n∈N ∈ Θα

H(B) such that

P(f0,n(x) ∈ CIn(x;β))→ β, as n→∞,

for z(1+β)/2 chosen such that

Π(f(x) ∈ CIn(x;β)|Xn, Yn) = β.

15

2. Distributed Bayesian methods usingGaussian process priors

In the previous chapter we recalled that the computational complexity of both the poste-rior mean as the posterior covariance is of order O(n3), which explodes for larger valuesof n. For the matrix operations which require large amounts of computations, there arein many cases approximations available [11, 22]. In our analysis we turn to distributedmethods to decrease the computation time. This is performed by dividing the data ofsize n over m (where we will, for simplicity assume n mod m = 0) machines and calculatem local posteriors which we will aggregate into a global posterior in different ways.

We will start with the same problem where we have a data set {Xn, Yn} which we

divide over m subsets {X(1)n/m, Y

(1)n/m}, {X

(2)n/m, Y

(2)n/m}, . . . , {X

(m)n/m, Y

(m)n/m} and send these to

m machines. These m machines will use the Bayesian method with Gaussian processpriors in the nonparametric regression model on their local data to compute a local

posterior Π(j)L (· |X(j)

n/m, Y(j)n/m). Means and variances are being computed of these local

posteriors using Lemma 1. Afterwards, a parent machine computes the global posteriorby combining all these local posteriors. This process can be applied recursively up toarbitrary many levels, but we will focus on the case where we have one parent machineand m sub-machines. All local posteriors are trained parallelly and share the sameparameters ν, λ where ν is taken equal to the smoothness of the truth.

2.1. Naive averaging of posteriors

As a starting point which forms our baseline case to compare the other methods to,we analyse a naive averaging distributed approach in which in every local problem wesimply use the prior Π(·) defined in (1.2). Every local machine then computes its local

posterior Π(j)L (· |X(j)

n/m, Y(j)n/m). This posterior follows by Bayes’ formula:

dΠ(j)L (f(x∗) |X(j)

n/m, Y(j)n/m) ∝ pf0,σ(Y

(j)n/m | f)dΠ(f(x∗) |α).

where the likelihood pf0,σ for this model is given as a multiple of e−12

∑ni=1(Y

(j)i−f0(xi))

2

σ2 .Locally considering the same prior and likelihood as in Section 1.1 we will have local

posteriors f jn/m(x∗) |Xn, Yn, k ∼ N(µj(x∗), s2

j (x∗)) where

µj(x∗) = (k

(j)∗ )T (K(j) + nm−1λI)−1Y

(j)n/m, (2.1)

s2j (x∗) = σ2(nm−1λ)−1(k∗∗ − (k

(j)∗ )T (K(j) + nm−1λI)−1k

(j)∗ ), (2.2)

16

where K(j) and k(j)∗ are similar to the corresponding values in Lemma 1, but based on

the data subset {X(j)n/m, Y

(j)n/m}. To have our local posteriors satisfying the upper bound

in a minimax sense, we take

h =

(B2n/m

σ2 log(n/m)

)− 12α+1

, (2.3)

for f ∈ Θα+ 1


H(B).

In this method, we take a draw from f(j)n/m(x∗) for every local posterior Πj

L(· |X(j)n/m, Y

(j)n/m).

Afterwards one takes the average of these m draws to obtain global posterior predictivedistribution

fn,m(x∗) =1

m

m∑i=1

f(j)n/m(x∗).

Note that for f(j)n/m(x∗) ∼ N(µj(x

∗), s2j (x∗)) we have

1

m

m∑j=1

f(j)n/m(x∗) ∼ N

1

m

m∑j=1

µj(x∗),

1

m2

m∑j=1

s2j (x∗)

. (2.4)

Applying this to our Bayesian nonparametric regression method, we gain

fn,m(x∗) |Xn, Yn ∼ N

1

m

m∑j=1

µj(x∗),

1

m2

m∑j=1

s2j (x∗)

.

As we consider multiple different distributed methods, which we will denote the currentmethod by method I, for which we obtain the posterior covariance function

CIn,m(x, x′) = σ2(nλ)−1 1

m

m∑j=1

(k(x, x′)− k(x,X

(j)n/m)(K(j) + (n/m)λI)−1k(X

(j)n/m, x

′)),

(2.5)

and posterior meanf In,m = E[f In,m |Xn, Yn].

Now, using this for a data set of size n = 2000 with parameters m = 100, σ = 12 and

ν = 32 , we will compare the distributed method to the nondistributed method for the

same data set.

17

Figure 2.1.: Distributed f In,m withpointwise 95% credi-ble sets for n = 2000,m = 100, σ = 1

2 andν = 3

2 .

Figure 2.2.: Nondistributed fn usingequal data.

From the simluation study, we conclude that this method does not perform well eitherfor posterior estimation or pointwise posterior uncertainty quantification. The posteriormean is smoother than in the nondistributed case, which tells us that the optimal tuningparameters for the local problems might return good results for the local problems,but provide an overly smooth estimator globally. We also see that the credible setsprovide worse uncertainty quantification, which is apparent from (2.4) and confirms thesuboptimal bias-variance trade-off.

Theorem 1 (Naive averaging bounds). Consider a covariance matrix K correspondingto the Matern kernel for a certain smoothness parameter α. Let h be equal to the value

(2.3). Then there exists a function in f0 ∈ ΘαH(B) such that as (n/m)

1−2αα m

12 → 0∥∥∥f In,m − f0

∥∥∥∞

&

(log(n/m)

n/m

) α2α+1

/ log2

((n/m

log(n/m)

)), (2.6)

with probability at least 1− (2m+2)(n/m)−10 with respect to the randomness of the data

Xn, Yn, and f0 ∈ Θα+ 1

2S (B) such that as (n/m)

1−2αα m

12 → 0∥∥∥f In,m − f0

∥∥∥∞

&

(log(n/m)

n/m

) α2α+1

/ log

((n/m

log(n/m)

)), (2.7)

with probability at least 1− (2m+2)(n/m)−10 with respect to the randomness of the dataXn, Yn.

Proof. The proof of this theorem is given in Section 4.5.

From this theorem, we can conclude that the L∞ loss of the posterior mean for cer-tain truths is suboptimally large with high probability. This confirms the sub-optimalestimation as seen in the simulation experiments.

18

Theorem 2 (Naive averaging pointwise credible set coverage). Let h be equal to the

value (2.3). Then, for any x ∈ [0, 1] there always exists a function fbad ∈ Θα+ 1

2S (B) such

that as n/m→∞ and m/ log2(n)→∞ we have

P(fbad(x) ∈ CIIn,m(x))→ 0.


These two theorems now prove the conclusions from the simulation study. The dis-tributed method using naive averaging does not provide an optimal rate of convergencetowards the truth, nor does it provide reliable uncertainty quanfication.

2.2. Aggregation by product of experts

We saw in the last section that simply taking draws from all separate posteriors andtaking an average returns suboptimal results. There are now two options to change theestimation procedure, by either changing the local posteriors, or changing the methodof aggregation of these local posteriors. This section will focus on the latter approach.

We can aggregate the posteriors differently by looking at the product of all local

posteriors∏mj=1 dΠ

(j)L (f(x∗) |X(j)

n/m, Y(j)n/m) and normalising this product, which then once

more returns a probability measure as proposed in [18]. To do so, we will first look atthe product of two Gaussian densities with parameters (µ1, σ

21), (µ2, σ

22), respectively.

This is given, up to a constant, by

exp

[−1

2(x− µ1)2/σ2

1

]exp

[−1

2(x− µ2)2/σ2

2

]∝ exp

[−1

2((x2 − 2xµ1)/σ2

1 + (x2 − 2xµ2)/σ22))

]∝ exp

[−1

2

((x2 − 2xµ1)σ2

2 + (x2 − 2xµ2)σ21

σ21σ

22

)]∝ exp

[−1

2

(x2(σ2

1 + σ22)− 2x(µ1σ

22 + µ2σ

21)

σ21σ

22

)]

∝ exp

−1

2

x2 − 2xµ1σ2

2+µ2σ21

σ21+σ2

2

σ21σ

22

σ21+σ2

2

.(2.8)

Also, we have

exp

[−1

2

(x− µ)2

σ2

]∝ exp

[−1

2

(x2 − 2xµ

σ2

)],

therefore we can conclude that equation (2.8) is proportional to a Gaussian density with

19

parameters

µ =µ1σ

22 + µ2σ

21

σ21 + σ2

2

= σ2

(µ1

σ21

+µ2

σ22

),

σ2 =σ2

1σ22

σ21 + σ2

2

=1

1σ21

+ 1σ22

.

Thus, normalising the product of two Gaussians gives us a Gaussian with these pa-rameters. Generalising this procedure for a product of m Gaussians we find that thenormalised product is again Gaussian [3], with parameters

µ = σ2m∑j=1

(µjσ2j

), (2.9)

σ2 =m∑j=1

(σ−2j

)−1. (2.10)

Using the same parameters and data set as for the naive averaging method, we nowobtain the following simulation results.

Figure 2.3.: Posterior mean f Ipn,m forn = 2000, m = 100, σ =12 and ν = 3

2 .

Figure 2.4.: Nondistributed fn,m pos-terior using equal data.

Here we see the result still being much smoother than in the nondistributed case. Thusthe present procedure does not give an optimal result. This suggests that using theseaggregation methods the local data does not contain sufficient information for the localmachines to compute a good estimation. From now on we will refer to Na and Np formethod N using aggregation by naive averaging and product of experts respectively.

20

2.3. Adjusted local likelihoods

To compensate for the underrepresentation of data in the local problems, we can raisethe power of the local likelihoods. Recall that our local likelihoods are of the form

pf0,σ(Y jn/m) =

1√2πσ

e−∑n/mi=1

(y(j)i−f0(x

(j)i

))2

2σ2 .

Thus, raising this to the power of m as proposed in [25], we obtain the modified locallikelihood of

(pf0,σ(Y jn/m))m ∝ e−

m∑n/mi=1

(y(j)i−f0(xi))

2

2σ2 .

Computing the posterior in this case now easily follows from Lemma 1, but with σ2

replaced by m−1σ2. Thus we now have our local posteriors of the form

(f IIn/m)(j)(x∗) |X(j)n/m, Y

(j)n/m ∼ N(µj(x

∗), σ2j (x∗)).

where

µj(x∗) = (k

(j)∗ )T (K(j) + nm−1λI)−1Y

(j)n/m, (2.11)

σ2j (x∗) = σ2(nλ)−1(k∗∗ − (k

(j)∗ )T (K(j) + nm−1λI)−1k

(j)∗ ). (2.12)

from which follows here

h =

(B2n

σ2 log(n/m)

)− 12α+1

, (2.13)

for f ∈ Θα+ 1

2S (B) and f ∈ Θα

H(B), as both the denominator and the divisor contain aterm of m−1 in this setting which cancel each other out.

Using this method, for the same parameters as in earlier sections, n = 2000, m = 100,ν = 3

2 and σ2 = 12 , we obtain the following results.

Figure 2.5.: f IIan,m using the adjustedlocal likelihoods method,aggregating by averagingthe posteriors.

Figure 2.6.: f IIpn,m using the adjustedlocal likelihoods method,aggregating by product ofexperts.

21

One can observe that this method indeed improves the earlier naive methods, as theestimation is more accurate. However, the credible sets in this case are very small andbarely visible, thus do not cover f0 well, which means there is still room for improvement.Note that using this different setup, we do see difference between the naive averagingand product of experts methods, with the product of experts method providing a betterestimation than the naive averaging method.

Theorem 3 (Adjusted local likelihoods and averaging bounds). Consider a covariancematrix K corresponding to the Matern kernel for a certain smoothness parameter α. Let

h be equal to the value (2.11). Then, for every function f0 ∈ Θα+ 1

2S (B) or f0 ∈ Θα

H(B)

we obtain as n1−2αα m

12α− 1−2α

α → 0∥∥∥f IIan,m − f0

∥∥∥∞

.

(log(n/m)

n

) α2α+1

. (2.14)



This shows us the posterior mean converges to the truth of similar order up to a log-arithmic term to the nondistributed case, which gives us an optimal estimation in theminimax sense. However, the coverage of the pointwise posterior credible set is subop-timal as shown in the following theorem.

Theorem 4 (Adjusted local likelihoods and averaging pointwise credible set coverage).Suppose h chosen minimax optimal as in (2.3). Then, for every x ∈ [0, 1] there always

exists a function fbad ∈ Θα+ 1

2S (B) such that as n/m→∞ and m/ log2(n)→∞ we have

P(fbad(x) ∈ CIIIan,m(x))→ 0.


This shows us that the pointwise posterior credible intervals are sub-optimal and thereis still room for improvement to this method. The simulation studies suggest thataggregation by product of experts does not improve the procedure.

2.4. Adjusted priors

Raising the likelihood to the m’th power poses a solution to the rate of convergence ofthe posterior as seen in last section. However, using this setup, we saw there was stillunreliable uncertainty quantification using the aggregation methods considered. Thisleaves us to find a procedure giving us reliable uncertainty quantification as well asoptimal recovery. Similar to the method used before where the likelihood was raisedto the power of m, we can consider raising the prior density to the 1/m’th power asproposed in [23].

22

We have seen that raising a Gaussian density to the 1/p’th power is equal to a mul-tiplying the variance by a factor p. This also holds for our prior density, where in thiscase raising the power of the prior density to 1/m corresponds to the covariance matrixbeing multiplied by the factor m:(

|(2πΣ)|−12

) 1me−

12m

(x−µ)TΣ−1(x−µ) ∝ e−12

(x−µ)T (mΣ)−1(x−µ).

Thus in this setting, raising the power of the prior density to 1/m is equal to changingthe Gaussian process prior to GP (0, σ2(nm−2λ)−1K). This is equivalent to rescalingthe function k as kIII = mk.

Computing the posterior predictive and its 95% credible sets using the adjusted prior,we obtain the following results.

Figure 2.7.: f IIIan,m using the adjustedpriors method, aggregat-ing by averaging the pos-teriors for n = 2000, m =100, σ = 1

2 and ν = 32 .

Figure 2.8.: f IIIpn,m using the adjustedlocal likelihoods method,aggregating by product ofexperts.

Figure 2.9.: ||f − f0||2N on [0,1] for dif-ferent values of σ2.

23

From the simulations we see that rescaling the prior improves the estimation as wellas the coverage of the pointwise posterior credible set. Once more, the aggregation ofproduct experts seems to give a slightly better result than the naive averaging method,the same as perceived in last section. The posterior spread specifically seems to differbetween the two aggregation methods. This can be explained because we have

m∑j=1

(σ−2j

m

)−1

≤ 1

m

m∑j=1

σ2j ,

by the harmonic-arithmetic mean inequality. Multiplying both sides by 1/m results in aleft hand side equal to the variance in the product of experts setting, and a right handside equal to the variance in the averaging setting. This confirms the observations.

Theorem 5 (Adjusted priors and averaging bounds). Consider a covariance matrixK corresponding to the Matern kernel for a certain smoothness parameter α. Let h

be equal to the value (2.3). Then, for every function f0 ∈ Θα+ 1

2S (B) we obtain as

n1−2αα m

2−3α2α

+α+1

4α2 → 0 ∥∥∥f IIIan,m − f0

∥∥∥∞

.

(log(n/m)

n

) α2α+1

. (2.15)



As we have seen in the simulation results, this method does not only offer an optimalsup-norm upper bound, but also point-wise credible set coverage of the truth. This isproven in the next theorem.

Theorem 6 (Adjusted priors and averaging bounds pointwise credible set coverage).For any β ∈ (0, 1) and x ∈ [0, 1], there is some sufficiently large constant B > 0 and a

sequence of functions {fn}n∈N where every fn ∈ Θα+ 1

2S (B), such that

P(fn(x) ∈ CIIIIan,m (x; b)

)→ β,

as m,n/(m4)→∞.


2.5. Adjusted local likelihoods and Wassersteinbarycenters

As we have seen, adjusting the local likelihoods improves recovery, but results in ques-tionable uncertainty quantification. A way to improve this could be to try another

24

method of aggregating posteriors, for instance computing their Wasserstein barycenters[25].

To do so, we look at the 2-Wasserstein distance between two probability measures µand ν which is defined as

W 22 (µ, ν) = inf

γ

∫ ∫||x− y||22γ(dx, dy),

where the infimum is defined over all measures γ on [0, 1] × [0, 1] with marginals µand ν. Furthermore, the 2-Wasserstein barycenter between m probability measuresµ1, µ2, . . . , µm is defined as

µ = argminµ

m∑i=1

W 22 (µ, µi).

From [25] we know that the Wasserstein barycenter of m Gaussians is again Gaussian,with a mean equal to the average mean of the m Gaussians, and a variance parameterequal to the average variance of the Gaussians. This gives a similar result to averagingthe posteriors, but has a variance term that is m times as large:

f IVn,m(x∗) |Xn, Yn ∼ N

1

m

m∑j=1

µj(x∗),

1

m

m∑j=1

s2j (x∗)

. (2.16)

As this variance term is m times as large as the naive averaging variance, we see thatapplying this aggregation method after raising the power of the likelihood compensatesfor the small posterior spread of using that method and aggregating by taking the av-erage. Therefore we expect to see optimal results for both the bias and the coverage ofthe pointwise posterior credible sets.

Figure 2.10.: f IVn,m on [0,1] using theadjusted local likelihoodswith aggregation byWasserstein barycen-ter for n = 2000,m =100, σ = 1

2 , ν = 32 .

25

2.6. Generalised Product of Experts

Another procedure we can apply to the problem is the generalised product of expertsmethod [5]. This method is based around the same principle of aggregation by productof experts with a slight adjustment. By raising the local posteriors to the power aj , weagain get a Gaussian posterior, but with new parameters:

µ(x∗) = σ2(x∗)m∑j=1

ajµj(x∗)

σ2j (x∗)

, (2.17)

σ2(x∗) =

m∑j=1

ajσ−2j (x∗)

−1

. (2.18)

Because this allows for infinitely many combinations, we will look at the case whereaj = a for all j. We will look at the estimation fVn,m for different values of a for the samedata set Xn, Yn, with a suggested value of a = 1/m [7].

Figure 2.11.: ||f − f0||2N per value of a.

Figure 2.12.: fVn,m on [0,1] for n = 2000,

m = 100, a = 1/m, σ = 12

and ν = 32 .

Here we notice that the a does not seem to have an obvious relation for the contractionof the estimation, but do seem to have an influence on the credible set coverage.

26

Figure 2.13.: fVn,m for a = 13m . Figure 2.14.: fVn,m for a = 1√

m.

From here, we can also conclude that the value of a does not significantly change theestimation of f0, but only the size of the credible sets. Similar to the product of expertsaggregation method from Section 2.2 this method also returns suboptimal recovery.

2.7. The Bayesian Committee Machine

The Bayesian Committee Machine proposes another option to aggregate posteriors [28].For this aggregation method we once more divide Xn, Yn into m subsets which we de-

note in this setting as {Dj}mj=1 := {{X(j)n/m, Y

(j)n/m}}

mj=1. Moreover, we define Di =

{D1, . . . , Di}. Now we are interested in dΠ(f q | Di−1, Di), where f q is the vector ofunknown target measurements corresponding to a given set of test points xq1 , . . . , xqN .From Bayes’ rule we know that this is proportional to

dΠ(f q)pf0,σ(Di−1 | f q)pf0,σ(Di | Di−1, fq). (2.19)

In this equation, we will now make the approximation

pf0,σ(Di | Di−1, fq) ≈ pf0,σ(Di | f q).

Only conditioned on the whole function, when f q ≡ f , we have that Di is independent ofDi−1. The approximation might still be reasonable for large N , or when the correlationbetween Di and Di−1 is very small. By [28] we obtain

dΠ(f q | Di−1, Di) ∝ dΠ(f q)pf0,σ(Di−1 | f q)p(Di | Di−1, fq)

≈ dΠ(f q)pf0,σ(Di−1 | f q)pf0,σ(Di | f q)

∝ dΠ(f q)pf0,σ(Di−1 | f q)

dΠ(f q)

pf0,σ(Di | f q)dΠ(f q)

∝ dΠ(f q | Di−1)dΠ(f q |Di)

dΠ(f q).

27

Using the fact that Dn−1 ∪Dn = D, we obtain

dΠ(f q |D)n−1∏i=1

dΠ(f q | Di−1, Di) =n∏i=1

dΠ(f q | Di−1, Di)

≈ Cn∏i=1

(dΠ(f q | Di−1)dΠ(f q |Di)

dΠ(f q)

)

= C

∏n−1i=1 (dΠ(f q | Di))

∏ni=1 (dΠ(f q |Di))

dΠ(f q)M−1

from which follows

dΠ(f q |D) ∝∏mi=1 dΠ(f q |Di)

dΠ(f q)m−1. (2.20)

We know that dΠ(f q |Di) and dΠ(f q) are both Gaussian in our regression setting, thuswe need to know the product of multivariate Gaussian densities and the fraction of multi-variate Gaussian densities to explicitly compute the approximate posterior. The productof two multivariate distributions with parameters µ1,Σ1 and µ2,Σ2 is proportional to amultivariate Gaussian with parameters

µ = (Σ−11 + Σ−1

2 )−1(Σ−11 µ1 + Σ−1

2 µ2),

Σ = (Σ−11 + Σ−1

2 )−1.

For the fraction of two multivariate Gaussian densities we obtain a multivariate Gaussiandensity with parameters [27]

µ = (Σ−11 − Σ−1

2 )−1(Σ−11 µ1 − Σ−1

2 µ2),

Σ = (Σ−11 − Σ−1

2 )−1.

As observed earlier, raising a Gaussian density to the (m − 1)’th power results in aGaussian density with a variance parameter multiplied by m − 1. For a prior withparameters 0,K, this results in the divisor equal to a density with parameters 0, (m−1)K.Furthermore,

∏mi=1 Π(f q |Di) is proportional to a multivariate Gaussian distribution

with parameters

µ = Σm∑i=1

Σ−1i µi,

Σ =

(m∑i=1

Σ−1i

)−1

.

Combining these two, we see that∏mi=1 dΠ(fq |Di)(dΠ(fq))m−1 is proportional to a multivariate Gaus-

28

sian distribution with parameters

µ = Σm∑i=1

Σ−1i µi, (2.21)

Σ =

( m∑i=1

Σ−1i

)−1

− (m− 1)K

−1

. (2.22)

Applying this to the earlier data, we obtain the following result.

Figure 2.15.: fV In,m on [0,1] using theBayesian committee ma-chine for n = 2000, m =100, σ = 1

2 and ν = 32 .

From this we can conclude that this aggregation method has both optimal contractionas well as coverage of the pointwise posterior credible sets. One point to note is thatthere is a large probability of encountering numerical errors in the calculation resultingin Σ possibly not positive semi-definite, meaning there are some eigenvalues close to 0with imaginary parts. This problem occurs regularly when creating K using the Maternkernel, as well as when computing σi for a local machine i (this problem has not showedup yet, because we evaluated the function at different values of x∗ separately, instead ofdrawing the whole function from a Gaussian distribution). For a degenerate matrix Awhich in theory should be positive semi-definite, one can use A+ cI for a small enoughc, as this causes all eigenvalues to be bounded away from 0.

2.8. Summary of distributed methods

We have seen that there are many methods available to improve the naive averaging ofposteriors. Various methods were introduced to improve performance of the distributedmethod by changing the computation of the local posteriors, changing the aggregationmethod of the local posteriors or a combination of both with mixed results. The naive

29

averaging and product of experts, do not show a significant difference in terms of recoveryas well as uncertainty quantification. The only obvious difference between the methodsare the slightly rougher paths from the product of experts method, which is also apparentin the theory by the arithmetic-harmonic mean inequality. For methods I, IIa, IIIa wehave proven the sup-norm bounds for the L∞ distance between the estimator and thetruth and frequentist coverage of the pointwise posterior credible sets. For all othermethods, we have seen simulation studies which suggest likely results for the distancebetween the truth and the estimator, and frequentist coverage of the pointwise posteriorcredible sets. We summarise the result of the simulation studies in Table 1.

Method Optimal recoveryUncertaintyquantification

Method Ia: Naive averaging No NoMethod Ip: Product of experts No NoMethod IIa: Adjusted likelihoods + NA Yes NoMethod IIp: Adjusted likelihoods + PoE Yes NoMethod IIIa: Adjusted priors + NA Yes YesMethod IIIp: Adjusted priors + PoE Yes YesMethod IV: Adjusted likelihoods +

Wasserstein barycenterYes Yes

Method V: Generalised product of experts No YesMethod VI: Bayesian committee machine Yes Yes

Table 2.1.: Performance of the different estimation methods using the Matern kernel.

From the simulation studies, one can see that methods IIIa, IIIp, IV and V I showresults comparable to the nondistributed method. Methods Ia, Ip and V show subopti-mal estimation of the truth, but V does show good frequentist coverage of the pointwiseposterior credible sets. Methods IIa and IIp show optimal estimation, but too smallpointwise posterior credible sets, giving suboptimal frequentist coverage of the pointwiseposterior credible sets.

One does have to keep in mind that all these estimations are being made with completeknowledge of the smoothness α. For an unknown value of α, adaptive methods areneeded to estimate the smoothness from the raw data. Methods to adaptively estimatethis smoothness in a nondistributed setting have been investigated in [18, 25]. We willsee that this results in suboptimal estimations for all methods in the distributed case inSection 3.2.

30

3. Distributed Bayesian methods usingthe squared exponential kernel

An initial goal was to improve the distributed methods from [18], which work similarto the distributed methods we have considered in Chapter 2, but using the squaredexponential kernel instead. As [31] states that the optimal value of the bandwidthparameter is of order n−1/(2α+1), this leads us to define the squared exponential kernelas

k(xi, xj) = e−12

(xi−xj)2τ−1n−1/(2α+1),

where τ is a hyper parameter influencing the bandwidth of the kernel.However, when trying to follow the methods used in [33] to compute upper bounds

for the posterior mean analogously for this kernel, a couple of complications occur. Themain requirements for the theorems are polynomially decaying eigenvalues and uniformlybounded, Lipschitz continuous eigenfunctions. The eigenvalues of this covariance kernelare not polynomially decaying, but exponential, of the form µi � αe−iα/2 [21], whichdoes not satisfy the eigenvalue properties used in the proof for nondistributed settingin [33]. This implies that for the squared exponential kernel, theoretical results in thenondistributed setting are first required before proving them in the distributed setting.

Furthermore, the eigenfunctions of the kernel are of the form

µk = C

√2a

ABk,

φk(x) = e−(c−a)x2Hk(√

2cx),

where Hk denotes the k’th order Hermite polynomial [12]. We only investigated thesquared exponential kernel in a simulation study.

3.1. Simulation study

It is possible to apply the distributed methods investigated in the previous chapterusing the squared exponential kernel. We expect to see similar behaviour for thesemethods in the new setting. We will start by computing the posterior mean using thenondistributed method to have a baseline case to compare the distributed methods toand see the influence of the parameter τ .

31

Figure 3.1.: The data (Xn, Yn) ofsize n = 500 wherexi ∼ U(0, 1) and

yi ∼ f0(xi) + ε whereε ∼ N(0, 1

2).

Figure 3.2.: fn on [0,1] with the 95%pointwise credible set forthe optimal value of τ inthis setting which we de-note by τ∗.

Figure 3.3.: fn on [0,1] with the 95%credible set for the valueof 10τ∗.

Figure 3.4.: fn on [0,1] with the 95%credible set for the valueof 1

10τ∗.

One can observe the influence of τ on the posterior clearly. The bandwidth of thekernel shrinks for larger values of τ , whilst it grows for smaller values of τ . Fromthis, it is apparent that both too large and too small values of τ are not desired, as theyrespectively undersmooth and oversmooth the prior, resulting in sub-optimal estimators.Moreover the bias-variance tradeoff is apparent, where a smaller bias comes paired withhigher variance.

32

Figure 3.5.: ||f−f0||2N on [0,1] for dif-ferent values of τ on thesame data.

Figure 3.6.: Two draws from the prior with τ = τ∗.

Figure 3.7.: Two draws from the prior with τ = 10τ∗.

33

Figure 3.8.: Two draws from the prior with τ = 110τ∗.

We now consider the same distributed methods as used for the Matern kernel in Chapter2 for the squared exponential kernel. We do so by taking data of size n = 2000 withσε = 0.5 and distribute the data over m = 50 machines.

Figure 3.9.: f In,m on [0,1]. Figure 3.10.: f IIan,m on [0,1].

Figure 3.11.: f IIpn,m on [0, 1] Figure 3.12.: f IIIan,m on [0,1] for p = m.

34

Figure 3.13.: f IIIpn,m on [0,1] for p = m. Figure 3.14.: f IVn,m on [0,1].

Figure 3.15.: fVn,m on [0, 1] for a = 1/m. Figure 3.16.: fV In,m on [0, 1].

From these simulations we can conclude that the methods perform similar for bothcovariance kernels. This is unsurprising as the same results were also perceived for thesignal-in-white noise model from [25]. We conclude that similar to Matern covariancekernel, we gain the following results for the squared exponential kernel

35

Method Optimal recoveryUncertaintyquantification

Method Ia: Naive averaging No NoMethod Ip: Product of experts No NoMethod IIa: Adjusted likelihoods + NA Yes NoMethod IIp: Adjusted likelihoods + PoE Yes NoMethod IIIa: Adjusted priors + NA Yes YesMethod IIIp: Adjusted priors + PoE Yes YesMethod IV: Adjusted likelihoods +

Wasserstein barycenterYes Yes

Method V: Generalised product of experts No YesMethod VI: Bayesian committee machine Yes Yes

Table 3.1.: Performance of the different estimation methods using the squared exponen-tial kernel.

3.2. Distributed adaptive methods

Up until this point, we have assumed that the smoothness parameter α is known. Inpractice however, this might not be the case. There is thus need for estimation methodswhich perform without prior knowledge about α. Several adaptive methods exist toestimate the parameter from data, of which we will consider by maximising the marginallikelihood [25, 18]. These methods have shown that in the nondistributed case thereexists methods that tune the parameter α optimally using only knowledge from thedata.

The marginal likelihood in this setting is of the form

|2π(σ2K + σ2I)|−12 e−

12yT (σ2K+σ2I)−1y.

The logarithm of the marginal likelihood is proportional to

−1

2

(yT (σ2K + σ2I)−1y + log |σ2K + σ2I|

). (3.1)

The nondistributed setting uses the value of τ which maximises this quantity. Numer-ically, we compute the marginal likelihood for many values of τ ∈ (0, C] for a suffi-ciently large constant C and use the value which returns the largest marginal likelihood.Similarly, for the Matern kernel, we compute the marginal likelihood for many valuesα ∈ (0.5, C] for some large enough constant C. In the nondistributed setting, thisadaptive method performs well for large enough sample size n.

However, in the distributed setting there is no central machine containing all theinformation. In the distributed setting it was proposed that one should use the averageof the local marginal likelihoods to derive an estimator for the regularity hyper parameterα, see [18]. Then in [25] it was shown that this method performs sub-optimally, since

36

the local machines contain insufficient information. For the nonparametric regressionmodel, an adaptive method is proposed in [33] by maximising the sum of the localmarginal likelihoods

m∑j=1

(−1

2

((Y

(j)n/m)T (σ2K(j) + σ2I)−1Y

(j)n/m + log |σ2K(j) + σ2I|

)). (3.2)

Figure 3.17.: Adaptive estimation forthe squared exponentialkernel using Method I:Average of local estimatedparameters

Figure 3.18.: Adaptive estimation forthe squared exponentialkernel using Method II:Median of local estimatedparameters.

Figure 3.19.: Adaptive estimation forthe squared exponentialkernel using Method III:Maximising the sum of lo-cal likelihoods.

Figure 3.20.: Adaptive estimation forthe Matern kernel usingMethod I: Average of lo-cal estimated parameters

37

Figure 3.21.: Adaptive estimation forthe Matern kernel MethodIII: Maximising the sum oflocal likelihoods.

One can conclude that distributed adaptive methods do not solve the local tuning ofthe parameters, which is also an open problem for the signal in white noise model [25].Furthermore, the adaptive methods have shown numerical errors for too large samplesizes, in this setting from as small as n = 600, which forms an issue. Subsequent workcould consider finding distributed adaptive methods and solving the numerical errors.

3.3. Summary

We have seen that the different distributed methods used in Chapter 2 show equalbehaviour when applied with the squared exponential kernel. The theoretical propertiesof the estimations using this kernel seem to be harder to compute, as the eigenvaluesdecay at a different rate than the eigenvalues of the Matern kernel. Moreover, theeigenfunctions of the squared exponential kernel do not satisfy the assumptions used inthe proofs for the Matern kernel.

We have performed a simulation study on methods that estimate the hyper parametersof the kernel from a set of data, instead of assuming these values on beforehand.

38

4. Proofs

In this chapter we provide the proofs of the theorems and lemmas stated in the previouschapters. Let us first begin with an outline of the notations, in which we will mostlyfollow the notation from Section 5.1 of [33].

4.1. Summary of notations

Symbol Definition

Xn/Yn x/y coordinate data of size n, {xi/yi}ni=1

f0 true functionwi Gaussian error, wi = yi − f0(xi) ∼ N(0, σ2)

fn posterior mean, E(f |xn, yn)

Cn posterior covariance function, Cn(x, x′) = Cov(f(x), f(x′) |Xn, Yn){νj}j∈N eigenvalues of the equivalent kernel, νj = µj/(µj + λ)h tuning parameter determined defined as h = λ2α

KN equivalent kernel, KN (s, t) =∑∞

j=1 νNj φj(s)φj(t)

FNλ convolution with equivalent kernel, FNλ g(t) =∫g(s)KN (s, t)ds

PNλ PNλ f = f − FNλ ffn,λ KRR solution, argminf∈H(n−1

∑ni=1(yi − f(xi))Kxi − Pλf)

Cn approximate covariance function, Cn = σ2hKANm aggregation function over m functions as used in method N

fNn,m,λ KRR solution of distributed method N , ANm({f (j)n/m,λ}

mj=1)

f(j)n/m,λ KRR solution of the j’th machine based on n/m samples

Sn,λ sample score function, Sn,λf = n−1∑n

i=1(yi − f(xi))Kxi − PλfSNn,m,λ aggregated sample score function ANm({S(j)

n/m,λ}mj=1)

Sλ population score function Sλf = Fλf0 − fCNn,m global posterior covariance function using method N

γn max{1, n−1+1/(2α)h−1/(2α)√

log n}√

(nh)−1 log n

Table 4.1.: Notation reference

We are using roman numbers to denote our different distributed methods. For anyoperator M , we use MN to denote operator M in the setting of distributed methodN , where if the roman number is left out, we refer to the nondistributed setting. Forexample, this means that KN is only different to K in a setting where the eigenvalues

39

or eigenfunctions are changed. Furthermore M (j) denotes M for the j’th local machine,for every symbol.

4.2. The RKHS framework

In many of the proofs we will use results on reproducing kernel Hilbert spaces (RKHS),of which further details and proofs can be found in Chapter 1 of [33].

A (real) RKHS H is a Hilbert space of real-valued functions on a general index setX , in our setting X = [0, 1], such that for any t ∈ X , the evaluation function Lt :H → R; f 7→ f(t), is a bounded linear functional. The function Lt is a bounded linearfunctional if there exists a constant Ct > 0 such that

|Ltf | = |f(t)| ≤ Ct ‖f‖H , f ∈ H,

where ||f ||H =√〈f, f〉H is the norm associated to the Hilbert space H. We call a

symmetric function K : [0, 1]2 → R positive definite if for any n ∈ N, a1, . . . , an ∈ R,and t1, . . . , tn ∈ [0, 1],

n∑i=1

n∑j=1

aiajK(ti, tj) > 0.

As some evaluation maps are bounded linear operators for s, t ∈ [0, 1], Riesz’ represen-tation theorem states that for every t ∈ [0, 1], there exists an element Kt ∈ H such thatf(t) = 〈f,Kt〉H. This element Kt is called the representer of evaluation at t, and thekernel K(s, t) = Kt(s) is positive definite. Furthermore, given any positive definite func-tion K, a unique RKHS on [0, 1] with K as its reproducing kernel, the kernel functionK for which f(t) = 〈f,Kt〉H holds, can be constructed. We will hence use the notationKt(·) = K(t, ·) throughout for any given kernel K. Let L2([0, 1]) denote the space ofsquare integrable functions f : [0, 1] → R which satisfy

∫ 10 f

2(x)dx < ∞. Note that for

the uniform random variable X on [0, 1] we have E[f(X)] =∫ 1

0 f(x)dx. The L2 inner

product on [0, 1] of two functions is denoted as 〈f, g〉L2 =∫ 1

0 f(x)g(x)dx. This lets us useMercer’s theorem [6], which tells us that for a continuous positive definite kernel K(s, t)satisfying

∫ 10

∫ 10 K(s, t)dsdt < ∞, there exists an orthonormal sequence of continuous

eigenfunctions denoted by {φj}j∈N in L2([0, 1]) with eigenvalues µ1 ≥ µ2 . . . ≥ 0, and∫K(s, t)φj(s)dt = µjφj(s), (4.1)

K(s, t) =

∞∑j=1

µjφj(s)φj(t). (4.2)

Therefore the RKHS H determined by K is the set

{f ∈ L2([0, 1]) :

∞∑j=1

f2j /µj <∞},

40

f ∈ L2([0, 1]), where fj = 〈f, φj〉L2 . Moreover, for f, g ∈ H,

‖f‖2H =∞∑j=1

f2j

µj, 〈f, g〉H =

∞∑j=1

fjgjµj

.

Let us define the kernel ridge regression (KRR) estimator

fn,λ := argminf∈H

`n,λ(f), `n,λ(f) :=

[1

n

n∑i=1

(yi − f(xi))2 + λ ‖f‖2H

], (4.3)

which coincides with the posterior mean, fn = E[f |Xn, Yn], in the nonparametric re-gression setting used [33] (see Subsection 6.2.2 from [21]). In our analysis we considerthe KRR estimator instead of the posterior mean for convenience.

Let us define a new inner product on H for a fixed λ > 0 as

〈f, g〉λ := 〈f, g〉L2 + λ〈f, g〉H,

and define the norm as ||f ||λ =√〈f, f〉λ. Note that this is again a RKHS as we have

|f(x)| ≤ Cx||f ||H ≤ C ′x||f ||λ for some constants Cx, C′x where the first inequality holds

because H is an RKHS which means |f(x)| ≤ Cx||f ||H for a certain constant Cx > 0.For two elements f =

∑∞j=1 fjφj and g =

∑∞j=1 gjφj in L2([0, 1]), we obtain

〈f, g〉λ = 〈f, g〉L2 + λ〈f, g〉H =∞∑j=1

fjgj + λ∞∑j=1

fjgjµj

=∞∑j=1

fjgjνj

, (4.4)

for

νj =1

1 + λµj

=µj

λ+ µj, j ∈ N.

This means the RKHS (H, 〈·, ·〉λ) consists of

{f =∞∑j=1

fjφj ∈ L2([0, 1]) :

∞∑j=1

f2j /νj <∞}

with

〈f, g〉λ =∞∑j=1

fjgjνj

.

We will denote the reproducing kernel of this new RKHS by

K(s, t) :=

∞∑j=1

νjφj(s)φj(t), (4.5)

which we will call the equivalent kernel of k defined in (1.1). Let us denote the representerof the evaluation map by Ks(·) = K(s, ·) which implies g(s) = 〈g, Ks〉λ for any g ∈ H.

41

The eigenvalues of the Matern kernel used in this paper are polynomially decaying, ofthe form

µj � j−2α (4.6)

for α the smoothness of the kernel [33], and the eigenfunctions are of the form

φ2j−1(x) = sin(πjx), φ2j = cos(πjx). (4.7)

One can observe there exist global constants Cφ, Lφ > 0 such that the eigenfunctions{φj}j∈N of K satisfy |φj(t)| ≤ Cφ for all j ≥ 1, t ∈ [0, 1], and |φj(t)− φj(s)| ≤ Lφj|t− s|for all t, s ∈ [0, 1] and j ≥ 1 [33].

We will frequently make use of the linear operator Fλ : H → H defined as

Fλg(t) =

∫g(s)K(s, t)ds,

which forms the convolution of g(s) with the equivalent kernel K. In the current settingwhere X ∼ U(0, 1) this can also be seen as Fλg(t) = EX [g(t)KX ]. For g =

∑∞j=1 gjφj ,

in view of (4.1)

Fλg(t) =

∫ ∞∑j=1

gjφj(t)K(s, t)ds =

∞∑j=1

νjgjφj(t). (4.8)

Next to this, we define another linear operator Pλ : H → H given by

Pλf = f − Fλf, (4.9)

where for g =∑∞

j=1 gjφj we obtain

Pλg(t) =∞∑j=1

(1− νj)gjφj(t). (4.10)

We will use both FNλ and PNλ for the linear operators in the setting of distributed methodN , which differ from the nondistributed operations only if the eigenvalues of the equiva-lent kernel are changed. For example, using method II only alters the likelihood, leavingthe eigenvalues intact, but method III considers rescaling the prior, which changes thecovariance kernel, and hence the eigenvalues of the equivalent covariance kernel. For thedistributed methods, we will define the KRR estimator using distributed method N as

fNn,m,λ = ANm

({f (j)n/m,λ}

mj=1

)(4.11)

where ANm is the posterior aggregation operator of m random variables as used in dis-

tributed method N and {f (j)n/m,λ}

mj=1 is the set of local KRR estimators for the m separate

machines. Recall that for an RKHS H corresponding to the prior covariance kernel K,

42

it is known that the KRR estimator coincides with the corresponding posterior mean.We will denote the sample score function

Sn,λ(f) =1

n

n∑i=1

(yi − f(xi))Kxi − Pλf, (4.12)

which follows from (14) of [33], and the population score function

Sλ(f) = Fλf0 − f. (4.13)

Note that we haveSn,λ(fn,λ) = 0,

as fn,λ is the solution to the KRR problem [33]. Furthermore, we obtain Sλ(Fλf0) =Fλf0 − Fλf0 = 0 as well. Note that by definition

Sλ(fn,λ) = Fλf0 − fn,λ, (4.14)

an identity which will be used frequently later on. For distributed methods, we willagain slightly alter the notation and define

SNn,m,λf = ANm({S(j)n/m,λ}

mj=1), (4.15)

in the setting of distributed method N , where {S(j)n/m,λ}

mj=1 is the set of local score

functions for the separate m machines. On these score functions, we state a couple ofidentities. If we define ∆fn := fn,λ − Fλf0, we obtain

∆fn = fn,λ − Fλf0 = −(Fλf0 − fn,λ) = −Sλ(fn,λ).

First note that because Pλ is a linear operator and Sn,λ(fn,λ) = 0, we obtain

−Sn,λ(Fλf0) = Sn,λ(fn,λ)− Sn,λ(Fλf0)

=1

n

n∑i=1

(yi − fn,λ(xi))Kxi − Pλfn,λ −

(1

n

n∑i=1

(yi − Fλf0(xi))Kxi − Pλ(Fλf0)

)

=1

n

n∑i=1

∆fn(xi)Kxi − Pλ(∆fn)

=1

n

n∑i=1

∆fn(xi)Kxi −∆fn + Fλ∆fn.

Note that as we are dealing with X ∼ U(0, 1), we have

Fλf(·) =

∫f(·)K(s, ·)ds = E[fKX ]. (4.16)

43

To summarise, we end up with the following identities

−Sλ(fn,λ)− Sn,λ(Fλf0) = ∆fn − Sn,λ(Fλf0), (4.17)

[Sn,λ(fn,λ)− Sλ(fn,λ)]− [Sn,λ(Fλf0)− Sλ(Fλf0)] =1

n

n∑i=1

∆fn(xi)Kxi − E[∆fn(X)KX ],

(4.18)

where the last equation holds as we have Sn,λ(fn,λ) = 0 = Sλ(Fλf0).

4.3. Lemmas used in the proofs

In the proofs of this section, we will encounter similar problems, hence it is useful tostate some technical results in forms of lemmas.

The first lemma gives a general sup-norm bound for the linear operator Pλ for boththe Sobolev and Holder classes. Furthermore, we show that in both sets there existfunctions for which we have matching upper and lower bounds, up to some constants.

Lemma 2 (Sup-norm bounds for Pλ). Let Pλ be as defined in (4.9) for the Materncovariance kernel. Then, for λ = h2α the following upper bound holds

‖Pλf‖∞ . hα,

for f ∈ Θα+ 1


H(B). Moreover, there exist f in both Θα+ 1

2S (B) and Θα

H(B)for which we also have that

‖Pλf‖∞ & hα/ log(h−1), ‖Pλf‖∞ & hα/ log2(h−1),

respectively.

Proof. Let us first assume that f ∈ ΘαS(B). By (4.10)

‖Pλf‖∞ ≤ ‖Pλf‖λ supx∈[0,1]

||Kx||λ, (4.19)

where || · ||λ denotes the RKHS norm defined in Section 4.2. By the fact that theeigenfunctions are uniformly bounded, we see ||Kx||2λ = |

∑∞j=1 ν

2j φ

2j (x)ν−1

j | .∑∞

i=1 νj .

We have µj � j−2α by (4.6) which implies

∞∑j=1

µjµj + λ

�∞∑j=1

1

1 + λj2α= λ−

12α

∞∑j=1

λ12α

1 +(λ

12α j)2α .

For the h equal to (2.3), limn→∞ h → 0, which implies λ → 0 as well. Notice thatthis infinite sum can be seen as a Riemann sum with step size λ1/(2α), meaning that weasymptotically have

∞∑j=1

µjµj + λ

� λ−12α

∫ ∞0

1

1 + x2αdx � λ−

12α , (4.20)

44

as the integral converges for α > 12 which is satisfied by assumption. Next,

||Pλf ||2λ =

∞∑j=1

(1− νj)2f2j

νj

=

∞∑j=1

(λ

λ+ µj

)2 λ+ µjµj

f2j

= λ∞∑j=1

λ

λ+ µj

f2j

µj

. λ∞∑j=1

j2αf2j

≤ λB2, (4.21)

by definition of ΘαS(B). This implies ||Pλf ||λ . λ

12 . Combining this with (4.20), we

obtain ||Pλf ||∞ . hα−12 . Let us consider the function

f0(x) = a

∞∑j=1

(2j)−α−12

log(2j)φ2j(x),

for some constant a, which is in ΘαS(B) as

a2∞∑j=1

(2j)2α

((2j)−α−

12

log(2j)

)2

= a2∞∑j=1

1

2j log2(2j)≤ B2,

for a appropriately chosen, as the infinite sum is convergent. Let us define J := dλ−12α e,

45

such that for j < J we obtain µ2j . λ. Then we obtain

‖Pλf0‖∞ =

∥∥∥∥∥∥∞∑j=1

λ

µ2j + λ(f0)2jφ2j

∥∥∥∥∥∥∞

∗=∞∑j=1

λ

µ2j + λ(f0)2j

=∞∑j=1

λµ2j

1 + λµ2j

(f0)2j

�∞∑j=1

λ(2j)2α

1 + λ(2j)2α(f0)2j

= a∞∑j=1

λ(2j)α−12

log(2j)(1 + λ(2j)2α)

≥ aJ∑j=1

λ(2j)α−12

log(2j)(1 + λ(2j)2α)

&1

log(2J)

J∑j=1

λ(2j)α−12

1 + λ(2j)2α

� λ12− 1

4α

log(λ−12α )

J∑j=1

λ12

+ 14α (2j)α−

12

1 + λ(2j)2α,

where (∗) holds because we have φ2j(x) = cos(πjx) and the supremum of the functionis taken at x = 0.

By repeating the Riemann sum argument we obtain

J∑j=1

λ12

+ 14α (2j)α−

12

1 + λ(2j)2α�∫ 1

0

xα−12

1 + x2α= C,

as we have λ(2J)2α = 1, and as λ→ 0, J →∞. We conclude that for α > 12

‖Pλf0‖∞ &λ

12− 1

4α

log(λ−12α )

= hα−12 / log(h−1).

By looking at α′ = α+1/2, the result follows by noting Θα′S = Θ

α+ 12

S , for which ||Pλf0|| &hα′

= hα+ 12 .

Let us now assume that f ∈ ΘαH(B). From Corollary 2.1 in [33] follows ||Pλf ||∞ . hα

by showing

‖Pλf‖∞ ≤∞∑j=1

λ

µj + λf0,j =

√λ∞∑j=1

√λµj

µj + λ

fj√µj

∗.√λ∞∑j=1

jαf0,j .√λ = hα, (4.22)

46

where (∗) follows because µj � j−2α and√λµj ≤ (µj+λ)/2 by the arithmetic-geometric

mean inequality and the convergence of the infinite sum follows from the definition ofΘαH(B). Now, let us consider the function

f0(x) =∞∑j=1

a(2j)−α−1

(log(2j))2φ2j(x),

for a constant a such that∞∑j=1

a

(2j)(log(2j))2≤ B,

which confirms f0 ∈ ΘαH(B) as the infinite sum is convergent. By similar reasoning as

before, we obtain

‖Pλf0‖∞ =

∥∥∥∥∥∥∞∑j=1

λ

µ2j + λ(f0)2jφ2j

∥∥∥∥∥∥∞

=∞∑j=1

λ

µ2j + λ(f0)2j

=∞∑j=1

λµ2j

1 + λµ2j

(f0)2j

�∞∑j=1

λ(2j)2α

1 + λ(2j)2α(f0)2j

= a

∞∑j=1

λ(2j)α−1

log2(2j)(1 + λ(2j)2α)

≥ aJ∑j=1

λ(2j)α−1

log2(2j)(1 + λ(2j)2α)

&1

log2(2J)

J∑j=1

λ(2j)α−1

1 + λ(2j)2α

� λ12

log2(λ−12α )

J∑j=1

λ12 (2j)α−1

1 + λ(2j)2α,

where the final statement lets us once more use Lemma A.1, now for s = 2α, t = 1,r = 1− α and ν = λ−1/(2α). Analogously to the earlier case, Lemma A.1(i) now impliesthat for α > 1

2

‖Pλf0‖∞ &λ

12

log2(λ−12α )

∞∑j=1

λ12 (2j)α−1

1 + λ(2j)2α� λ

12

log2(λ−12α )

=hα

log2(h−1).

47

We conclude that there exists a function f0 ∈ ΘαH(B) such that ||Pλf0||∞ & hα/ log2(h−1).

In this chapter, we will often make use of Lemma 5.2 from [33], stated as follows.

Lemma 3. LetGn = {g ∈ H : ‖g‖H < An, ‖g‖∞ < Bn}

for An, Bn > 0. For any α > 1/2, it holds for some constant C independent of (n, h,An)that for x1, x2, . . . , xn i.i.d.

P

[sup

t∈T,g∈Gn

∣∣∣ 1n

n∑i=1

g(xi)K(xi, t)− E[g(x1)K(x1, t)]∣∣∣

> (nh)−12

(√log n+

√x+

√log

Bn‖g‖∞

+A12αn

)‖g‖∞

+ (nh)−1(

log n+ x+ logBn‖g‖∞

+A1αn max(1, (n/h)

12α− 1

2 ))‖g‖∞

]≤ e−x, x > 0.

Now let us define

An := 2λ−12

(‖Pλf0‖λ +Dσ(nh)−

12 log n

), (4.23)

for D some constant larger than 0. As || · ||H ≤ λ−12 || · ||λ by (4.4), ||∆fn||H ≤ An holds

for D and n sufficiently large by (47) from [33] which states

‖∆fn‖λ ≤ ‖∆fn − Sn,λ(Fλf0)‖λ + ‖Sn,λ(Fλf0)‖λ . n−β2 ‖Pλf0‖λ + σ(nh)−

12 log n,

(4.24)

for β = (2α− 1)2/(4α(2α+ 1)) which is strictly positive. Note that

‖f‖2∞ = supx∈[0,1]

|∞∑j=1

fjφj(x)|2∗. sup

x∈[0,1]

∞∑j=1

νjφ2j (x)

∞∑j=1

f2j

νj

∗∗. h−1 ‖f‖2λ ,

where (∗) holds by Cauchy-Schwarz and (∗∗) holds by the fact that the eigenfunctions are

uniformly bounded. By once more using (4.24), we obtain ||∆fn||∞ . h−12 ||∆fn||λ . n.

Note that for An = An, Bn = n and x = c log n for some positive constant c, we have

(nh)−12

(√log n+

√x+

√log

Bn‖g‖∞

+A12αn

)‖g‖∞

≤ (nh)−12

((2 + c)

√log

n

min(1, ‖g‖∞)+ A

12αn

)‖g‖∞ ,

48

and

(nh)−1(

log n+ x+ logBn‖g‖∞

+ A1αn max(1, (n/h)

12α− 1

2 ))‖g‖∞

≤ (nh)−1

((2 + c) log

n

min(1, ‖g‖∞)+ A

1αn max(1, (n/h)

12α− 1

2 )

)‖g‖∞ .

Recall that ||Pλf0||λ .√λ by (4.21). For h equal to (1.9), in view of λ = h2α

h−α(nh)−12 log n = h−α−

12n−

12 log n . (n/(log n))

α+12

2α+1n−12 log n = (log n)

12 ,

which implies An .√

log n, from which follows A12αn . A

1αn .

√log n, as α > 1/2. This

implies

(nh)−12

((2 + c)

√log

n

min(1, ‖g‖∞)+ A

12αn

)‖g‖∞

. (nh)−12

(√log

n

min(1, ‖g‖∞)

)‖g‖∞ , (4.25)

and

(nh)−1

((2 + c) log

n

min(1, ‖g‖∞)+ A

1αn max(1, (n/h)

12α− 1

2 )

)‖g‖∞

. (nh)−1

(max(1, (n/h)

12α− 1

2 )

(log

n

min(1, ‖g‖∞)

))‖g‖∞ . (4.26)

By taking the sum of (4.25) and (4.26) we obtain that for g satisfying ||g||∞ ≤ n and||g||H ≤ An and c,D and n sufficiently large, we obtain with probability at least 1−n−10∥∥∥∥∥ 1

n

n∑i=1

g(xi)Kxi − E[g(X)KX ]

∥∥∥∥∥∞

.

(√log

n

min(1, ‖g‖∞)+ (nh)−

12 max

(1, (n/h)

12α− 1

2

)(log

n

min(1, ‖g‖∞)

))‖g‖∞√nh

.

(4.27)

By (4.27) we now obtain∥∥∥∥∥∥ 1

n/m

n/m∑i=1

g(xi)Kxi − E[g(X)KX ]

∥∥∥∥∥∥∞

. max

(1, (n/m)−1+ 1

2αh−12α

√log

n/m

min(1, ‖g‖∞)

)√log

n/m

min(1, ‖g‖∞)

‖g‖∞√nm−1h

,

(4.28)

49

for any α > 1/2, g satisfying ||g||λ . An/m and ||g||∞ ≤ n/m, and λ = h2α, withprobability at least 1 − (n/m)−10 for c sufficiently large. Finally, note that (4.24) alsoimplies ||Sn/m,λ(Fλf0)||∞ . n/m and ||Sn/m,λ(Fλf0)||H ≤ An/m, which will be usedlater on.

The following lemma follows from the sup-norm bounds for fn,λ−f0 as proved in [33].During the procedure of computing bounds for the L∞ distance between the estimatorand the true function, the triangle inequality is frequently used, where we encounter aterm ||fNn,m,λ − FNλ f0||∞. The following lemma gives us an upper bound for this term.

Lemma 4. Let f0 ∈ ΘαH(B) or f0 ∈ Θ

α+ 12

S (B) and h equal to (2.3). We define ∆fNn,m =

fNn,m,λ − FNλ f0 and ∆f(j)n/m = f

(j)n/m,λ − Fλf0. Let us define

γn/m := max(

1, (n/m)−1+ 12αh−

12α

√log(n/m)

)√log(n/m)

1√nm−1h

. (4.29)

Now if the aggregation function ANm takes the average over m values, we obtain∥∥∆fNn,m∥∥∞ . max

((n/m)−10h−1, γn/m

∥∥∥S(1)n/m,λ(FNλ f0)

∥∥∥∞

)+∥∥SNn,m,λ(F Iλf0)

∥∥∞ ,

with probability at least 1−m(n/m)−10.

Proof. The proof is analogous to equation (49) of [33]. Let us begin by using the triangleinequality to obtain∥∥∥fNn,m,λ − FNλ f0

∥∥∥∞≤∥∥∥fNn,m,λ − FNλ f0 − SNn,m,λ(FNλ f0)

∥∥∥∞

+∥∥SNn,m,λ(FNλ f0)

∥∥∞ .

By the identities (4.17) and (4.18) we obtain

∥∥∥fNn,m,λ − FNλ f0 − SNn,m,λ(FNλ f0)∥∥∥∞

=

∥∥∥∥∥∥ 1

m

m∑j=1

(f(j)n/m,λ − F

(j)λ f0 − S(j)

n/m,λ(FNλ f0))

∥∥∥∥∥∥∞

≤ 1

m

m∑j=1

∥∥∥∆f(j)n/m − S

(j)n/m(FNλ f0)

∥∥∥∞,

which follows by repeatedly using the triangle inequality. By the identities (4.17) and(4.18) combined with (4.28), we obtain for every j

∥∥∥∆f(j)n/m − S

(j)n/m(FNλ f0)

∥∥∥∞

=

∥∥∥∥∥∥ 1

n/m

n/m∑i=1

∆f(j)n/m(x

(j)i )K

x(j)i

− E(∆f(j)n/m(X)KX)

∥∥∥∥∥∥∞

(4.30)

. max

1, (n/m)−1+ 12αh−

12α

√√√√logn/m

min(1,∥∥∥∆f

(j)n/m

∥∥∥∞

)

√√√√logn/m

min(1,∥∥∥∆f

(j)n/m

∥∥∥∞

)

∥∥∥∆f(j)n/m

∥∥∥∞√

nm−1h,

50

with probability at least 1 − (n/m)−10 on an event we will denote by Aj . Takingthe intersection ∩mj=1Aj tells us that with probability 1 − m(1/m)−10 all events hold

simultaneously. Let us now distinguish between two cases. First assume ||∆f (j)n/m||∞ ≤

(n/m)−a, for a large enough. We will continue to use a = 10 throughout. Then∥∥∥∥∥∥ 1

n/m

n/m∑i=1

∆f(j)n/m(x

(j)i )K

x(j)i

− E(∆f(j)n/m(X)KX)

∥∥∥∥∥∥∞

≤ (n/m)−10

∥∥∥∥∥∥ 1

n/m

n/m∑i=1

Kx(j)i

− E(KX)

∥∥∥∥∥∥∞

. (n/m)−10h−1,

as supx,x′ |K(x, x′)|∞ = supx,x′ |∑∞

j=1 νjφj(x)φj(x′)| = supx,x′ |

∑∞j=1 νj | . h−1 by

(4.20). For h equal to (2.3), this is of order at most (n/m)−(9+ 12

) log(n/m)−1

2α+1 . Weconclude∥∥∥fNn,m,λ − FNλ f0

∥∥∥∞

. (n/m)−(9+ 12

) log(n/m)−1

2α+1 +∥∥SNn,m,λ(FNλ f0)

∥∥∞ .

Now, if ||∆f (j)n/m||∞ ≥ (n/m)−10, we obtain√√√√log

(n/m)

min(1,∥∥∥∆f

(j)n/m

∥∥∥∞

).√

log((n/m)11) =√

11 log(n/m) .√

log(n/m).

This implies that by (4.30), we obtain for every j that∥∥∥∆f(j)n/m − S

(j)n/m,λ(FNλ f0)

∥∥∥∞

. γn/m

∥∥∥∆f(j)n/m

∥∥∥∞,

with probability at least 1− (n/m)−10. This implies that

∥∥∆fNn,m − SNn,m,λ(FNλ f0)∥∥∞ ≤

1

m

m∑j=1

∥∥∥∆f(j)n/m − S

(j)n/m,λ(Fλf0)

∥∥∥∞

. γn/m maxj

∥∥∥∆f(j)n/m

∥∥∥∞,

with probability at least 1 − m(n/m)−10 on the intersection of the events on which

||∆f (j)n/m−S

(j)n/m,λ(FNλ f0)||∞ . γn/m||∆f

(j)n/m||∞ for all j. By using the triangle inequality

and the last inequality, we obtain∥∥∥∆f(j)n/m

∥∥∥∞≤∥∥∥∆f

(j)n/m − S

(j)n/m,λ(FNλ f0)

∥∥∥∞

+∥∥∥S(j)

n/m,λ(FNλ f0)∥∥∥∞

=⇒∥∥∥∆f

(j)n/m

∥∥∥∞≤ γn/m

∥∥∥∆f(j)n/m

∥∥∥∞

+∥∥∥S(j)

n/m,λ(FNλ f0)∥∥∥∞

=⇒∥∥∥∆f

(j)n/m

∥∥∥∞

. (1− γn/m)−1∥∥∥S(j)

n/m,λ(FNλ f0)∥∥∥ .

Let us turn to the term γn/m. If max(1, (n/m)−1+ 12αh−

12α

√log(n/m)) = 1, we obtain

γn/m . σ−1/(2α+1)(log(n/m)/(n/m))α/(2α+1) which goes to 0 for n/m→∞. If

51

max(1, (n/m)−1+ 12αh−

12α

√log(n/m)) = (n/m)−1+ 1

2αh−12α

√log(n/m), we obtain for h

equal to (2.3) that

(n/m)−1+ 12αh−

12α log(n/m) = Cσ

− 12α2+α (n/m)−1+ 1

2α (n/m)1

4α2+2α log(n/m) log(n/m)− 1

4α2+2α

= Cσ− 1

2α2+α (n/m)−1+ 1

2α+ 1

4α2+2α log(n/m)1− 1

4α2+2α .

Moreover, for h equal to (2.3), we obtain√log(n/m)/(nm−1h) = Cσ−

12α+1 (log(n/m)/(n/m))−

1α2α+1 ,

which has a positive power of n for α > 1/2, meaning nm−1h → ∞ as n → ∞. Thisimplies that at most

γn/m = O

(σ− 2α+1

2α2+α (n/m)1−2αα (log(n/m))

6α2+2α−1

4α2+2α

), (4.31)

as

−1 +1

2α+

1

4α2 + 2α− 2α

2α+ 1=

1− 2α

α.

Combining these two parts, we can conclude that γn/m → 0 as n/m → ∞ as n/mdominates log(n/m). Combining this with (4.31) conclude that∥∥∥fNn,m,λ − FNλ f0

∥∥∥∞≤∥∥∥fNn,m,λ − FNλ f0 − SNn,m,λ(FNλ f0)

∥∥∥∞


∥∥∞

. γn/m maxj

∥∥∥∆f(j)n/m

∥∥∥∞


∥∥∞

. γn/m(1− γn/m)−1 maxj

∥∥∥S(j)n/m,λ(FNλ f0)

∥∥∥∞


∥∥∞

. γn/m

∥∥∥S(1)n/m,λ(FNλ f0)

∥∥∥∞


∥∥∞ ,

with probability at least 1 − m(n/m)−10. The statement now follows by taking themaximum over the two cases.

Following this lemma, we have all the tools to prove the sup-norm bounds for the pos-terior mean using the naive averaging method, shown in the following theorem.

4.4. Proof of Lemma 1

Proof. Note that if we have a (n+m)-dimensional multivariate normal distribution[xy

]∼ N

([µ1

µ2

],

[Σ11 Σ12

Σ21 Σ22

]),

52

then we have x |Yn ∼ N(µ′,Σ′) where

µ′ = µ1 + Σ12Σ−122 (Yn − µ2), (4.32)

Σ′ = Σ11 − Σ12Σ−122 Σ21. (4.33)

In this case, we are looking for y∗ := f(x∗) |Xn, Yn, k where x∗ is the test input andXn, Yn the training data. First note that if we draw two functions f and g from aGaussian process with mean function 0 and covariance kernel k defined in (1.1) marginalto input data sets Xn and X ′n, we have[

f(X)g(X ′)

]∼ N

(0,K(Xn, Xn) K(Xn, X

′n)

K(X ′n, Xn) K(X ′n, X′n)

).

which in our setting, if we consider a single one-dimensional test input x∗, means:[y∗

Yn

]∼ N

(0,

[σ2(nλ)−1k(x∗, x∗) σ2(nλ)−1k(x∗, Xn)σ2(nλ)−1k(Xn, x

∗) σ2(nλ)−1k(Xn, Xn) + σ2I

]).

Combining this with (4.32) and (4.33), we obtain

y∗ |Xn, Yn, k ∼ N(kT∗ (K + nλI)−1Yn, σ2(nλ)−1(k∗∗ − kT∗ (K + nλI)−1k∗)).

4.5. Proof of Theorem 1

We start by giving a sketch of the proof. To prove the lower bound for the L∞ distancebetween the KRR estimator and the truth, we will use the triangle inequality to splitthe distance in the difference of two terms, ||P Iλf0||∞ and ||f In,m,λ − F Iλf0||∞. As theeigenvalues of the equivalent kernel are equal to the nondistributed setting, it will followthat there exists a function for which ||P Iλf0||∞ = ||Pλf0||∞ & hα/ log2(h−1) by Lemma

2. The upper bound for ||f In,m,λ − F Iλf0||∞ is different from the nondistributed setting.Analogous to [33], by the triangle inequality∥∥∥f In,m,λ − F Iλf0

∥∥∥∞≤∥∥∥f In,m,λ − F Iλf0 − SIn,m,λ(Fλf0)

∥∥∥∞

+∥∥SIn,m,λ(Fλf0)

∥∥∞ .

Computing the bounds, we will see that with high probability

||f In,m,λ − F Iλf0||∞

. max(

(n/m)−(9+ 12

) log(n/m)−1

2α+1 , γn/m

∥∥∥S(1)n/m,λ(F Iλf0)

∥∥∥∞

)+∥∥SIn,m,λ(F Iλf0)

∥∥∞ ,

by using Lemma 4. The term ||SIn,m,λ(Fλf0)||∞ will be bounded from above by again

using the triangle inequality. It follows that ||P Iλf0||∞ is of higher order than ||f In,λ −F Iλf0||∞. The dominating term is larger than the dominating term in the minimaxoptimal rate, letting us conclude that there exists a function for which the method issuboptimal. We will now begin the proof of Theorem 1.

53

Proof. Let us assume that f0 ∈ Θα+ 1

2S (B) such that ||P Iλf0|| = ||Pλf0|| & hα/ log(h−1).

We begin by noting∥∥∥f In,m,λ − f0

∥∥∥∞

=∥∥∥F Iλf0 − f0 + f In,m,λ − F Iλf0

∥∥∥∞

≥∥∥P Iλf0

∥∥∞ −

∥∥∥f In,m,λ − F Iλf0

∥∥∥∞

≥∥∥P Iλf0

∥∥∞ −

(∥∥∥f In,m,λ − F Iλf0 + Sλ(F Iλf0)∥∥∥∞

+∥∥Sλ(F Iλf0)

∥∥∞

).

(4.34)

We derived in Lemma 2 bounds for ||Pλf0||∞, the nondistributed setting. Note that inthis distributed setting the equivalent kernel K retains eigenvalues of equal form, whichimplies both Pλ = P Iλ and Fλ = F Iλ . We continue to denote I for consistency. Thus letus turn to the second term. By Lemma 4 we obtain that there exists an event An,m withP(An,m) ≥ 1−m(n/m)−10 on which∥∥∆f In,m

∥∥∞ . max

((n/m)−(9+ 1

2) log(n/m)−

12α+1 , γn/m


∥∥∥∞

)+∥∥SIn,m,λ(F Iλf0)

∥∥∞ .

Recall by (1.11) that for a sample size of n/m and h equal to (2.3), we obtain∥∥∥S(1)n/m,λ(F Iλf0)

∥∥∥∞

. (σ2 log(n/m)/(n/m))α

(2α+1) ,

with probability at least 1− (n/m)−10, on an event we will denote by Bn,m. Combiningthis with (4.31), we observe

γn/m


∥∥∥∞

. σ−1α

+ 2α2α+1 (n/m)

1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α . (4.35)

Note that

(n/m)−(9+ 12

)(log(n/m))−1

2α+1 . (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α ,

from which follows that on An,m ∩ Bn,m we obtain

∥∥∆f In,m∥∥∞ . (n/m)

1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α +∥∥SIn,m,λ(F Iλf0)

∥∥∞ ,

with P(An,m∩Bn,m) ≥ 1− (m+ 1)(n/m)−10 as in the current setting σ does not dependon n or m.

Now let us turn to ||SIn,m,λ(F Iλf0)||∞. As we have AIm({f (j)n/m,λ}

mj=1) = m−1

∑mj=1 fj ,

54

we find

∥∥SIn,m,λ(F Iλf0)∥∥∞ =

∥∥∥∥∥∥ 1

m

m∑j=1

(S

(j)n/m,λ(F Iλf0))

)∥∥∥∥∥∥∞

=

∥∥∥∥∥∥ 1

m

m∑j=1

1

n/m

n/m∑i=1

(y(j)i − (F Iλf0)(x

(j)i ))K

x(j)i

− P Iλ (F Iλf0)

∥∥∥∥∥∥∞

=

∣∣∣∣∣∣∣∣∣∣ 1

m

m∑j=1

(1

n/m

n/m∑i=1

(y

(j)i − f0(x

(j)i ) + f0(x

(j)i )

− (F Iλf0)(x(j)i ))Kx(j)i

− P Iλ (F Iλf0)

)∣∣∣∣∣∣∣∣∣∣∞

∗≤

∥∥∥∥∥∥ 1

m

m∑j=1

1

n/m

n/m∑i=1

P Iλf0(x(j)i )K

x(j)i

− E(P Iλf0(X)KX)

∥∥∥∥∥∥∞

+

∥∥∥∥∥∥ 1

m

m∑j=1

1

n/m

n/m∑i=1

(y(j)i − f0(x

(j)i ))K

x(j)i

∥∥∥∥∥∥∞

≤ 1

m

m∑j=1

∥∥∥∥∥∥ 1

n/m

n/m∑i=1

P Iλf0(x(j)i )K

x(j)i

− E(P Iλf0(X)KX)

∥∥∥∥∥∥∞

+

∥∥∥∥∥∥ 1

m

m∑j=1

1

n/m

n/m∑i=1

w(j)i K

(j)

x(j)i

∥∥∥∥∥∥∞

:=1

m

m∑j=1

T j1 + T2,

where wji := y(j)i − f0(x

(j)i ) is the N(0, σ2) distributed error term for the local data

X(j)n/m, Y

(j)n/m, which form a partition of Xn, Yn. Inequality (∗) holds in view of the triangle

55

inequality combined with

P IλFIλf0 = P Iλ

( ∞∑k=1

νkf0,kφk

)

=∞∑k=1

(1− νk)νkf0,kφk

= Fλ(∞∑k=1

(1− νk)f0,kφk)

= F IλPIλf0

= E[P Iλf0(X)KX ],

where last line follows by (4.16). For An defined as in (4.23) ||P Iλf0||λ = ||Pλf0||λ .√λ . An/m by (4.21). Furthermore ||P Iλf0||∞ . hα . n for the chosen value of h. This

tells us that P Iλf0 satisfies all assumptions for Lemma 3 for An = An and Bn = n. Byinequality (4.28) we know that there exists an event Cn,m with P(Cn,m) ≥ 1−m(n/m)−10

on which we have

T(j)1 . max

(1, (n/m)−1+ 1

2αh−12α

√log

(n/m)

min(1, ||P Iλf0||∞)

)√log

(n/m)

min(1, ||P Iλf0||∞)

∥∥P Iλf0

∥∥∞√

(n/m)h,

for all j simultaneously. Let us now distinguish between two cases. First, if ||P Iλf0||∞ ≤(n/m)−10, then∥∥∥∥∥∥ 1

n/m

n/m∑i=1

P Iλf0(x(j)i )K

x(j)i

− E(P Iλf0(X)KX)

∥∥∥∥∥∥∞

≤

∥∥∥∥∥∥(n/m)−10

n/m

n/m∑i=1

Kx(j)i

− E(KX)

∥∥∥∥∥∥∞

. (n/m)−10h−1,

as supx,x′ |K(x, x′)|∞ . h−1. For h equal to (2.3), this is of order at most

(n/m)−(9+ 12

) log(n/m)−1

2α+1 . Now, if ||P Iλf0||∞ ≥ (n/m)−10, we obtain√log

(n/m)

min(1,∥∥P Iλf0

∥∥∞)

.√

log(n/m)11 =√

11 log(n/m) .√

log(n/m).

This implies T(j)1 . γn/m||P Iλf0||∞. Now, using (4.31) we obtain

T(j)1 . σ−

1α (n/m)

1−2αα (log(n/m))

6α2+2α−1

4α2+2α∥∥P Iλf0

∥∥∞ ,

which is of larger order than (n/m)−(9+ 12

) log(n/m)−1

2α+1 . All the pieces combined nowgive us that on the event Cn,m with P(Cn,m) ≥ 1−m(n/m)−10, we have

T1 :=1

m

m∑j=1

T(j)1 . σ

− α+1

2α2+α (n/m)1−2αα (log(n/m))

6α2+2α−1


∥∥∞ , (4.36)

56

which is of smaller order than ||P Iλf0||∞ as the power of (n/m) is less than 0 for α > 12 .

In this setting the σ term is irrelevant, but it will be needed in later proofs.To bound T2, note that we can write∥∥∥∥∥∥ 1

m

m∑j=1

w(j)i Kxi

∥∥∥∥∥∥∞

=

∥∥∥∥∥∥ 1

m

m∑j=1

w(j)i

( ∞∑k=1

φk(x(j)i )φk(·)νk

)∥∥∥∥∥∥∞

=

∥∥∥∥∥∞∑k=1

(w′i,kφk(·)νk

)∥∥∥∥∥∞

where w′i,k := m−1∑m

j=1w(j)i φk(x

(j)i ), which is distributedN(0, (σ2

∑mj=1(φk(x

(j)i ))2)/m2).

Furthermore, denote w′i,k = wiφk(xi) ∼ N(0, σ2(φk(xi))2) for the nondistributed setting.

Let us state an adaptation of Lemma 5.3 from [33].

Lemma 5. There exists some constant C such that for any x > 0,

P

∥∥∥∥∥∥ 1

n/m

n/m∑i=1

( ∞∑k=1

w′i,kφk(·)νk

)∥∥∥∥∥∥∞

≤ C σ√m

√1 + x

nm−1h

≥ 1− e−x. (4.37)

The proof to this Lemma can be found in Appendix A.2. For x = c log n we obtain

T2 =

∥∥∥∥∥∥ 1

n/m

n/m∑i=1

( ∞∑k=1

w′i,kφk(·)νk

)∥∥∥∥∥∥∞

≤ Cσ√

log(n/m)

nh, (4.38)

for some C > 0, with probability at least 1 − (n/m)−10 for sufficiently large c on anevent which we will denote by event Dn,m. Combining this with (4.36) we obtain thaton the event An,m ∩ Bn,m ∩ Cn,m ∩ Dn,m∥∥∥fn,λ − F Iλf0

∥∥∥ . (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α

+ (n/m)1−2αα (log(n/m))

6α2+2α−1


∥∥∞ + σ

√log(n/m)

nh

. (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α + σ

√log(n/m)

nh. (4.39)

For h equal to (2.3), we obtain

σ

√log(n/m)

nh.

√log(n/m)

n

(n/m

log(n/m)

) 14α+2

= m−12

(log(n/m)

n/m

) α2α+1

. (4.40)

Note that

(n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α

= (n/m)1−2αα (log(n/m))

6α2+2α−1

4α2+2α

(log(n/m)

n/m

) α2α+1

57

which implies that for α > 12 and (n/m)

1−2αα m

12 → 0

m−12

(log(n/m)

n/m

) α2α+1

& (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α . (4.41)

From Lemma 2 we know that there exists a function f0 ∈ ΘαH(B) such that ||P Iλf0||∞ &

hα/ log2(h−1). This lets us conclude that for f0 on An,m ∩ Bn,m ∩ Cn,m ∩ Dn,m we have∥∥∥f In,m − f0

∥∥∥∞≥∥∥f0 − F Iλf0

∥∥− ∥∥∥fn,λ − F Iλf0

∥∥∥∞

&∥∥P Iλf0

∥∥∞ −m

− 12

(log(n/m)

n/m

) α2α+1

&

(log(n/m)

n/m

) α2α+1

/ log2

((n/m

log(n/m)

) 12α+1

)−m−

12

(log(n/m)

n/m

) α2α+1

,

with P(An,m∩Bn,m∩Cn,m∩Dn,m) ≥ 1− (2m+2)(n/m)−10. Here the negative terms areof smaller order than the positive term, as we have n/m→∞ and m→∞. This tells us

the dominating term in the lower bound is (n/m)−α

2α+1 . We have seen that the minimax

sup-norm upper bound in the nondistributed case has a dominating term n−α

2α+1 , whichis smaller than the distributed case. This lets us conclude∥∥∥f In,m − f0

∥∥∥∞

&

(log(n/m)

n/m

) α2α+1

/ log2

((n/m

log(n/m)

) 12α+1

),

with probability at least 1− (2m+ 2)(n/m)−10.

This result now trivially extends to the function f0 ∈ Θα+ 1

2S (B) for which ||P Iλf0||∞ &

hα/ log(h−1).


We start by giving a sketch of the proof. We will prove that there exists a function

fbad ∈ Θα+ 1

2S (B) for which we have asymptotic frequentist coverage of the pointwise

credible sets of probability 0. This is done by showing the squared bias dominates thevariance.

We will show the possibility of approximating the covariance kernel by some fixedfunction, independent of the data, in the nondistributed setting. This will show tobe useful in proving the pointwise credible set coverage. This theory focusses on thenondistributed setting, which relates to the distributed setting by applying the aggrega-tion operator on the local covariance kernels. By the definition of the covariance kernel(1.6), we can write

σ−2nλCn(x, x′) = kx(x′)− Kx(x′),

58

where kx(·) = k(x, ·) the covariance kernel as defined before, and

Kx(x∗) = k(x∗, Xn)T (K(Xn, Xn) + nλI)−1k(Xn, x).

If we compare this to (4.3), we obtain that Kx can also be written as the solution of thefollowing KRR problem with noiseless observations of Kx at the random design pointsXn,

Kx = argming∈H

[1

n

n∑i=1

(Zxi − g(xi))2 + λ ‖g‖2H

], (4.42)

where Zix = k(x, xi), which is noiseless (see Section 2.2 of [33]). Corollary 2.2 from [33]states that for a noiseless version, where the measured random variables do not containa noise term of a certain variance, of the KRR problem, we obtain both∥∥∥fn,λ − f0

∥∥∥∞≤ 2 ‖Pλf0‖∞ ,

and ∥∥∥f0 − fn,λ − Pλf0

∥∥∥∞≤ Cγn ‖Pλf0‖∞ , (4.43)

for some constant C > 0, with probability at least 1−n−10. Notice that for the covariancekernel, the first statement implies∥∥∥σ−2nλCn(x, ·)

∥∥∥∞

=∥∥∥kx − Kx

∥∥∥∞≤ 2 ‖Pλkx‖∞ .

Note that by (4.2) and the definition of Pλ, we observe that

Pλkx =∞∑j=1

(1− νj)µjφj(x)φj(·) =∞∑j=1

µjλ

µj + λφj(x)φj(·) = λ

∞∑j=1

νjφj(x)φj(·) = λKx,

for Kx(·) =∑∞

j=1 νjφj(x)φj(·). By the fact that {φj}j∈N is uniformly bounded, weobtain ∣∣∣∣∣∣

∞∑j=1

νjφj(x)φj(x′)

∣∣∣∣∣∣ . h−1,

from the Riemann sum (4.20). Combining this with the higher order expansion (4.43)∥∥∥σ−2nλCn(x, ·)− Pλkx∥∥∥∞≤ Cγn ‖Pλkx‖∞ =⇒

∥∥∥σ−2nCn(x, ·)− Kx

∥∥∥∞≤ Cγn

∥∥∥Kx

∥∥∥∞.

(4.44)

As we have γn → 0 for n → ∞, this opens an opportunity to numerically approximateσ−2nCn in a way such that there is no need to compute the eigendecomposition andwhich does not involve the random design points.

59

Let us now define Cn(x, x′) := σ2hK(x, x′). Using (4.44), we now obtain

supx,x′|nhCn(x, x′)− Cn(x, x′)| ≤ Cσ2hγn sup

x,x′|K(x, x′)| . γn, (4.45)

since supx,x′ |K(x, x′)| . h−1 by (4.20). As we have seen earlier, for h equal to (1.9)this goes to 0 for n → ∞, giving us a covariance kernel to approximate our originalcovariance kernel Cn with, which does not depend on the random design points but is afixed function. This will prove to be useful in the computation of pointwise credible setcoverage.

With this approximation explained, let us turn to the proof of Theorem 2.

Proof. Consider a function fbad ∈ Θα+ 1

2S (B) such that∥∥P Iλfbad∥∥∞ & hα/ log2(h−1),

which we know exists by Lemma 2. By the definition of Cn we observe that

supx,x′∈[0,1]

|Cn(x, x′)| = h supx,x′∈[0,1]

|∞∑k=1

φk(x)φk(x′)νk| . hh−1 = 1

thus Cn is of constant order. Combining this with (4.45), we conclude that Cn is oforder (nh)−1. Recall that

CIn(x;β) =

[fn(x)− z(1+β)/2

√Cn(x, x), fn(x) + z(1+β)/2

√Cn(x, x)

],

we see that the size of credible set is determined by√Cn(x, x). Now let us turn to the

distributed setting, by (2.4) we have

√CIn,m(x, x) =

√√√√m−2

m∑j=1

(Cn/m(x, x))(j),

which implies that the size of

CIIn,m(x;β) :=

[f In,m(x)− z(1+β)/2

√CIn,m(x, x), f In,m(x) + z(1+β)/2

√CIn,m(x, x)

]is of order m−

12 (nm−1h)−

12 = (nh)−

12 again for the distributed setting. Let us define

60

γ = (1 + β)/2. From this follows, analogously to (33) from [33] that:

P(fbad(x) ∈ CIIn,m(x;β)]

= P(∣∣∣√nh(f In,m(x)− fbad(x))

∣∣∣ ≤ zγ√nhCIn,m(x, x)

)∗= P

∣∣∣√nh(f In,m(x)− fbad(x))∣∣∣ ≤ zγ

√√√√nhm−2

m∑m=1

C(j)n/m(x, x)

≈ P

∣∣∣√nh(f In,m(x)− fbad(x))∣∣∣ ≤ zγ

√√√√m−1

m∑j=1

C(j)n/m(x, x)

= P

(√nh(f In,m(x)− F Iλfbad(x)) ∈

[√nhP Iλfbad(x)± zγ

√C

(1)n/m(x, x)

]),

where (∗) holds because we know the variance of distributed method I is distributedm−1 times the variance of the nondistributed method for a sample size of n/m, and

the approximation follows from (4.45). We have m−1∑m

j=1 C(j)n/m = C

(1)n/m = σ2hK as

all C(j)n/m are equal. This is of constant order as ||K||∞ .

∑∞k=1 νk . h−1. First note

that for h equal to the optimal value (2.3) by (4.39) we obtain with probability at least1− (n/m)−10

√nh(f In,m(x)− F Iλfbad(x)) . (nh)

12m−

12

(log(n/m)

n/m

) α2α+1

= (n/m)12h

12 (n/m)−

12

+12

2α+1 (log(n/m))α

2α+1

.

(n/m

log(n/m)

)− 14α+2

(n/m)1

4α+2 (log(n/m))α

2α+1

=√

log(n/m),

which goes to infinity as n/m,m → ∞ as α is positive. Furthermore, using Lemma 2for h equal to 2.3 we have

√nhP Iλfbad & (nh)

12

(log(n/m)

n/m

) α2α+1

/ log

((n/m

log(n/m)

) 12α+1

)

&(√

m log(n/m))/ log

((n/m

log(n/m)

) 12α+1

),

by analogous reasoning. This approaches infinity faster than√nh(fn−Fλfbad(x)). Com-

bined, this implies that

P[fbad ∈ CIIn,m(x;β)]

≈ P[√

nh(f In(x)− F Iλfbad(x)) ∈[√

nhP Iλfbad(x)± zγ√Cn/m(x, x)

]]→ 0,

61

as n/m,m/ log2 n→∞.


The proof to the upper bound for the L∞ distance between the KRR estimator and thetrue function f0 follows by combining the results on the adjusted likelihood method andTheorem 1.

Proof. Let us assume that f0 ∈ ΘαH(B). By the triangle inequality we obtain∥∥∥f IIan,m − f0

∥∥∥∞≤∥∥P IIaλ f0

∥∥∞ +

∥∥∥f IIan,m,λ − F IIaλ f0

∥∥∥∞,

where we will bound these two terms similarly to Section 4.5. From Section 2.3 we knowthe variance of the noise is σ′ = σ/

√m in this setting. By Lemma 2 then follows∥∥P IIaλ f0

∥∥∞ . hα

=

(B2n/m

(σ′)2 log(n/m)

)− α2α+1

=

(B2n/m

σ2/m log(n/m)

)− α4α+2

.

(log(n/m)

n

) α2α+1

.

First, note that

γn/m

∥∥∥S(1)n/m,λ(F IIaλ f0)

∥∥∥∞

. m12α− α

2α+1 (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α ,

thus

max(

(n/m)−10h−1, γn/m


∥∥∥∞

). m

12α− α

2α+1 (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α .

Following this, we combine (4.35), (4.36) (where in this case the value of σ is relevant)and (4.39) to obtain∥∥∥f IIan,m,λ − F IIaλ f0

∥∥∥∞

. m12α− α

2α+1 (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α

+m12α (n/m)


6α2+2α−1

4α2+2α∥∥P IIaλ f0

∥∥∞ + σ′

√log(n/m)

nh,

62

with probability at least 1− (2m+ 2)(n/m)−10. Note that

m12α (n/m)


6α2+2α−1


∥∥∞

. m12α− α

2α+1 (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α ,

where the power of (n/m) is negative for α > 12 . Now as n

1−2αα m

12α− 1−2α

α → 0, we have

m12α− α

2α+1 (n/m)1−2αα− α

2α+1 (log(n/m))8α2+2α−1

4α2+2α .

(log(n/m)

n

) α2α+1

.

Furthermore, we obtain

σ′√

log(n/m)

nh−

12 =

σ√m

√log(n/m)

nh−

12 . m−

12

(log(n/m)

n

) α2α+1

.

This lets us conclude ∥∥∥f IIan,m,λ − f0

∥∥∥∞

.

(log(n/m)

n

) α2α+1

,

with probability at least 1 − (2m + 2)(n/m)−10. Here the dominant term is n−α

2α+1 ,which is of same order as the dominant term in the minimax rate for the nondistributedmethod from (23) from [33]. Adjusting the local likelihood thus gives an optimal upperbound for f0 ∈ Θα

H(B).

This result now trivially extends to the function f0 ∈ Θα+ 1

2S (B).


This proof is done similarly to Section 4.6, except for σ′ = σ/√m instead of σ.

Proof. Let us consider fbad ∈ Θα+ 1

2S (B) such that ||P IIaλ fbad||∞ & hα/ log(h−1). We

have seen that in the general distributed setting using the average of the posteriors thatCIIn,m(x;β) is of order (nh)−

12 . As the adjusted likelihood setting deals with σ/

√m as

opposed to σ, the size of CIIIan,m(x;β) can be computed by the definition of Cn√CIIan,m(x, x) =

√m−1(CIn/m)(x, x) � (nmh)−

12 .

63

Implemeting this, we get

P[fbad ∈ CIIIan,m(x;β)]

= P[∣∣∣√nmh(f IIan,m(x)− fbad(x))

∣∣∣ ≤ zγ√nmhCIIan,m(x, x)

]∗= P

∣∣∣√nmh(f IIan,m(x)− fbad(x))∣∣∣ ≤ zγ

√√√√nm−1hm∑j=1

(CIIan/m)(j)(x, x)

≈ P

∣∣∣√nmh(f IIan,m(x)− fbad(x))∣∣∣ ≤ zγ

√√√√ 1

m

m∑j=1

C(j)n/m(x, x)

= P

[√nmh(f IIan,m(x)− F IIaλ fbad(x)) ∈

[√nmhP IIaλ fbad(x)± zγ

√C

(1)n/m(x, x)

]],

where (∗) holds as CIIan,m(x, x) = 1m2

∑mj=1 C

(j)n/m(x, x) and the approximation follows from

(4.45) combined with the fact that CIIan/m = m−1Cn/m by the adjusted likelihoods.Using similar computations as in Theorem 2, we obtain

√nmh(f IIan,m(x)− F IIaλ fbad(x)) .

√nmh ·m−

12

(log(n/m)

n

) α2α+1

=√

log(n/m).

Furthermore, as we have

P IIaλ fbad(x) &

(log(n/m)

n

) α2α+1

/ log

((n

log(n/m)

) α2α+1

),

in this setting, we obtain

√nmhP IIaλ fbad(x) &

(√m log(n/m)

)/ log

((n

log(n/m)

) α2α+1

).

As we have n/m → ∞ and m/ log2(n) → ∞ , this implies that√nmhP IIaλ fbad → ∞

faster than√nmh(f IIan,m − F IIaλ fbad(x)). As Cn/m is fixed and of constant order, we

concludeP[fbad ∈ CIIIan,m(x;β)]→ 0,

as m,n/m→∞.


As the prior is adjusted, the eigenvalues of the kernel get changed by a factor of m.This changes the eigenvalues of the equivalent kernel and changes the variance of theposterior. In the proof below we obtain an upper bound for ||P IIIaλ f0||∞ for the newprior, as well as local variance terms for the new prior. We will combine this analogouslyto the proof of Theorem 1.

64

Proof. Let f0 ∈ Θα+ 1

2S (B). In this case, we have KIIIa := mKI . As the eigenvalues are

defined by Ax = µx, we obtain for (cA)x = (cµ)x, which tells us in the current settingthat the eigenvalues of KIIIa, µ′j , satisfy µ′j = Cmj−2α for some constant C. From this

follows that ||P IIIaλ f0||∞ is different from ||Pλf0||∞. Note that the rescaling can be seenas a new parameter λIIIa := λ/m. This implies that we have a different value of h as

well hIIIa := (λ/m)12α = hm−

12α . Having defined the new value of h for this setting,

we can use the results proved for the nondistributed setting in [33] with h replaced byhIIIa.

Once more, we use the triangle inequality to observe that∥∥∥f IIIan,m,λ − f0

∥∥∥∞≤∥∥P IIIaλ f

∥∥∞ +

∥∥∥f IIIan,m,λ − F IIIaλ f0

∥∥∥∞.

To bound the first term, let us assume f0 ∈ Θα′S (B) for α′ = α + 1

2 . Recall that by

(4.19) we have ||P IIIaλ f0||∞ ≤ ||P IIIaλ f0||λ supx∈[0,1] ||KIIIax ||λ. Where ||P IIIaλ f0||λ can

be bounded analogously to (4.21)

||P IIIaλ f0||2λ = λ∞∑k=1

λ/m

λ/m+ µj

f2j

mµj.

λ

m

∑k=1

j2α′f2j ≤

λ

mB2,

as (λ/m)/(λ/m + µj) ≤ 1 and µj � j−2α′ . As supx∈[0,1] ||KIIIax ||2λ .

∑∞j=1 νj .

m1/(2α′)h−1, this implies that ||P IIIaλ f0||∞ . m−12

+ 14α′ hα

′− 12 , which for α′ = α + 1

2

implies ||P IIIaλ f0||∞ . m−12

+ 14α+2hα. For h equal to (2.3) we obtain

∥∥P IIIaλ f0

∥∥∞ . m−

12

+ 14α+2hα .

(log(n/m)

n

) α2α+1

, (4.46)

which is of equal order to the minimax rate up to a logarithmic term (1.11). First, notethat by (1.7) and taking γn/m with hIIIa we have

γn/m

∥∥∥S(1)n/m,λ(F IIIaλ f0)

∥∥∥∞

. γn/m

√log(n/m)

nm−1hIIIa

. n1−2αα m

2−3α2α

+α+1

4α2 (log(n/m))6α2+2α−1

4α2+2α

(log(n/m)

n

) α2α+1

,

where we see

m12α (n/m)−(9+ 1

2) log(n/m)−

12α+1 . n

1−2αα m

2−3α2α

+α+1

4α2 (log(n/m))6α2+2α−1

4α2+2α

(log(n/m)

n

) α2α+1

.

Now, by repeating all the steps of the proof of Theorem 1 up to (4.39) for the modifiedequivalent kernel where we replace h by hIIIa, we obtain∥∥∥f IIIan,m,λ − F IIIaλ f0

∥∥∥∞

. γn/m

(∥∥∥S(1)n/m,λ(F IIIaλ f0)

∥∥∥∞

+∥∥P IIIaλ f0

∥∥∞

)+m−

12

√log(n/m)

nm−1hIIIa

65

with probability at least 1− (2m+ 2)(n/m)−10. By (1.7) we have∥∥∥S(1)n/m,λ(F IIIaλ f0)

∥∥∥∞

.

√log(n/m)

nm−1hIIIa. m

12

(log(n/m)

n

) α2α+1

,

with probability at least 1− (2m+ 2)(n/m)−10. Combining this with (4.46) we obtain

γn/m

(∥∥∥S(1)n/m,λ(F IIIaλ f0)

∥∥∥∞

+∥∥P IIIaλ f0

∥∥∞

). γn/m

(m

12

(log(n/m)

n

) α2α+1

)

. n1−2αα m

2−3α2α

+α+1

4α2 (log(n/m))6α2+2α−1

4α2+2α

(log(n/m)

n

) α2α+1

.

(log(n/m)

n

) α2α+1

,

for n1−2αα m

2−3α2α

+α+1

4α2 → 0. Furthermore we see that

m−12

√log(n/m)

nm−1hIIIa. m−

12

+ 14α+2

(log(n/m)

n/m

) α2α+1

=

(log(n/m)

n

) α2α+1

,

implying with probability at least 1− (2m+ 2)(n/m)−10

∥∥∥f IIIan,m,λ − F IIIaλ f0

∥∥∥∞

.

(log(n/m)

n

) α2α+1

. (4.47)

Combining this with (4.46), we conclude that∥∥∥f IIIan,m,λ − f0

∥∥∥ .

(log(n/m)

n

) α2α+1

,

with probability at least 1 − (2m + 2)(n/m)−10. The order of this term is equal to theorder nondistributed minimax rate up to a logarithmic term.


Proof. To prove the pointwise credible set coverage, note that the process from (4.6)can be repeated, but by replacing K by KIIIa, the equivalent kernel of the rescaledcovariance kernel. If we now define

CIIIan := σ2hIIIaKIIIa(x, x′),

we can approximate(n/m)hIIIaCIIIan/m (x, x) ≈ CIIIan/m ,

66

by (4.45).Let us define the process Un(·) =

√hIIIa/n

∑ni=1wiK

IIIaxi (·) where we define the

covariance function C ′n which matches the covariance function of the process Un(·) on[0, 1] as follows

C ′n(x, x′) := E[Un(x)Un(x′)]

= σ2hIIIaE[KIIIa(Xn, x)KIIIa(Xn, x′)]

= σ2hIIIa∞∑j=1

φj(x)φj(x′)

(1 + h2α/µ′j)2.

Analogous to (4.20), we obtain as hIIIa →∞:

∞∑j=1

1

(1 + h2a/µ′j)2�∞∑j=1

1

(1 + (m−12αhj)2α)2

= (hIIIa)−1∞∑j=1

hIIIa

(1 + (hIIIaj)2α)2

� (hIIIa)−1

∫ ∞0

1

(1 + x2α)2

� (hIIIa)−1.

Combined with the fact that {φj}j∈N is uniformly bounded, we obtain that C ′n(x, x′) isof constant order. For this process we obtain

supu∈R

∣∣∣∣∣P(C ′n(x, x)−

12

√hIIIa/n

n∑i=1

wiKIIIaxi (x) ≤ u

)− Φ(u)

∣∣∣∣∣ . (nhIIIa)−12 , (4.48)

by using the Berry-Esseen theorem as in Section 5.5 from [33]. The second display of(27) in [33] tells us that in the nondistributed setting∥∥∥∥∥fn − Fλf0 −

1

n

n∑i=1

wiKIIIaxi

∥∥∥∥∥∞

. γn

(‖Pλf0‖∞ + σ

√log n

nh

)=: δn, (4.49)

with probability at least 1 − n−10. For the current distributed setting, we obtain byusing triangle inequality that∥∥∥∥∥f IIIan,m − F IIIaλ f0 −

1

n

n∑i=1

wiKIIIaxi

∥∥∥∥∥∞

=

∥∥∥∥∥∥ 1

m

m∑j=1

f (j)n/m − Fλf0 −

1

n/m

n/m∑i=1

w(j)i KIIIa

x(j)i

∥∥∥∥∥∥∞

≤ 1

m

m∑j=1

∥∥∥∥∥∥f (j)n/m − Fλf0 −

1

n/m

n/m∑i=1

w(j)i KIIIa

x(j)i

∥∥∥∥∥∥∞

. δn/m, (4.50)

67

with probability at least 1 −m(n/m)−10 on the intersection of the events on which wehave ∥∥∥∥∥∥f (j)

n/m − Fλf0 −1

n/m

n/m∑i=1

w(j)i KIIIa

x(j)i

∥∥∥∥∥∥∞

. δn/m,

for all j, where h = hIIIa by applying (4.49) to all seperate terms. Let us now denote

a := an,m(x) =1

n

n∑i=1

wiKIIIaxi (x), (4.51)

b := bn,m(x) = f IIIan,m (x)− F IIIaλ f0(x), (4.52)

c := cn,m(x) =√nhIIIa(C ′n(x, x))−

12 , (4.53)

where we have by (4.48) supu∈R |P(ca ≤ u)− Φ(u)| . (nhIIIa)−12 and by (4.50)

||a − b||∞ . δn/m, which also implies ||c(a − b)||∞ .√nhIIIaδn/m. This now implies

that

supu∈R|P(cb ≤ u)− Φ(u)| = sup

u∈R|P(cb ≤ u− c(a− b))− Φ(u− c(a− b))|

∗≤ sup

u∈R|P(cb ≤ u− c(a− b))− Φ(u) +Dc(a− b))|

≤ supu∈R|P(cb ≤ u− c(a− b))− Φ(u)|+ |Dc(a− b))|

. supu∈R|P(ca ≤ u)− Φ(u)|+

√nhIIIaδn/m

. (nhIIIa)−12 +√nhIIIaδn/m,

with probability at least 1− (n/m)−10, where (∗) holds for some constant D because Φhas a bounded derivative 1. We conclude

supu∈R

∣∣∣P(C ′n(x, x)−12

√nhIIIa(f IIIan,m (x)− Fλf0(x)) ≤ u

)− Φ(u)

∣∣∣ . (nhIIIa)−12 +√nhIIIaδn/m.

(4.54)

Note that

√nhIIIaδn/m = γn/m

(√nhIIIa ‖Pλf0‖∞ + σ

√m log(n/m)

). γn/m

√m log(n/m)

. (nm−1hIIIa)−12

√m log(n/m)

. m

(log(n/m)

n

) α2α+1

,

1This appears to be a mistake in the proof of Theorem 3.2 in [33], where δn is stated without a√nh

factor.

68

by (4.47). This goes to 0 for n/(m4) → ∞ as the power of n is at most 1/2. By usingthe triangle inequality in combination with (4.45), we obtain

supx,x′|nhIIIaCIIIan,m (x, x′)− CIIIan/m (x, x′)| = sup

x,x′|nhIIIa 1

m2

m∑j=1

C(j)n/m(x, x′)− CIIIan/m (x, x′)|

≤ 1

m

m∑j=1

supx,x′|nm−1hIIIaC

(j)n/m(x, x′)− CIIIan/m (x, x′)|

. γn/m. (4.55)

Now denote

a := an,m(x) = f(x)− f IIIan,m (x) (4.56)

c1 := (c1)n,m(x) =√CIIIan/m (x, x) (4.57)

c2 := (c2)n,m(x) =√

(nhIIIa)−1CIIIan/m (x, x), (4.58)

from which follows a |Xn, Yn ∼ N(0, c1). First note that

nhIIIa|(c1 − c2)(c1 + c2)| = |nhIIIa(c21 − c2

2)| . γn/m,

by (4.55). As (c1 + c2) . (nhIIIa)−12 because as CIIIan/m is of order 1, both terms are of

order (nhIIIa)−12 , we obtain

|c1 − c2| =∣∣∣√CIIIan/m (x, x)−

√(nhIIIa)−1CIIIan/m (x, x)

∣∣∣ . (nhIIIa)−12γn/m, (4.59)

by dividing both sides by (nhIIIa)12 . Let us denote the half length of the credible interval

using distributed method IIIa as

lIIIa(x;β) = z(1+β)/2

√CIIIan,m (x, x)

69

Using this, we obtain∣∣∣Φ(√nhIIIa(C ′n)−

12 (x, x)lIIIan,m (x;β))

∣∣∣=

∣∣∣∣∣∣Φ√nhIIIa(C ′n)−

12 (x, x)

lIIIan,m (x;β)−

√CIIIan/m (x, x)

nhIIIaz(1+β)/2 +

√CIIIan/m (x, x)

nhIIIaz(1+β)/2

∣∣∣∣∣∣=

∣∣∣∣∣∣∣Φz(1+β)/2

√√√√ CIIIan/m (x, x)

C ′n(x, x)+√nhIIIa(C ′n)−

12 (x, x)

lIIIan,m (x;β)−

√CIIIan/m (x, x)

nhIIIaz(1+β)/2

∣∣∣∣∣∣∣

≤

∣∣∣∣∣∣∣Φz(1+β)/2


C ′n(x, x)

∣∣∣∣∣∣∣+D

∣∣∣∣∣∣√nhIIIa(C ′n)−

12 (x, x)

lIIIan,m (x;β)−

√CIIIan/m (x, x)

nhIIIaz(1+β)/2

∣∣∣∣∣∣.

∣∣∣∣∣∣∣Φz(1+β)/2


C ′n(x, x)

∣∣∣∣∣∣∣+ γn/m, (4.60)

where the first inequality follows by the fact that Φ has a bounded derivative and thesecond inequality follows by (4.59). Furthermore, note that

P[f0(x) ∈ CIIIIan,m (x;β)]

= P[√

nhIIIa(f IIIan,m (x)− F IIIaλ f0(x)) ∈[√

nhIIIaP IIIaλ f0(x)± lIIIan,m (x;β)]]

= P

[C ′n(x, x))−

12

√nhIIIa(f IIIan,m (x)− F IIIaλ f0(x)) ∈ (4.61)

[C ′n(x, x))−

12 (√nhIIIaP IIIaλ f0(x)± lIIIan,m (x;β))

] ]. (4.62)

As we can write P(Z ∈ [a, b]) = Φ(b)−Φ(a) for Z ∼ N(0, 1), using (4.54) combined with

70

(4.60), we obtain∣∣∣∣∣P[f0(x) ∈ CIIIIan,m (x;β)]− Φ(

(C ′n(x, x))−12 (√nhIIIaP IIIaλ f0(x) + lIIIan,m (x;β))

)+ Φ

((C ′n(x, x))−

12 (√nhIIIaP IIIaλ f0(x)− lIIIan,m (x;β))

) ∣∣∣∣∣.

∣∣∣∣∣P[f0(x) ∈ CIIIIan,m (x;β)]− Φ


C ′n(x, x)z(1+β)/2 + C ′n(x, x)−

12

√nhIIIaP IIIaλ f0(x)

+

Φ

−√√√√ CIIIan/m (x, x)

C ′n(x, x)z(1+β)/2 + C ′n(x, x)−

12

√nhIIIaP IIIaλ f0(x)

∣∣∣∣∣+ γn/m

. (nhIIIa)−12 +√nhIIIaδn/m + γn/m, (4.63)

where we will denote

uIIIan,m (x;β) :=


C ′n(x, x)z(1+β)/2, bIIIan,m = (C ′n(x, x))−

12

√nhIIIaP IIIaλ f0(x).

(4.64)

Note that all seperate terms in (4.63) go to 0 for n/(m4)→∞.Having shown this, we can analogously to the smooth-match case of Section 5.6 in [33]

show the required result. Let (fn)j = Cβj−α, for j = jn = b(λ/m)−1/(2α)c and (fn)j = 0

for j 6= jn. The constant Cβ is a parameter which will be determined later. For B ≥ Cβtrivially follows that fn ∈ Θ

α+ 12

S (B). This means we obtain

√nhIIIaP IIIaλ (fn)(x) =

(λ

m

) 12∞∑j=1

j−α

λm + j−2α

(fn)jj−α

φj(x)

= Cβ

(λ

m

) 12 b(λ/m)−

12α c−α

λ/m+ b(λ/m)−12α c−2α

φjn(x)

: = cλ,mCβφjn(x),

where cλ,m ∈ [12 , 1] as for b(λ/m)−

12α c = (λ/m)−

12α we obtain cλ,m = 1/2, and further-

more (λ

m

) 12 b(λ/m)−

12α c−α


≤(λ

m

) 12 ((λ/m)−

12α )−α


≤ 1.

Now let bn be the solution of the equation

Φ(uIIIan,m (x;β) + bn)− Φ((uIIIan,m (x;β)− bn) = β.

71

If Cβ is chosen such that bn = CIIIan/m (x, x)−12 cλ,mCβφjn(x), by (4.63) we obtain

P[fn(x) ∈ CIIIIan,m (x;β)]→ β,

for n/m→∞.

72

Conclusion

In this thesis we have studied distributed Bayesian methods in the nonparametric regres-sion model. We have briefly studied the characteristics of the nonparametric regressionmodel, for which we showed the motivation behind applying Bayesian methods usingGaussian process priors. The Bayesian methods have built-in uncertainty quantificationas opposed to a frequentist setting. We saw that Gaussian processes offer flexible op-tions as priors through the possibilities of covariance kernels available. We have studiedtwo kernels, the Matern kernel and the squared exponential kernel, of which we sawthe eigendecomposition used in the RKHS setting. From the eigendecomposition, theMatern kernel showed to provide easier computations. We have seen that the requiredposterior function should not only satisfy a mean close to the true function, but alsoquantify the uncertainty well by containing the true function inside a band in which with95% probability the posterior function based on the data lies.

We started by performing simulation studies for the Matern kernel in a nondistributedsetting. For these simulation results, we have seen the theoretical results presented by[33] which form an optimal estimation to aim for in the distributed setting. This paperpresents results which we have briefly studied, regarding both the posterior bias andfrequentist coverage of pointwise credible sets. The functions for which we computedtheoretical results have been in either the Sobolev ball or the Holder class, which aresets which contain functions satisfying certain smoothness conditions for a smoothnessparameter α.

Afterwards, we studied the performance of this model in a distributed setting wherewe considered the smoothness parameter to be known. In the distributed setting, wedivided the data of size n over m machines to separately compute local posteriors to beaggregated into a global posterior by various aggregation methods. The first option weconsidered was naively taking draws from all m local posteriors and taking the average ofthese draws to form one global posterior. The simulation study shows that it turns outthat this method results in both a suboptimal recovery and uncertainty quantification.Afterwards we considered multiple different distributed methods which mostly followedfrom [25] which studied distributed Bayesian methods in the signal in white noise model.These methods consist of changing the computation of the local posteriors, differentaggregation methods or combinations of both. The methods showed to perform similarto the results in the signal in white noise model.

For three of the methods from the simulation studies, we computed theoretical resultsfor both the L∞ distance between the estimator and the truth, and frequentist cover-age of pointwise posterior credible sets. The aim was to compare these results to theoptimal results shown for the nondistributed setting. To prove these theoretical results,we considered an RKHS framework as proposed in [33]. The use of this framework

73

made it possible to compute theoretical results for the model without the use of matrixoperations. The theoretical results were computed analogously to the results for thenondistributed setting. These results confirmed the observations from the simulationstudies, where the studied methods performed similar to these methods in the signal inwhite noise model.

After the theoretical results for the distributed methods, we performed simulationstudies using the squared exponential kernel instead of the Matern kernel. For thiskernel, the distributed methods performed similar as on the Matern kernel. Finally,we looked into an adaptive method which tunes the parameters of the kernel from thedata by maximising the likelihood, as opposed to it being known for all earlier cases.The adaptive methods have been shown to work in a nondistributed setting, but failto produce good results for the distributed setting. Both the Matern kernel and thesquared exponential kernel show suboptimal recovery and uncertainty quantificationwhen estimating their hyper parameters in distributed setting.

The research can be continued in various directions. The L∞ bounds proven canbe extended to contraction rates. Moreover, there are still distributed methods forwhich no L∞ bounds have been proven. The results on the frequentist coverage ofpointwise posterior credible sets can be extended to sup-norm credible bounds aroundthe estimator fn,m. In particular, all of these results can be extended to results forthe squared exponential kernel, which might require more work because of the differentstructure of the eigenvalues and eigenfunctions. The adaptive methods have been studiedvia simulation, but require theoretical results as well. Moreover, there has not been anydistributed method which performs good results in the simulation studies. It rests tofind a solution for this, similar to the open problem from [25].

74

Appendices

75

A. Appendix

A.1. Proof of Lemma 5

The proof to this lemma follows analogous to the proof of Lemma 5.3 in [33]. We includethis proof for self containedness and completeness.

Proof. To begin our proof, we recall a version of the classical Bernstein inequality anda resulting expectation bound.

Theorem 7 (Bernstein’s inequality). Let X1, . . . , Xn be independent random variables.Let ν and c be positive numbers such that

∑ni=1 EX2

i ≤ ν, and∑n

i=1 E|Xi|q ≤ 2−1q!νcq−2

for q ≥ 3. Let S =∑n

i=1(Xi − EXi). Then, for any λ ∈ (0, 1/c),

EeλS ≤ exp

(νλ2

2(1− cλ)

),

which in particular implies, for any x > 0,

P(|S| ≥

√2νx+ cx

)≤ 2e−x. (A.1)

Further, if S a random variable satisfying (A.1), then for any q ≥ 1,

E[S2q] ≤ q!(8ν)q + (2q!)(4c)2q.

In particular, E[S2] . C1ν + C2c2 where C1, C2 are nonnegative constants.

This proof makes use of the following tail bound for supremum of empirical processeswith sub-exponential increments from [1].

Theorem 8 (Bernstein-type inequality for suprema of random processes (Theorem 2.1from [1]). Let (Xt)t∈T be a centered familiy with T ⊂ RD for some finite D. Fix somet0 ∈ T and let Z = supt |Xt−Xt0 |. Consider norms d(s, t) = d(s−t) and δ(s, t) = δ(s−t)on RD, and assume there exist v, b > 0, c ≥ 0 such that

T ⊂ {t ∈ RD : d(t, t0 ≤ v, cδ(t, t0) ≤ b}.

Moreover, assume that for all s, t ∈ T where s 6= t,

E[eλ(Xt−Xs)] ≤ exp

(λ2d2(t, s)

2(1− λcδ(t, s))

),∀λ ∈

[0,

1

cδ(t, s)

]. (A.2)

Then, with C = 18,

P(Z ≥ C

(√v2(1 + x) + b(1 + x)

))≤ 2e−x, ∀x > 0. (A.3)

76

Without loss of generality, we assume σ = 1. Let

Ut =

n/m∑i=1

zi

∞∑k=1

νj√m

√√√√ m∑j=1

φ2k(X

(j)i )φk(t)

,

where zi ∼ N(0, 1/m), Xi ∼ U(0, 1), and Xi is zj for all 0 ≤ i, j ≤ n. It suffices to finda tail bound for ||U ||∞ = supt∈[0,1] |Ut|. For t, s ∈ [0, 1], we write

Ut =

n/m∑i=1

zi

∞∑k=1

νkφk(t)1√m

√√√√ m∑j=1

φ2k(X

(j)i )

From this follows

Ut − Us =

n/m∑i=1

ziΓt,si =

n/m∑i=1

zi

∞∑k=1

νj(φk(t)− φk(s))1√m

√√√√ m∑j=1

φ2k(X

(j)i )

,

where we denote

Γt,si :=

∞∑k=1

νj(φk(t)− φk(s))1√m

√√√√ m∑j=1

φ2k(X

(j)i )

,

where we will suppress the dependence on s and t, and write Γi from now on. Noticethat

1√m

√√√√ m∑j=1

φ2k(X

(j)i ) ≤ 1√

m

√mC2

φ = Cφ,

for Cφ the uniform bound for {φj}j∈N.We now proceed to the proof by defining the two norms d and δ. Let d(t, s) =(

E[(Ut − Us)2]) 1

2 and for any κ ∈ (0, α ∧ 1), let δ(t, s) = |t− s|κ. Using these norms, weestimate the quantities v, b and c in (A.3). First, note that

d2(t, s) = E

n/m∑i=1

(ziΓi)

=

n/m∑i=1

m−1EΓ2i .

As the eigenfunctions are uniformly bound and orthonormal, we obtain

EΓ2i =

∞∑k=1

νk|φk(t)− φk(t)|2 ≤ 2C2φ

∞∑k=1

ν2k ≤ Ch−1.

The two previous displays combined now imply

supt,s∈[0,1]

d(t, s) ≤√nm−2h−1.

77

In order to satisfy condition (A.2) we require to bounds on E∑n/m

i=1 |ziΓi|q for q ≥ 3. Westart by bounding |Γi|. For any i and fixed κ ∈ (0, 1), we use the uniform boundnessand Lipschitz continuity of the eigenfunctions to obtain

|Γi| ≤∞∑k=1

νk|φk(t)− φk(s)|

∣∣∣∣∣∣ 1√m

√√√√ m∑j=1

φ2k(X

(j)i )

∣∣∣∣∣∣≤ Cφ

∞∑k=1

νk|φk(t)− φk(s)|κ2C1−κφ

≤ 2C2−κφ Lκφ|t− s|κ

∞∑k=1

kκνk

≤ Ch−(1+κ)δ(t, s).

This implies that for any q ≥ 3,

E

n/m∑i=1

|ziΓi|q ≤ n/m∑

i=1

m−q/2qq/2E[|Γi|q−2Γ2i ] ≤ 2−1q!d2(t, s)

(h−(1+κ)δ(t, s)

)q−2,

where we used both the fact that E|zi|q . m−q/2qq/2 and q! ≥ (q/e)q. For the |Γi|q−2

term, we used the global bound on Γi and the fact that∑n/m

i=1 m−1EΓ2i = d2(t, s).

Now let c = h−(1+κ). Then by Theorem 7, we obtain

Eeλ(Ut−Us) ≤ exp

(λ2d2(t, s)

2(1− λcδ(t, s))

), ∀λ ∈

[0,

1

cδ(t, s)

].

Thus condition (A.2) is satisfied, which implies quantities v2 and b in (A.3) are given bynm−2h−1 and h−(1+κ), respectively. Theorem 8 now states

P(‖U‖∞ ≥ 18(

√nm−2h−1(1 + x) + h−(1+κ)(1 + x))

)≤ 2e−x, ∀x > 0.

Dividing both sides in the probability measure by n/m proofs the stated result.

A.2. Technical lemmas

Lemma A.1. For r, s, t ≥ 0 with st > 1 consider f(x) = x−r(xs + 1)t and set

fν = ν−1∞∑k=1

f(k/ν).

Then as ν →∞,

(i) if r < 1, then fν �∫∞

0 f(x)dx.

78

(ii) if r = 1, then fν � log ν.

(iii) if r > 1, then fν � νr−1∑∞

k=1 k−r.

Proof. A proof to this lemma can be found in [24].

79

Popular summary

In recent times, digital techniques have made it possible to easily keep track of large setsof data. Due to the availability of data, the importance of data analysis has grown im-mensely. Topics such as machine learning and big data are mentioned frequently. Manymodern applications use the historical data of its users to fine tune the application to theuser. As the amount of available data grows, this can result in very long computations.

These computations can for example be sped up by approximating the calculationsperformed, which should return a similar result in lesser time. Another way to speedup the process is to have multiple machines work on the same computation at a time.These methods are called distributed methods. In distributed methods, we assume wehave a single set of data which we will denote by D. These sets are divided over multiplemachines, which perform the computations that would otherwise be done on just onecomputer separately on their part of the data. After doing these local computations, wewill combine them into a single global outcome by a method which we call an aggregationmethod.

We are interested in performing these methods on the nonparametric regression model.Regression models in general have some kind of data which consists of multiple variables,of which one variables wants to be predicted by all of the other variables. A simpleexample of this can be to estimate someones weight based on someones height. To doso, we usually consider different functions for which we calculate the Euclidian distancebetween this prediction and the true value. The function which turns out to have thesmallest average distance between the predictions and the true values is than chosen asthe optimal choice based on this data set.

As the potential outcomes are limitless, there is usually assumed for the function tobe of a certain shape. An example in this case is linear regression, where we look for thestraight line that fits the cloud of data the best. In many cases this can work, but formore complex and unknown functions this can give terrible results!

In nonparametric regression we do not assume that the function belongs to somespecific parametric family, hence introducing more flexibility into the model. Whereparametric regression already assumes the outcome of a function to be of a certainshape, nonparametric regression takes into account this uncertainty and bases its formmore on the data.

80

Figure 4.1.: Linear regression. Figure 4.2.: Nonparametricregression.

As in this case we have the prior knowledge that the actual function on the x datapoints is linear, we know that the linear estimation will generally perform optimally. Aswe see, the nonparametric estimator is more likely to shape to a coincidental clusteringof points. This can be seen as for example when we have three people of 1.82 who areall somehow relatively light for their length, the estimation of 1.82 will also be lowerthan in the linear case, which is not optimal. We call this phenomenon overfitting to thedata, which will in most real life cases not be a problem as a larger sample size makesfor less overfitting.

We want to perform the nonparametric regression in a Bayesian way. The world ofstatistics can roughly be divided in two schools of thought: the frequentist, and theBayesian. When estimating from a set of data, the frequentist assumes there is alwaysa true value to be found, while the Bayesian says that the reality is more uncertain,and looks for a distribution of possible outcomes. The Bayesian does so by assuming adistribution for a parameter θ which will likely be of the shape of the outcome, the socalled prior distribution. This is combined with the likelihood of the given data D, tofind a distribution of θ given D.

As we have seen, this is done by first assuming a prior distribution for the function f .We will consider so-called Gaussian processes as prior distributions. Gaussian processesare random functions which are defined by a mean function which tells the expectedvalue at a point in time, and a covariance function, which calculates the covariancebetween two points in time. Some examples of Gaussian processes can be found below.

81

Figure 4.3.: Sample draws from different Gaussian processes.

As we do not want to add too much bias to our outcome, we will first assume themean function to be 0 for every point in time, where hence the only characteristics ofthe Gaussian process prior distribution is determined by the function computing thecovariance between the time points.

As visible in Figure 4.5, some Gaussian processes are rougher than others. The covari-ance kernel we choose for our prior distribution, the Matern kernel, is only dependentof one parameter α, which determines the smoothness of the function. There is no sin-gle optimal value of the smoothness, this one depends on the smoothness of the to beestimated function f itself.

Combining this with the likelihood of the data, we now obtain a posterior distributionfor f given the data.

Figure 4.4.: The data. Figure 4.5.: The true function withthe estimated functionfor α = 1.

Here the black line is the original function f , the thick blue line is the expectedposterior function, and the dashed blue lines around the expected posterior functionshow the range in which with 95% certainty the posterior function will be in. Weconclude that this method performs a good estimation of the original function, andquantifies the uncertainty well.

82

However, as data sets grow bigger, the estimation process takes up more time. For adata set of k times larger, the process takes up around k3 times as long! That is 1000times as long for only 10 times as much data. As this takes unacceptably long for largesets of data, we will use the distributed methods as shown earlier to improve the speedat which the estimation is made.

This is done by dividing the data of size n in m parts which we send to m separatecomputers. These computers will then compute the posterior distribution based on theirlocal data of size n/m which we will aggregate into a single global posterior distributionafterwards. The most obvious way of aggregating these posterior distributions is bytaking the average of the local posterior distributions. This means that when we wantto estimate f(x) given the data, we take the average of fj(x), where fj(x) is the computedestimate of f(x) by computer j, using only its own local data.

Figure 4.6.: Average of local poste-rior means with corre-sponding pointwise cred-ible sets.

From this we can conclude that the average of the estimations gives a too smoothestimation. The local posteriors do not have sufficient data to compute a good estimationfor the true function. The low influence of the data can be compensated by raising thepower of the likelihood to m in this process, and taking the average of the local posteriorsafterwards.

83

Figure 4.7.: Average of local poste-rior means with corre-sponding pointwise cred-ible sets when using ad-justed local likelihoods.

One can see that this results in a better estimation of the true function f . However,in this case, the sets in which 95% of the posterior distributions are contained, are waysmaller than before, and do not seem to cover the true function at many coordinates.This leaves room for improvement. As opposed to raising the power of the likelihoodsto the power of m, we can also raise the power of the prior distribution to 1/m in theprocess of locally computing the posteriors.

Figure 4.8.: Average of local poste-rior means with corre-sponding pointwise cred-ible sets when using ad-justed priors.

Here we see that the resulting global posterior does not only estimate the true functionwell, also does the true function lie in the 95% credible sets around every point. Hencethere exist distributed methods which perform as well as the nondistributed setting.

We have seen that the Bayesian nonparametric regression methods form good esti-

84

mation methods for functions with very little prior knowledge. The model allows formuch flexibility in the way the prior distribution is chosen, where we saw that Gaussianprocesses perform well with the Matern covariance kernel in the nondistributed setting.However, when distributing the data, we have seen that there are many traps to fall in.The local posteriors are computed with an insufficient amount of data, which will resultin an oversmoothed global posterior when naively averaging the local posteriors. Thereare ways to bypass this problem. One is to raise the power of the likelihood of the datato m, which as we saw gives a better estimation, but does not quantify the uncertaintywell. A better solution is to raise the power of the priors to 1/m, which does not onlyperform similar to the nondistributed setting in terms of estimation, but also quantifiesthe uncertainty well.

85

Bibliography

[1] Y. Baraud et al. A bernstein-type inequality for suprema of random processes withapplications to model selection in non-gaussian regression. Bernoulli, 16(4):1064–1085, 2010.

[2] J. Bernardo, J. Berger, A. Dawid, A. Smith, et al. Regression and classificationusing gaussian process priors. Bayesian statistics, 6:475, 1998.

[3] P. Bromiley. Products and convolutions of gaussian probability density functions.Tina-Vision Memo, 3(4):1, 2003.

[4] F. Burk. The geometric, logarithmic, and arithmetic mean inequality. The AmericanMathematical Monthly, 94(6):527–528, 1987.

[5] Y. Cao and D. J. Fleet. Generalized product of experts for automatic and principledfusion of gaussian process predictions. arXiv preprint arXiv:1410.7827, 2014.

[6] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel hilbertspaces of integrable functions and mercer theorem. Analysis and Applications,4(04):377–408, 2006.

[7] M. P. Deisenroth and J. W. Ng. Distributed gaussian processes. arXiv preprintarXiv:1502.02843, 2015.

[8] C. B. Do. Gaussian processes. Stanford University, Stanford, CA, accessed Dec,5:2017, 2007.

[9] S. Ghosal, J. K. Ghosh, A. W. Van Der Vaart, et al. Convergence rates of posteriordistributions. Annals of Statistics, 28(2):500–531, 2000.

[10] S. Ghosal and A. van der Vaart. Fundamentals of nonparametric Bayesian inference,volume 44. Cambridge University Press, 2017.

[11] N. E. Gibbs, W. G. Poole, Jr, and P. K. Stockmeyer. An algorithm for reducing thebandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis,13(2):236–250, 1976.

[12] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academicpress, 2014.

[13] R. Guhaniyogi, C. Li, T. D. Savitsky, and S. Srivastava. A divide-and-conquerbayesian approach to large-scale kriging. arXiv preprint arXiv:1712.09767, 2017.

86

[14] M. T. Heath. Scientific computing. McGraw-Hill New York, 2002.

[15] B. Knapik, B. Szabo, A. van der Vaart, and J. van Zanten. Bayes procedures foradaptive inference in inverse problems for the white noise model. Probability Theoryand Related Fields, 164(3-4):771–813, 2016.

[16] N. D. Lawrence. Gaussian process latent variable models for visualisation of highdimensional data. In Advances in neural information processing systems, pages329–336, 2004.

[17] M. Liu, Z. Shang, and G. Cheng. How many machines can we use in parallelcomputing for kernel ridge regression? arXiv preprint arXiv:1805.09948, 2018.

[18] J. W. Ng and M. P. Deisenroth. Hierarchical mixture-of-experts model for large-scale gaussian process regression. arXiv preprint arXiv:1412.3078, 2014.

[19] D. Pati and A. Bhattacharya. Adaptive bayesian inference in the gaussian sequencemodel using exponential-variance priors. Statistics & Probability Letters, 103:100–104, 2015.

[20] J. Quinonero-Candela and C. E. Rasmussen. A unifying view of sparse approximategaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.

[21] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures onmachine learning, pages 63–71. Springer, 2004.

[22] Y. Saad. Sparskit: A basic tool kit for sparse matrix computations. 1990.

[23] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E.McCulloch. Bayes and big data: The consensus monte carlo algorithm. InternationalJournal of Management Science and Engineering Management, 11(2):78–88, 2016.

[24] B. Szabo, A. van der Vaart, J. van Zanten, et al. Empirical bayes scaling of gaussianpriors in the white noise model. Electronic Journal of Statistics, 7:991–1018, 2013.

[25] B. Szabo and H. van Zanten. An asymptotic analysis of distributed nonparametricmethods. arXiv preprint arXiv:1711.03149, 2017.

[26] M. Titsias. Variational learning of inducing variables in sparse gaussian processes.In Artificial Intelligence and Statistics, pages 567–574, 2009.

[27] V. Tresp. A bayesian committee machine. Neural computation, 12(11):2719–2741,2000.

[28] V. Tresp. The generalized bayesian committee machine. In Conference on Knowl-edge Discovery in Data: Proceedings of the sixth ACM SIGKDD international con-ference on Knowledge discovery and data mining, volume 20, pages 130–139. Cite-seer, 2000.

87

[29] A. van der Vaart, H. van Zanten, et al. Bayesian inference with rescaled gaussianprocess priors. Electronic Journal of Statistics, 1:433–448, 2007.

[30] A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posteriordistributions based on gaussian process priors. The Annals of Statistics, pages1435–1463, 2008.

[31] A. W. van der Vaart, J. H. van Zanten, et al. Adaptive bayesian estimation usinga gaussian random field with inverse gamma bandwidth. The Annals of Statistics,37(5B):2655–2675, 2009.

[32] G. Wahba. Spline models for observational data, volume 59. Siam, 1990.

[33] Y. Yang, A. Bhattacharya, and D. Pati. Frequentist coverage and sup-norm conver-gence rate in gaussian process regression. arXiv preprint arXiv:1708.04753, 2017.

88

Optimal Bayesian distributed methods for nonparametric regression · 2020-07-17 · Bayesian...

Documents

Transcript of Optimal Bayesian distributed methods for nonparametric regression · 2020-07-17 · Bayesian...