Adaptive inference for the mean of a Gaussian...

Adaptive inference for the mean of a Gaussian process

in functional data

Florentina Bunea1, Marten H. Wegkamp1 and Andrada E. Ivanescu2

Florida State University and East Carolina University

November 5, 2010

Abstract

This paper proposes and analyzes fully data driven methods for inference about the

mean function of a Gaussian process from a sample of independent trajectories of the

process, observed at random time points and corrupted by additive random error. The

proposed method uses thresholded least squares estimators relative to an approximat-

ing function basis. The variable threshold levels are estimated from the data and the

resulting estimates adapt to the unknown sparsity of the mean function relative to the

selected approximating basis. These results are based on novel oracle inequalities that

are used to derive the rates of convergence of our estimates. In addition, we construct

confidence balls that adapt to the unknown regularity of the mean function. They are

easy to compute since they do not require explicit estimation of the covariance operator

of the process. The simulation study shows that the new method performs very well in

practice, and is robust against large variations introduced by the random error terms.

Keywords: Stochastic processes; nonparametric mean estimation; thresholded estimators;

functional data; oracle inequalities; adaptive inference; confidence balls.

Acknowledgements: The authors thank the Associate Editor and a referee for their con-

structive remarks. The research of Florentina Bunea and Marten Wegkamp was supported in

part by NSF Grants DMS-0706829 and DMS-1007444. Part of the research was done while

1Department of Statistics, Florida State University, Tallahassee, FL 32306-4330.2Department of Biostatistics, East Carolina University, Greenville, NC 27858-4353.

1

the authors were visiting the Isaac Newton Institute for Mathematical Sciences (Statistical

Theory and Methods for Complex, High-Dimensional Data Programme) at Cambridge Uni-

versity during Spring 2008.

1 Introduction

In this paper we develop and analyze new methodology for inference about the mean of a

Gaussian process from data that consists of independent realizations of this process observed

at discrete times, where each observation is contaminated by an additive error term. For-

mally, let X(t), 0 ≤ t ≤ 1 be a Gaussian process with mean function f(t) = E[X(t)] and

stochastic part Z(t) = X(t) − f(t). We denote the covariance function of X (and Z) by

Γ(s, t) = Cov(X(s), X(t)), for all 0 ≤ s, t ≤ 1. We observe Yij at times tij, for 1 ≤ i ≤ n,

1 ≤ j ≤ m, that are of the form

Yij = Xi(tij) + εij (1)

where Xi(t), with mean f(t), are random independent realizations of the process X(t).

Although we could allow for different sample sizes mi per curve – the conditions on m

imposed below would then be replaced by conditions on minmi – we treat mi = m for ease

of notation and clarity of exposition. We assume that εij are independent across i and j

with E[εij] = 0 and E[ε2ij] = σ2

ε <∞, and ε is independent of X.

Although the estimation of f received considerable attention over the last decade, the the-

oretical study of data-adaptive estimators in model (1) is still open to investigation. In

contrast with the abundance of methods for estimating f , methods for constructing confi-

dence sets for f are very limited. This motivates our two-fold contribution to the existing

literature:

(1) construction of computationally efficient and fully data-driven estimators and confi-

dence balls for f ;

(2) theoretical assessment of the quality of our data adaptive estimates and proof that the

estimators and the confidence balls adapt to the unknown regularity of f and Z.

2

We begin by reviewing the existing results in the literature, which provides further moti-

vation for the procedure set forth in this article. The problem of estimating f from data

generated from (1) has been considered by a large number of authors, starting with Ramsay

and Silverman (2002, 2005) and Rupert, Wand and Carroll (2003). The existing methods are

either based on kernel smoothers as in Zhang and Chen (2007), Yao (2007), Benko, Hardle

and Kneip (2009), penalized splines, as in, for instance, Ramsay and Silverman (2005),

free-knot splines as in Gervini (2006), or ridge-type least squares estimates as in Rice and

Silverman (1991). All resulting estimates depend on tuning parameters that are method

specific. Theoretical properties of these estimates of f are still emerging, and have only been

established for non-adaptive choices of the respective tuning parameters which require prior

knowledge of the smoothness of f , see, for instance, Zhang and Chen (2007) and Gervini

(2006). Although guidelines for data-driven choices of these parameters are offered in all

these works, the theoretical properties of the resulting estimates are still open to investi-

gation. In contrast, we suggest in Section 2 below a computationally simple method based

on thresholded projection estimators, with variable threshold level. Our method does not

require any specification of the regularity of f(t) or Z(t) prior to estimation. We show via

oracle inequalities that our estimators adapt to this unknown regularity.

Whereas the estimation of the mean f(t) of the process X(t) is well understood, modulo

the technical and possibly computational issues raised above, the construction of uniform

confidence sets for f has not been investigated in this context and in general the construction

of confidence sets for f in model (1) seems to have received little attention. An exception

is Degras (2009). Although his procedure is attractive, his theoretical analysis ignores the

bias term when applying a classical result by Landau and Shepp (1970) on the supremum

of a Gaussian process. Therefore his confidence bands do not attain the nominal coverage.

We propose and analyze a number of alternative procedures for constructive confidence sets.

In particular, we offer a computationally simple procedure that leads to adaptive confidence

balls.

The paper is organized as follows. In Section 2.2 below we discuss thresholded projection

estimators in the functional data setting. In Section 2.3 we establish oracle inequalities for the

3

fit of the estimators which show that the estimates adapt to the unknown sparsity of the mean

f . Under appropriate conditions on the mean f(t) and the covariance function Γ(s, t) of the

process Z(t), we derive rates of convergence for our estimates in Section 2.4. In Section 2.5 we

construct confidence balls for f and prove that they have the desired asymptotic coverage

probability uniformly over large classes of functions. Moreover, we suggest a number of

methods for constructing confidence bands. Section 3 contains a comprehensive simulation

study that indicates that our methods compare favorably with existing methods. The net

merit of the proposed methods is especially visible when the variance of the random noise ε

is at the same level as that of the stochastic process Z(t), and we discuss this in detail in

Sections 3.2 and 3.3.

2 Methodology

2.1 Preliminaries

In this section we introduce notation and assumptions that will be used throughout the

paper. As explained in the introduction, the aim of this paper is

(a) to estimate the mean f(t) of the process X(t) and

(b) to construct confidence sets for the mean f(t), t ∈ [0, 1].

We assume throughout that f ∈ L2([0, 1], dt) and is bounded; dt denotes the Lebesgue mea-

sure on [0, 1] and in what follows we will write L2 for the space L2([0, 1], dt). We will also

make the following standard assumption on the process:

Assumption 1. The paths of the Gaussian process X(t), t ∈ [0, 1], are L2-functions almost

surely, and the covariance kernel Γ is continuous and satisfies∫ 1

0

Γ(t, t) dt <∞.

Remark. We notice, for further reference, that Assumption 1 guarantees that Γ ≥ 0 (semi-

positive definite); see, for instance, Shorack and Wellner (1986, page 208). Also, by Mercer’s

4

theorem, all continuous Γ have an eigen-decomposition

Γ(s, t) =∞∑j=1

λjfj(s)fj(t)

in terms of eigenvalues λ1 ≥ λ2 ≥ · · · and (orthonormal) eigenfunctions f1, f2, · · · . More-

over, λj ≥ 0 and∑∞

j=1 λj <∞; see, e.g., Shorack and Wellner (1986, page 210).

Our approach uses thresholded projection estimates which are obtained relative to bases

φ1, φ2, . . . that are orthonormal in L2 and are known to have good approximation properties

over a large scale of smoothness classes to which the target f may belong. Examples include

the Fourier, local trigonometric and wavelet bases.

Assumption 2. The mean f(t) = E[X(t)] of the Gaussian process X(t) is in L2 and may

be written as

f(t) =∞∑k=1

µkφk(t) (2)

where the convergence is uniform and absolute on [0, 1]. The coefficients µk are given by

µk =

∫ 1

0

f(t)φk(t) dt. (3)

Assumption 3. The observation “times” tij are independent and uniformly distributed on

the interval [0, 1].

Assumption 4. The errors εij are independent N(0, σ2) random variables.

Assumption 5. The basis functions φk are uniformly bounded.

Remark. The Gaussian assumption on the process X(t) and errors εij in Assumptions 1

and 4, and the bounded basis assumption (Assumption 5) may be relaxed at the cost of

rather technical proofs. These assumptions are used in the proof of Proposition 2 of Section

2.3.1 below.

5

2.2 Threshold-type estimators for functional data

Our procedure falls between two of the currently used strategies for the estimation of f :

averaging estimated individual trajectories and applying various smoothing methods to the

entire data set. Our initial estimator of f is a projection estimator onto a space generated

by a large set of basis functions and can be viewed as an average (over n) of weighted values

of the observations Yij. Our final estimator will be a truncated version of the projection

estimator, with data dependent truncation levels determined from the entire data set. We

describe the details in what follows.

Given a family of basis functions φkk and a large integer d (cut-off point), which can grow

polynomially in n, our initial estimator of f is f(t) =∑d

k=1 µkφk(t), where

µk =1

n

n∑i=1

µi,k (4)

is the average of the projection estimators

µi,k =1

m

m∑j=1

Yijφk(tij). (5)

The variance of the initial estimator f(t) =∑d

k=1 µkφk(t) may be unnecessarily inflated by

the presence of, possibly many, very small estimates µk. This drawback can be remedied

by truncating the coefficients at a level rk that takes into account both the variability of

the measurement errors εij and the variability of the stochastic processes Zi(t). This is the

essential difference between truncated estimators based on data generated as in (1) and their

counterpart based only on independent data in a standard nonparametric regression setting.

We will focus on hard threshold estimators of the coefficients µk and of the function f . They

are, respectively,

µk(rk) = µk1|µk| > rk

and

f(r) =d∑

k=1

µk(rk)φk,

6

where here and in what follows we will use the notation r = (r1, . . . , rd). In the next section

we discuss the goodness-of-fit of these estimates in terms of the L2 norm, where for any

g ∈ L2 we denote by ‖g‖2 =∫ 1

0g2(t) dt its L2 norm.

2.3 Oracle inequalities

Define, for each 1 ≤ k ≤ d,

µk(rk) = µk1|µk| > rk

and write

f(r)(t) =d∑

k=1

µk(rk)φk(t)1|µk| > rk.

The function f(r) can be regarded as a sparse approximation of f relative to a given basis;

of course, since the function f is unknown so is its sparse approximation. In this section we

show that the truncated estimators introduced above, constructed without any prior knowl-

edge of such sparse representations, mimic the bias-variance decomposition of estimators

that would use such information in their construction. Therefore our estimates adapt to

the unknown sparsity of f and we call the corresponding results sparsity oracle inequalities.

They permit us to determine, as a consequence, the rates of convergence of our estimators.

We discuss them in detail in the next section.

We begin by establishing Theorem 1, which is an oracle inequality for the hard threshold

estimator. The result holds on the event

Ωn =d⋂

k=1

|µk − µk| ≤ rk (6)

and is valid for any given threshold levels rk. This clearly shows what will drive the choice of

the threshold levels rk: they have to be chosen such that Ωn holds with probability arbitrarily

close to one. In Proposition 2 below we propose levels rk for which lim infn→∞ P(Ωn) ≥ 1−α,

for any given 0 < α < 1. In particular, for α = 1/n, this guarantees that limn→∞ P(Ωn) = 1.

7

Theorem 1. For all d ≥ 1, on the event Ωn,

‖f(2r) − f‖ ≤ ‖f − f(r)‖+ 3

√√√√ d∑k=1

r2k1|µk| > rk.

Moreover, for d ≤ n, we have for some finite constant C,

E[‖f(2r) − f‖

]≤ ‖f − f(r)‖+ 3

√√√√ d∑k=1

r2k1|µk| > rk+ C

√1− P(Ωn).

Proof. We first observe that

‖f(2r) − f(r)‖2 =d∑

k=1

(µk(2rk)− µk(rk))2.

For the first claim, it suffices to show that on the event Ωn,

|µk(2rk)− µk(rk)| ≤ 3rk1|µk| > rk (7)

holds for all 1 ≤ k ≤ d, and any d ≥ 1 since this bound at the coefficient-level implies

‖f(2r) − f(r)‖ ≤ 3

√√√√ d∑k=1

r2k1|µk| > rk,

and the claim of Theorem 1 follows from the triangle inequality. We now prove (7). Indeed,

on the event Ωn,

|µk(2rk)− µk(rk)|

= |µk1|µk| > 2rk − µk1|µk| > rk|

≤ |µk − µk|1|µk| > rk+ |µk| |1|µk| > 2rk − 1|µk| > rk|

≤ rk1|µk| > rk+ |µk| |1|µk| > 2rk − 1|µk| > rk| .

We consider two cases – |µk| ≤ rk and |µk| > rk – for the second term on the right and we

8

obtain the bound

|µk| |1|µk| > 2rk − 1|µk| > rk|

= |µk|1|µk| > 2rk1|µk| ≤ rk+ |µk|1|µk| ≤ 2rk1|µk| > rk

≤ 2rk1|µk| > rk.

The last inequality follows from the fact that |µk| ≤ rk implies that |µk| ≤ 2rk. Combining

the two preceding bounds yields (7).

For the second claim, we note that

E[‖f(2r) − f‖

]= E

[‖f(2r) − f‖1Ωn

]+ E

[‖f(2r) − f‖1Ωcn

]≤ ‖f − f(r)‖+ 3

√√√√ d∑k=1

r2k1|µk| > rk+ E

[‖f(2r) − f‖1Ωcn

]

using the first claim. Then we observe that Var(µk) = τ 2k/n with τ 2

k defined in (8) below. It

is easy to see that τ 2k ≤ ‖Γ‖∞ + (1/m)‖Γ‖∞ + ‖f‖2

∞ + σ2ε ≤ C for some C > 0. By the

Cauchy-Schwarz inequality and the fact that

E[‖f(2r) − f‖2

]≤ 2E‖f(2r)‖2 + 2‖f‖2

≤ 2E

[d∑

k=1

µ2k

]+ 2‖f‖2

≤ 2d∑

k=1

τ 2k

n+ 2

d∑k=1

µ2k + 2‖f‖2

≤ 2d

nC + 2

d∑k=1

µ2k + 2‖f‖2

≤ 2C + 4‖f‖2

we obtain

E[‖f(2r) − f‖1Ωcn

]≤√

E[‖f(2r) − f‖

]2√P(Ωc

n) ≤√

2C + 4‖f‖2√

P(Ωcn),

and the second claim follows.

9

Theorem 1 is a novel type of oracle inequalities for thresholded estimators in a functional

data context, and extends similar results established in the more traditional sequence model,

non-parametric regression and density estimation settings, see, for instance, Donoho and

Johnstone (1995, 1998), Donoho et al. (1996), Wasserman (2006), Tsybakov (2009) and the

references therein.

2.3.1 Choice of the threshold levels.

In what follows we propose threshold levels rk that guarantee that the event Ωn, defined

in (6) above, holds with any pre-specified probability level 1 − α of interest. A calculation

shows that µk has mean E[µk] = µk and variance Var(µk) = n−1τ 2k with

τ 2k =

m− 1

mγ2k +

1

m

∫ 1

0

Γ(t, t)φ2k(t) dt+

∫ 1

0

f 2(t)φ2k(t) dt− µ2

k + σ2ε

. (8)

Here

γ2k =

∫ 1

0

∫ 1

0

φk(s)Γ(s, t)φk(t) dsdt (9)

is often the leading term. Let zα/2d be the α/2d upper point of the N(0, 1) distribution. We

denote z2α/2d by ρn and we note that ρn = O(ln(d/α)) for future reference. Set

rk =z α

2d√n

(τk +

c0√m

), (10)

for some c0 large enough.

Proposition 2. Let rk be as in (10) above with c0 sufficiently large. Then, under Assump-

tions 1–5,

P(Ωn) ≥ 1− α− de−cn − 2α2

d

for some constant c > 0. Consequently, if d→∞ and de−n → 0 as n→∞,

lim infn→∞

P(Ωn) ≥ 1− α.

For clarity of exposition, in the proof below and for the rest of the paper we will use the

symbol . to denote an inequality that holds up to multiplicative positive constants.

10

Proof. First we decompose

µk − µk =1

n

n∑i=1

1

m

m∑j=1

Z(tij) + εijφk(tij) +1

nm

n∑i=1

m∑j=1

f(tij)φk(tij)− µk

=: µ(1)k + µ

(2)k − µk.

By the definition of Ωn and rk we have

PΩcn ≤

d∑k=1

P |µk − µk| > rk (11)

≤d∑

k=1

P√

n|µ(1)k | > zα/(2d)

(τk +

c0

2√m

)+

d∑k=1

P√

nm|µ(2)k − µk| >

c0

2zα/(2d)

=: (I) + (II).

We bound each term separately, starting with (I). Let

V 2k =:

1

n

n∑i=1

1

m2

m∑j=1

m∑j′=1

φk(tij)Γ(tij, tij′)φk(tij′) + σ2ε

1

nm

n∑i=1

m∑j=1

φ2k(tij),

set V 2k =: E[V 2

k ] ≤ τ 2k and introduce the event Ak =: Vk ≤ Vk + c0/(2m

1/2). Then

(I) =d∑

k=1

P|µ(1)k | >

zα/(2d)√n

(τk +

c0

2√m

), Ak

+

d∑k=1

P|µ(1)k | >

zα/(2d)√n

(τk +

c0

2√m

), Ack

≤d∑

k=1

P(Ack) +d∑

k=1

E[P√

n|µ(1)k | > zα/(2d)Vk

∣∣∣ (tij)i,j

].

To bound the first sum above, we observe that, by Assumptions 1 and 5, |φ2kΓ| ≤ C and

|φ2k| ≤ C for some constant C > 0. Then using Hoeffding’s inequality for sums of bounded

independent random variables, we obtain, for some bounded constant C ′ > 0

d∑k=1

P(Ack) ≤d∑

k=1

PV 2k +

c20

4m< V 2

k

≤ d exp

(−nc

40

C ′

). (12)

For the second term in the display above, we observe that µ(1)k , conditionally given the times

11

tij, is Gaussian, with mean 0 and variance equal to V 2k /n. Therefore

d∑k=1

E[P√

n|µ(1)k | > zα/(2d)Vk

∣∣∣ (tij)i,j

]= α. (13)

To control (II) in (11) we use Hoeffding’s inequality again, to obtain, for some bounded

constant c1,

d∑k=1

P√

nm|µ(2)k − µk| >

c0zα/(2d)

2

≤ 2d exp

−c1c

20ρn.

Hence, for c0 large enough,

d∑k=1

P√

nm|µ(2)k − µk| >

c0zα/(2d)

2

≤ 2

αt

d1−t (14)

for any t ≥ 1. Collecting (12), (13) and (14) with t = 1 we obtain the result.

Remarks. Assumptions 1, 4 and 5 may be relaxed, but at the price of additional technical-

ities and restrictions on m that would clutter the presentation.

The term c0m−1/2 in (10) is technical and generally smaller than τk. We propose the following

practical choice for rk. Since µk = n−1∑n

i=1 µi,k is the average of i.i.d. random variables, we

expect that it is approximately normal. Motivated by the central limit theorem, we use in

our simulations and data analysis the choice

rk =Sk√n× z α

2d,

based on the sample variances S2k = (n − 1)−1

∑ni=1(µi,k − µk)

2. Indeed, by Bonferroni’s

bound,

P(Ωcn) ≤

d∑k=1

P|µk − µk| > rk ≈d∑

k=1

P|N(0, 1)| > zα/2d

= α.

We summarize the results of this section in the following corollary.

12

Corollary 3. For any 0 < α < 1, set rk as in (10) above. Under Assumptions 1–5,

‖f(2r) − f‖ ≤ ‖f − f(r)‖+ 3

√√√√ d∑k=1

r2k1|µk| > rk

holds with probability larger than 1− α, as n→∞.

Moreover, for α = 1/n and d ≤ n, we have, for some constant C <∞,

E[‖f(2r) − f‖

]≤ ‖f − f(r)‖+ 3

√√√√ d∑k=1

r2k1|µk| > rk+

C√n.

We establish the asymptotic implications of these results and the rates of convergence of our

variable threshold estimator in the next section. The advantages of the variable threshold,

which is different for each estimated coefficient µk over (a) no threshold at all, and (b) a

constant threshold for all k, will be discussed in detail below.

2.4 Rates of convergence

We begin by discussing classes of functions over which we derive the rates of convergence

of our estimator. Recall that we have assumed that the mean function f ∈ L2, therefore

‖f‖2 =∑

k µ2k <∞ and so µk → 0 as k →∞. We introduce a parameter β > 0 that governs

this decay to zero of the coefficients µk, for k large enough, say k ≥ K, for some K ≥ 1.

We consider bases φkk and classes of functions f that satisfy

µ2k . k−2β−1, (15)

for all k ≥ K. The classes of functions that satisfy (15) are quite general. For instance,

consider the trigonometric basis φkk and the generalized Sobolev classes of the form

W β(R) =f : [0, 1]→ R : f(0) = f(1), ‖f (β)‖ ≤ R

,

with smoothness index β > 0 and for some R > 0. Notice that this class includes continuous

13

functions, for β > 1/2, see, e.g., Tsybakov (2009) for details. Then f ∈ W β(R) if and only

if ∞∑k=1

k2β(µ22k + µ2

2k+1) ≤ R2.

Consequently, µ2k = o(k−2β−1) for k → ∞, so that (15) holds. More generally, it can be

shown that balls in Besov spaces can be written as `p bodies when expanded in suitable

wavelet bases or spaces generated by piece-wise polynomials based on regular or irregular

partitions; see for instance DeVore and Lorenz (1996) for further details.

We first state an intermediate result.

Proposition 4. On Ωn, we have

‖f(2r) − f‖2 ≤ 18

(∞∑

k=N+1

µ2k +

N∑k=1

r2k

)(16)

for all N ≤ d. Consequently, for all f ∈ W β(R), on the event Ωn,

‖f(2r) − f‖2 . N−2β +N∑k=1

r2k. (17)

Proof. Using Theorem 1 and the inequality (a+ b)2 ≤ 2a2 + 2b2, we find

‖f(2r) − f‖2 ≤ 2‖f(r) − f‖2 + 18d∑

k=1

r2k1|µk| > rk.

Since

‖f(r) − f‖2 =d∑

k=1

µ2k1|µk| ≤ rk+

∞∑k=d+1

µ2k

14

we obtain

‖f(2r) − f‖2 ≤ 2d∑

k=1

µ2k1|µk| ≤ rk+ 2

∞∑k=d+1

µ2k + 18

d∑k=1

r2k1|µk| > rk

≤ 18

∞∑

k=d+1

µ2k +

d∑k=1

min(r2k, µ

2k)

≤ 18

∞∑

k=N+1

µ2k +

N∑k=1

min(r2k, µ

2k)

which proves the first claim. The second claim follows immediately from the first claim and

(15).

From this proposition, it follows that the other important ingredient in establishing the rate

of convergence of our estimates is the size of the threshold levels rk. We discuss this below.

For the stochastic process Z(t) and any basis that is orthonormal in L2, the quantity γ2k is

bounded by λ1, the largest eigenvalue of the covariance kernel Γ. Under Assumption 1 on

the kernel and from the remark following it, we see that λ1 ≤∑

k λk < ∞. Since Γ ≥ 0 we

therefore always have, for all k, that 0 ≤ γ2k <∞, where we recall that γ2

k is the term of the

variance component defined by (9) above. In what follows we further assume that

γ2k . k−δ (18)

for all k ≥ K and for some finite, positive constants K ≥ 1 and δ ≥ 0. Just as β is the

smoothness parameter of the function f , we view δ as the smoothness parameter of Γ or the

regularity index of the process Z.

If the basis φkk happens to be the collection of eigenfunctions corresponding to the co-

variance kernel Γ, then γ2k = λk, and from the considerations above γ2

k → 0 as k → ∞.

This is the case, for instance, for the Brownian bridge or Brownian motion processes and

the Fourier basis. However, in practice, Γ is unknown, and one cannot guarantee the same

type of decay for the quantities γ2k.

Notice that our condition (18) is not restrictive, as we allow δ to equal zero, which translates

15

into re-stating what we already observed above, namely that for the type of processes that

we consider the values γ2k are bounded.

In our analysis, two cases naturally arise: 0 ≤ δ ≤ 1 and δ > 1. We notice that for δ > 1,

the series∑d

k=1 γ2k <∞. In this case, the mean squared error of the least squares estimator

f is

E[‖f − f‖2

]=

d∑k=1

Var(µk) +∞∑

k=d+1

µ2k .

1

n

d∑k=1

k−δ +d

nm+ d−2β

.1

n+

d

nm+ d−2β. (19)

The RHS is minimized by the non-adaptive choice of d = O(n1/2β+1) and yields the rate

E[‖f − f‖2

].

1

n+

(1

nm

)2β/2β+1

. (20)

For perfectly observed n trajectories Xi(t), at all points t and without error, the rate of

estimating f would be the parametric n−1/2 rate. It is interesting to see, from the display

above, that the same is possible in our context, if enough points per curve are observed.

Specifically, the first term on the right, of order O(1/n), dominates the rate, as soon as

m ≥ n1/2, for all β > 1. For smaller values of m relative to n, one reverts back to the slower

non-parametric minimax rate, which corresponds to estimating a function f from mn data

points. We also note that a direct calculation based on (19) shows that the mean squared

error of the least squares estimator f is of order O(1/n) for all β > 1, as long as m ≥ n1/2

and if we chose d such that n1/2 ≤ d ≤ m.

The following result indicates that our threshold estimator f(2r) has a similar performance,

with the rate in (20) obtained adaptively over β ≥ 1.

Theorem 5. Let Assumptions 1–5 hold. For any f ∈ W β(R) with β ≥ 1, δ > 1 in (18),

and d ≥ (mn/ρn)1/3, we have

limn→∞

P‖f(2r) − f‖2 .

ρnn

+( ρnmn

)2β/2β+1≥ 1− α,

16

for any 0 < α < 1. Moreover, if m ≥ (n/ρn)1/2, then

limn→∞

P‖f(2r) − f‖2 .

ρnn

≥ 1− α,

where ρn = z2α/(2d) = O(ln(d/α)).

Proof. By Proposition 4 we have, on Ωn,

‖f(2r) − f‖2 . inf1≤N≤d

(1

N

)2β

+ρnn

∑k

k−δ +Nρnnm

.ρnn

+ inf1≤N≤d

(1

N

)2β

+Nρnnm

since

∑k

k−δ <∞

=ρnn

+( ρnmn

)2β/2β+1

. (21)

This rate is obtained for N = (mn/ρn)1/2β+1 < (mn/ρn)1/3 ≤ d, for all β ≥ 1. The second

claim follows immediately from (21) for these values of m. Finally, invoke Proposition 2 to

complete the proof.

Remarks.

1. The rate of our adaptive estimator differs from that of the least squares estimator only

by a ρn factor, which is a very small price to pay for adaptation.

2. The condition d ≥ (mn/ρn)1/3 together with the condition m ≥ (n/ρn)1/2 imply that

d ≥ (n/ρn)1/2; this is the lower bound on d for which our threshold estimator achieves

the parametric rate. Notice further that in the regime m ≥ (n/ρn)1/2 the value of N

giving the optimal rate is bounded by m, which thus provides a a natural upper bound

for d. Therefore, our estimator is of O(ρn/n) if we use (n/ρn)1/2 ≤ d ≤ m.

For 0 ≤ δ ≤ 1, we first notice that the first term in the right hand side of (19) is the

multiplication of 1/n with the partial sum of a divergent series; therefore the mean squared

error of the least squares estimator f will no longer achieve the O(1/n) rate in this case.

The following theorem shows that over classes of functions and processes that satisfy (15)

and (18), respectively, the rate of convergence of ‖f(2r) − f‖2 is

ψn =(ρnn

)2β/(2β+1−δ), (22)

17

for appropriately chosen d. This rate is markedly better than the rate that can be achieved

by the least squares estimator f : for δ = 1 it is of O(ρn/n) and in general it adapts to the

unknown δ.

Theorem 6. Let Assumptions 1–5 hold and assume that (15) and (18) hold with 2β > 1 + δ

and 0 ≤ δ ≤ 1. Then, for√n/ρn ≤ d ≤ m:

limn→∞

P‖f(2r) − f‖2 . ψn

≥ 1− α. (23)

Proof. By Proposition 4

‖f(2r) − f‖2 . inf1≤N≤d

N−2β +

ρnn

N∑k=1

k−δ +Nρnnm

. inf1≤N≤d

N−2β +

ρnnN1−δ +

Nρnnm

. inf

1≤N≤d

N−2β +

ρnnN1−δ

,

as for N ≤ d ≤ m we have N1−δ > N/m, for all 0 ≤ δ ≤ 1. The infimum is achieved for

N = (n/ρn)1/(2β+1−δ), which is less than d since 2β + 1− δ > 2, and apply Proposition 2 to

conclude the proof.

We now compare our estimator f(2r), which is constructed with the variable thresholds (10),

with a threshold estimator that utilizes fixed thresholds rk = Bn−1/2ρ1/2n , for all k, and for

some constant B > 0.

Theorem 7. If Assumptions 1–5 and (15) with β > 1/2 hold, then the hard-threshold

estimator with constant threshold rk = Bn−1/2ρ1/2n , for all k satisfies

limn→∞

P‖f(2r) − f‖2 .

(ρnn

)2β/(2β+1)≥ 1− α. (24)

Proof. Again, by Proposition 4,

‖f(2r) − f‖2 .∞∑

k=N+1

µ2k +

N∑k=1

r2k . N−2β +Nρn/n

18

for all N ≤ d. Optimizing the r.h.s. over N leads to N (n/ρn)1/(2β+1). Finally, notice that

the threshold Bn−1/2ρ−1/2n is larger than (10) for B large enough. Inspection of the proof of

Proposition 2 yields that limn→∞ P(Ωcn) ≤ α and the claim follows.

Summarizing, we see that our threshold estimator can attain fast rates of convergence, with-

out prior knowledge of δ. The case δ = 0 provides a clear illustration. Recall that taking

δ = 0 in conditions (18) means that we impose no other restrictions on the process Z be-

yond the standard regularity properties stated in Assumption 1 above. In this case Theorem

6 guarantees net improvements over the least squares estimators: the rate of convergence

of the hard threshold estimator with variable thresholds (10) is equal to n−β/(2β+1), up to

a ln(n) factor; a similar rate can also be obtained by the fixed threshold estimator. If

0 < δ ≤ 1, which corresponds to mild assumptions on the process, the variable threshold

estimator has rate n−β/(2β+1−δ), up to logarithmic factors. Thus, it continues to outperform

the least squares estimator. In this regime it also outperforms the fixed threshold estimator,

with rate n−β/(2β+1) given by Theorem 7. Therefore, the variable threshold estimators, con-

structed without knowing δ, adapt to the unknown regularity index of the process Z(t), and

are expected to have better accuracy, quantified by the faster rates of convergence derived

above.

2.5 Confidence sets

In this section we propose confidence sets (balls, intervals and bands) for the mean function

f(t). Our first result shows that the confidence ball B(f(2r), ρ(2r)) centered around the hard

threshold estimator f(2r) with radius slightly larger than

ρ(2r) =

√√√√ d∑k=1

r2k1|µk| > 2rk. (25)

has asymptotic coverage 1−α, for a large class of functions f . First we obtain an intermediate

result.

19

Proposition 8. For all d ≥ 1 and 0 < α < 1

‖f(2r) − f‖ ≤

√√√√ d∑k=1

r2k1|µk| > 2rk+

√√√√ d∑k=1

µ2k|µk| ≤ 3rk+

√√√√ ∞∑k=d+1

µ2k

on Ωn.

Proof. We first notice that, for 1 ≤ k ≤ d,

|µk(2rk)− µk| ≤ |µk − µk|1|µk| > 2rk+ |µk|1|µk| ≤ 2rk

and

‖f(2r) − f‖2 =d∑

k=1

(µk(2rk)− µk)2 +∞∑

k=d+1

µ2k.

Consequently, on the event Ωn,

‖f(2r) − f‖ ≤

√√√√ d∑k=1

(µk − µk)21|µk| > 2rk+

√√√√ d∑k=1

µ2k1|µk| ≤ 2rk+

√√√√ ∞∑k=d+1

µ2k

≤

√√√√ d∑k=1

r2k1|µk| > 2rk+

√√√√ d∑k=1

µ2k1|µk| ≤ 3rk+

√√√√ ∞∑k=d+1

µ2k

and our claim follows.

The radius ρ(2r) dominates the bound in Proposition 8, on the event Ωn, if

limn→∞

∑dk=1 µ

2k1|µk| ≤ 3rk+

∑∞k=d+1 µ

2k∑d

k=1 r2k1|µk| > 3rk

= 0. (26)

Consequently, we have the following result:

Theorem 9. Assume that Assumptions 1–5 hold. Fix d ≥ 1 and 0 < α < 1. Provided (26)

holds, we have, for all s > 1,

lim infn→∞

Pf ∈ B

(f(2r), s · ρ(2r)

)≥ 1− α.

20

We illustrate this result for functions f =∑

k µkφk with

c1 ≤ µ2kk

2β+1 ≤ C1, k →∞ (27)

and covariance functions Γ(s, t) with

c2 ≤ kδγ2k ≤ C2, k →∞ (28)

similar to our discussion in Section 2.4. The corresponding set of functions satisfying (27) is

denoted by Fβ(c1, C1). The upper bound in (27) is already used in (15). The lower bound in

(27) is crucial for our analysis. It is similar to Condition Hs(M,x0) introduced in Picard and

Tribouley (2000, page 307) for their analysis of wavelet based estimators in non-parametric

regression models. Different type of lower bounds are possible, and we refer the reader to

the recent work of Gine and Nickl (2010, section 3.5) on nonparametric density estimation.

In what follows we use the notation a >> b whenever a is larger in order than b.

Corollary 10. Let Assumptions 1–5 hold and fix 0 < α < 1. Let m > d >> n1/2 and δ ≥ 1

in (28). Then, for all s > 1,

lim infn→∞

inff∈∪β≥1Fβ(c1,C1)

Pf ∈ B(f(2r), s · ρ(2r))

≥ 1− α.

Moreover,

lim infn→∞


Pρ(2r) . ln(n)/n1/2

≥ 1− α.

Proof. In view of Theorem 9, it suffices to verify (26). We consider two cases, δ ≥ 2β + 1

and 1 ≤ δ < 2β + 1, separately.

First, take δ ≥ 2β + 1. Then, for n large enough, we have 9r2k < µ2

k for all k. This implies

that the numerator in (26) reduces to∑∞

k=d+1 µ2k . d−2β = o(1/n) for d >> n1/2 and β ≥ 1.

On the other hand, the denominator in (26) equals

d∑k=1

r2k1|µk| > 3rk

ρnn

d∑k=1

k−δ +d ln(n)

mn ln(n)

n,

21

since m > d and∑

k k−δ < ∞, for δ > 2β + 1 > 1. Consequently, the convergence (26)

holds.

Assume now 1 < δ < 2β + 1. Then µ2k ≤ 9r2

k ⇐⇒ k ≥ N = C(n/ ln(n))1/(2β+1−δ). Notice

that N >> (n/ ln(n))1/2. Now the numerator in (26) is

d∑k=1

µ2k1|µk| ≤ 3rk+

∞∑k=d+1

µ2k .

∞∑k=d+1

µ2k . d−2β +N−2β = o

(lnn

n

)

since d >> n1/2 and β ≥ 1. On the other hand, the denominator in (26) is larger than

cln(n)/n for some c > 0. Hence the ratio in (26) is asymptotically negligible as n→∞.

The second claim regarding the radius follows from the fact that, on the event Ωn,

ρ2(2r) =d∑

k=1

r2k1|µk| > 2rk ≤

d∑k=1

r2k1|µk| > rk.

The term on the right is of order ln(n)/n by the above calculations.

For small values of δ, 0 ≤ δ ≤ 1, that translate into larger values of rk, the ratio between

the bias and the radius in (26) is bounded, yet it does not necessarily converge to zero.

Multiplying then the radius by a small ln1/2(n) factor, we obtain an asymptotic 1 − α

confidence ball for f .

Corollary 11. Let Assumptions 1–5 hold. Fix 0 < α < 1. Let m > d >> n1/2 and 0 ≤ δ ≤ 1

in (28). Then

lim infn→∞


Pf ∈ B(f(2r),

√ln(n)ρ(2r))

≥ 1− α.

Moreover,

lim infn→∞


Pρ(2r) . ln(n)/nβ/(2β+1−δ) ≥ 1− α.

Proof. In view of Theorem 9, we need to show that

limn→∞

∑dk=1 µ

2k1|µk| ≤ 3rk+

∑∞k=d+1 µ

2k

ln(n)∑d

k=1 r2k1|µk| > 3rk

= 0 (29)

22

to prove the first claim. Observe that, for 0 ≤ δ ≤ 1, µ2k ≤ 9r2

k ⇐⇒ k ≥ N =

C(n/ ln(n))1/(2β+1−δ). Consequently

d∑k=1

µ2k1|µk| ≤ 3rk+

∞∑k=d+1

µ2k . N−2β =

(ln(n)

n

) 2β2β+1−δ

.

On the other hand,

ln(n)d∑

k=1

r2k1|µk| > 3rk ≥ C ln(n)

(ln(n)

n

)− 2β2β+1−δ

,

so that the ratio in (30) is asymptotically negligible as n→∞. By the inequality

ρ(2r) ≤

√√√√ d∑k=1

r2k1|µk| > rk,

that holds on the event Ωn, and the bound∑d

k=1 r2k1|µk| > rk ≤ Cn/ ln(n)β/(2β+1−δ) for

some C <∞, that can be obtained by arguments above, the second claim follows.

Remark. The proposed confidence balls adapt to the smoothness parameters β and δ and

the results hold uniformly over Fβ(c1, C1) ⊂ W β. Moreover, the simulation study presented

in Section 3.3.1 shows that the confidence balls attain the prescribed coverage and have

small radii in a variety of finite sample scenarios. Adaptive nonparametric confidence balls

in nonparametric regression are proposed and studied by, among others, Beran and Dumbgen

(1998), Baraud (2004), Cai and Low (2006), Robins and Van der Vaart (2006) and Davies,

Kovac and Meise (2009). In all these works the confidence balls are uniform over large classes

of functions, for instance Wβ. The same ideas can be extended to our set-up, and we can

obtain confidence balls over Wβ essentially by multiplying the radius by a factor 1 ∨ n1−δ.

However, the drawback of the methods proposed for non-parametric regression and of their

possible extension to the functional data context is clear: if one wants to derive confidence

balls that are uniform over too general a class of functions, such as Wβ, then the widths of

the resulting balls will necessarily be too wide for practical use. For this reason we do not

pursue such construction here and omit further discussion.

23

Confidence balls are hard to visualize and confidence bands are more appropriate for this

purpose. Analogous to Proposition 8, we have the following result.

Proposition 12. For all d ≥ 1 and 0 < α < 1,

|f(2r)(t)− f(t)| ≤d∑

k=1

rk|φk(t)|1|µk| > 2rk+d∑

k=1

|µkφk(t)||µk| ≤ 3rk+∞∑

k=d+1

|µkφk(t)|

holds, uniformly in 0 ≤ t ≤ 1, on the event Ωn.

Proof. Since, for 1 ≤ k ≤ d,

|µk(2rk)− µk| ≤ |µk − µk|1|µk| > 2rk+ |µk|1|µk| ≤ 2rk,

we have

|f(2r)(t)− f(t)| ≤d∑

k=1

|(µk − µk)φk(t)|+d∑

k=1

|µkφk(t)|1|µk| ≤ 2rk+∞∑

k=d+1

|µkφk|

≤d∑

k=1

rk|φk(t) +d∑

k=1

|µkφk(t)|1|µk| ≤ 3rk+∞∑

k=d+1

|µkφk|

as claimed.

Provided

limn→∞

sup0<t<1

∑dk=1 |µkφk(t)|1|µk| ≤ 3rk+

∑∞k=d+1 |µkφk(t)|∑d

k=1 rk|φk(t)|1|µk| > 3rk= 0, (30)

Proposition 12 implies that

lim infn→∞

P

f(t) ∈ f(2r)(t)±

d∑k=1

rk|φk(t)|1|µk| > 2rk, 0 < t < 1

≥ 1− α. (31)

We found in our simulations that these bands are only a little wide, but not much. We

briefly indicate a few other possible bands based on our estimators and we report on their

empirical behavior in the next section. The theoretical analysis of these bands is the subject

of future work.

24

First we consider a confidence interval based on the truncated series estimator f(t) =∑dk=1 µkφk(t). Recognizing that f(t) can be written as an average n−1

∑ni=1Wi(t) of in-

dependent components Wi(t) =∑d

k=1 µikφk(t), 1 ≤ i ≤ n, the distribution of f(t) is asymp-

totically Gaussian. Thus, provided that (i) the bias√nE[f(t)] − f(t) is asymptotically

negligible, and (ii) the sample variance S2n(t) based on W1(t), . . . ,Wn(t) consistently esti-

mates v2(t) = Var(W1(t)), the random intervals [f(t) ± n−1/2Sn(t)zα/2] contain f(t) with

probability 1−α, asymptotically, as n→∞. First we address the bias issue (i). For a large

class of functions f ∈ W β with β ≥ 1, the bias disappears for d large enough since

√n∣∣∣E[f(t)]− f(t)

∣∣∣ . √n ∞∑k=d+1

|µk| .√n

∞∑k=d+1

kβ−1/2 . n1/2d−β+1/2 → 0

for d >> n1/(2β−1) under the bounded basis assumption (Assumption 5). We now address

(ii). Each Wi(t) =∑d

k=1 µikφk(t) is a sum of d iid components with mean µk and variance

of order O(1/m). Hence, Wi(t) is asymptotically Gaussian with mean∑d

k=1 µkφk(t) and

variance v2(t) = O(d2m−1). Writing Wi(t) = v(t)Ui(t), we find that

S2n(t)− v2(t) = v2(t)

[1

n− 1

n∑i=1

Ui(t)− U(t)2 − 1

]

is of stochastic order Op(v2(t)n−1/2) and the sample variance S2

n(t) converges to v2(t) in

probability if d2/(m√n) → 0. A band can now be obtained by taking the regular grid

tj = j/m and computing

f(tj)±Sn(tj)√

nzα/(2m) (32)

based on Bonferroni’s inequality. This band is easy to compute and has reasonably good

coverage as shown in our simulations.

We found in our simulations, reported in the next section, that replacing each Wi(t) by

Wi(t) =∑

k∈s µikφk(t) with s = k : |µk| > rk being the set of selected indices, yields

much smaller bands with good asymptotic coverage. We notice that this approach ignores

the uncertainty due to selecting the set s, resulting in coverage less than 100(1− α)%.

25

A superior way to obtain confidence bands is to analyze the process

En(t) =√nf(t)− E[f(t)]

=

1√n

n∑i=1

d∑k=1

(µik − µk)φk(t).

For the trigonometric basis φk used in our simulations, it is not hard to show that En

converges weakly to a Gaussian limit. As before, for f ∈ W β, β ≥ 1, provided d >> n1/(2β−1)

is large enough, the bias√n‖E[f ]− f‖∞ → 0 and

f(t)± Sn(t)√nqα (33)

based on the α upper point qα of the distribution of the supremum of the normalized process

En(t)/Sn(t), constitutes an asymptotic 1− α confidence band. For an easily implementable

approximation q∗α of qα, we rely on the following resampling procedure.

1. Draw with replacement n vectors (µ∗i1, . . . , µ∗id) from (µ11, . . . , µ1d), · · · , (µn1, . . . , µnd).

2. Compute C∗i (t) =∑d

k=1 φk(t)(µ∗ik−µk) , S∗n(t) = (n−1)−1/2

[∑ni=1C∗i (t)− C∗(t)2

]1/2and E∗n(t) = n−1/2

∑ni=1 C

∗i (t)

3. Approximate supt |E∗n(t)/S∗n(t)|

4. Repeat the previous steps B times, and obtain the α-upper point q∗α.

3 Numerical results

3.1 Simulation design

We conducted our simulations for a combination of types of mean zero stochastic processes,

stationary and non-stationary, and differentiable and non-differentiable mean functions.

Specifically, we consider two stationary processes, AR(1) and ARMA(1,1), and two non-

stationary processes, the Brownian Bridge (BB) and the Brownian Motion (BM) on [0,1].

We consider the two mean functions:

f(t) = c1 exp −64(t− 0.25)2+ c2 exp −256(t− 0.75)2,

26

referred to in the sequel as Signal 1, and

f(t) = c310.35 < t < 0.375+ c410.75 < t < 0.875,

referred to as Signal 2. The constants c1− c4 will be varied to achieve various desired signal-

to-noise ratios. We provide the definition of signal-to-noise ratio in section 3.1.1.

For our simulations we considered two popular families of bases, Fourier and Haar, each

known to have good approximation properties for functions in L2 belonging to general

smoothness classes, such as Sobolev classes. Any other bases with this property can be

considered, and the qualitative and quantitative points we illustrate here will remain essen-

tially the same.

3.1.1 Simulation scenarios

We simulated n curves for each of the eight combinations (signal, stochastic process) above.

Each curve 1 ≤ i ≤ n is observed at m points tij and the observations follow the model:

Yij = f(tij) + Zi(tij) + εij,

for 1 ≤ i ≤ n and 1 ≤ j ≤ m. We facilitate comparison with published work on other

estimators by considering time points tij = tj = j/m, for all i, and 0 ≤ j ≤ m. We also

evaluated the performance of our estimators when tij ∈ [0, 1] were simulated independently

from the uniform distribution on [0, 1]. The measurement errors εij were generated indepen-

dently from N(0, σ2ε). The parameters for simulating the AR(1) and ARMA(1,1) processes

were chosen in order to achieve the following equivalences:

median︸︷︷︸t

Var [ Brownian Bridge (t)] = Var AR(1)

median︸︷︷︸t

Var [ Brownian Motion (t)] = Var ARMA(1,1)

This facilitates comparison between processes of different natures. Next, the variance of the

measurement error σ2ε is chosen so that we have two cases: σ∗ = 1 and σ∗ = 10, where

σ∗ =var[Z(t)]

σ2ε

. (34)

27

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(a) AR(1), σ∗ = 10

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(b) AR(1), σ∗ = 1

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(c) BB, σ∗ = 10

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(d) BB, σ∗ = 1

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(e) ARMA(1,1), σ∗ = 10

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(f) ARMA(1,1), σ∗ = 1

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(g) BM, σ∗ = 10

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

3

x

(h) BM, σ∗ = 1

Figure 1: Plots of Signal 1 + AR(1)/BB + Noise (top row) and of Signal 1 + ARMA(1,1)/BM+ Noise (bottom row), n = 50, SNR = 4.25

When σ∗ = 1 the variability of the measurement error is the same as that of the stochastic

process, whereas for σ∗ = 10 the measurement errors become essentially negligible. Figure

1 below shows, respectively, realizations from each of the stochastic process with mean cor-

responding to Signal 1 and added noise corresponding to σ∗ = 1 and 10, respectively.

We conducted simulations for different values of the signal-to-noise ratio (SNR). Since the

process Z(t) is assumed independent of the measurement error, we define as a measure of the

noise (Var[Z(t)] + σ2ε)

1/2and the signal-to-noise ratio to be SNR = Range[f ]/ (Var[Z(t)] + σ2

ε)1/2

,

where Range[f ] = |max0≤t≤1 f(t)−min0≤t≤1 f(t)|.

3.2 Simulation results: the fit of the estimates

We contrast the quality of the fit of our estimates with eight other methods previously

proposed and studied in the literature. The first is the simple ensemble average of the obser-

vations Yij. The next three are obtained by applying, respectively, the following smoothing

28

methods to the entire data set, containing all n curves:

• Linear polynomial kernel smoothing (Local Poly) with a global bandwidth, suggested

by, among others, Yao et al. (2003), Muller (2005), Yao, Muller and Wang (2005),

Yao (2007). We implemented locpoly and dpill in R to obtain the estimate and its

bandwidth, respectively.

• Nadaraya-Watson kernel smoothing (NWK) with a global bandwidth, discussed in a

functional data setting by, e.g., Yao (2007). We used the N(0, 1) kernel and obtained

the bandwidth via cross-validation.

• Smoothing splines, as suggested, for instance, by Rice and Silverman (1991). We used

order 4 B-splines basis functions Bk(t) with a knot placed at each design time point

tj and a roughness penalty proportional to the square of the second derivative of f(t).

We implemented the method using smooth.spline in R with the tuning parameter in

the penalty term selected by generalized cross-validation, leaving one curve out at a

time.

For the last four of the methods used for comparison we estimate f(t) be averaging smoothed

versions of the individual trajectories. The reconstructions of the individual curves were

performed using:

• Linear polynomial kernel (Global kernel) smoother with a global bandwidth

• Linear polynomial kernel smoother (Local kernel) with a local bandwidth, where the

bandwidths are found using the plug-in algorithm proposed in Seifert, Brockman, Engel

and Gasser (1994)

• B-splines regression with roughness penalty

• Fourier expansion regression with roughness penalty, as discussed in Chapter 5 in

Ramsay and Silverman (2005).

We compare the estimates above with our proposed estimates using the Fourier and Haar

bases. We consider hard threshold estimates (HT) obtained by truncating the least squares

29

HT(r) Kernel Smooth Spline

0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(a) AR(1), σ∗ = 10


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(b) AR(1), σ∗ = 1


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(c) BB, σ∗ = 10


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(d) BB, σ∗ = 1


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(e) ARMA(1,1), σ∗ = 10


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(f) ARMA(1,1), σ∗ = 1

HT(r) Kernel Smooth Spline0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(g) BM, σ∗ = 10


0.02

0.04

0.06

0.08

0.10

Estimator

sqrt(MSE)

(h) BM, σ∗ = 1

Figure 2: Boxplots of the√MSE of HT(r), kernel and smoothing splines estimators, over

500 simulations

estimates either at levels rk, for each k, and denote the resulting estimate by HT(r), or at

levels 2rk, to obtain HT(2r), for rk given by

rk =Sk√nzα/(2d) = zα/(2d)

√√√√ 1

n(n− 1)

n∑i=1

(µi,k − µk)2.

In this simulation study we took d = m basis functions. Any choice d > m leads to repetitive

basis functions φm+k(j/m) = φk(j/m) for all 1 ≤ j ≤ m and k ≥ 1 in the case of equally

spaced design tij = j/m and Fourier basis φk.

Table 1 contains the empirical mean squared errors (EMSEs) for all the competing estimates

and for the proposed hard-thresholded estimates; we also included the least squares estima-

tor as a basis of comparison. For brevity we only included the results relative to f(t) equal to

Signal 1 since the results for Signal 2 were similar. The SNR for these simulations was set to

30

Table 1: EMSE results for Signal 1 based on 500 simulations and equally spaced design.Scenario: n = 400, m = d = 28 = 256, α = .05, SNR = 4.25.√

EMSE × 10−6

[√MEDMSE × 10−6] σ? = 1 σ? = 10 σ? = 1 σ? = 10

Brownian Bridge AR(1)Fourier BasisLS 29582 [27580] 21104 [18217] 30767 [30398] 22767 [22415]HT(r) 18429 [15723] 17544 [14446] 16928 [16072] 15989 [15195]HT(2r) 20231 [17721] 18334 [15441] 23894 [23767] 21448 [19651]Haar BasisLS 29493 [27537] 21092 [18208] 30672 [30302] 22749 [22389]HT(r) 34820 [33551] 23808 [21495] 38204 [38064] 27852 [27449]HT(2r) 48011 [47158] 48879 [53945] 61629 [61684] 52427 [52232]Pooled CurvesLocal Poly 22769 [20274] 20557 [17584] 22673 [22075] 20479 [20073]NWK 23186 [20790] 20271 [17289] 22848 [22279] 20594 [20153]Smoothing Splines 21455 [18801] 20186 [17308] 20794 [20263] 19408 [19069]Ensemble Ave 29493 [27537] 21092 [18208] 30672 [30302] 22749 [22389]Curve - by - CurveGlobal Kernel 47709 [46877] 23165 [20647] 31641 [31123] 20451 [20095]Local Kernel 36161 [34809] 21677 [19036] 27824 [27261] 20548 [20152]B-splines Regression 38532 [37384] 20181 [17291] 21413 [20899] 20897 [20531]Fourier Regression 39543 [38316] 31905 [30200] 38309 [38091] 30359 [30167]

Brownian Motion ARMA(1,1)Fourier BasisLS 50959 [44662] 37958 [29482] 53549 [52989] 41562 [40893]HT(r) 35388 [26045] 34133 [24780] 31356 [30346] 30491 [29488]HT(2r) 37794 [29300] 35322 [24835] 42625 [41283] 44088 [39450]Haar BasisLS 50815 [44564] 37938 [29419] 53399 [52749] 41543 [40867]HT(r) 58862 [54324] 45587 [34722] 66936 [65883] 51689 [51020]HT(2r) 78181 [75281] 88327 [85689] 106709 [106957] 103036 [96124]Pooled CurvesLocal Poly 40648 [32707] 37345 [28707] 41186 [40240] 38084 [37464]NWK 41260 [33561] 36791 [27848] 41277 [40470] 38092 [37468]Smoothing Splines 38737 [30617] 36865 [27928] 38281 [37269] 36341 [35685]Ensemble Ave 50815 [44564] 37938 [29419] 53399 [52749] 41543 [40867]Curve - by - CurveGlobal Kernel 83134 [80354] 43568 [36505] 48418 [47520] 38552 [38017]Local Kernel 62759 [58781] 39777 [31632] 45305 [44259] 38780 [38184]B-splines Regression 71699 [68131] 38657 [30346] 39255 [38373] 39874 [39140]Fourier Regression 66629 [62591] 54688 [49219] 64238 [64015] 51861 [51601]

31

Table 2: EMSE results for Signal 1 based on 200 simulations and uniform design. Scenario:n = 128, m = d = 27 = 128, α = .05, SNR = 4.25.√

EMSE × 10−6

[√MEDMSE × 10−6] σ? = 1 σ? = 10 σ? = 1 σ? = 10

Brownian Bridge AR(1)Fourier BasisLS 94138[91881] 75697[72876] 95598[94382] 76807[76221]HT(r) 48161[44972] 44907[38006] 45785[44642] 39685[37951]

Brownian Motion ARMA(1,1)Fourier BasisLS 158890[153804] 128867[124620] 163308[161969] 133878[134306]HT(r) 84029[75817] 80056[66235] 80765[79201] 77738[68341]

4.25. Table 1 is obtained for n = 400 and m = 256 and equally spaced design points. Figure

2 shows that the variability of the MSEs of the estimates with closest performance (HT(r),

smoothing spline, and kernel estimators via pooled curves) is comparable. We notice that

the MEDMSE (median MSE) and EMSE of HT(r) are, respectively, smaller than those of

its competitors. In Table 2, we assessed the quality of our estimator for uniformly sampled

design points, and we lowered the values of m and n to 128. The smaller sample size and

additional randomness induced by the random design are responsible for the inflation of the

EMSE values presented in Table 2 compared to the results presented in Table 1.

Our results support the following conclusions on the performance of the fit of the estimators:

1. If σ∗ = 10, the variance of the process dominates the variance of the random errors,

and we see that the HT estimators based on the Fourier basis perform slightly better

than most of the competing estimators.

2. If σ∗ = 1, the difference between our estimator and the competing estimators is more

pronounced, especially for the BM and ARMA(1,1) processes, suggesting that this type

of estimation is more robust against the variability in the data.

3. As an additional remark, our experiments indicate that some of the estimators proposed

in the literature may be outperformed by the simple least squares estimator based on

all the data points, or even by the naive sample average, if the choice of their tuning

parameters is not refined; for all our simulations we did choose these tuning parameters

32

adaptively as explained above, but we did not attempt to improve upon the published

guidelines on their selection.

3.3 Simulation results: confidence sets

3.3.1 Confidence balls

We first investigate the empirical performance of the confidence ball proposed in Section

2.5 using the Fourier basis. The confidence ball B(f(2r), ρ(2r)) has the radius established in

display (25). In Table 3 we report average radius, average empirical L2 distance evaluated at

64 equally spaced time points, and coverage over S = 500 simulated datasets. Results show

that the confidence ball achieves the nominal coverage for equally spaced design. When the

time points are uniform, the ball has a wider radius and coverage close to the nominal level,

even for σ∗ = 1.

Table 3: Numerical results for confidence ball B(f(2r), ρ(2r)). Table displays ave. radius, ave.empirical L2 distance, and coverage over S = 500 simulations. Scenario: Signal 1, m = d = 26 = 64,Fourier φk basis functions, SNR = 2.2.

AR(1) process

Empirical L2 Radius Coverageσ∗ = 10

(n = 350) α = 0.05

Equally Sp. tj 0.023 0.059 0.97Uniform tij 0.036 0.063 0.94σ∗ = 1

(n = 125) α = 0.10

Equally Sp. tj 0.051 0.100 0.96Uniform tij 0.074 0.107 0.88

3.3.2 Confidence bands

Next, we investigate the finite sample coverage of the confidence bands (Methods 1 - 4)

proposed in Section 2.5, again using the Fourier basis. Method 1 is based on display (31).

33

Method 2 is the band obtained using display (32) with Wi(t) =∑

k∈s µikφk(t). Method 3 is

the band obtained using display (32) with Wi(t) =∑d

k=1 µikφk(t). Method 4 is implemented

as in display (33).

We consider the following scenario for our simulations for evaluating confidence bands:

n = 300, m = 64 and d = 30. The signal-to-noise ratio was set to 2.2. Both the uni-

form and equally spaced design are considered for the discrete time points. We evaluate our

bands at the fine grid consisting of equally spaced time points in [0,1], which we construct

by taking tj = j/m, 1 ≤ j ≤ m. We compare these methods with simultaneous confidence

bands (SCB) of the form: fave(t)± zα/(2m)n−1/2Γ1/2(t, t), based on Zhang and Chen (2007).

Here fave(t) = (1/n)∑n

i=1 floc,i(t) is the average of local linear kernel estimators floc,i for

each curve with local bandwidth (built-in bandwidth choice lokerns in R). The estimated

covariance is computed using the sample variance of the kernel estimators at each t. Table

4 shows, for equally spaced design, that these SCB are not optimal.

Table 4 summarizes the results for the AR(1) and BB processes and Signal 1. The entries

in these tables are the average widths of the confidence bands followed, in parentheses, by

their empirical coverage over S = 300 simulations. Since we chose for this simulation study

α = 0.05, we expect the empirical coverage of the proposed bands to be around the 0.95

nominal level. The results presented in Table 4 below support the following conclusions on

the proposed confidence bands.

1. Method 1 yields an adaptive band that is conservative and necessarily wider than

the other methods we analyze. The coverage is close to and often times exceeds the

nominal coverage.

2. Method 2 has coverage close to the nominal level for equally spaced design, and its

width is narrowest among all methods considered. Methods 2 and 3 have similar

coverage in the case of equally spaced design with Method 2 having smaller width. In

the uniform design case, however, Method 2 has slightly lower coverage than Method

3.

3. Method 3 and 4 are centered at the same estimator; the difference in their width being

34

the quantile used. Method 4, employing a quantile chosen via bootstrap, produces

bands with narrower width on average, while maintaining coverage close to the nominal

level. It has slightly lower coverage than Method 3.

4. For the same (n,m, d) combination, the scenario σ∗ = 10 has narrower band widths

compared to σ∗ = 1. Also, the bands for the uniform design have a slightly wider width

compared to equally spaced design. Across all scenarios we consider, the coverage for

the proposed bands lies between 0.84 - 1.00 for AR(1) process and 0.91 - 1.00 for BB

process.

4 Application to daily temperature curves

We use the tempkent dataset, which is part of the functional datasets fds package in R.

It consists of many temperature curves recorded over the course of a day in Kent Town,

Australia. We consider the daily temperature curves corresponding to the time period 2003

- 2007. Temperature curves are observed at equally spaced (half-hour) m = 48 time points

each day. Our goal is to estimate the mean temperature curve and to provide confidence

bands. We used d = 40 Fourier basis functions. The temperature curves contained in this

dataset, together with the confidence bands we obtained, are displayed in Figure 4. The con-

fidence band for the mean temperature curve has a smooth appearance, and indicates a cyclic

behavior for the mean temperature curve, with an expected maximum of approximately 22.3

degrees Celsius, and a minimum of approximately 12.4 degrees Celsius.

References

[1] Yannick Baraud. Confidence balls in Gaussian regression. Annals of Statistics, 32, 528-

551, 2004.

[2] Peter Bickel and Yaacov Ritov. Non and semi-parametric statistics compared and con-

trasted. Journal of Statistical Planning and Inference, 91, 209-228, 2000.

35

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

(a) Method 2

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

(b) Method 3

Figure 3: Scenario: AR(1), σ∗ = 1, SNR = 2.2, n = 300, m = 64, d = 30, Equally spacedtj, α = 0.05. The plots show simulated data curves in black, true f in white solid line, andconfidence bands in white dash lines.

36

0 10 20 30 40

1020

3040

Half-hour time points

Tem

pera

ture

(Deg

rees

C)

Figure 4: Daily temperature curves for Kent Town, Australia, for years 2003 - 2007, areshown in black lines. Confidence bands for the mean temperature curve for this time periodare shown in white lines: Method 1 band (pronounced dash), and Method 2 band (fine dash).

37

[3] Lucien Birge and Pascal Massart. An Adaptive Compression Algorithm in Besov Spaces.

Constructive Approximation 16 (1), 1-36, 2000.

[4] Michal Benko, Wolfgang Hardle and Alois Kneip. Common functional principal compo-

nents. Annals of Statistics 37(1), 1–34, 2009.

[5] Rudolph Beran and Lutz Dumbgen. Modulation of estimators and confidence sets. An-

nals of Statistics 26, 1826–1856, 1998.

[6] P. Laurie Davies, Arne Kovac and Monika Meise. Nonparametric regression, confidence

regions and regularisation. Annals of Statistics 37, 2597–2625, 2009.

[7] David Degras. Nonparametric estimation of a trend based upon sampled continuous

processes. C. R. Acad. Sci. Paris Ser. I 347, 191-194, 2009.

[8] David Donoho and Iain Johnstone. Adapting to the unknown smoothness via wavelet

shrinkage. Journal of the American Statistical Association 90, 1200–1224,1995.

[9] David Donoho and Iain Johnstone. Minimax estimation via wavelet shrinkage. Annals

of Statistics 26, 789–921, 1998.

[10] David Donoho, Iain Johnstone, Gerard Kerkyacharian, and Dominique Picard. Density

estimation by wavelet thresholding. Annals of Statistics 24(2), 508–539, 1996.

[11] Christopher Genovese and Larry Wasserman. Adaptive confidence bands. Annals of

Statistics 36(2), 875–905, 2008.

[12] Daniel Gervini. Free-knot spline smoothing for functional data. Journal of the Royal

Statistical Society, Series B 68(4), 671–687, 2006.

[13] Evarist Gine and Richard Nickl. Confidence bands in density estimation. Annals of

Statistics 38(2), 1122–1170, 2010.

[14] Henry Landau and Lawrence Shepp. On the supremum of a Gaussian process. Sankhya,

32, 369–378, 1970.

[15] Mark Low. On nonparametric confidence intervals. The Annals of Statistics, 25 (6),

2547 - 2554, 1997.

38

[16] Pascal Massart. Concentration Inequalities and Model Selection, Ecole d’Ete de Proba-

bilites de Saint-Flour XXXIII - 2003, volume 1896, 2007.

[17] Hans-Georg Muller. Functional modeling and classification of longitudinal data. Scan-

dinavian Journal of Statistics 32, 223–240, 2005.

[18] Dominique Picard and Karine Tribouley. Adaptive confidence interval for pointwise

curve estimation Annals of Statistics 28(1), 298–335, 2000.

[19] R Development Core Team (2009). R: A language and environment for statistical com-

puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,

URL http://www.R-project.org.

[20] James Ramsay and Bernard Silverman. Functional data analysis, 2nd Edition. Springer,

New York, 2005.

[21] James Ramsay and Bernard Silverman. Applied functional data analysis. Springer, New

York, 2002.

[22] John Rice and Bernard Silverman. Estimating the mean and covariance structure non-

parametrically when the data are curves. Journal of the Royal Statistical Society, Series

B 53(1), 233–243, 1991.

[23] Jamie Robins and Aad van der Vaart. Adaptive nonparametric confidence sets. Annals

of Statistics, 34, 229-253, 2006.

[24] David Ruppert, Matthew Wand and Raymond Carroll. Semiparametric regression.

Cambridge University Press, Cambridge 2003.

[25] Burkhart Seifert, Michael Brockmann, Joachim Engel and Theo Gasser. Fast algorithms

for nonparametric curve estimation. Journal of Computational and Graphical Statistics

3(2), 192–213, 1994.

[26] Galen Shorack. Probability for Statisticians. Springer, 2000.

[27] Galen Shorack and Jon Wellner. Empirical Processes with Applications to Statistics,

Wiley, 1986.

39

[28] Alexandre Tsybakov. Introduction to nonparametric estimation. Springer, New York,

2009.

[29] Larry Wasserman. All of nonparametric statistics. Springer, New York, 2006.

[30] Fang Yao. Asymptotic distributions of nonparametric regression estimators for longitu-

dinal and functional data. Journal of Multivariate Analysis 98, 40–56, 2007.

[31] Fang Yao, Hans-Georg Muller and Jane-Ling Wang. Functional data analysis for sparse

longitudinal data. Journal of the American Statistical Association 100(740), 577–590,

2005.

[32] Jin-Ting Zhang and Jianwei Chen. Statistical inferences for functional data. Annals of

Statistics 35(3), 1052–1079, 2007.

40

Table 4: Table entry is ave. width (coverage) over S = 300 simulations. Scenario: Signal 1, AR(1),α = 0.05, Fourier φk basis functions, SNR = 2.2, B = 100, n = 300, m = 64, d = 30.

AR(1) BBprocess process

σ∗ = 10

Equally Sp. tjBands based on asympt. normalitySCB 0.16 (0.78) 0.15 (0.84)

Proposed BandsMethod 1) f(2r)(tj)±

∑dk=1 rk|φk(tj)|1|µk|>2rk 0.27 (0.94) 0.26 (1.00)

Method 2) HT & Bonferroni 0.13 (0.96) 0.14 (1.00)Method 3) LS & Bonferroni 0.17 (0.96) 0.16 (0.98)Method 4) LS & Bootstrap 0.16 (0.92) 0.14 (0.94)

Uniform tijProposed BandsMethod 1) f(2r)(tj)±

∑dk=1 rk|φk(tj)|1|µk|>2rk 0.30 (0.92) 0.30 (0.97)


σ∗ = 1

Equally Sp. tjProposed BandsMethod 1) f(2r)(tj)±

∑dk=1 rk|φk(tj)|1|µk|>2rk 0.31 (1.00) 0.30 (1.00)


Uniform tijProposed BandsMethod 1) f(2r)(tj)±

∑dk=1 rk|φk(tj)|1|µk|>2rk 0.34 (0.99) 0.35 (1.00)


41

Adaptive inference for the mean of a Gaussian...

Documents

Transcript of Adaptive inference for the mean of a Gaussian...