Adaptive inference for the mean of a Gaussian...
Transcript of Adaptive inference for the mean of a Gaussian...
Adaptive inference for the mean of a Gaussian process
in functional data
Florentina Bunea1, Marten H. Wegkamp1 and Andrada E. Ivanescu2
Florida State University and East Carolina University
November 5, 2010
Abstract
This paper proposes and analyzes fully data driven methods for inference about the
mean function of a Gaussian process from a sample of independent trajectories of the
process, observed at random time points and corrupted by additive random error. The
proposed method uses thresholded least squares estimators relative to an approximat-
ing function basis. The variable threshold levels are estimated from the data and the
resulting estimates adapt to the unknown sparsity of the mean function relative to the
selected approximating basis. These results are based on novel oracle inequalities that
are used to derive the rates of convergence of our estimates. In addition, we construct
confidence balls that adapt to the unknown regularity of the mean function. They are
easy to compute since they do not require explicit estimation of the covariance operator
of the process. The simulation study shows that the new method performs very well in
practice, and is robust against large variations introduced by the random error terms.
Keywords: Stochastic processes; nonparametric mean estimation; thresholded estimators;
functional data; oracle inequalities; adaptive inference; confidence balls.
Acknowledgements: The authors thank the Associate Editor and a referee for their con-
structive remarks. The research of Florentina Bunea and Marten Wegkamp was supported in
part by NSF Grants DMS-0706829 and DMS-1007444. Part of the research was done while
1Department of Statistics, Florida State University, Tallahassee, FL 32306-4330.2Department of Biostatistics, East Carolina University, Greenville, NC 27858-4353.
1
the authors were visiting the Isaac Newton Institute for Mathematical Sciences (Statistical
Theory and Methods for Complex, High-Dimensional Data Programme) at Cambridge Uni-
versity during Spring 2008.
1 Introduction
In this paper we develop and analyze new methodology for inference about the mean of a
Gaussian process from data that consists of independent realizations of this process observed
at discrete times, where each observation is contaminated by an additive error term. For-
mally, let X(t), 0 ≤ t ≤ 1 be a Gaussian process with mean function f(t) = E[X(t)] and
stochastic part Z(t) = X(t) − f(t). We denote the covariance function of X (and Z) by
Γ(s, t) = Cov(X(s), X(t)), for all 0 ≤ s, t ≤ 1. We observe Yij at times tij, for 1 ≤ i ≤ n,
1 ≤ j ≤ m, that are of the form
Yij = Xi(tij) + εij (1)
where Xi(t), with mean f(t), are random independent realizations of the process X(t).
Although we could allow for different sample sizes mi per curve – the conditions on m
imposed below would then be replaced by conditions on minmi – we treat mi = m for ease
of notation and clarity of exposition. We assume that εij are independent across i and j
with E[εij] = 0 and E[ε2ij] = σ2
ε <∞, and ε is independent of X.
Although the estimation of f received considerable attention over the last decade, the the-
oretical study of data-adaptive estimators in model (1) is still open to investigation. In
contrast with the abundance of methods for estimating f , methods for constructing confi-
dence sets for f are very limited. This motivates our two-fold contribution to the existing
literature:
(1) construction of computationally efficient and fully data-driven estimators and confi-
dence balls for f ;
(2) theoretical assessment of the quality of our data adaptive estimates and proof that the
estimators and the confidence balls adapt to the unknown regularity of f and Z.
2
We begin by reviewing the existing results in the literature, which provides further moti-
vation for the procedure set forth in this article. The problem of estimating f from data
generated from (1) has been considered by a large number of authors, starting with Ramsay
and Silverman (2002, 2005) and Rupert, Wand and Carroll (2003). The existing methods are
either based on kernel smoothers as in Zhang and Chen (2007), Yao (2007), Benko, Hardle
and Kneip (2009), penalized splines, as in, for instance, Ramsay and Silverman (2005),
free-knot splines as in Gervini (2006), or ridge-type least squares estimates as in Rice and
Silverman (1991). All resulting estimates depend on tuning parameters that are method
specific. Theoretical properties of these estimates of f are still emerging, and have only been
established for non-adaptive choices of the respective tuning parameters which require prior
knowledge of the smoothness of f , see, for instance, Zhang and Chen (2007) and Gervini
(2006). Although guidelines for data-driven choices of these parameters are offered in all
these works, the theoretical properties of the resulting estimates are still open to investi-
gation. In contrast, we suggest in Section 2 below a computationally simple method based
on thresholded projection estimators, with variable threshold level. Our method does not
require any specification of the regularity of f(t) or Z(t) prior to estimation. We show via
oracle inequalities that our estimators adapt to this unknown regularity.
Whereas the estimation of the mean f(t) of the process X(t) is well understood, modulo
the technical and possibly computational issues raised above, the construction of uniform
confidence sets for f has not been investigated in this context and in general the construction
of confidence sets for f in model (1) seems to have received little attention. An exception
is Degras (2009). Although his procedure is attractive, his theoretical analysis ignores the
bias term when applying a classical result by Landau and Shepp (1970) on the supremum
of a Gaussian process. Therefore his confidence bands do not attain the nominal coverage.
We propose and analyze a number of alternative procedures for constructive confidence sets.
In particular, we offer a computationally simple procedure that leads to adaptive confidence
balls.
The paper is organized as follows. In Section 2.2 below we discuss thresholded projection
estimators in the functional data setting. In Section 2.3 we establish oracle inequalities for the
3
fit of the estimators which show that the estimates adapt to the unknown sparsity of the mean
f . Under appropriate conditions on the mean f(t) and the covariance function Γ(s, t) of the
process Z(t), we derive rates of convergence for our estimates in Section 2.4. In Section 2.5 we
construct confidence balls for f and prove that they have the desired asymptotic coverage
probability uniformly over large classes of functions. Moreover, we suggest a number of
methods for constructing confidence bands. Section 3 contains a comprehensive simulation
study that indicates that our methods compare favorably with existing methods. The net
merit of the proposed methods is especially visible when the variance of the random noise ε
is at the same level as that of the stochastic process Z(t), and we discuss this in detail in
Sections 3.2 and 3.3.
2 Methodology
2.1 Preliminaries
In this section we introduce notation and assumptions that will be used throughout the
paper. As explained in the introduction, the aim of this paper is
(a) to estimate the mean f(t) of the process X(t) and
(b) to construct confidence sets for the mean f(t), t ∈ [0, 1].
We assume throughout that f ∈ L2([0, 1], dt) and is bounded; dt denotes the Lebesgue mea-
sure on [0, 1] and in what follows we will write L2 for the space L2([0, 1], dt). We will also
make the following standard assumption on the process:
Assumption 1. The paths of the Gaussian process X(t), t ∈ [0, 1], are L2-functions almost
surely, and the covariance kernel Γ is continuous and satisfies∫ 1
0
Γ(t, t) dt <∞.
Remark. We notice, for further reference, that Assumption 1 guarantees that Γ ≥ 0 (semi-
positive definite); see, for instance, Shorack and Wellner (1986, page 208). Also, by Mercer’s
4
theorem, all continuous Γ have an eigen-decomposition
Γ(s, t) =∞∑j=1
λjfj(s)fj(t)
in terms of eigenvalues λ1 ≥ λ2 ≥ · · · and (orthonormal) eigenfunctions f1, f2, · · · . More-
over, λj ≥ 0 and∑∞
j=1 λj <∞; see, e.g., Shorack and Wellner (1986, page 210).
Our approach uses thresholded projection estimates which are obtained relative to bases
φ1, φ2, . . . that are orthonormal in L2 and are known to have good approximation properties
over a large scale of smoothness classes to which the target f may belong. Examples include
the Fourier, local trigonometric and wavelet bases.
Assumption 2. The mean f(t) = E[X(t)] of the Gaussian process X(t) is in L2 and may
be written as
f(t) =∞∑k=1
µkφk(t) (2)
where the convergence is uniform and absolute on [0, 1]. The coefficients µk are given by
µk =
∫ 1
0
f(t)φk(t) dt. (3)
Assumption 3. The observation “times” tij are independent and uniformly distributed on
the interval [0, 1].
Assumption 4. The errors εij are independent N(0, σ2) random variables.
Assumption 5. The basis functions φk are uniformly bounded.
Remark. The Gaussian assumption on the process X(t) and errors εij in Assumptions 1
and 4, and the bounded basis assumption (Assumption 5) may be relaxed at the cost of
rather technical proofs. These assumptions are used in the proof of Proposition 2 of Section
2.3.1 below.
5
2.2 Threshold-type estimators for functional data
Our procedure falls between two of the currently used strategies for the estimation of f :
averaging estimated individual trajectories and applying various smoothing methods to the
entire data set. Our initial estimator of f is a projection estimator onto a space generated
by a large set of basis functions and can be viewed as an average (over n) of weighted values
of the observations Yij. Our final estimator will be a truncated version of the projection
estimator, with data dependent truncation levels determined from the entire data set. We
describe the details in what follows.
Given a family of basis functions φkk and a large integer d (cut-off point), which can grow
polynomially in n, our initial estimator of f is f(t) =∑d
k=1 µkφk(t), where
µk =1
n
n∑i=1
µi,k (4)
is the average of the projection estimators
µi,k =1
m
m∑j=1
Yijφk(tij). (5)
The variance of the initial estimator f(t) =∑d
k=1 µkφk(t) may be unnecessarily inflated by
the presence of, possibly many, very small estimates µk. This drawback can be remedied
by truncating the coefficients at a level rk that takes into account both the variability of
the measurement errors εij and the variability of the stochastic processes Zi(t). This is the
essential difference between truncated estimators based on data generated as in (1) and their
counterpart based only on independent data in a standard nonparametric regression setting.
We will focus on hard threshold estimators of the coefficients µk and of the function f . They
are, respectively,
µk(rk) = µk1|µk| > rk
and
f(r) =d∑
k=1
µk(rk)φk,
6
where here and in what follows we will use the notation r = (r1, . . . , rd). In the next section
we discuss the goodness-of-fit of these estimates in terms of the L2 norm, where for any
g ∈ L2 we denote by ‖g‖2 =∫ 1
0g2(t) dt its L2 norm.
2.3 Oracle inequalities
Define, for each 1 ≤ k ≤ d,
µk(rk) = µk1|µk| > rk
and write
f(r)(t) =d∑
k=1
µk(rk)φk(t)1|µk| > rk.
The function f(r) can be regarded as a sparse approximation of f relative to a given basis;
of course, since the function f is unknown so is its sparse approximation. In this section we
show that the truncated estimators introduced above, constructed without any prior knowl-
edge of such sparse representations, mimic the bias-variance decomposition of estimators
that would use such information in their construction. Therefore our estimates adapt to
the unknown sparsity of f and we call the corresponding results sparsity oracle inequalities.
They permit us to determine, as a consequence, the rates of convergence of our estimators.
We discuss them in detail in the next section.
We begin by establishing Theorem 1, which is an oracle inequality for the hard threshold
estimator. The result holds on the event
Ωn =d⋂
k=1
|µk − µk| ≤ rk (6)
and is valid for any given threshold levels rk. This clearly shows what will drive the choice of
the threshold levels rk: they have to be chosen such that Ωn holds with probability arbitrarily
close to one. In Proposition 2 below we propose levels rk for which lim infn→∞ P(Ωn) ≥ 1−α,
for any given 0 < α < 1. In particular, for α = 1/n, this guarantees that limn→∞ P(Ωn) = 1.
7
Theorem 1. For all d ≥ 1, on the event Ωn,
‖f(2r) − f‖ ≤ ‖f − f(r)‖+ 3
√√√√ d∑k=1
r2k1|µk| > rk.
Moreover, for d ≤ n, we have for some finite constant C,
E[‖f(2r) − f‖
]≤ ‖f − f(r)‖+ 3
√√√√ d∑k=1
r2k1|µk| > rk+ C
√1− P(Ωn).
Proof. We first observe that
‖f(2r) − f(r)‖2 =d∑
k=1
(µk(2rk)− µk(rk))2.
For the first claim, it suffices to show that on the event Ωn,
|µk(2rk)− µk(rk)| ≤ 3rk1|µk| > rk (7)
holds for all 1 ≤ k ≤ d, and any d ≥ 1 since this bound at the coefficient-level implies
‖f(2r) − f(r)‖ ≤ 3
√√√√ d∑k=1
r2k1|µk| > rk,
and the claim of Theorem 1 follows from the triangle inequality. We now prove (7). Indeed,
on the event Ωn,
|µk(2rk)− µk(rk)|
= |µk1|µk| > 2rk − µk1|µk| > rk|
≤ |µk − µk|1|µk| > rk+ |µk| |1|µk| > 2rk − 1|µk| > rk|
≤ rk1|µk| > rk+ |µk| |1|µk| > 2rk − 1|µk| > rk| .
We consider two cases – |µk| ≤ rk and |µk| > rk – for the second term on the right and we
8
obtain the bound
|µk| |1|µk| > 2rk − 1|µk| > rk|
= |µk|1|µk| > 2rk1|µk| ≤ rk+ |µk|1|µk| ≤ 2rk1|µk| > rk
≤ 2rk1|µk| > rk.
The last inequality follows from the fact that |µk| ≤ rk implies that |µk| ≤ 2rk. Combining
the two preceding bounds yields (7).
For the second claim, we note that
E[‖f(2r) − f‖
]= E
[‖f(2r) − f‖1Ωn
]+ E
[‖f(2r) − f‖1Ωcn
]≤ ‖f − f(r)‖+ 3
√√√√ d∑k=1
r2k1|µk| > rk+ E
[‖f(2r) − f‖1Ωcn
]
using the first claim. Then we observe that Var(µk) = τ 2k/n with τ 2
k defined in (8) below. It
is easy to see that τ 2k ≤ ‖Γ‖∞ + (1/m)‖Γ‖∞ + ‖f‖2
∞ + σ2ε ≤ C for some C > 0. By the
Cauchy-Schwarz inequality and the fact that
E[‖f(2r) − f‖2
]≤ 2E‖f(2r)‖2 + 2‖f‖2
≤ 2E
[d∑
k=1
µ2k
]+ 2‖f‖2
≤ 2d∑
k=1
τ 2k
n+ 2
d∑k=1
µ2k + 2‖f‖2
≤ 2d
nC + 2
d∑k=1
µ2k + 2‖f‖2
≤ 2C + 4‖f‖2
we obtain
E[‖f(2r) − f‖1Ωcn
]≤√
E[‖f(2r) − f‖
]2√P(Ωc
n) ≤√
2C + 4‖f‖2√
P(Ωcn),
and the second claim follows.
9
Theorem 1 is a novel type of oracle inequalities for thresholded estimators in a functional
data context, and extends similar results established in the more traditional sequence model,
non-parametric regression and density estimation settings, see, for instance, Donoho and
Johnstone (1995, 1998), Donoho et al. (1996), Wasserman (2006), Tsybakov (2009) and the
references therein.
2.3.1 Choice of the threshold levels.
In what follows we propose threshold levels rk that guarantee that the event Ωn, defined
in (6) above, holds with any pre-specified probability level 1 − α of interest. A calculation
shows that µk has mean E[µk] = µk and variance Var(µk) = n−1τ 2k with
τ 2k =
m− 1
mγ2k +
1
m
∫ 1
0
Γ(t, t)φ2k(t) dt+
∫ 1
0
f 2(t)φ2k(t) dt− µ2
k + σ2ε
. (8)
Here
γ2k =
∫ 1
0
∫ 1
0
φk(s)Γ(s, t)φk(t) dsdt (9)
is often the leading term. Let zα/2d be the α/2d upper point of the N(0, 1) distribution. We
denote z2α/2d by ρn and we note that ρn = O(ln(d/α)) for future reference. Set
rk =z α
2d√n
(τk +
c0√m
), (10)
for some c0 large enough.
Proposition 2. Let rk be as in (10) above with c0 sufficiently large. Then, under Assump-
tions 1–5,
P(Ωn) ≥ 1− α− de−cn − 2α2
d
for some constant c > 0. Consequently, if d→∞ and de−n → 0 as n→∞,
lim infn→∞
P(Ωn) ≥ 1− α.
For clarity of exposition, in the proof below and for the rest of the paper we will use the
symbol . to denote an inequality that holds up to multiplicative positive constants.
10
Proof. First we decompose
µk − µk =1
n
n∑i=1
1
m
m∑j=1
Z(tij) + εijφk(tij) +1
nm
n∑i=1
m∑j=1
f(tij)φk(tij)− µk
=: µ(1)k + µ
(2)k − µk.
By the definition of Ωn and rk we have
PΩcn ≤
d∑k=1
P |µk − µk| > rk (11)
≤d∑
k=1
P√
n|µ(1)k | > zα/(2d)
(τk +
c0
2√m
)+
d∑k=1
P√
nm|µ(2)k − µk| >
c0
2zα/(2d)
=: (I) + (II).
We bound each term separately, starting with (I). Let
V 2k =:
1
n
n∑i=1
1
m2
m∑j=1
m∑j′=1
φk(tij)Γ(tij, tij′)φk(tij′) + σ2ε
1
nm
n∑i=1
m∑j=1
φ2k(tij),
set V 2k =: E[V 2
k ] ≤ τ 2k and introduce the event Ak =: Vk ≤ Vk + c0/(2m
1/2). Then
(I) =d∑
k=1
P|µ(1)k | >
zα/(2d)√n
(τk +
c0
2√m
), Ak
+
d∑k=1
P|µ(1)k | >
zα/(2d)√n
(τk +
c0
2√m
), Ack
≤d∑
k=1
P(Ack) +d∑
k=1
E[P√
n|µ(1)k | > zα/(2d)Vk
∣∣∣ (tij)i,j
].
To bound the first sum above, we observe that, by Assumptions 1 and 5, |φ2kΓ| ≤ C and
|φ2k| ≤ C for some constant C > 0. Then using Hoeffding’s inequality for sums of bounded
independent random variables, we obtain, for some bounded constant C ′ > 0
d∑k=1
P(Ack) ≤d∑
k=1
PV 2k +
c20
4m< V 2
k
≤ d exp
(−nc
40
C ′
). (12)
For the second term in the display above, we observe that µ(1)k , conditionally given the times
11
tij, is Gaussian, with mean 0 and variance equal to V 2k /n. Therefore
d∑k=1
E[P√
n|µ(1)k | > zα/(2d)Vk
∣∣∣ (tij)i,j
]= α. (13)
To control (II) in (11) we use Hoeffding’s inequality again, to obtain, for some bounded
constant c1,
d∑k=1
P√
nm|µ(2)k − µk| >
c0zα/(2d)
2
≤ 2d exp
−c1c
20ρn.
Hence, for c0 large enough,
d∑k=1
P√
nm|µ(2)k − µk| >
c0zα/(2d)
2
≤ 2
αt
d1−t (14)
for any t ≥ 1. Collecting (12), (13) and (14) with t = 1 we obtain the result.
Remarks. Assumptions 1, 4 and 5 may be relaxed, but at the price of additional technical-
ities and restrictions on m that would clutter the presentation.
The term c0m−1/2 in (10) is technical and generally smaller than τk. We propose the following
practical choice for rk. Since µk = n−1∑n
i=1 µi,k is the average of i.i.d. random variables, we
expect that it is approximately normal. Motivated by the central limit theorem, we use in
our simulations and data analysis the choice
rk =Sk√n× z α
2d,
based on the sample variances S2k = (n − 1)−1
∑ni=1(µi,k − µk)
2. Indeed, by Bonferroni’s
bound,
P(Ωcn) ≤
d∑k=1
P|µk − µk| > rk ≈d∑
k=1
P|N(0, 1)| > zα/2d
= α.
We summarize the results of this section in the following corollary.
12
Corollary 3. For any 0 < α < 1, set rk as in (10) above. Under Assumptions 1–5,
‖f(2r) − f‖ ≤ ‖f − f(r)‖+ 3
√√√√ d∑k=1
r2k1|µk| > rk
holds with probability larger than 1− α, as n→∞.
Moreover, for α = 1/n and d ≤ n, we have, for some constant C <∞,
E[‖f(2r) − f‖
]≤ ‖f − f(r)‖+ 3
√√√√ d∑k=1
r2k1|µk| > rk+
C√n.
We establish the asymptotic implications of these results and the rates of convergence of our
variable threshold estimator in the next section. The advantages of the variable threshold,
which is different for each estimated coefficient µk over (a) no threshold at all, and (b) a
constant threshold for all k, will be discussed in detail below.
2.4 Rates of convergence
We begin by discussing classes of functions over which we derive the rates of convergence
of our estimator. Recall that we have assumed that the mean function f ∈ L2, therefore
‖f‖2 =∑
k µ2k <∞ and so µk → 0 as k →∞. We introduce a parameter β > 0 that governs
this decay to zero of the coefficients µk, for k large enough, say k ≥ K, for some K ≥ 1.
We consider bases φkk and classes of functions f that satisfy
µ2k . k−2β−1, (15)
for all k ≥ K. The classes of functions that satisfy (15) are quite general. For instance,
consider the trigonometric basis φkk and the generalized Sobolev classes of the form
W β(R) =f : [0, 1]→ R : f(0) = f(1), ‖f (β)‖ ≤ R
,
with smoothness index β > 0 and for some R > 0. Notice that this class includes continuous
13
functions, for β > 1/2, see, e.g., Tsybakov (2009) for details. Then f ∈ W β(R) if and only
if ∞∑k=1
k2β(µ22k + µ2
2k+1) ≤ R2.
Consequently, µ2k = o(k−2β−1) for k → ∞, so that (15) holds. More generally, it can be
shown that balls in Besov spaces can be written as `p bodies when expanded in suitable
wavelet bases or spaces generated by piece-wise polynomials based on regular or irregular
partitions; see for instance DeVore and Lorenz (1996) for further details.
We first state an intermediate result.
Proposition 4. On Ωn, we have
‖f(2r) − f‖2 ≤ 18
(∞∑
k=N+1
µ2k +
N∑k=1
r2k
)(16)
for all N ≤ d. Consequently, for all f ∈ W β(R), on the event Ωn,
‖f(2r) − f‖2 . N−2β +N∑k=1
r2k. (17)
Proof. Using Theorem 1 and the inequality (a+ b)2 ≤ 2a2 + 2b2, we find
‖f(2r) − f‖2 ≤ 2‖f(r) − f‖2 + 18d∑
k=1
r2k1|µk| > rk.
Since
‖f(r) − f‖2 =d∑
k=1
µ2k1|µk| ≤ rk+
∞∑k=d+1
µ2k
14
we obtain
‖f(2r) − f‖2 ≤ 2d∑
k=1
µ2k1|µk| ≤ rk+ 2
∞∑k=d+1
µ2k + 18
d∑k=1
r2k1|µk| > rk
≤ 18
∞∑
k=d+1
µ2k +
d∑k=1
min(r2k, µ
2k)
≤ 18
∞∑
k=N+1
µ2k +
N∑k=1
min(r2k, µ
2k)
which proves the first claim. The second claim follows immediately from the first claim and
(15).
From this proposition, it follows that the other important ingredient in establishing the rate
of convergence of our estimates is the size of the threshold levels rk. We discuss this below.
For the stochastic process Z(t) and any basis that is orthonormal in L2, the quantity γ2k is
bounded by λ1, the largest eigenvalue of the covariance kernel Γ. Under Assumption 1 on
the kernel and from the remark following it, we see that λ1 ≤∑
k λk < ∞. Since Γ ≥ 0 we
therefore always have, for all k, that 0 ≤ γ2k <∞, where we recall that γ2
k is the term of the
variance component defined by (9) above. In what follows we further assume that
γ2k . k−δ (18)
for all k ≥ K and for some finite, positive constants K ≥ 1 and δ ≥ 0. Just as β is the
smoothness parameter of the function f , we view δ as the smoothness parameter of Γ or the
regularity index of the process Z.
If the basis φkk happens to be the collection of eigenfunctions corresponding to the co-
variance kernel Γ, then γ2k = λk, and from the considerations above γ2
k → 0 as k → ∞.
This is the case, for instance, for the Brownian bridge or Brownian motion processes and
the Fourier basis. However, in practice, Γ is unknown, and one cannot guarantee the same
type of decay for the quantities γ2k.
Notice that our condition (18) is not restrictive, as we allow δ to equal zero, which translates
15
into re-stating what we already observed above, namely that for the type of processes that
we consider the values γ2k are bounded.
In our analysis, two cases naturally arise: 0 ≤ δ ≤ 1 and δ > 1. We notice that for δ > 1,
the series∑d
k=1 γ2k <∞. In this case, the mean squared error of the least squares estimator
f is
E[‖f − f‖2
]=
d∑k=1
Var(µk) +∞∑
k=d+1
µ2k .
1
n
d∑k=1
k−δ +d
nm+ d−2β
.1
n+
d
nm+ d−2β. (19)
The RHS is minimized by the non-adaptive choice of d = O(n1/2β+1) and yields the rate
E[‖f − f‖2
].
1
n+
(1
nm
)2β/2β+1
. (20)
For perfectly observed n trajectories Xi(t), at all points t and without error, the rate of
estimating f would be the parametric n−1/2 rate. It is interesting to see, from the display
above, that the same is possible in our context, if enough points per curve are observed.
Specifically, the first term on the right, of order O(1/n), dominates the rate, as soon as
m ≥ n1/2, for all β > 1. For smaller values of m relative to n, one reverts back to the slower
non-parametric minimax rate, which corresponds to estimating a function f from mn data
points. We also note that a direct calculation based on (19) shows that the mean squared
error of the least squares estimator f is of order O(1/n) for all β > 1, as long as m ≥ n1/2
and if we chose d such that n1/2 ≤ d ≤ m.
The following result indicates that our threshold estimator f(2r) has a similar performance,
with the rate in (20) obtained adaptively over β ≥ 1.
Theorem 5. Let Assumptions 1–5 hold. For any f ∈ W β(R) with β ≥ 1, δ > 1 in (18),
and d ≥ (mn/ρn)1/3, we have
limn→∞
P‖f(2r) − f‖2 .
ρnn
+( ρnmn
)2β/2β+1≥ 1− α,
16
for any 0 < α < 1. Moreover, if m ≥ (n/ρn)1/2, then
limn→∞
P‖f(2r) − f‖2 .
ρnn
≥ 1− α,
where ρn = z2α/(2d) = O(ln(d/α)).
Proof. By Proposition 4 we have, on Ωn,
‖f(2r) − f‖2 . inf1≤N≤d
(1
N
)2β
+ρnn
∑k
k−δ +Nρnnm
.ρnn
+ inf1≤N≤d
(1
N
)2β
+Nρnnm
since
∑k
k−δ <∞
=ρnn
+( ρnmn
)2β/2β+1
. (21)
This rate is obtained for N = (mn/ρn)1/2β+1 < (mn/ρn)1/3 ≤ d, for all β ≥ 1. The second
claim follows immediately from (21) for these values of m. Finally, invoke Proposition 2 to
complete the proof.
Remarks.
1. The rate of our adaptive estimator differs from that of the least squares estimator only
by a ρn factor, which is a very small price to pay for adaptation.
2. The condition d ≥ (mn/ρn)1/3 together with the condition m ≥ (n/ρn)1/2 imply that
d ≥ (n/ρn)1/2; this is the lower bound on d for which our threshold estimator achieves
the parametric rate. Notice further that in the regime m ≥ (n/ρn)1/2 the value of N
giving the optimal rate is bounded by m, which thus provides a a natural upper bound
for d. Therefore, our estimator is of O(ρn/n) if we use (n/ρn)1/2 ≤ d ≤ m.
For 0 ≤ δ ≤ 1, we first notice that the first term in the right hand side of (19) is the
multiplication of 1/n with the partial sum of a divergent series; therefore the mean squared
error of the least squares estimator f will no longer achieve the O(1/n) rate in this case.
The following theorem shows that over classes of functions and processes that satisfy (15)
and (18), respectively, the rate of convergence of ‖f(2r) − f‖2 is
ψn =(ρnn
)2β/(2β+1−δ), (22)
17
for appropriately chosen d. This rate is markedly better than the rate that can be achieved
by the least squares estimator f : for δ = 1 it is of O(ρn/n) and in general it adapts to the
unknown δ.
Theorem 6. Let Assumptions 1–5 hold and assume that (15) and (18) hold with 2β > 1 + δ
and 0 ≤ δ ≤ 1. Then, for√n/ρn ≤ d ≤ m:
limn→∞
P‖f(2r) − f‖2 . ψn
≥ 1− α. (23)
Proof. By Proposition 4
‖f(2r) − f‖2 . inf1≤N≤d
N−2β +
ρnn
N∑k=1
k−δ +Nρnnm
. inf1≤N≤d
N−2β +
ρnnN1−δ +
Nρnnm
. inf
1≤N≤d
N−2β +
ρnnN1−δ
,
as for N ≤ d ≤ m we have N1−δ > N/m, for all 0 ≤ δ ≤ 1. The infimum is achieved for
N = (n/ρn)1/(2β+1−δ), which is less than d since 2β + 1− δ > 2, and apply Proposition 2 to
conclude the proof.
We now compare our estimator f(2r), which is constructed with the variable thresholds (10),
with a threshold estimator that utilizes fixed thresholds rk = Bn−1/2ρ1/2n , for all k, and for
some constant B > 0.
Theorem 7. If Assumptions 1–5 and (15) with β > 1/2 hold, then the hard-threshold
estimator with constant threshold rk = Bn−1/2ρ1/2n , for all k satisfies
limn→∞
P‖f(2r) − f‖2 .
(ρnn
)2β/(2β+1)≥ 1− α. (24)
Proof. Again, by Proposition 4,
‖f(2r) − f‖2 .∞∑
k=N+1
µ2k +
N∑k=1
r2k . N−2β +Nρn/n
18
for all N ≤ d. Optimizing the r.h.s. over N leads to N (n/ρn)1/(2β+1). Finally, notice that
the threshold Bn−1/2ρ−1/2n is larger than (10) for B large enough. Inspection of the proof of
Proposition 2 yields that limn→∞ P(Ωcn) ≤ α and the claim follows.
Summarizing, we see that our threshold estimator can attain fast rates of convergence, with-
out prior knowledge of δ. The case δ = 0 provides a clear illustration. Recall that taking
δ = 0 in conditions (18) means that we impose no other restrictions on the process Z be-
yond the standard regularity properties stated in Assumption 1 above. In this case Theorem
6 guarantees net improvements over the least squares estimators: the rate of convergence
of the hard threshold estimator with variable thresholds (10) is equal to n−β/(2β+1), up to
a ln(n) factor; a similar rate can also be obtained by the fixed threshold estimator. If
0 < δ ≤ 1, which corresponds to mild assumptions on the process, the variable threshold
estimator has rate n−β/(2β+1−δ), up to logarithmic factors. Thus, it continues to outperform
the least squares estimator. In this regime it also outperforms the fixed threshold estimator,
with rate n−β/(2β+1) given by Theorem 7. Therefore, the variable threshold estimators, con-
structed without knowing δ, adapt to the unknown regularity index of the process Z(t), and
are expected to have better accuracy, quantified by the faster rates of convergence derived
above.
2.5 Confidence sets
In this section we propose confidence sets (balls, intervals and bands) for the mean function
f(t). Our first result shows that the confidence ball B(f(2r), ρ(2r)) centered around the hard
threshold estimator f(2r) with radius slightly larger than
ρ(2r) =
√√√√ d∑k=1
r2k1|µk| > 2rk. (25)
has asymptotic coverage 1−α, for a large class of functions f . First we obtain an intermediate
result.
19
Proposition 8. For all d ≥ 1 and 0 < α < 1
‖f(2r) − f‖ ≤
√√√√ d∑k=1
r2k1|µk| > 2rk+
√√√√ d∑k=1
µ2k|µk| ≤ 3rk+
√√√√ ∞∑k=d+1
µ2k
on Ωn.
Proof. We first notice that, for 1 ≤ k ≤ d,
|µk(2rk)− µk| ≤ |µk − µk|1|µk| > 2rk+ |µk|1|µk| ≤ 2rk
and
‖f(2r) − f‖2 =d∑
k=1
(µk(2rk)− µk)2 +∞∑
k=d+1
µ2k.
Consequently, on the event Ωn,
‖f(2r) − f‖ ≤
√√√√ d∑k=1
(µk − µk)21|µk| > 2rk+
√√√√ d∑k=1
µ2k1|µk| ≤ 2rk+
√√√√ ∞∑k=d+1
µ2k
≤
√√√√ d∑k=1
r2k1|µk| > 2rk+
√√√√ d∑k=1
µ2k1|µk| ≤ 3rk+
√√√√ ∞∑k=d+1
µ2k
and our claim follows.
The radius ρ(2r) dominates the bound in Proposition 8, on the event Ωn, if
limn→∞
∑dk=1 µ
2k1|µk| ≤ 3rk+
∑∞k=d+1 µ
2k∑d
k=1 r2k1|µk| > 3rk
= 0. (26)
Consequently, we have the following result:
Theorem 9. Assume that Assumptions 1–5 hold. Fix d ≥ 1 and 0 < α < 1. Provided (26)
holds, we have, for all s > 1,
lim infn→∞
Pf ∈ B
(f(2r), s · ρ(2r)
)≥ 1− α.
20
We illustrate this result for functions f =∑
k µkφk with
c1 ≤ µ2kk
2β+1 ≤ C1, k →∞ (27)
and covariance functions Γ(s, t) with
c2 ≤ kδγ2k ≤ C2, k →∞ (28)
similar to our discussion in Section 2.4. The corresponding set of functions satisfying (27) is
denoted by Fβ(c1, C1). The upper bound in (27) is already used in (15). The lower bound in
(27) is crucial for our analysis. It is similar to Condition Hs(M,x0) introduced in Picard and
Tribouley (2000, page 307) for their analysis of wavelet based estimators in non-parametric
regression models. Different type of lower bounds are possible, and we refer the reader to
the recent work of Gine and Nickl (2010, section 3.5) on nonparametric density estimation.
In what follows we use the notation a >> b whenever a is larger in order than b.
Corollary 10. Let Assumptions 1–5 hold and fix 0 < α < 1. Let m > d >> n1/2 and δ ≥ 1
in (28). Then, for all s > 1,
lim infn→∞
inff∈∪β≥1Fβ(c1,C1)
Pf ∈ B(f(2r), s · ρ(2r))
≥ 1− α.
Moreover,
lim infn→∞
inff∈∪β≥1Fβ(c1,C1)
Pρ(2r) . ln(n)/n1/2
≥ 1− α.
Proof. In view of Theorem 9, it suffices to verify (26). We consider two cases, δ ≥ 2β + 1
and 1 ≤ δ < 2β + 1, separately.
First, take δ ≥ 2β + 1. Then, for n large enough, we have 9r2k < µ2
k for all k. This implies
that the numerator in (26) reduces to∑∞
k=d+1 µ2k . d−2β = o(1/n) for d >> n1/2 and β ≥ 1.
On the other hand, the denominator in (26) equals
d∑k=1
r2k1|µk| > 3rk
ρnn
d∑k=1
k−δ +d ln(n)
mn ln(n)
n,
21
since m > d and∑
k k−δ < ∞, for δ > 2β + 1 > 1. Consequently, the convergence (26)
holds.
Assume now 1 < δ < 2β + 1. Then µ2k ≤ 9r2
k ⇐⇒ k ≥ N = C(n/ ln(n))1/(2β+1−δ). Notice
that N >> (n/ ln(n))1/2. Now the numerator in (26) is
d∑k=1
µ2k1|µk| ≤ 3rk+
∞∑k=d+1
µ2k .
∞∑k=d+1
µ2k . d−2β +N−2β = o
(lnn
n
)
since d >> n1/2 and β ≥ 1. On the other hand, the denominator in (26) is larger than
cln(n)/n for some c > 0. Hence the ratio in (26) is asymptotically negligible as n→∞.
The second claim regarding the radius follows from the fact that, on the event Ωn,
ρ2(2r) =d∑
k=1
r2k1|µk| > 2rk ≤
d∑k=1
r2k1|µk| > rk.
The term on the right is of order ln(n)/n by the above calculations.
For small values of δ, 0 ≤ δ ≤ 1, that translate into larger values of rk, the ratio between
the bias and the radius in (26) is bounded, yet it does not necessarily converge to zero.
Multiplying then the radius by a small ln1/2(n) factor, we obtain an asymptotic 1 − α
confidence ball for f .
Corollary 11. Let Assumptions 1–5 hold. Fix 0 < α < 1. Let m > d >> n1/2 and 0 ≤ δ ≤ 1
in (28). Then
lim infn→∞
inff∈∪β≥1Fβ(c1,C1)
Pf ∈ B(f(2r),
√ln(n)ρ(2r))
≥ 1− α.
Moreover,
lim infn→∞
inff∈∪β≥1Fβ(c1,C1)
Pρ(2r) . ln(n)/nβ/(2β+1−δ) ≥ 1− α.
Proof. In view of Theorem 9, we need to show that
limn→∞
∑dk=1 µ
2k1|µk| ≤ 3rk+
∑∞k=d+1 µ
2k
ln(n)∑d
k=1 r2k1|µk| > 3rk
= 0 (29)
22
to prove the first claim. Observe that, for 0 ≤ δ ≤ 1, µ2k ≤ 9r2
k ⇐⇒ k ≥ N =
C(n/ ln(n))1/(2β+1−δ). Consequently
d∑k=1
µ2k1|µk| ≤ 3rk+
∞∑k=d+1
µ2k . N−2β =
(ln(n)
n
) 2β2β+1−δ
.
On the other hand,
ln(n)d∑
k=1
r2k1|µk| > 3rk ≥ C ln(n)
(ln(n)
n
)− 2β2β+1−δ
,
so that the ratio in (30) is asymptotically negligible as n→∞. By the inequality
ρ(2r) ≤
√√√√ d∑k=1
r2k1|µk| > rk,
that holds on the event Ωn, and the bound∑d
k=1 r2k1|µk| > rk ≤ Cn/ ln(n)β/(2β+1−δ) for
some C <∞, that can be obtained by arguments above, the second claim follows.
Remark. The proposed confidence balls adapt to the smoothness parameters β and δ and
the results hold uniformly over Fβ(c1, C1) ⊂ W β. Moreover, the simulation study presented
in Section 3.3.1 shows that the confidence balls attain the prescribed coverage and have
small radii in a variety of finite sample scenarios. Adaptive nonparametric confidence balls
in nonparametric regression are proposed and studied by, among others, Beran and Dumbgen
(1998), Baraud (2004), Cai and Low (2006), Robins and Van der Vaart (2006) and Davies,
Kovac and Meise (2009). In all these works the confidence balls are uniform over large classes
of functions, for instance Wβ. The same ideas can be extended to our set-up, and we can
obtain confidence balls over Wβ essentially by multiplying the radius by a factor 1 ∨ n1−δ.
However, the drawback of the methods proposed for non-parametric regression and of their
possible extension to the functional data context is clear: if one wants to derive confidence
balls that are uniform over too general a class of functions, such as Wβ, then the widths of
the resulting balls will necessarily be too wide for practical use. For this reason we do not
pursue such construction here and omit further discussion.
23
Confidence balls are hard to visualize and confidence bands are more appropriate for this
purpose. Analogous to Proposition 8, we have the following result.
Proposition 12. For all d ≥ 1 and 0 < α < 1,
|f(2r)(t)− f(t)| ≤d∑
k=1
rk|φk(t)|1|µk| > 2rk+d∑
k=1
|µkφk(t)||µk| ≤ 3rk+∞∑
k=d+1
|µkφk(t)|
holds, uniformly in 0 ≤ t ≤ 1, on the event Ωn.
Proof. Since, for 1 ≤ k ≤ d,
|µk(2rk)− µk| ≤ |µk − µk|1|µk| > 2rk+ |µk|1|µk| ≤ 2rk,
we have
|f(2r)(t)− f(t)| ≤d∑
k=1
|(µk − µk)φk(t)|+d∑
k=1
|µkφk(t)|1|µk| ≤ 2rk+∞∑
k=d+1
|µkφk|
≤d∑
k=1
rk|φk(t) +d∑
k=1
|µkφk(t)|1|µk| ≤ 3rk+∞∑
k=d+1
|µkφk|
as claimed.
Provided
limn→∞
sup0<t<1
∑dk=1 |µkφk(t)|1|µk| ≤ 3rk+
∑∞k=d+1 |µkφk(t)|∑d
k=1 rk|φk(t)|1|µk| > 3rk= 0, (30)
Proposition 12 implies that
lim infn→∞
P
f(t) ∈ f(2r)(t)±
d∑k=1
rk|φk(t)|1|µk| > 2rk, 0 < t < 1
≥ 1− α. (31)
We found in our simulations that these bands are only a little wide, but not much. We
briefly indicate a few other possible bands based on our estimators and we report on their
empirical behavior in the next section. The theoretical analysis of these bands is the subject
of future work.
24
First we consider a confidence interval based on the truncated series estimator f(t) =∑dk=1 µkφk(t). Recognizing that f(t) can be written as an average n−1
∑ni=1Wi(t) of in-
dependent components Wi(t) =∑d
k=1 µikφk(t), 1 ≤ i ≤ n, the distribution of f(t) is asymp-
totically Gaussian. Thus, provided that (i) the bias√nE[f(t)] − f(t) is asymptotically
negligible, and (ii) the sample variance S2n(t) based on W1(t), . . . ,Wn(t) consistently esti-
mates v2(t) = Var(W1(t)), the random intervals [f(t) ± n−1/2Sn(t)zα/2] contain f(t) with
probability 1−α, asymptotically, as n→∞. First we address the bias issue (i). For a large
class of functions f ∈ W β with β ≥ 1, the bias disappears for d large enough since
√n∣∣∣E[f(t)]− f(t)
∣∣∣ . √n ∞∑k=d+1
|µk| .√n
∞∑k=d+1
kβ−1/2 . n1/2d−β+1/2 → 0
for d >> n1/(2β−1) under the bounded basis assumption (Assumption 5). We now address
(ii). Each Wi(t) =∑d
k=1 µikφk(t) is a sum of d iid components with mean µk and variance
of order O(1/m). Hence, Wi(t) is asymptotically Gaussian with mean∑d
k=1 µkφk(t) and
variance v2(t) = O(d2m−1). Writing Wi(t) = v(t)Ui(t), we find that
S2n(t)− v2(t) = v2(t)
[1
n− 1
n∑i=1
Ui(t)− U(t)2 − 1
]
is of stochastic order Op(v2(t)n−1/2) and the sample variance S2
n(t) converges to v2(t) in
probability if d2/(m√n) → 0. A band can now be obtained by taking the regular grid
tj = j/m and computing
f(tj)±Sn(tj)√
nzα/(2m) (32)
based on Bonferroni’s inequality. This band is easy to compute and has reasonably good
coverage as shown in our simulations.
We found in our simulations, reported in the next section, that replacing each Wi(t) by
Wi(t) =∑
k∈s µikφk(t) with s = k : |µk| > rk being the set of selected indices, yields
much smaller bands with good asymptotic coverage. We notice that this approach ignores
the uncertainty due to selecting the set s, resulting in coverage less than 100(1− α)%.
25
A superior way to obtain confidence bands is to analyze the process
En(t) =√nf(t)− E[f(t)]
=
1√n
n∑i=1
d∑k=1
(µik − µk)φk(t).
For the trigonometric basis φk used in our simulations, it is not hard to show that En
converges weakly to a Gaussian limit. As before, for f ∈ W β, β ≥ 1, provided d >> n1/(2β−1)
is large enough, the bias√n‖E[f ]− f‖∞ → 0 and
f(t)± Sn(t)√nqα (33)
based on the α upper point qα of the distribution of the supremum of the normalized process
En(t)/Sn(t), constitutes an asymptotic 1− α confidence band. For an easily implementable
approximation q∗α of qα, we rely on the following resampling procedure.
1. Draw with replacement n vectors (µ∗i1, . . . , µ∗id) from (µ11, . . . , µ1d), · · · , (µn1, . . . , µnd).
2. Compute C∗i (t) =∑d
k=1 φk(t)(µ∗ik−µk) , S∗n(t) = (n−1)−1/2
[∑ni=1C∗i (t)− C∗(t)2
]1/2and E∗n(t) = n−1/2
∑ni=1 C
∗i (t)
3. Approximate supt |E∗n(t)/S∗n(t)|
4. Repeat the previous steps B times, and obtain the α-upper point q∗α.
3 Numerical results
3.1 Simulation design
We conducted our simulations for a combination of types of mean zero stochastic processes,
stationary and non-stationary, and differentiable and non-differentiable mean functions.
Specifically, we consider two stationary processes, AR(1) and ARMA(1,1), and two non-
stationary processes, the Brownian Bridge (BB) and the Brownian Motion (BM) on [0,1].
We consider the two mean functions:
f(t) = c1 exp −64(t− 0.25)2+ c2 exp −256(t− 0.75)2,
26
referred to in the sequel as Signal 1, and
f(t) = c310.35 < t < 0.375+ c410.75 < t < 0.875,
referred to as Signal 2. The constants c1− c4 will be varied to achieve various desired signal-
to-noise ratios. We provide the definition of signal-to-noise ratio in section 3.1.1.
For our simulations we considered two popular families of bases, Fourier and Haar, each
known to have good approximation properties for functions in L2 belonging to general
smoothness classes, such as Sobolev classes. Any other bases with this property can be
considered, and the qualitative and quantitative points we illustrate here will remain essen-
tially the same.
3.1.1 Simulation scenarios
We simulated n curves for each of the eight combinations (signal, stochastic process) above.
Each curve 1 ≤ i ≤ n is observed at m points tij and the observations follow the model:
Yij = f(tij) + Zi(tij) + εij,
for 1 ≤ i ≤ n and 1 ≤ j ≤ m. We facilitate comparison with published work on other
estimators by considering time points tij = tj = j/m, for all i, and 0 ≤ j ≤ m. We also
evaluated the performance of our estimators when tij ∈ [0, 1] were simulated independently
from the uniform distribution on [0, 1]. The measurement errors εij were generated indepen-
dently from N(0, σ2ε). The parameters for simulating the AR(1) and ARMA(1,1) processes
were chosen in order to achieve the following equivalences:
median︸ ︷︷ ︸t
Var [ Brownian Bridge (t)] = Var AR(1)
median︸ ︷︷ ︸t
Var [ Brownian Motion (t)] = Var ARMA(1,1)
This facilitates comparison between processes of different natures. Next, the variance of the
measurement error σ2ε is chosen so that we have two cases: σ∗ = 1 and σ∗ = 10, where
σ∗ =var[Z(t)]
σ2ε
. (34)
27
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(a) AR(1), σ∗ = 10
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(b) AR(1), σ∗ = 1
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(c) BB, σ∗ = 10
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(d) BB, σ∗ = 1
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(e) ARMA(1,1), σ∗ = 10
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(f) ARMA(1,1), σ∗ = 1
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(g) BM, σ∗ = 10
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
3
x
(h) BM, σ∗ = 1
Figure 1: Plots of Signal 1 + AR(1)/BB + Noise (top row) and of Signal 1 + ARMA(1,1)/BM+ Noise (bottom row), n = 50, SNR = 4.25
When σ∗ = 1 the variability of the measurement error is the same as that of the stochastic
process, whereas for σ∗ = 10 the measurement errors become essentially negligible. Figure
1 below shows, respectively, realizations from each of the stochastic process with mean cor-
responding to Signal 1 and added noise corresponding to σ∗ = 1 and 10, respectively.
We conducted simulations for different values of the signal-to-noise ratio (SNR). Since the
process Z(t) is assumed independent of the measurement error, we define as a measure of the
noise (Var[Z(t)] + σ2ε)
1/2and the signal-to-noise ratio to be SNR = Range[f ]/ (Var[Z(t)] + σ2
ε)1/2
,
where Range[f ] = |max0≤t≤1 f(t)−min0≤t≤1 f(t)|.
3.2 Simulation results: the fit of the estimates
We contrast the quality of the fit of our estimates with eight other methods previously
proposed and studied in the literature. The first is the simple ensemble average of the obser-
vations Yij. The next three are obtained by applying, respectively, the following smoothing
28
methods to the entire data set, containing all n curves:
• Linear polynomial kernel smoothing (Local Poly) with a global bandwidth, suggested
by, among others, Yao et al. (2003), Muller (2005), Yao, Muller and Wang (2005),
Yao (2007). We implemented locpoly and dpill in R to obtain the estimate and its
bandwidth, respectively.
• Nadaraya-Watson kernel smoothing (NWK) with a global bandwidth, discussed in a
functional data setting by, e.g., Yao (2007). We used the N(0, 1) kernel and obtained
the bandwidth via cross-validation.
• Smoothing splines, as suggested, for instance, by Rice and Silverman (1991). We used
order 4 B-splines basis functions Bk(t) with a knot placed at each design time point
tj and a roughness penalty proportional to the square of the second derivative of f(t).
We implemented the method using smooth.spline in R with the tuning parameter in
the penalty term selected by generalized cross-validation, leaving one curve out at a
time.
For the last four of the methods used for comparison we estimate f(t) be averaging smoothed
versions of the individual trajectories. The reconstructions of the individual curves were
performed using:
• Linear polynomial kernel (Global kernel) smoother with a global bandwidth
• Linear polynomial kernel smoother (Local kernel) with a local bandwidth, where the
bandwidths are found using the plug-in algorithm proposed in Seifert, Brockman, Engel
and Gasser (1994)
• B-splines regression with roughness penalty
• Fourier expansion regression with roughness penalty, as discussed in Chapter 5 in
Ramsay and Silverman (2005).
We compare the estimates above with our proposed estimates using the Fourier and Haar
bases. We consider hard threshold estimates (HT) obtained by truncating the least squares
29
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(a) AR(1), σ∗ = 10
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(b) AR(1), σ∗ = 1
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(c) BB, σ∗ = 10
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(d) BB, σ∗ = 1
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(e) ARMA(1,1), σ∗ = 10
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(f) ARMA(1,1), σ∗ = 1
HT(r) Kernel Smooth Spline0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(g) BM, σ∗ = 10
HT(r) Kernel Smooth Spline
0.02
0.04
0.06
0.08
0.10
Estimator
sqrt(MSE)
(h) BM, σ∗ = 1
Figure 2: Boxplots of the√MSE of HT(r), kernel and smoothing splines estimators, over
500 simulations
estimates either at levels rk, for each k, and denote the resulting estimate by HT(r), or at
levels 2rk, to obtain HT(2r), for rk given by
rk =Sk√nzα/(2d) = zα/(2d)
√√√√ 1
n(n− 1)
n∑i=1
(µi,k − µk)2.
In this simulation study we took d = m basis functions. Any choice d > m leads to repetitive
basis functions φm+k(j/m) = φk(j/m) for all 1 ≤ j ≤ m and k ≥ 1 in the case of equally
spaced design tij = j/m and Fourier basis φk.
Table 1 contains the empirical mean squared errors (EMSEs) for all the competing estimates
and for the proposed hard-thresholded estimates; we also included the least squares estima-
tor as a basis of comparison. For brevity we only included the results relative to f(t) equal to
Signal 1 since the results for Signal 2 were similar. The SNR for these simulations was set to
30
Table 1: EMSE results for Signal 1 based on 500 simulations and equally spaced design.Scenario: n = 400, m = d = 28 = 256, α = .05, SNR = 4.25.√
EMSE × 10−6
[√MEDMSE × 10−6] σ? = 1 σ? = 10 σ? = 1 σ? = 10
Brownian Bridge AR(1)Fourier BasisLS 29582 [27580] 21104 [18217] 30767 [30398] 22767 [22415]HT(r) 18429 [15723] 17544 [14446] 16928 [16072] 15989 [15195]HT(2r) 20231 [17721] 18334 [15441] 23894 [23767] 21448 [19651]Haar BasisLS 29493 [27537] 21092 [18208] 30672 [30302] 22749 [22389]HT(r) 34820 [33551] 23808 [21495] 38204 [38064] 27852 [27449]HT(2r) 48011 [47158] 48879 [53945] 61629 [61684] 52427 [52232]Pooled CurvesLocal Poly 22769 [20274] 20557 [17584] 22673 [22075] 20479 [20073]NWK 23186 [20790] 20271 [17289] 22848 [22279] 20594 [20153]Smoothing Splines 21455 [18801] 20186 [17308] 20794 [20263] 19408 [19069]Ensemble Ave 29493 [27537] 21092 [18208] 30672 [30302] 22749 [22389]Curve - by - CurveGlobal Kernel 47709 [46877] 23165 [20647] 31641 [31123] 20451 [20095]Local Kernel 36161 [34809] 21677 [19036] 27824 [27261] 20548 [20152]B-splines Regression 38532 [37384] 20181 [17291] 21413 [20899] 20897 [20531]Fourier Regression 39543 [38316] 31905 [30200] 38309 [38091] 30359 [30167]
Brownian Motion ARMA(1,1)Fourier BasisLS 50959 [44662] 37958 [29482] 53549 [52989] 41562 [40893]HT(r) 35388 [26045] 34133 [24780] 31356 [30346] 30491 [29488]HT(2r) 37794 [29300] 35322 [24835] 42625 [41283] 44088 [39450]Haar BasisLS 50815 [44564] 37938 [29419] 53399 [52749] 41543 [40867]HT(r) 58862 [54324] 45587 [34722] 66936 [65883] 51689 [51020]HT(2r) 78181 [75281] 88327 [85689] 106709 [106957] 103036 [96124]Pooled CurvesLocal Poly 40648 [32707] 37345 [28707] 41186 [40240] 38084 [37464]NWK 41260 [33561] 36791 [27848] 41277 [40470] 38092 [37468]Smoothing Splines 38737 [30617] 36865 [27928] 38281 [37269] 36341 [35685]Ensemble Ave 50815 [44564] 37938 [29419] 53399 [52749] 41543 [40867]Curve - by - CurveGlobal Kernel 83134 [80354] 43568 [36505] 48418 [47520] 38552 [38017]Local Kernel 62759 [58781] 39777 [31632] 45305 [44259] 38780 [38184]B-splines Regression 71699 [68131] 38657 [30346] 39255 [38373] 39874 [39140]Fourier Regression 66629 [62591] 54688 [49219] 64238 [64015] 51861 [51601]
31
Table 2: EMSE results for Signal 1 based on 200 simulations and uniform design. Scenario:n = 128, m = d = 27 = 128, α = .05, SNR = 4.25.√
EMSE × 10−6
[√MEDMSE × 10−6] σ? = 1 σ? = 10 σ? = 1 σ? = 10
Brownian Bridge AR(1)Fourier BasisLS 94138[91881] 75697[72876] 95598[94382] 76807[76221]HT(r) 48161[44972] 44907[38006] 45785[44642] 39685[37951]
Brownian Motion ARMA(1,1)Fourier BasisLS 158890[153804] 128867[124620] 163308[161969] 133878[134306]HT(r) 84029[75817] 80056[66235] 80765[79201] 77738[68341]
4.25. Table 1 is obtained for n = 400 and m = 256 and equally spaced design points. Figure
2 shows that the variability of the MSEs of the estimates with closest performance (HT(r),
smoothing spline, and kernel estimators via pooled curves) is comparable. We notice that
the MEDMSE (median MSE) and EMSE of HT(r) are, respectively, smaller than those of
its competitors. In Table 2, we assessed the quality of our estimator for uniformly sampled
design points, and we lowered the values of m and n to 128. The smaller sample size and
additional randomness induced by the random design are responsible for the inflation of the
EMSE values presented in Table 2 compared to the results presented in Table 1.
Our results support the following conclusions on the performance of the fit of the estimators:
1. If σ∗ = 10, the variance of the process dominates the variance of the random errors,
and we see that the HT estimators based on the Fourier basis perform slightly better
than most of the competing estimators.
2. If σ∗ = 1, the difference between our estimator and the competing estimators is more
pronounced, especially for the BM and ARMA(1,1) processes, suggesting that this type
of estimation is more robust against the variability in the data.
3. As an additional remark, our experiments indicate that some of the estimators proposed
in the literature may be outperformed by the simple least squares estimator based on
all the data points, or even by the naive sample average, if the choice of their tuning
parameters is not refined; for all our simulations we did choose these tuning parameters
32
adaptively as explained above, but we did not attempt to improve upon the published
guidelines on their selection.
3.3 Simulation results: confidence sets
3.3.1 Confidence balls
We first investigate the empirical performance of the confidence ball proposed in Section
2.5 using the Fourier basis. The confidence ball B(f(2r), ρ(2r)) has the radius established in
display (25). In Table 3 we report average radius, average empirical L2 distance evaluated at
64 equally spaced time points, and coverage over S = 500 simulated datasets. Results show
that the confidence ball achieves the nominal coverage for equally spaced design. When the
time points are uniform, the ball has a wider radius and coverage close to the nominal level,
even for σ∗ = 1.
Table 3: Numerical results for confidence ball B(f(2r), ρ(2r)). Table displays ave. radius, ave.empirical L2 distance, and coverage over S = 500 simulations. Scenario: Signal 1, m = d = 26 = 64,Fourier φk basis functions, SNR = 2.2.
AR(1) process
Empirical L2 Radius Coverageσ∗ = 10
(n = 350) α = 0.05
Equally Sp. tj 0.023 0.059 0.97Uniform tij 0.036 0.063 0.94σ∗ = 1
(n = 125) α = 0.10
Equally Sp. tj 0.051 0.100 0.96Uniform tij 0.074 0.107 0.88
3.3.2 Confidence bands
Next, we investigate the finite sample coverage of the confidence bands (Methods 1 - 4)
proposed in Section 2.5, again using the Fourier basis. Method 1 is based on display (31).
33
Method 2 is the band obtained using display (32) with Wi(t) =∑
k∈s µikφk(t). Method 3 is
the band obtained using display (32) with Wi(t) =∑d
k=1 µikφk(t). Method 4 is implemented
as in display (33).
We consider the following scenario for our simulations for evaluating confidence bands:
n = 300, m = 64 and d = 30. The signal-to-noise ratio was set to 2.2. Both the uni-
form and equally spaced design are considered for the discrete time points. We evaluate our
bands at the fine grid consisting of equally spaced time points in [0,1], which we construct
by taking tj = j/m, 1 ≤ j ≤ m. We compare these methods with simultaneous confidence
bands (SCB) of the form: fave(t)± zα/(2m)n−1/2Γ1/2(t, t), based on Zhang and Chen (2007).
Here fave(t) = (1/n)∑n
i=1 floc,i(t) is the average of local linear kernel estimators floc,i for
each curve with local bandwidth (built-in bandwidth choice lokerns in R). The estimated
covariance is computed using the sample variance of the kernel estimators at each t. Table
4 shows, for equally spaced design, that these SCB are not optimal.
Table 4 summarizes the results for the AR(1) and BB processes and Signal 1. The entries
in these tables are the average widths of the confidence bands followed, in parentheses, by
their empirical coverage over S = 300 simulations. Since we chose for this simulation study
α = 0.05, we expect the empirical coverage of the proposed bands to be around the 0.95
nominal level. The results presented in Table 4 below support the following conclusions on
the proposed confidence bands.
1. Method 1 yields an adaptive band that is conservative and necessarily wider than
the other methods we analyze. The coverage is close to and often times exceeds the
nominal coverage.
2. Method 2 has coverage close to the nominal level for equally spaced design, and its
width is narrowest among all methods considered. Methods 2 and 3 have similar
coverage in the case of equally spaced design with Method 2 having smaller width. In
the uniform design case, however, Method 2 has slightly lower coverage than Method
3.
3. Method 3 and 4 are centered at the same estimator; the difference in their width being
34
the quantile used. Method 4, employing a quantile chosen via bootstrap, produces
bands with narrower width on average, while maintaining coverage close to the nominal
level. It has slightly lower coverage than Method 3.
4. For the same (n,m, d) combination, the scenario σ∗ = 10 has narrower band widths
compared to σ∗ = 1. Also, the bands for the uniform design have a slightly wider width
compared to equally spaced design. Across all scenarios we consider, the coverage for
the proposed bands lies between 0.84 - 1.00 for AR(1) process and 0.91 - 1.00 for BB
process.
4 Application to daily temperature curves
We use the tempkent dataset, which is part of the functional datasets fds package in R.
It consists of many temperature curves recorded over the course of a day in Kent Town,
Australia. We consider the daily temperature curves corresponding to the time period 2003
- 2007. Temperature curves are observed at equally spaced (half-hour) m = 48 time points
each day. Our goal is to estimate the mean temperature curve and to provide confidence
bands. We used d = 40 Fourier basis functions. The temperature curves contained in this
dataset, together with the confidence bands we obtained, are displayed in Figure 4. The con-
fidence band for the mean temperature curve has a smooth appearance, and indicates a cyclic
behavior for the mean temperature curve, with an expected maximum of approximately 22.3
degrees Celsius, and a minimum of approximately 12.4 degrees Celsius.
References
[1] Yannick Baraud. Confidence balls in Gaussian regression. Annals of Statistics, 32, 528-
551, 2004.
[2] Peter Bickel and Yaacov Ritov. Non and semi-parametric statistics compared and con-
trasted. Journal of Statistical Planning and Inference, 91, 209-228, 2000.
35
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
(a) Method 2
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
(b) Method 3
Figure 3: Scenario: AR(1), σ∗ = 1, SNR = 2.2, n = 300, m = 64, d = 30, Equally spacedtj, α = 0.05. The plots show simulated data curves in black, true f in white solid line, andconfidence bands in white dash lines.
36
0 10 20 30 40
1020
3040
Half-hour time points
Tem
pera
ture
(Deg
rees
C)
Figure 4: Daily temperature curves for Kent Town, Australia, for years 2003 - 2007, areshown in black lines. Confidence bands for the mean temperature curve for this time periodare shown in white lines: Method 1 band (pronounced dash), and Method 2 band (fine dash).
37
[3] Lucien Birge and Pascal Massart. An Adaptive Compression Algorithm in Besov Spaces.
Constructive Approximation 16 (1), 1-36, 2000.
[4] Michal Benko, Wolfgang Hardle and Alois Kneip. Common functional principal compo-
nents. Annals of Statistics 37(1), 1–34, 2009.
[5] Rudolph Beran and Lutz Dumbgen. Modulation of estimators and confidence sets. An-
nals of Statistics 26, 1826–1856, 1998.
[6] P. Laurie Davies, Arne Kovac and Monika Meise. Nonparametric regression, confidence
regions and regularisation. Annals of Statistics 37, 2597–2625, 2009.
[7] David Degras. Nonparametric estimation of a trend based upon sampled continuous
processes. C. R. Acad. Sci. Paris Ser. I 347, 191-194, 2009.
[8] David Donoho and Iain Johnstone. Adapting to the unknown smoothness via wavelet
shrinkage. Journal of the American Statistical Association 90, 1200–1224,1995.
[9] David Donoho and Iain Johnstone. Minimax estimation via wavelet shrinkage. Annals
of Statistics 26, 789–921, 1998.
[10] David Donoho, Iain Johnstone, Gerard Kerkyacharian, and Dominique Picard. Density
estimation by wavelet thresholding. Annals of Statistics 24(2), 508–539, 1996.
[11] Christopher Genovese and Larry Wasserman. Adaptive confidence bands. Annals of
Statistics 36(2), 875–905, 2008.
[12] Daniel Gervini. Free-knot spline smoothing for functional data. Journal of the Royal
Statistical Society, Series B 68(4), 671–687, 2006.
[13] Evarist Gine and Richard Nickl. Confidence bands in density estimation. Annals of
Statistics 38(2), 1122–1170, 2010.
[14] Henry Landau and Lawrence Shepp. On the supremum of a Gaussian process. Sankhya,
32, 369–378, 1970.
[15] Mark Low. On nonparametric confidence intervals. The Annals of Statistics, 25 (6),
2547 - 2554, 1997.
38
[16] Pascal Massart. Concentration Inequalities and Model Selection, Ecole d’Ete de Proba-
bilites de Saint-Flour XXXIII - 2003, volume 1896, 2007.
[17] Hans-Georg Muller. Functional modeling and classification of longitudinal data. Scan-
dinavian Journal of Statistics 32, 223–240, 2005.
[18] Dominique Picard and Karine Tribouley. Adaptive confidence interval for pointwise
curve estimation Annals of Statistics 28(1), 298–335, 2000.
[19] R Development Core Team (2009). R: A language and environment for statistical com-
puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,
URL http://www.R-project.org.
[20] James Ramsay and Bernard Silverman. Functional data analysis, 2nd Edition. Springer,
New York, 2005.
[21] James Ramsay and Bernard Silverman. Applied functional data analysis. Springer, New
York, 2002.
[22] John Rice and Bernard Silverman. Estimating the mean and covariance structure non-
parametrically when the data are curves. Journal of the Royal Statistical Society, Series
B 53(1), 233–243, 1991.
[23] Jamie Robins and Aad van der Vaart. Adaptive nonparametric confidence sets. Annals
of Statistics, 34, 229-253, 2006.
[24] David Ruppert, Matthew Wand and Raymond Carroll. Semiparametric regression.
Cambridge University Press, Cambridge 2003.
[25] Burkhart Seifert, Michael Brockmann, Joachim Engel and Theo Gasser. Fast algorithms
for nonparametric curve estimation. Journal of Computational and Graphical Statistics
3(2), 192–213, 1994.
[26] Galen Shorack. Probability for Statisticians. Springer, 2000.
[27] Galen Shorack and Jon Wellner. Empirical Processes with Applications to Statistics,
Wiley, 1986.
39
[28] Alexandre Tsybakov. Introduction to nonparametric estimation. Springer, New York,
2009.
[29] Larry Wasserman. All of nonparametric statistics. Springer, New York, 2006.
[30] Fang Yao. Asymptotic distributions of nonparametric regression estimators for longitu-
dinal and functional data. Journal of Multivariate Analysis 98, 40–56, 2007.
[31] Fang Yao, Hans-Georg Muller and Jane-Ling Wang. Functional data analysis for sparse
longitudinal data. Journal of the American Statistical Association 100(740), 577–590,
2005.
[32] Jin-Ting Zhang and Jianwei Chen. Statistical inferences for functional data. Annals of
Statistics 35(3), 1052–1079, 2007.
40
Table 4: Table entry is ave. width (coverage) over S = 300 simulations. Scenario: Signal 1, AR(1),α = 0.05, Fourier φk basis functions, SNR = 2.2, B = 100, n = 300, m = 64, d = 30.
AR(1) BBprocess process
σ∗ = 10
Equally Sp. tjBands based on asympt. normalitySCB 0.16 (0.78) 0.15 (0.84)
Proposed BandsMethod 1) f(2r)(tj)±
∑dk=1 rk|φk(tj)|1|µk|>2rk 0.27 (0.94) 0.26 (1.00)
Method 2) HT & Bonferroni 0.13 (0.96) 0.14 (1.00)Method 3) LS & Bonferroni 0.17 (0.96) 0.16 (0.98)Method 4) LS & Bootstrap 0.16 (0.92) 0.14 (0.94)
Uniform tijProposed BandsMethod 1) f(2r)(tj)±
∑dk=1 rk|φk(tj)|1|µk|>2rk 0.30 (0.92) 0.30 (0.97)
Method 2) HT & Bonferroni 0.14 (0.91) 0.15 (0.95)Method 3) LS & Bonferroni 0.21 (0.95) 0.20 (0.96)Method 4) LS & Bootstrap 0.20 (0.94) 0.19 (0.95)
σ∗ = 1
Equally Sp. tjProposed BandsMethod 1) f(2r)(tj)±
∑dk=1 rk|φk(tj)|1|µk|>2rk 0.31 (1.00) 0.30 (1.00)
Method 2) HT & Bonferroni 0.14 (0.94) 0.15 (0.97)Method 3) LS & Bonferroni 0.20 (0.94) 0.19 (0.96)Method 4) LS & Bootstrap 0.19 (0.93) 0.18 (0.94)
Uniform tijProposed BandsMethod 1) f(2r)(tj)±
∑dk=1 rk|φk(tj)|1|µk|>2rk 0.34 (0.99) 0.35 (1.00)
Method 2) HT & Bonferroni 0.16 (0.84) 0.17 (0.91)Method 3) LS & Bonferroni 0.24 (0.96) 0.24 (0.95)Method 4) LS & Bootstrap 0.24 (0.95) 0.23 (0.94)
41