For Peer Review Only - SUSTech

59
For Peer Review Only Nonparametric Inference for Partly Linear Additive Cox Models based on Polynomial Spline Estimation Journal: Journal of the American Statistical Association Manuscript ID Draft Manuscript Type: Article – Theory & Methods Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likelihood Journal of the American Statistical Association

Transcript of For Peer Review Only - SUSTech

Page 1: For Peer Review Only - SUSTech

For Peer Review OnlyNonparametric Inference for Partly Linear Additive Cox

Models based on Polynomial Spline Estimation

Journal: Journal of the American Statistical Association

Manuscript ID Draft

Manuscript Type: Article – Theory & Methods

Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likelihood

Journal of the American Statistical Association

Page 2: For Peer Review Only - SUSTech

For Peer Review OnlyNonparametric Inference for Partly Linear Additive Cox

Models based on Polynomial Spline Estimation

Abstract

The global smoothing method based on polynomial splines is a popular techniquefor nonparametric regression estimation and has received great attention in the liter-ature. However, it is tremendously challenging to obtain local asymptotic propertiesof the polynomial spline estimators and to make inference for the regression functions.We develop a general theory of local asymptotics for the polynomial spline estimationof partly linear additive Cox models. We obtain a uniform Bahadur representation ofand design-adaptive asymptotic normality of the resulting nonparametric estimators.Furthermore, we propose a distance-based statistic for specification tests of the additivecomponents and establish the limiting distribution of the test statistic. We propose abootstrap procedure to calculate the p-value of the above test statistic and prove itsconsistency. Based on the polynomial-spline estimation, we also introduce a two-stepestimation method, which possesses an oracle property in the sense that any additivecomponent could be estimated as if other additive components were known. All ofthe above local asymptotics and testing results are also established for this two-stepprocedure. Simulations demonstrate nice finite sample performance of the proposedprocedure. Analysis of the Framingham Heart Study data illustrates the use of ourmethodology.

Keywords: Conditional hazard rate, Hypothesis testing, Local Asymptotics, Partial likeli-hood.

1 Introduction

The global smoothing method based on polynomial splines is a popular technique for non-parametric regression estimation. Its main advantage over popular kernel smoothing is its fastimplementation and nice finite sample performance. This is significant in high dimensional

1

Page 1 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 3: For Peer Review Only - SUSTech

For Peer Review Only

smoothing. Global convergence rates of the resulting polynomial spline estimators are ex-haustively studied. See Stone (1985, 1986, 1994), Kooperberg et al. (1995a, b), Huang (1998,1999, 2001), Huang and Stone (1998), Huang et al. (2000), and Huang and Shen (2004),among others. However, it is very challenging to obtain local asymptotic distributions ofthe polynomial spline estimators, which hampers development of the methodology. Zhou etal. (1998) and Huang (2003) established the local asymptotic normality of such estimators inthe framework of nonparametric regression models. To our knowledge, this kind of results arenot available for other models. This motivates us to study local asymptotics of the polynomialspline estimators for partly linear additive Cox models. The local asymptotic results are thenused for statistical inference, in addition to providing theoretical insights about the propertiesof estimators that cannot be explained by global asymptotic results.

As an extension of the Cox (1972) model, the partly linear additive Cox model (Huang,1999) specifies the conditional hazard of failure time T given the covariate (x,w) ∈ Rd × RJ

as

λ{t;x,w} = lim∆↓0

[∆−1 Pr{t ≤ T < t+∆|T ≥ t,x,w}

]= λ0(t) exp{β′x + ϕ(w)}, (1.1)

where λ0(·) ≥ 0 is an unspecified base-line hazard, β is a d-vector of parameters, and ϕ(w)

is an unknown function of w with the additive structure ϕ(w) = ϕ1(w1) + · · ·+ ϕJ(wJ). Theparameters of interest are β and ϕ’s. This model avoids the curse-of-dimensionality inherentin the saturated multivariate semiparametric hazard regression model (Sasieni, 1992). See thediscussions in Hastie and Tibshirani (1990). It allows one to explore nonlinearity of certaincovariates and retains the nice interpretability of the linear structure in Cox’s (1972) model.

The polynomial spline estimator of β is efficient in the sense that it achieves the semipara-metric information bound (Huang,1999). This indicates that the estimator of β is asymptoti-cally most efficient among all the regular estimators (van der Vaart, 1991; Bickel et al. , 1993,Chapter 3). Since the information lower bound could not be consistently estimated, Jiangand Jiang (2011) proposed a bootstrap based inference method for β. However, only a globalrate of convergence for the resulting estimators of ϕ’s is available in the literature (Huang,1999). It has remained unsolved to establish the local asymptotics of the polynomial splineestimators of ϕ’s, since the publication of Huang (1999). In this article we solve this longstanding problem as well as others. Since many Cox’s types of models are specific examplesof model (1.1), the local asymptotics for the estimation of ϕ’s can also be used to justifythe appropriateness of these models, for example, by testing if ϕj(·)’s admit some specificparametric forms. This makes our theoretic results significant in applications.

Unlike the least squares based polynomial spline estimation in Zhou et al. (1998) and

2

Page 2 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 4: For Peer Review Only - SUSTech

For Peer Review Only

Huang (2003), there is no explicit formula for our estimators in the current setting. It is atremendously novel challenge to obtain the asymptotic distributions of our estimators. Sucha challenge also arises from the fact that the score functions involving a diverging numberof parameters are asymptotically infinite dimensional, in contrast to the local polynomialestimation of the additive components involving locally only finitely many parameters (Fanand Yao, 2003). Although for additive regression models the unknown additive functionscan be estimated by the backfitting algorithm (Buja et al., 1989) using the kernel smoothingas building block, local asymptotics of such backfitting estimators of ϕ’s based on the kernelsmoothing is hard to establish and is not available in the literature, not to mention that for thepolynomial spline estimation. In fact, there are no formal results in the literature about thelocal asymptotics for any estimation methods of the partial additive Cox model. We bravedthis difficulty and have made determined efforts to derive a uniform Bahadur representationand asymptotic normality of the polynomial spline estimators of ϕ’s, which are importantfor establishing confidence bands and for hypothesis testing about the model structure. Wealso provide consistent estimators of the baseline hazard and the asymptotic variance of thenonparametric part estimation. This variance estimator allows one to construct pointwiseinterval estimates of ϕ’s. Our techniques are very different from those in Zhou et al. (1998) andHuang (2003) for nonparametric regression models, because of the nature of counting processwith a diverging number of parameters. Our local asymptotic results show some remarkableproperties of our estimators: they have no boundary problems and are design-adaptive (Fan,1992; Fan and Gijbels, 1996).

However, like the kernel estimation for nonparametric additive regression models (Opsomerand Ruppert, 1997; Opsomer 2000), the polynomial-spline estimation for one additive com-ponent of interest depends on the other additive nuisance components. This is not a desiredproperty. Motivated by the two-stage estimation methods for varying-coefficient regressionmodels (Fan and Zhang, 1999) and for nonparametric additive regression models (Horowitzand Mammen, 2004), we introduce a two-step estimation method to solve this problem. It isshown that the two-step estimator is more efficient than the polynomial spline estimator. Inparticular, the two-step approach possesses an oracle property: any additive component canbe estimated well as if the remaining additive components were known.

With the uniform Bahadur representation, we further study a nonparametric specificationtest for the additive component. This kind of tests are available for the kernel smoothing,but for the polynomial spline smoothing there is no formal nonparametric test for any modelsin the literature. We propose two test statistics and establish asymptotic null and alterna-tive distributions of the proposed test statistics. Other testing problems can be dealt withanalogously. Like other nonparametric tests, the asymptotic null distribution may not give a

3

Page 3 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 5: For Peer Review Only - SUSTech

For Peer Review Only

good approximation in finite sample situations due to the low convergence rate (Bickel andRosenblatt, 1973; Fan, Zhang and Zhang, 2001; Fan and Jiang, 2005; Hong and Lee, 2013).Hence, we propose a bootstrap approach to calculate the p-values of the proposed tests andprove its consistency. It is expected that our test methodology will promote the developmentof nonparametric inference using the global polynomial spline estimation as build block due toits popularity and advantages on fast implementation and stable performance in high dimen-sional nonparametric screening (Fan, Feng and Wu, 2011) and in large financial time seriesdata analysis (Liu and Yang, 2016). Our methodology can also be extended to the partlylinear additive hazards model, an important alternative to model (1.1) considered in Lu andSong (2015), and other non- and semi- parametric regression models.

The remainder of this paper is organized as follows. In Section 2 we introduce the partiallikelihood of model (1.1) along with the polynomial spline estimation. In Section 3 we intro-duce some notations and technical assumptions for our theoretical results. In Section 4 weconcentrate on the uniform linear representation and asymptotic normality of the estimatorsof ϕ’s. In Section 5 we suggest the two-step estimation. In Section 6 we study the specificationtest. In Section 7 we present details of implementation of the proposed method and conductsimulations, and a real example is employed to illustrate the use of our methodology. Someconcluding remarks are given in Section 8. Technical proofs are provided in the Appendix.

2 Partial likelihood and polynomial spline estimation

Based on Cox’s (1975) partial likelihood, Stone (1986) proposed polynomial spline estimationfor the fully nonparametric additive Cox model and studied the global convergence rates of theresulting estimator, and Huang (1999) extended this estimation method to the partly linearadditive Cox model. The idea of this approach is to approximate function ϕ(·) in the partiallikelihood by a polynomial spline.

Suppose that there are n independent individuals in a study cohort. In practice, not allof the survival times T1, · · · , Tn are fully observable, due to termination of the study or earlywithdrawal from a study. Instead one observes an event time Si = min(Ti, Ci) for the i-thsubject, where Ci is the censoring time. Let δi = I(Ti ≤ Ci) be the censoring indicator and(Xi,Wi) be an associated vector of covariates. The observed data are {(Si, δi,Xi,Wi) : i =

1, · · · , n}, which is an i.i.d. sample from the population (S, δ,X,W) with S = min(T,C) andδ = I(T ≤ C). Then the partial likelihood function (Cox, 1975) for model (1.1) is

L(β, ϕ) =n∏

i=1

{ ri(β, ϕ)∑j∈Ri

rj(β, ϕ)

}δi, (2.2)

where ri(β, ϕ) = exp{β′Xi + ϕ(Wi)}, and Ri = {j : Sj ≥ Si} is the risk set at time Si.

4

Page 4 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 6: For Peer Review Only - SUSTech

For Peer Review Only

Maximizing the above partial likelihood leads to over-fitting, in the absence of any restric-tion on the form of ϕ; in particular, parameters will be unidentifiable. Huang (1999) resortedto an approximation of ϕ(·) in a space of polynomial splines, using the same number of knotsfor all ϕj(·)’s. Here we allow different numbers of knots for different ϕj(·)’s. This relaxationbrings us two benefits: on the one hand, we can employ different smoothing parameters to ac-commodate different degrees of smoothness of each additive function ϕj; on the other hand, wecan under-smooth nuisance functions in our hypothesis testing problems and obtain the Wilksresult. Specifically, without loss of generality, assume that W = (W1, . . . ,WJ)

′ takes values inW = [0, 1]J , and let Wi = (Wi1, . . . ,WiJ)

′. For approximating function ϕj(·) (j = 1, . . . , J),we need a knot sequence ξj = {ξj,i}

Kj+1i=0 such that 0 = ξj,0 < ξj,1 < · · · < ξj,K < ξj,Kj+1 = 1.

Let Ij,i = [ξj,i, ξj,i+1) for i = 0, . . . , Kj − 1, and Ij,Kj= [ξj,Kj

, ξj,Kj+1]. Then {Ij,i}Kj

i=0 is apartition of [0, 1]. The knot number can be chosen by data such that Kj = Kj,n → ∞ asn → ∞. The space of polynomial splines of order ℓj (degree ℓj − 1) and knot sequence ξj,denoted by S(ℓj, ξj), consists of functions s(·) satisfying

(i) for 0 ≤ i ≤ Kj, s(·) is a polynomial of order ℓj − 1 in Ij,i;

(ii) for ℓj ≥ 2, s(·) is ℓj − 2 times continuously differentiable on [0, 1].

Since S(ℓj, ξj) is a qj-dimensional linear space with qj = Kj + ℓj, for any ϕnj(·) ∈ S(ℓj, ξj),there exists a local basis {Bj,i(·)}

qji=1 for S(ℓj, ξj), such that ϕnj(wj) =

∑qji=1 bjiBj,i(wj) for

j = 1, . . . , J (Schumaker, 1981, page 124). For example, for fixed ξj and ℓj, let

Bj,i(w) = (ξj,i − ξj,i−ℓj)[ξj,i−ℓj , . . . , ξj,i](ξ − w)ℓj−1+ , i = 1, . . . , qj,

where [ξj,i−ℓj , . . . , ξj,i]g denotes the ℓjth order divided differences of the function g and ξj,i =

ξj,min{max(i,0),Kj+1} for any i = 1− ℓj, . . . , qj. Then {Bj,i(·)}qji=1 forms a local basis, which is the

normalized B-spline basis (de Boor, 1978) and satisfies

{ Bj,i(w) = 0 unless ξj,i < w < ξj,i+ℓj ,

Bj,i(w) ≥ 0, and∑qj

i=1Bj,i(w) = 1.(2.3)

Let w = (w1, . . . , wJ)′, bj = (bj1, . . . , bjqj)

′, b = (b′1, . . . ,b

′J)

′, Bj(wj) = (Bj,1(wj), . . . , Bj,qj(wj))′,

and B(w) = (B′1(w1), . . . ,B

′J(wJ))

′. Then

ϕnj(wj) = b′jBj(wj) and ϕn(w) ≡

J∑j=1

ϕnj(wj) = b′B(w).

Replacing ϕ in (2.2) by ϕn, we obtain the logarithm of an approximated partial likelihood:

ℓ(β,b) =n∑

i=1

δi[β′Xi + ϕn(Wi)− log

∑k∈Ri

exp{β′Xk + ϕn(Wk)}]. (2.4)

5

Page 5 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 7: For Peer Review Only - SUSTech

For Peer Review Only

Let the maximizer of (2.4) be (β, b). Since ϕj’s can only be identified up to an additiveconstant, we assume E{δϕj(Wj)} = 0 or equivalently E{ϕj(Wj)|δ = 1} = 0 and center theestimators of ϕj’s as in Huang (1999). Specifically, let

ϕ∗j(wj) = b

′jBj(wj), ϕ

∗j =

n∑i=1

δiϕ∗j(Wij)/

n∑i=1

δi,

ϕ∗(w) = b′B(w), and ϕ∗ =

n∑i=1

δiϕ∗(Wi)/

n∑i=1

δi.

Then ϕj(wj) is estimated by ϕj(wj) = ϕ∗j(wj) − ϕ∗

j , and ϕ(w) by ϕ(w) =∑J

j=1 ϕj(wj) =

ϕ∗(w)− ϕ∗.Under mild conditions, the cumulative baseline hazard function Λ0(t) =

∫ t

0λ0(u)du is

estimated by the Breslow estimator (Breslow, 1972, 1974)

Λ0(t) =

∫ t

0

[ n∑i=1

Yi(u) exp{β′Xi + ϕ(Wi)}

]−1n∑

i=1

dNi(u),

where Yi(u) = I(Si ≥ u) and Ni(u) = I(Si ≤ u, δi = 1).When there is no X-variable, the above approach reduces to the polynomial spline esti-

mators in Stone (1986), where the local asymptotics is still unavailable. Computationally,maximization problem (2.4) can easily be implemented by the existing Cox regression pro-gram, see Section 7.2. In the next section we will establish local asymptotic normality of ϕand ϕj for j = 1, . . . , J . Several appealing properties of them will be revealed.

3 Notations and Conditions

The following regularity conditions are needed for our theoretic results.Condition (A):

(A1) Let Y (s) = I(S ≥ s). Then P (Y (τ) = 1) > 0, where τ is the observing end time pointfor the event time S.

(A2) Denote by Qn(w) the empirical distribution function of {Wi}ni=1. Let Q(w) be thedistribution function of Wi, which has a positive continuous density q(w) in its supportW .

(A3) Let r∗(s) = Y (s) exp{β′X + ϕ(W)} and f(w, s) = E{r∗(s)|W = w)}. Assume that

(i) 0 < E{infs∈[0,τ ] r∗(s)} and E{sups∈[0,τ ] r∗(s)} <∞;

(ii)∫ τ

0E{r∗(s) − f(W, s)}2 dΛ0(s) < ∞, and f(w, s) is bounded away from zero and

infinity and is equicontinuous on (w, s).

6

Page 6 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 8: For Peer Review Only - SUSTech

For Peer Review Only

Condition (A1) is commonly used in the literature. In Condition (A2), positive designdensity q(w) in its support is required, which ensures there are enough data points forsmoothing. Condition (A3) is mild. Since P (Y (τ) = 1) > 0 and Y (τ) exp{β′X + ϕ(W)} ≤r∗(s) ≤ exp{β′X + ϕ(W)}, it is trivial to show that our condition (A3)(i) is weaker than0 < E[exp{β′X + ϕ(W)}] < ∞. Note that E{r∗(s) − f(W, s)}2 ≤ Var{r∗(s)}. Condition(A3)(ii) holds if Var{r∗(s)} <∞.

The following Condition (B) were used in Huang (1999) and listed for convenience. UnderCondition (B), the estimator of β is

√n−consistent, so that each ϕj(·) can be estimated well

as if β in known (Opsomer and Ruppert, 1999; Jiang et al., 2007).

Condition (B):

(B1) (i) The regression parameter β belongs to an open subset (not necessarily bounded) ofRd, and each ϕj lies in Aj for j = 1, . . . , J , where Aj is the class of functions ϕj on [0, 1]

whose ℓjth derivative exists and satisfies the following Lipschitz condition of order α:

|ϕ(ℓj)j (s)− ϕ

(ℓj)j (t)| ≤ C|s− t|αj for s, t ∈ [0, 1],

where αj ∈ (0, 1] satisfies pj = ℓj + αj > 0.5. Let p = minj=1,...,J pj.

(ii) E(δX) = 0 and E{δϕj(Wj)} = 0, 1 ≤ j ≤ J.

(B2) The failure time T and the censoring time C are conditionally independent given thecovariate (X,W).

(B3) (i) Only the observations for which the event time Si (1 ≤ i ≤ n) is in a finite interval[0, τ ], say, are used in the partial likelihood. The baseline cumulative hazard functionΛ0(τ) =

∫ τ

0λ0(s) ds < ∞. (ii) The covariate X takes values in a bounded subset of Rd,

and the covariate W takes values in W .

(B4) There exists a small positive constant ε such that (i) P (δ = 1|X,W) > ε and (ii)P (C > τ |X,W) > ε almost surely with respect to the probability measure of (X,W).

(B5) Let 0 < c1 < c2 <∞ be two constants. the joint density f(t,w, δ = 1) of (S,W, δ = 1)

satisfies c1 ≤ f(t,w, δ = 1) < c2 for all (t,w) ∈ [0, τ ]×W .

(B6) For a positive integer q ≥ 1, assume that the qth partial derivative of the joint densityf(t, x,w, δ = 1) of (S,X,W, δ = 1) with respect to t or w exists and is bounded. [Fordiscrete covariate X, f(t,x,w, δ = 1) is defined to be (∂2/∂t∂w)P (S ≤ t,X = x,W1 ≤w1, . . . ,WJ ≤ wJ , δ = 1).

7

Page 7 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 9: For Peer Review Only - SUSTech

For Peer Review Only

(B7) Let Kj ≡ Kj,n be a positive integer such that Kj,n = O(nv) for 0.25/p < v < 0.5.Assume that h = O(n−v).

(B8) The information matrix I(β) for estimation of β is positive definite, where I(β) wasdefined in Theorem 3.1 of Huang (1999).

Note that condition 0.25/p < v < 0.5 in (B7) is used to ensure that β is√n-consistent

(Theorem 3.2 of Huang (1999)).

4 Local asymptotics

Let hj,i = ξj,i+1 − ξj,i, hj ≡ max0≤i≤Kj|hj,i|, h = min1≤j≤J hj, and h = max1≤j≤J hj. Assume

thatmax

1≤i≤Kj

|hj,i − hj,i−1| = o(K−1j ) and hj/ min

0≤k≤Kj

hj,k ≤M,

whereM > 0 is a predetermined constant. This condition was employed in univariate nonpara-metric regression models by Zhou et al. (1998). Under this condition, we have M−1 < Kjhj <

M , the condition required for numerical computation. Throughout this paper, denote by A⊗2

the matrix AA′, for any vector or matrix A. Put Ni(t) = I(Si≤t,δi=1) and N(t) =∑n

i=1Ni(t).Let

Fi(t) = σ{Ni(s), Yi(s+), Xi,Wi, δi, s ≤ t}

represent the failure time, censoring and covariates information for the ith subject up to timet. Then

Mi(t) = Ni(t)−∫ t

0

r∗i (s)λ0(s) ds (4.5)

is an orthogonal local square integrable martingale with respect to Fi(t), such that ⟨Mi(t),Mj(t)⟩ =0 for i = j (Kalbfleisch and Prentice, 1980; Fleming and Harrington, 1991), where r∗i (s) =

Yi(s) exp{β′Xi + ϕ(Wi)}. Let F(t) = ∪ni=1Fi(t) be the smallest σ-algebra containing Fi(t).

Then M(t) =∑n

i=1Mi(t) is a martingale with respect to F(t). For k = 0, 1, 2, let R∗k(s) =

E[B(W)⊗kY (s) exp{β′X + ϕ(W)}],

Σ0 =

∫ τ

0

[R∗

2(s)/R∗0(s)− {R∗

1(s)/R∗0(s)}⊗2

]R∗

0(s) dΛ0(s),

ξn1 = n−1

n∑i=1

∫ τ

0

{B(Wi)− R∗

1(s)/R∗0(s)

}dMi(s).

Since Σ0 is positive definitive (see Lemma 8), it has an inverse Σ−10 . Let e∗j be a J × J

diagonal matrix with the jth diagonal element being 1 and 0’s for others, Iqj be a qj × qj

identity matrix. Set ej = e∗j ⊗ Iqj , where ⊗ denotes the Kronecker product of matrices. Let

8

Page 8 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 10: For Peer Review Only - SUSTech

For Peer Review Only

ℓ = min1≤j≤J ℓj. The notation an ≍ bn represents that they are in the same order. Then wehave the following uniform Bahadur representation.

Theorem 1. Under Conditions (A) and (B), if ϕj(wj) is ℓ ≥ 2 times continuously differen-tiable on [0, 1], nh3 → ∞, and h ≍ h, then

ϕ(w)− ϕ(w)− α(w) = B′(w)Σ−10 ξn1 + op(h

ℓ + 1/√nh),

ϕj(wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ−10 ξn1 + op(h

ℓ + 1/√nh),

uniformly in w ∈ W , where α(w) =∑J

j=1 αj(wj),

αj(wj) = −Kj∑i=0

1

ℓj !hℓjj,iϕ

(ℓj)j (wj)B

∗ℓj(wj − ξj,ihj,i

)I(wj∈Ij,i),

and B∗ℓ (·) is the Bernoulli polynomial defined inductively as follows:

B∗0(x) = 1, B∗

ℓ (x) =

∫ x

0

ℓB∗ℓ−1(z) dz + b∗ℓ ,

with b∗ℓ = −ℓ∫ 1

0

∫ x

0B∗

ℓ−1(z) dz dx being the ℓth Bernoulli number.

The above uniform Bahadur representation is useful for establishing the local asymptoticdistribution of ϕj(·) and for statistical inference about the additive components. In Section 6,we derive the asymptotic distributions of the specification test using this Bahadur represen-tation. Even though this representation is established for model (1.1) with the number ofadditive components J fixed. The result can be extended to the models with J diverging asn goes to ∞, if the partial likelihood in (2.4) is penalized by using the group lasso (Yuan andLin, 2006; Ma et al., 2015) or the elastic net (Zou and Hastie, 2005), among others. Thisshould facilitate statistical inference after model selection for Cox’s types of models in highdimensional settings, but will be investigated in our next project.

Let q =∑J

j=1 qj. Then Σ0 is a q × q matrix with q → ∞. Partition Σ0 and Σ−10

into J × J block matrices with the jth diagonal blocks of size qj × qj. For i, j = 1, . . . , J ,let Σ0,ij and Σij

0 be the (i, j)th blocks of the matrix Σ0 and Σ−10 , respectively, and let

σn,j(wj) = {n−1B′j(wj)Σ

jj0 Bj(wj)}1/2 and σn(w) = {n−1B′(w)Σ−1

0 B(w)}1/2. The followingtheorem describes the asymptotic normality of the polynomial spline estimators.

Theorem 2. Under Conditions of Theorem 1, if ϕj(wj) is ℓ ≥ 2 times continuously differen-tiable on [0, 1] and nh3 → ∞, then

(i) {ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1), and

(ii) {ϕ(w)− ϕ(w)− α(w)}/σn(w)D−→ N (0, 1),

9

Page 9 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 11: For Peer Review Only - SUSTech

For Peer Review Only

where σn,j(wj) ≍ (nh)−1/2 and σn(w) ≍ (nh)−1/2, uniformly in w.

Corollary 1. Under the conditions of Theorem 2, if J = 1, then

{ϕ1(w1)− ϕ1(w1)− α1(w1)}/σn,1(w1)D−→ N (0, 1).

Remark 1. Theorem 2 shows that the asymptotic bias αj(wj) of ϕj(wj) shares the sameform as that of the regression spline estimator in Zhou et al. (1998) and does not depend onthe design distribution Q(w). This reflects that the polynomial spline estimators are designadaptive: the bias does not depend on the design density, a property shared by the localpolynomial regression (Fan, 1992). The asymptotic variance of ϕj(wj) is σ2

n,j(wj) ≍ (nh)−1,which suggests that ϕj(wj) is

√nh-consistent.

By Theorem 2, the asymptotic mean squares error of estimator ϕj can be defined as

AMSE{ϕj(wj)} = α2j (wj) + σ2

n,j(wj).

If ξj,i ≤ wj < ξj,i+1 (i = 0, . . . , Kj) and hj,i = hj = h, then

AMSE{ϕj(wj)} = h2ℓ{ 1

ℓ !ϕ(ℓ)j (wj)B

∗ℓ (wj − ξj,i

h)}2

+1

nB′

j(wj)Σjj0 Bj(wj).

Minimizing the above AMSE over h, one can get the theoretically optimal value of h at theorder of n−1/(2ℓ+1), which is in the same order of the optimal bandwidth for kernel smoothing(Fan and Gijbels, 1996).

Theorem 3. Under the conditions in Theorem 2,

limC→∞

lim supn→∞

supw∈W

P{|ϕ(w)− ϕ(w)| ≥ C(hℓ + 1/√nh)} = 0.

This extends the result of Corollary 3.1 in Huang (2003) to the current situation. In lightof Theorem 3, if nh2ℓ+1 → 0, then

√nh|ϕ(w) − ϕ(w)| = Op(1) uniformly in w ∈ W , and

Theorem 2 can be interpreted as follows: the asymptotic normality holds for all w ∈ W .

Remark 2. If h = cn−1/(2ℓ+1) for some c > 0, then, by Theorem 3, ϕj(wj) − ϕj(wj) =

Op(n−ℓ/(2ℓ+1) uniformly for any wj ∈ [0, 1]. This indicates that the nonparametric part esti-

mation shares a nice property, free of boundary effects (Gasser and Muller, 1984), with thelocal polynomial estimation for hazard regression with one covariate (Fan et al. , 1997). Asimilar property was also revealed for the regression spline estimation in Zhou et al. (1998).

The following result indicates that the estimator of baseline is uniformly consistent.

Theorem 4. Under the conditions in Theorem 2, we have

supt∈[0,τ ]

|Λ0(t)− Λ0(t)| = op(1). (4.6)

10

Page 10 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 12: For Peer Review Only - SUSTech

For Peer Review Only

The definition of σ2n(w) suggests the following plug-in estimator:

σ2n(w) = n−1B′(w)Σ

−1

0 B(w),

where Σ0 =∫ τ

0[Rn2(s)/Rn0(s)−{Rn1(s)/Rn0(s)}⊗2]Rn0(s) dΛ0(s) and for k = 0, 1, 2,Rnk(s) =

n−1∑n

i=1 B(Wi)⊗kYi(s) exp{β′Xi + ϕ(Wi)}. Similarly, σ2

n,j(wj) is estimated by

σ2n,j(wj) = {n−1B′

j(wj)Σjj

0 Bj(wj)}1/2,

where Σjj

0 is the jth diagonal block of Σ−1

0 . The following theorem shows that σ2n(w) and

σ2n,j(wj) are consistent.

Theorem 5. Under the conditions in Theorem 2, we have

(i) σ2n(w) − σ2

n(w) = op{1/(nh)} and σ2n,j(wj) − σ2

n,j(wj) = op{1/(nh)}, uniformly forw ∈ W;

(ii) {ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1);

(iii) {ϕ(w)− ϕ(w)− α(w)}/σn(w)D−→ N (0, 1).

Remark 3. Since the convergence rate of ϕj is√nh (see Theorem 2), the result of Theo-

rem 5(i) indicates that the variance estimator of ϕj is consistent, which contrasts with esti-mating the variance of β, for which there is no direct consistent variance estimator in theliterature (Huang, 1999).

By the proof of Theorem 2, σ2n(w) = Op{1/(nh)}. Since αj = O(hℓ), the optimal h

in the sense of minimizing AMSE(ϕ) is of order n−1/(2ℓ+1). With the asymptotic normalityin Theorem 5, if h = o(n−1/(2ℓ+1)) (undersmoothing), then {ϕj(wj) − ϕj(wj)}/σn,j(wj)

D−→N (0, 1), which can be used to construct pointwise confidence intervals for ϕj(wj).

5 Two-step estimation

The local asymptotic results in the previous theorems show that the estimator of each ϕj

depends on the remaining ϕk’s (k = j). In the following we propose a two-step estimationmethod to remove this kind of dependence. This estimation needs an initial estimates of βand ϕk (for k = j), which are taken as β and ϕk in Section 2.

Consider estimating ϕj(wj) of interest. Regarding the remaining parameters as nuisanceand replacing them by the corresponding initial estimators, similar to (2.4), we obtain thelogarithm of an approximated partial likelihood:

ℓj(bj) =n∑

i=1

δi[β

′Xi + b′

jBj(Wij) + ϕ−j(Wi,−j)

− log∑k∈Ri

exp{β′Xk + b′

jBj(Wkj) + ϕ−j(Wk,−j)}]. (5.7)

11

Page 11 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 13: For Peer Review Only - SUSTech

For Peer Review Only

where ϕ−j(Wi,−j) =∑J

k=1(=j) ϕk(Wik), and Bj is defined similarly to Bj in Section 2 butwith a new number of knots, qj, so that the corresponding hj is different from before andwe denote it by hj to stress this difference. We also use αj(·) to replace αj(·) to reflect thischange.

Denote by bj the maximizer of (5.7). Let ϕ∗j(wj) = b′

jBj(wj) and ϕ∗j=

∑ni=1 δiϕ

∗j(Wij)/

∑ni=1 δi.

Similar to ϕj, we define the two-step estimator of ϕj as

ϕj(wj) = ϕ∗j(wj)− ϕ∗

j.

Then the two-step estimator of ϕ(w) is simply ϕ(w) =∑J

j=1 ϕj(wj).

For k = 0, 1, 2, let R∗kj(s) = E[Bj(Wj)

⊗kY (s) exp{β′X+ϕ(W)}], Σ0,jj =∫ τ

0

[R∗

2j(s)/R∗0j(s)−

{R∗1j(s)/R

∗0j(s)}⊗2

]R∗

0j(s) dΛ0(s), σn,j(wj) = {n−1B′j(wj)Σ

−1

0,jjBj(wj)}1/2, and

ξn,j = n−1

n∑i=1

∫ τ

0

{Bj(Wij)− R∗

1j(s)/R∗0j(s)

}dMi(s).

Then ξn,j is a martingale of mean zero and variance Σ0,jj. The following theorem gives auniform Bahadur representation of the two-step estimator and a natural by product of thelimiting distribution.

Theorem 6. Under the conditions of Theorem 1, if nhjhℓ → 0, h = o(hj) and nh3j → ∞,then

ϕj(wj)− ϕj(wj)− αj(wj) = B′j(wj)Σ

−1

0,jj ξn,j + op(hℓj + 1/

√nhj),

uniformly in wj ∈ [0, 1]. Furthermore,

{ϕj(wj)− ϕj(wj)− αj(wj)}/σn,j(wj)D−→ N (0, 1).

The results of Theorem 6 are the same as those in Theorems 1 and 2 with the J = 1 casethat W is univariate. This indicates that the two-step estimation is essentially equivalent toan oracle method in the sense that ϕj could estimate ϕj well as if β and the remaining ϕk’s wereall known. The result holds regardless of the finite dimension of W, so asymptotically thereis no curse of dimensionality. This is a desired property shared by other two-step estimationapproaches (Fan and Zhang, 1999; Horowitz and Mammen, 2004; Jiang and Li, 2008; Liu,Yang and Härdle, 2013; Ma et al., 2015).

In Theorem 6, the undersmoothing condition h = o(hj) is used for the initial estimate.In practical implementation of the two-step estimation, the initial estimate ϕ−j in the firststage, should be different from the best polynomial spline estimate which is certainly notundersmoothed. When same knots are used for polynomial spline estimator ϕj in Theorem 2and for two-step estimator ϕj in the 2nd stage, then αj(wj) = αj(wj) and Σ

−1

0,jj = Σ−10,jj. Since

12

Page 12 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 14: For Peer Review Only - SUSTech

For Peer Review Only

Σ−10,jj ≤ Σjj

0 , it is seen from Theorems 2 and 6 that ϕj is asymptotically more efficient than ϕj.The actual advantage of the two-step estimation depends on how far away Σ0 is from the blockdiagonal matrix. If the correlation, weighted by risk function r∗(s), between variable Bi(Wi)

for approximating ϕi and variable Bj(Wj) for approximating ϕj is zero for all i = j, then thediagonal blocks of Σ0 disappear and ϕj and ϕj have the same asymptotic distribution.

6 Nonparametric hypothesis testing

The uniform Bahadur representation of the global splines estimators in Theorem 1 facilitatesstatistical inference for the nonparametric additive components. For illustration we considerthe following testing problem for ϕ1:

H0 : ϕ1(w1) = ϕ1,0(w1) versus H1 : ϕ1(w1) = ϕ1,0(w1).

For this test problem, there is no formal work using the global splines estimation in the litera-ture for any nonparametric models including the nonparametric regression models consideredin Zhou et al. (1998). When ϕ1,0(·) = 0, it reduces to testing significance of ϕ1. The testingproblem is a nonparametric null hypothesis versus a nonparametric alternative, because thenuisance parameters under H0 are still nonparametric. Other testing problems, for example,testing the significance of a group of variables, can be solved analogously.

We consider the intuitive discrepancy measures

Tn = nh1

∫ 1

0

{ϕ1(w1)− ϕ1,0(w1)}2a(w1) dw1,

Tn = nh1

∫ 1

0

{ϕ1(w1)− ϕ1,0(w1)}2a(w1) dw1,

where a(·) is a bounded, nonnegative and integrable weighting function. The above distance-based statistics were used for density estimation in Bickel and Rosenblatt (1973). They can beregarded as extensions to the Kolmogorov-Smirnov and Cramer-von Mises types of statistics.Other tests can be developed using our uniform Bahadur representations, such as the general-ized likelihood ratio tests in Fan, Zhang and Zhang (2001) and Fan and Jiang (2005). Whilethe generalized likelihood ratio tests have Wilks’ property and the asymptotic optimality interms of rates of convergence for nonparametric hypothesis testing according to the formula-tions of Ingster (1993), the distance based test has its own advantage, as advocated in Hongand Lee (2013). Let σn(u, v) = n−1B′

1(u)Σ110 B1(v) and σn(u, v) = n−1B

′1(u)Σ

−1

0,11B1(v). Thefollowing theorem gives the limiting null distributions of our test statistics, demonstratingthat the Wilks phenomenon is still observed in the current situation.

13

Page 13 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 15: For Peer Review Only - SUSTech

For Peer Review Only

Theorem 7. Assume that Conditions in Theorem 1 hold. Under H0, the following resultshold:

(i) If nh2ℓ+11 → 0, then

σ∗−1n {Tn − µ∗

n(1 + op(1))}D−→ N (0, 1),

where µ∗n = nh1

∫ 1

0σ2n,1(w1)a(w1) dw1 = O(1) and

σ∗2n = 4(nh1)

2

∫ 1

0

∫ 1

0

σ2n(u, v)a(u)a(v) du dv = O(h1).

Furthermore, let rn = 2µ∗n/σ

∗2n , then rnTn

a∼ χ2rnµ∗

n.

(ii) If nh2ℓ+11 → 0 and h1 = o(h1), then the above result holds for Tn but with h1, σ2

n,1(w1),and σ2

n(u, v) replaced by h1, σ2n,1(w1), and σ2

n(u, v), respectively.

Theorem 7 provides us an appropriate level-α test with rejection region

Rn = {Tn : Tn − µ∗n ≥ zασ

∗n}.

If we consider the contiguous alternative of form

H1n : ϕ1(w1) = ϕ1,0(w1) + γng(w1) + o(γn),

where γn → 0 as n→ ∞, o(γn) is uniform in w1 on [0, 1], g is a continuous function such that∫ 1

0g2(w1)a(w1) dw1 > 0, then the local power of Tn (or Tn) can be approximated using the

following theorem.

Theorem 8. Assume that Conditions in Theorem 7 hold. Under H1n, the following resultshold:

(i) If limn→∞ nh1γ2n = c and 0 < c <∞, then

s∗−1n {Tn − µ∗

n(1 + op(1))− c∗} D−→ N (0, 1),

where c∗ = c∫ 1

0g2(w1)a(w1) dw1 > 0 and s∗2n = σ∗2

n + 4ch1b′Σ11

0 b = O(h1) with b =∫ 1

0g(w1)a(w1)B1(w1) dw1.

(ii) If h1 = o(h1) and limn→∞ nh1γ2n = c and 0 < c < ∞, then the result in (i) still holds

for Tn with h1, B1, and Σ110 replaced by h1, B1, and Σ

−1

0,11 respectively.

Let z1−α be the (1 − α)th percentile of N (0, 1). By Theorems 7 and 8, the power of thetest is approximated by

PH1n(Rn) ≈ 1− Φ(s∗−1n σ∗

nz1−α − s∗−1n c∗), as n→ ∞,

14

Page 14 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 16: For Peer Review Only - SUSTech

For Peer Review Only

where Φ(·) is the standard normal distribution. Hence, as n → ∞ the power goes to one,since s∗−1

n c∗ → +∞ and 0 ≤ s∗−1n σ∗

n ≤ 1. This shows that the test is consistent. With theabove alternative distribution, asymptotic optimality of the test can be obtained using theargument of the generalized likelihood ratio test in Fan, Zhang and Zhang (2001) and Fanand Jiang (2005).

To implement the proposed test, we need to obtain the null distribution of Tn (or Tn).Theoretically the asymptotic null distribution of Tn (or Tn) can be used in determining thep-value, but it may not give a good approximation in a finite sample setting because of the lowconvergence rate in Theorem 7, which is a common phenomenon in nonparametric hypothesistesting (Bickel and Rosenblatt, 1973; Fan and Jiang, 2005; Hong and Lee, 2013). To dealwith this difficulty, we propose the following bootstrap method to find the p-value. Let Fn bethe empirical distribution of the observations {Si, δi,Wi,Xi}ni=1. The bootstrap procedure isdetailed as follows.

1. Resample a bootstrap sample {S∗i , δ

∗i ,W∗

i ,X∗i }ni=1 from Fn.

2. Based on the bootstrap sample, fit model (1.1) to get the estimate of ϕ, denoted byϕ∗1, using the same routine as for ϕ. Then compute the bootstrap version of the test

statistic Tn:

T ∗n = nh1

∫ 1

0

{ϕ∗1(w1)− ϕ1(w1)}2a(w1) dw1.

3. Repeat steps 1 and 2 to obtain a sample of T ∗n ’s, T ∗(k)

n , k = 1, . . . , B, say.

4. Use the bootstrap sample {T ∗(k)n }Bk=1 to determine the quantiles of Tn.

The above bootstrap method can be obviously modified for Tn, and we use T ∗n to denote the

bootstrap test statistic corresponding to Tn. The following theorem establishes consistency ofthe proposed bootstrap method.

Theorem 9. Assume that the conditions in Theorem 7 hold. Then under H0, supt |P (T ∗n <

t|Fn)− P (Tn < t)| → 0 a.s. and supt |P (T ∗n < t|Fn)− P (Tn < t)| → 0 a.s.

Although the asymptotic null distribution of Tn (or Tn) may not be approximated well,the null distribution of T ∗

n (or T ∗n) can be obtained by resampling. Theorem 9 ensures that

the null distribution of Tn (or Tn) is well approximated by the conditional distribution of T ∗n

(or T ∗n), given the original sample. See Figure 2 in the simulation section for finite sample

performance of the proposed test.

15

Page 15 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 17: For Peer Review Only - SUSTech

For Peer Review Only

7 Numerical Studies

7.1 Selection of knots

To implement the estimation method in (2.4), we need to specify the location of knot se-quence {ξj,k}

Kj,n

k=1 . Theoretically, asymptotic optimal knot placement can be derived fromour asymptotic result in Section 4 by following an argument similar to that in Agarwal andStudden (1980). Practically, the equally spaced knots and quantile knots methods are usu-ally used for placing the knots. Throughout this section we use the latter, which places theknots at the sample quantiles of the variable and results in approximately the same numberof observed values of the variable between any two adjacent knots. The numbers of knots{Kj,n, j = 1, · · · , J} are smoothing parameters chosen by minimizing the bic:

bic = −2 log(likelihood) + log(n){d+J∑

j=1

(Kj,n + ℓj − 1)},

where Kj,n and ℓj − 1 denote respectively the number of knots and degree of the spline forestimating ϕj. For Cox’s type of models with single index, λ(t|x) = λ0(t) exp{ψ(β′

0x)}, Huangand Liu (2006) used the bic and showed via simulations that polynomial splines estimationperformed well in general. Other methods can be used to decide Kj,n, such as the cross-validation and generalized cross-validation. For details, see O’Sullivan (1988), Hastie andTibshirani (1990), and Nan et al. (2005).

7.2 Simulations

We conduct simulations to illustrate nice performance of the polynomial spline smoother, todemonstrate that the proposed bootstrap method gives an accurate estimate of the distributionof our test statistic Tn, and to check the consistency and power of the proposed test. We alsocompare the polynomial spline smoother with others.

There are several estimation methods for model (1.1), for example, polynomial splinesestimation and kernel based estimation, but there is no solid limiting theory for the kernel-based profile partial likelihood (ppl) estimation except for the univariate ϕ(w) (J = 1) inCai et al. (2007). It is well known that the polynomial splines estimation is easy to calculateand has very good finite sample performance. We compare the polynomial spline estimationwith the two-step estimation and the oracle one in the sense that while estimating a specificadditive component it assumed the remaining components were known. For univariate ϕ(w),we also compare our method with the ppl method in Cai et al. (2007). It is easy to calculateour estimators.

16

Page 16 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 18: For Peer Review Only - SUSTech

For Peer Review Only

For each simulation, we use natural cubic splines without intercept and with degrees offreedom (df) between 3 and 20 for approximating each ϕj. With this option, the quantile knotsmethod is used to place the df − 1 knots at the sample quantiles of the variable. We employthe BIC criterion in Section 7.1 to select the number of knots for ϕj. Our codes are availableupon request. Jiang and Jiang (2001) investigated the polynomial splines estimator of β indetail. For saving space, we here focus on estimation of the nonparametric components.

Example 1. (Bivariate ϕ) We sample data according to the following scheme. First, wegenerate covariate X from the bivariate normal distribution with marginal N(0, 1) and cor-relation 0.5, and Z1 and Z2 independently from U(0, 1). Next, we set W1 = Z1 and W2 =

0.5Z1 +0.5√3Z2, so that W1 and W2 have correlation 0.5. Then, we generate the failure time

from an exponential distribution with, hazard function

λ(t;X,W) = λ0 exp{β1X1 + β2X2 + ϕ(W)},

where λ0 = 1, β1 = 0.6, β2 = 0.4, and ϕ(w) = ϕ1(w1) + ϕ2(w2) with ϕ1(w1) = exp(2w1) −0.5{exp(2) − 1} and ϕ2(w2) = 0.5π sin(πw2) − 1. Finally, given (X,W), we generate thecensoring variable from U [0, 1] so that it is independent of the failure time variable. For thissetting the censoring rate is about 30%.

Figure 1: Estimated functions with 95% confidence intervals. Top panel - polynomial splineestimation; bottom panel - orcale polynomial spline estimation. Solid- true, dashed- median,dotted dash- 2.5% and 97.5% percentiles.

17

Page 17 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 19: For Peer Review Only - SUSTech

For Peer Review Only

To compare the polynomial spline estimator with the oracle one, we run 1, 000 simulations,and for each of them we draw a random sample of sizes n = 200 and compute the estimates.Figure 1 shows the median curves with envelopes formed via pointwise 2.5% and 97.5% samplepercentiles for the two estimation approaches. Both estimators have good performance in thisexample, even though the confidence intervals of the oracle estimator are a little bit narrower.We also calculate the two-step polynomial spline estimator. As expected, its performance isquite close to the oracle one, we omit the results for saving space.

To investigate difference between the distribution of T ∗n and that of Tn, we run 1000

simulations. For each simulation, we obtain three bootstrap samples (the results are almostthe same if more bootstrap samples are drawn). In total we have 3, 000 bootstrap samples,which gives us 3, 000 realized values of T ∗

n . Using the kernel density estimate, we obtainthe distribution of T ∗

n . We also calculate the value of Tn in each simulation and obtained1, 000 realized values of Tn, which provides a kernel density estimate of the distribution of Tn.Figure 2 displays the estimated densities of Tn and T ∗

n at sample size n = 400 and n = 1200. Itis evident that both densities almost coincide. Therefore, it is reasonable to use the bootstrapmethod to approximate the null distribution of Tn if one has a moderate large sample. Theresult agrees with Theorem 9.

0 2 4 6 8 10

0.00

0.10

0.20

0.30

Estimated Density

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Estimated Density

Figure 2: Estimated densities. Left panel: n=400; right panel: n=1200. Solid — true, dotted— the bootstrap approximation.

Last, we check the power of our test. We evaluate the power for a sequence of alternativemodels indexed by γ,

Hγ : ϕ1(w1) = exp(2w1)− 0.5{exp(2)− 1}+ γg(w1),

where g(w1) = 2w1−1 and γ = 0, 0.2, . . . , 0.8, 1. The alternative sequence ranges from the nullmodel to reasonably far away from it. When γ = 0, the alternative becomes the null, and the

18

Page 18 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 20: For Peer Review Only - SUSTech

For Peer Review Only

Table 1: Powers of the proposed test at significance level α

n α γ = 0 γ = 0.2 γ = 0.4 γ = 0.6 γ = 0.8 γ = 1.0

0.20 0.207 0.444 0.766 0.970 0.996 0.999

400 0.10 0.092 0.278 0.618 0.909 0.988 0.999

0.05 0.038 0.167 0.468 0.837 0.973 0.997

0.20 0.215 0.614 0.955 0.998 1.000 1.000

800 0.10 0.097 0.462 0.905 0.997 1.000 1.000

0.05 0.043 0.338 0.840 0.989 1.000 1.000

0.20 0.203 0.684 0.982 1.000 1.000 1.0001200 0.10 0.100 0.554 0.968 0.999 1.000 1.000

0.05 0.045 0.426 0.944 0.997 1.000 1.000

corresponding power should be about the significance level, which indicates the test keeps thesize. As γ increases, the alternative gets further away from the null hypothesis, and the powergets larger. For a fixed value of γ, the rejection rates of the null hypothesis should increase toone as n goes to ∞, which suggests the test is consistent. These phenomena are observed inTable 1, which exemplifies nice performance of the proposed methodology. For this example,the two-step estimation and the corresponding test Tn have very similar performance, so theresults are omitted for saving space. ⋄

Example 2. In the previous example, the polynomial spline estimator is very close to thetwo-step estimator and the oracle estimator. One may wonder if the two-step estimator isonly better in its asymptotic theory. We here consider a similar setting of Example 1 butwith W1 = Z1, W2 = −(9/

√19)Z1+Z2, ϕ1(w1) = −8w1(1−w1)(1− 2w1)(1+w1)− 2/15, and

ϕ2(w2) = π/2 sin(2πw2). Then the correlation coefficient between W1 and W2 is as high as−0.9. We run 1000 simulations. For each simulation we draw a random sample of size n = 400

and calculate the polynomial spline estimates, the oracle estimates and the two-step estimates,using the natural cubic B-spline and select the number of knots by the BIC criterion for eachof them. The initial estimator for the two-step estimator employs the 5th order polynomialspline estimation, so that it is undersmoothed. Figure 3 displays the estimated percentiles ofthe additive components. Obviously, the two-step estimator is similar to the oracle estimatorand is significantly better than the polynomial spline estimator, since the envelopes formed bythe 2.5th and 97.5th percentiles for the two-step estimator are much narrower. This reflexesthe two-step estimator’s oracle property. ⋄

19

Page 19 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 21: For Peer Review Only - SUSTech

For Peer Review Only−2

0

2

0.00 0.25 0.50 0.75 1.00w1

φ1(w

1)

−2

0

2

0.00 0.25 0.50 0.75 1.00w1

φ1(w

1)

−2

0

2

0.00 0.25 0.50 0.75 1.00w1

φ1(w

1)

−2

0

2

0.00 0.25 0.50 0.75 1.00w2

φ2(w

2)

−2

0

2

0.00 0.25 0.50 0.75 1.00w2

φ2(w

2)

−2

0

2

0.00 0.25 0.50 0.75 1.00w2

φ2(w

2)

Figure 3: Estimated functions with 95% confidence intervals. Top panel - polynomial splineestimation; bottom panel - two-step estimation. Solid- true, dashed- median, dotted dash-2.5% and 97.5% percentiles.

20

Page 20 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 22: For Peer Review Only - SUSTech

For Peer Review Only0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

Estimated curves with polynomial splines

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

Estimated curves with the PPL

Figure 4: Estimated ϕ(·). Solid- true, dotted- median, dotted dash- 2.5% and 97.5% per-centiles.

Example 3. (Univariate ϕ) In this example, we compare our estimator with the ppl estimatorin Cai et al. (2007). Since the ppl estimation can only deal with one-dimensional ϕ(·), weconsider the following model:

λ(t;X,W ) = λ0 exp{β1X1 + β2X2 + ϕ(W )},

where λ0 = 1, β1 = 0.6, β2 = 0.4, and ϕ(w) = −8w(1 − w2). The survival function isS(t) = exp[−tλ0 exp{β1X1 + β2X2 +ϕ(W )}]. We generate W from U(0, 1) and X = (X1, X2)

from a bivariate normal distribution with the correlation coefficient 0.5 and the marginaldistributions N(0, 1). Given X and W , we generate the failure time T from the above survivalfunction. The censoring time δ is generated from U(0, 6), which produces about 63.2% ofcensoring. The sample size is taken as n = 200. We run 600 simulations and calculate bothestimators of ϕ(·). Figure 4 displays the estimated ϕ(·). It is seen that both estimates are closeexcept in the boundary. This is expected, since no one dominates the other in the univariatenonparametric regression. However, the computational advantage for the polynomial splineestimator is significant. For this example, computing time for the former is about 1% of thatfor the latter. In fact, the CPU time is 55.24 seconds for the former and 5216 seconds for thelatter, on a personal laptop (with 8GM RAM and Intel Core i5-5200U [email protected]).

7.3 A real data example

We use our proposed methodology to analyze the “Framingham Heart Study (FHS)” data(Dawber, 1980). There are 1571 observations and about 90.42% censoring in the dataset.We are interested in the failure time, measured from the time at the “age 45” exam to the

21

Page 21 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 23: For Peer Review Only - SUSTech

For Peer Review Only

occurrence of coronary heart disease (CHD). The risk factors include age (at age “45” exam),gender, systolic blood pressure (SBP), body mass index (BMI), cholesterol level, waiting time,and cigarette smoking status. We first fit the data using model (1.1) with

x = (Smoking status, Gender)′ and ϕ(w) =5∑

j=1

ϕj(wj),

where w1 = Cholesterol, w2 = BMI, w3 = SBP, w4 = Age, and w5 = Waiting time. Thisincludes the Cox proportional hazards model as a special case.

We use natural cubic splines without intercept and with degrees of freedom (df) between 1

and 10 for approximating each ϕj. With this option, the df − 1 knots are chosen according tothe quantile knots method. Based on the BIC criterion in Section 7.1 for selecting the numberof knots for ϕj, we estimate the model parameters/functions and check significance of eachadditive component using the proposed test statistic. Our test suggests that Age and Waitingtime are not significant, because the p-values are 0.58 and 0.98, respectively. Since Age is aconfounding variable, we remove variable w5 and retain w4 and fit the data using model (1.1)with

x = (Age, Smoking status, Gender)′ and ϕ(w) =3∑

j=1

ϕj(wj),

where w1 = Cholesterol, w2 = BMI, and w3 = SBP. The BIC method automatically choosesdf = 1 for each of these ϕj’s, so we get the resulting estimates in Figure 5. It is seen that bothestimates are very close, which is expected because the sample size is large and both estimatesare consistent. Obviously, the effect of each variable is strictly increasing as the variable levelincreases. Then we consider hypothesis testing H0 : ϕj = 0 versus H1 : ϕj = 0 for j = 1, 2, 3.

The corresponding p-values are reported in Table 2, which indicates that all these functionsare significant. This validates the use of Cox’s model for the FHS data in Clegg et al. (1999).

Table 2: Significance tests of ϕj’s

p values ϕ1(·) ϕ2(·) ϕ3(·)Tn 0.003 0.054 0.004

Tn 0.010 0.048 0.005

8 Discussion

We have studied local asymptotics of the polynomial spline estimators of the partly linear ad-ditive Cox models and hypothesis testing problems for additive nonparametric components.

22

Page 22 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 24: For Peer Review Only - SUSTech

For Peer Review Only0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

z=Cholesterol

φ 1(z

)

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

z=BMIφ 2

(z)

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

z=SBP

φ 3(z

)

Figure 5: Estimated additive components for the FHS data. Dotted: Polynomial splineestimates; Solid: two-step estimates.

We have made deterministic efforts to establish the uniform Bahadur representation and thedesign-adaptive asymptotic normality of the polynomial spline estimators. We have also pro-posed a two-step estimation procedure for estimating the additive components and establishedits oracle property in the sense that one component can be estimated as if all other compo-nents were known in advance. It has been demonstrated that the two-step estimators are moreefficient. We have proposed a distance-based statistic for specification tests of the additivecomponents. Asymptotic distributions of the proposed test have been obtained. We have alsoproposed a consistent bootstrap approach to calculate the p-value of the proposed test. Oursimulations demonstrate nice performance of the estimation methods and the test. We haveapplied our approach to analyze the FHS data.

Since our approach is based on the partial likelihood of Cox’s models, essentially it can beextended to likelihood-based inference. The proposed methodology is also applicable to othermodels with polynomial spline estimation, for example, the generalized additive models andtransformation models. An interesting project is to study quantile estimation of the single-index models (Kong and Xia, 2007) and the transformation models (Chen, Jin and Ying, 2002;Ma and Kosorok, 2005; Chen and Tong, 2010; Lu and Zhang, 2010) using polynomial splineapproximation to nonparametric components. This is among our investigation in our futureprojects.

Appendix. Proofs of Theorems

Throughout the proofs, for any column vector a, let ∥a∥ = (a′a)1/2 be the Euclidean norm,and for any square matrix A, let ∥A∥ = sup{∥Ax∥ : ∥x∥ = 1} be the operator norm of A,

23

Page 23 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 25: For Peer Review Only - SUSTech

For Peer Review Only

which corresponds to the Euclidean norm when A is a column vector. Denote by λmin(B)

and λmax(B) the minimum and maximum eigenvalues of the square matrix B, respectively.For any probability measure P , define L2(P ) = {f :

∫f 2 dP < ∞}. Let ∥ · ∥2 be the usual

L2-norm with respect to P , that is, ∥f∥2 =√∫

f 2 dP , and let ∥ · ∥∞ denote the supremumnorm. For k = 0, 1, 2, let

Vnk(s,b) = n−1

n∑i=1

B(Wi)⊗kYi(s) exp{β′Xi + b′B(Wi)}.

For ease of notation, we introduce the following matrices:

Σn0 =

∫ τ

0

[Rn2(s)

Rn0(s)−{Rn1(s)

Rn0(s)

}⊗2]Rn0(s) dΛ0(s),

Σn1(b) = n−1

n∑i=1

∫ τ

0

[Vn2(s,b)Vn0(s,b)

−{Vn1(s,b)

Vn0(s,b)

}⊗2]dMi(s),

Σn2(b) =∫ τ

0

[Vn2(s,b)Vn0(s,b)

−{Vn1(s,b)

Vn0(s,b)

}⊗2]Rn0(s) dΛ0(s).

We denote the score function by U(β,b) = ∂ℓ(β,b)/∂b, and the Hessian matrix by Σn(b) =−n−1∂U(β,b)/∂b′. Then by (4.5),

Σn(b) = n−1

∫ τ

0

[Vn2(s,b)Vn0(s,b)

−{Vn1(s,b)

Vn0(s,b)

}⊗2]dN(s)

= Σn1(b) +Σn2(b).

To ease the arguments for proofs, we introduce the centered versions of variables, B(w) =

B(w)−∑n

j=1 δjB(Wj)/∑n

j=1 δj and Xi = Xi −∑n

j=1 δjXj/∑n

j=1 δj. By (2.4), it is straight-forward to verify that

ℓ(β,b) =n∑

i=1

δi[β′Xi + b′B(Wi)

− log∑k∈Ri

exp{β′Xk + b′B(Wk)}]. (A.1)

In the following, we present the proofs of our theorems. To streamline our arguments, weneed some technical lemmas which are relegated to the supplementary material.

Proof of Theorem 1. The proof consists of the following four steps.(i) Taylor’s expansion for the score function. Let U(β,b) = ∂ℓ(β,b)/∂b. Using (2.4) and

24

Page 24 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 26: For Peer Review Only - SUSTech

For Peer Review Only

(4.5), we obtain from the definition of U(β,b) that

U(β,b) =n∑

i=1

∫ τ

0

B(Wi) dNi(s)−∫ τ

0

Vn1(s,b)/Vn0(s,b) dN(s)

=n∑

i=1

∫ τ

0

{B(Wi)− Vn1(s,b)/Vn0(s,b)} dMi(s)

+n∑

i=1

∫ τ

0

{B(Wi)− Vn1(s,b)/Vn0(s,b)}r∗i (s) dΛ0(s)

≡ Ln1(b) + Ln2(b). (A.2)

Then, by the definition of β and b, and the Taylor expansion, we have

0 = U(β, b) = U(β, b) + {∂U(β, b)/∂β′}(β − β) + R∗n0, (A.3)

where R∗n0 = 0.5(β−β)′{∂2U(β∗, b)/∂β∂β′}(β−β) with β∗ being between β and β, and R∗

n0

is defined as a d×1 vector whose kth component isR∗n0,k = 0.5(β−β)′{∂2Uk(β

∗, b)/∂β∂β′}(β−β) with Uk being the kth entry of U. Similarly, we have

U(β, b) = U(β,b0) + {∂U(β,b0)/∂b′}(b − b0) + R∗n, (A.4)

where b0 is defined in Lemma 1, R∗n = 0.5(b − b0)

′{∂2U(β,b∗)/∂b∂b′}(b − b0) with b∗

lying between b and b0, and R∗n is defined as a Jqn × 1 vector whose kth element is R∗

n,k =

0.5(b − b0)′{∂2Uk(β,b∗)/∂b∂b′}(b − b0).

(ii) Asymptotic expression for ϕ∗ − ϕ. By Lemma 1, we have

ϕ(w)− ϕ∗n0(w) + α(w) = op(h

ℓ) (A.5)

uniformly for w ∈ W , where ϕ∗n0(w) = b′

0B(w) is given in Lemma 1. By Lemma 8(i),Σn(b0) = −n−1∂U(β,b0)/∂b′ is positive definite for large n. Then, combining (A.3) and(A.4) leads to

b − b0 = n−1Σ−1n (b0)U(β,b0) + rn0 + rn1 + rn, (A.6)

where rn0 = n−1Σ−1n (b0){∂U(β, b)/∂β′}(β−β), rn1 = n−1Σ−1

n (b0)R∗n0, and rn = n−1Σ−1

n (b0)R∗n.

Recalling that ϕ∗n0(w) = b′

0B(w), we obtain from (A.2) and (A.6) that

ϕ∗(w)− ϕ∗n0(w) ≡ B′(w)(b − b0)

= vn(w) + αn(w) + B′(w)rn0 + B′(w)rn1 + B′(w)rn, (A.7)

where vn(w) = B′(w)Σn(b0)−1n−1Ln1(b0) and

αn(w) = B′(w)Σn(b0)−1n−1Ln2(b0).

25

Page 25 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 27: For Peer Review Only - SUSTech

For Peer Review Only

By Lemma 9, B′(w)rn = op(hℓ + 1/

√nh) and B′(w)rn1 = op(h

ℓ + 1/√nh). By Lemma 11,

αn(w) = op(hℓ) uniformly for w ∈ W . Since β − β = Op(1/

√n), it is easy to show that

B′(w)rn0 = Op(n−1/2). Therefore, by (A.7),

ϕ∗(w)− ϕ∗n0(w) = vn(w) + op(h

ℓ + 1/√nh), (A.8)

uniformly in w ∈ W . Then, by (A.5) and (A.8),

ϕ∗(w)− ϕ(w)− α(w) = vn(w) + op(hℓ + 1/

√nh), (A.9)

uniformly in w ∈ W .(iii) Asymptotic analysis for vn. Let v∗

n = n−1∑n

i=1

∫ τ

0{B(Wi)−Vn1(s,b0)/Vn0(s,b0)} dMi(s).

Then

vn(w) = B′(w)Σ−1n (b0)v∗

n.

Note that Mi(t) is a martingale and B(Wi) − Vn1(s,b0)/Vn0(s,b0) is Fi(s)-predictable. Itfollows that E(v∗

n) = 0 and

E∥v∗n∥2 = tr{E(v∗⊗2

n )} = n−1

∫ τ

0

E[tr{Vn(s)}]Rn0(s) dΛ0(s),

where Vn(s) = Vn2(s,b0)/Vn0(s,b0)− {Vn1(s,b0)/Vn0(s,b0)}⊗2. Let

Gn(a, s) =EPn{v2(W)r∗n(s)}

EPn{r∗n(s)}−[EPn{v(W)r∗n(s)}

EPn{r∗n(s)}

]2,

where Pn is the empirical distribution function of {Wi,Xi, Yi(s)}ni=1 and r∗n(s) = Y (s) exp{β′X+

ϕ∗n0(W)}. Hence, for vector a such that ∥a∥ = 1, we have

a′Vn(s)a = Gn(a, s).

By Lemma 8(i), a′Vna = G(a, s) + o(h), where G(a, s) is the population version of Gn(a, s).Then, by Lemma 4, a′Vn(s)a = O(h), uniformly for s ∈ [0, τ ], and hence the eigenvaluesof Vn(s) are all of order O(h) and tr{Vn(s)} = O(1), uniformly for s ∈ [0, τ ]. Therefore,E∥v∗

n∥2 = O(1/n). Applying the Markov inequality, we obtain that ∥v∗n∥ = Op(1/

√n). Let

ξn1 = n−1

n∑i=1

∫ τ

0

{B(Wi)− R∗

1(s)/R∗0(s)

}dMi(s),

ξn2 = n−1

n∑i=1

∫ τ

0

{R∗

1(s)/R∗0(s)− Rn1(s)/Rn0(s)

}dMi(s),

ξn3 = n−1

n∑i=1

∫ τ

0

{Rn1(s)/Rn0(s)− Vn1(s,b0)/Vn0(s,b0)

}dMi(s)

≡ n−1

n∑i=1

∫ τ

0

Rn(s) dMi(s).

26

Page 26 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 28: For Peer Review Only - SUSTech

For Peer Review Only

Thenvn(w) = B′(w)Σ−1

n (b0)(ξn1 + ξn2 + ξn3). (A.10)

Since Mi(t) is a martingale, we have

E∥ξn2∥2 = tr{E(ξ⊗2n2 )} = n−1E

∫ τ

0

tr{R(s)⊗2}Rn0(s) dΛ0(s)

= O(n−1)

∫ τ

0

E[tr{R(s)⊗2}] dΛ0(s),

where R(s) = R∗1(s)/R∗

0(s) − Rn1(s)/Rn0(s). By straightforward calculation, it is easy toshow that

sups∈[0,τ ]

E[tr{R(s)⊗2}] = O{1/(nh)}.

Therefore, E∥ξn2∥2 = O{1/(n2h)} and

∥ξn2∥ = Op{1/(n√h)}. (A.11)

Since Rn(s) is Fi(s)-predictable,

E∥ξn3∥2 = tr{E(ξ⊗2n3 )} = tr

[n−1

∫ τ

0

E{Rn(s)⊗2Rn0(s)} dΛ0(s)

]≤ n−1

∫ τ

0

E[tr{Rn(s)⊗2}tr{Rn0(s)}] dΛ0(s).

Then, by Lemma 14 and Condition (A3),

E∥ξn3∥2 = O(n−1h2ℓ)

∫ τ

0

E{Rn0(s)} dΛ0(s) = O(n−1h2ℓ).

Hence,

∥ξn3∥ = Op(hℓ/√n). (A.12)

Applying Lemma 8, we get

∥B′(w)Σ−1n (b0)∥ = Op(h

−1). (A.13)

By (A.11), (A.12), (A.13), and the Holder inequality (for ℓ ≥ 2),

|B′(w)Σ−1n (b0)(ξn2 + ξn3)| ≤ ∥B(w)Σ−1

n (b0)∥ ∥ξn2 + ξn3∥

= op(1/√nh),

uniformly in w ∈ W . Therefore, by (A.10),

vn(w) = B′(w)Σ−1n (b0)ξn1 + op(1/

√nh),

27

Page 27 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 29: For Peer Review Only - SUSTech

For Peer Review Only

uniformly w ∈ W . Naturally, it can be written that

vn(w) = B′(w)Σ−1n0 ξn1

+B′(w){Σ−1n (b0)−Σ−1

n0 }ξn1 + op(1/√nh), (A.14)

uniformly in w ∈ W . By Lemma 8, ∥Σn1(b0)∥ = Op(h/√n). By Lemma 1, |ϕ(Wi) −

ϕ∗n0(Wi)| = O(hℓ), uniformly for i = 1, . . . , n. By Lemma 1, Condition (A3) and Taylor’s

expansion,

Vnk(s,b0)− Rnk(s) =1

n

n∑i=1

r∗i (s)B(Wi)⊗k{ϕ∗

n0(Wi)− ϕ(Wi)}

+O(h2ℓ), a.s. (A.15)

uniformly for components and for s ∈ [0, τ ]. By (2.3), (A.15) and an argument similar to thatfor Lnk(s) in Lemma 11, it is easy to show that, for ℓ ≥ 2, Rnk(s) − Vnk(s,b0) = op(h

ℓ+1)

uniformly for components. Thus, Σn2(b0)−Σn0 = op(hℓ+1), uniformly for components. This

leads to ∥Σn2(b0)−Σn0∥ = o(hℓ). Hence, by Lemma 8, for ℓ ≥ 2,

∥Σn(b0)−Σn0∥ ≤ ∥Σn1(b0)∥+ ∥Σn2(b0)−Σn0∥

= Op(h/√n) + op(h

ℓ), (A.16)

which, combined with Lemma 8, yields that ∥Σ−1n0 {Σn(b0) − Σn0}∥ = op(1). Then applying

Lemma 12, we establish that

Σ−1n (b0) = [I +Σ−1

n0 {Σn(b0)−Σn0}]−1Σ−1n0

= Σ−1n0 −Σ−1

n0 {Σn(b0)−Σn0}Σ−1n0 + γnΣ

−1n0 , (A.17)

where ∥γn∥ = O(∥Σ−1n0 {Σn(b0) −Σn0}∥2). This, combined with (A.16) and Lemma 8, leads

to ∥γn∥ = Op(1/n) + op(h2ℓ−2). Then, for ℓ ≥ 2,

∥γnΣ−1n0 ∥ = {Op(1/n) + op(h

2ℓ−2)}h−1 = op(1/√h).

Note that E(ξ⊗2n1 ) = n−1Σ0. It follows that

E{tr(ξ⊗2n1 )} = tr{E(ξ⊗2

n1 )} = n−1tr(Σ0) ≤ qnλmax(Σ0)/n = O(1/n),

and hence

∥ξn1∥ = Op(1/√n). (A.18)

Then, by (A.14) and (A.17), we have

vn(w) = B′(w)Σ−1n0 ξn1 − B′(w)Σ−1

n0 {Σn(b0)−Σn0}Σ−1n0 ξn1

+op(1/√nh), (A.19)

28

Page 28 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 30: For Peer Review Only - SUSTech

For Peer Review Only

uniformly in w ∈ W . By Lemma 8, we have

{ ∥B′(w)Σ−10 ∥ = O(h−1), ∥Σ−1

n0 ∥ = O(h−1),

∥B′(w)Σ−1n0 ∥ = O(h−1),

(A.20)

uniformly in w ∈ W . Using Holder’s inequality, (A.16) and (A.20), we obtain that

∥B′(w)Σ−1n0 {Σn(b0)−Σn0}Σ−1

n0 ∥ ≤ ∥B′(w)Σ−1n0 ∥ ∥Σn(b0)−Σn0∥ ∥Σ−1

n0 ∥

= op(1/√h),

uniformly in w ∈ W . This, combined with (A.18) and (A.19), leads to

vn(w) = B′(w)Σ−1n0 ξn1 + op(1/

√nh), (A.21)

uniformly in w ∈ W . By Lemmas 8 and 15, we have

∥(Σn0 −Σ0)Σ−10 ∥ ≤ ∥Σ−1

0 ∥ ∥(Σn0 −Σ0)∥ = Op(1/√n). (A.22)

Combination of (A.20) and (A.22) produces that

∥B′(w)(Σ−1n0 −Σ−1

0 )∥ = ∥B′(w)Σ−1n0 (Σn0 −Σ0)Σ

−10 ∥

≤ ∥B′(w)Σ−1n0 ∥ ∥(Σn0 −Σ0)Σ

−10 ∥ = op(h

−1/2),

uniformly in w ∈ W . Therefore, by (A.18), we have

∥B′(w)(Σ−1n0 −Σ−1

0 )ξn1∥ = op(1/√nh), (A.23)

uniformly in w ∈ W . Combination of (A.21) and (A.23) yields that

vn(w) = B′(w)Σ−10 ξn1 + op(1/

√nh), (A.24)

uniformly in w ∈ W . Further, by (A.18) and (A.20), we have

vn(w) = Op(1/√nh), (A.25)

uniformly in w ∈ W .

(iv) Bahadur’s representation. Using (A.9) and (A.24), we obtain that

ϕ∗(w)− ϕ(w)− α(w) = B′(w)Σ−10 ξn1 + op(h

ℓ + 1/√nh), (A.26)

uniformly in w ∈ W . Note that ϕ∗j(wj)−ϕj0(wj) = B′(w)ej(b−b0). Using the same argument

for (A.26), we obtain that

ϕ∗j(wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ

−10 ξn1 + op(h

ℓ + 1/√nh). (A.27)

29

Page 29 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 31: For Peer Review Only - SUSTech

For Peer Review Only

By (A.26), we have

In ≡n∑

i=1

δi{ϕ∗(Wi)− ϕ(Wi)}/n∑

i=1

δi = In1 + In2 + op(hℓ + 1/

√nh),

where In1 =∑n

i=1 δiα(Wi)/∑n

i=1 δi, and

In2 =n∑

i=1

δiB′(Wi)Σ−10 ξn1/

n∑i=1

δi.

From the proof of Lemma 3, we see that In1 = op(hℓ+1/

√nh). By Lemma 8, ∥Σ−1

0 ∥ = O(h−1).By (2.3),

n∑i=1

δiBk(Wi)/n∑

i=1

δi = Op(h). (A.28)

This, combined with (A.18) and (A.28), yields that In2 = Op(1/√n) and In = op(h

ℓ+1/√nh).

Since E{ϕ(Wi)|δi} = 0, by the central limit theorem, we have

Jn ≡n∑

i=1

δi[ϕ(Wi)− E{ϕ(Wi)|δi}]/n∑

i=1

δi = Op(1/√n).

Naturally,

ϕ(w)− ϕ(w)− α(w) = ϕ∗(w)− ϕ(w)− α(w)− In − Jn

= B′(w)Σ−10 ξn1 + op(h

ℓ + 1/√nh), (A.29)

uniformly in w ∈ W . Similarly, by (A.27) we have

ϕj(wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ−10 ξn1 + op(h

ℓ + 1/√nh). (A.30)

Proof of Theorem 2. Define

ξn1(t) = n−1

n∑i=1

∫ t

0

{B(Wi)− R∗

1(s)/R∗0(s)

}dMi(s),

Σ0(t) =

∫ t

0

[R∗

2(s)/R∗0(s)− {R∗

1(s)/R∗0(s)}⊗2

]R∗

0(s) dΛ0(s),

ϕn(t) = B′(w)Σ−10 (t)ξn1(t), and ϕn = B′(w)Σ−1

0 ξn1. Then ξn1(τ) = ξn1, ϕn(τ) = ϕn, andϕn(t) is a martingale with respect to F(t) and has mean zero and predictable variation process

⟨ϕn(t)⟩ = n−1B′(w)Σ−10 (t)B(w).

30

Page 30 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 32: For Peer Review Only - SUSTech

For Peer Review Only

By the martingale central limit theorem (Fleming and Harrington, 2005), σ−1n (w)ϕn(τ) is

asymptotically standard normal, where

σ2n(w) = n−1B′(w)Σ−1

0 B(w). (A.31)

This, combined with (A.29) and Slutsky’s theorem, yields that

[ϕ(w)− ϕ(w)− α(w)]/σn(w)D−→ N (0, 1). (A.32)

Furthermore, by (A.31), σ2n(w) ≤ n−1∥B(w)∥2∥Σ−1

0 ∥. By Lemma 8, σ2n(w) = O{1/(nh)}.

Similarly, by (A.30),

[ϕj(wj)− ϕj(wj)− αj(wj)]/σn,j(wj)D−→ N (0, 1)

with σ2n,j(wj) = n−1B′(w)ejΣ

−10 e′jB(w) = n−1B′

j(wj)Σjj0 Bj(wj). By Lemma 13, there exist

positive constants C1, C2 (independent of w) such that

C1/(nh) ≤ σ2n(w) ≤ C2/(nh). (A.33)

Hence, σn(w) ≍ (nh)−1/2, uniformly in w. Similarly, σn,j(wj) ≍ (nh)−1/2. This completes theproof of the theorem.

Proof of Theorem 3. Note that E(ξn1) = 0 and var(ξn1) = Σ0/n. Let ϕn(w) = B′(w)Σ−10 ξn1.

Then E{ϕn(w)} = 0 and

var{ϕn(w)} = B′(w)Σ−10 var(ξn1)Σ

−10 B(w)

= n−1B′(w)Σ−10 B(w) = σ2

n(w).

Then, by (A.33) and the Chebyshev inequality, for any M > 0,

supw∈W

P (|ϕn(w)| > M/√nh) ≤ C2/M

2 → 0

asM → ∞, or equivalently by (A.29) and supw α(w) = O(hℓ), the result of theorem holds.

Proof of Theorem 4. Let r0(u) = n−1∑n

i=1 r∗i (s) and

r0(u) = n−1

n∑i=1

Yi(u) exp{β′Xi + ϕ(Wi)}.

By (A.9), (A.25), and the definition of α(w), supw∈W |ϕ(w) − ϕ(w)| = op(1). Note thatsup1≤i≤n ∥n−1/2Xi∥ = op(1) and hence sup1≤i≤n |β

′Xi −β′Xi| = op(1). It is easy to show that

rj0(u) = r0(u){1 + op(1)}, uniformly for u ∈ [0, τ ]. Hence,

Λ0(t) =

∫ t

0

r−10 (u)n−1

n∑i=1

dNi(u) =

∫ t

0

r−10 (u)n−1

n∑i=1

dNi(u) + op(1),

31

Page 31 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 33: For Peer Review Only - SUSTech

For Peer Review Only

uniformly in t ∈ [0, τ ]. This, combined with the Doob-Meyer decomposition, leads to

Λ0(t)− Λ0(t) = n−1

∫ t

0

r−10 (u)dM(u) + op(1)

≡ γn(t) + op(1),

uniformly in t ∈ [0, τ ]. Since r0(u) is F(u)-predictable and M(u) is a martingale with respectto F(u), by the martingale central limit theorem, γn(t) is obviously of order Op(1/

√n). By

the Borel-Lebesgue covering theorem, for any small ε > 0, there exist a finite number of openintervals, (τj − ε, τj + ε) for j = 1, . . . , L, such that τj ∈ (0, τ) and [0, τ ] ⊂ ∩L

j=1(τj − ε, τj + ε).Since each γn(t) is of order op(1), max1≤j≤L γn(τj) = op(1). For any t ∈ [0, τ ], it must be inone of the intervals, for example, the kth interval (τk − ε, τk + ε). Then |t− τk| < ε. Note thatr0(u) = n−1

∑ni=1 r

∗i (u). It follows that

|γn(t)− γn(τk)| = |∫ t

τk

r0(u)−1n−1

n∑i=1

{dNi(u)− r∗i (u)dΛ0(u)} |

≤ |∫ t

τk

r0(u)−1n−1

n∑i=1

dNi(u) | + |∫ t

τk

dΛ0(u)} |

= O(ε),

uniformly in t. Hence,

supt∈[0,τ ]

|γn(t)| ≤ max1≤j≤L

|γn(τj)|+ supt∈[0,τ ]

|γn(t)− γn(τk)| = op(1).

Then the result of theorem follows.

Proof of Theorem 5. By definitions of Σ0, Σn0 and Σ0, we have

Σ0 −Σn0 =

∫ τ

0

[Rn2(s)

Rn0(s)−{Rn1(s)

Rn0(s)

}⊗2]Rn0(s) d{Λ0(s)− Λ0(s)}, (A.34)

and

σ2n(w)− σ2

n(w) = n−1B′(w)(Σ−1

0 −Σ−10 )B(w)

= n−1B′(w)(Σ−1

0 −Σ−1n0 )B(w) + n−1B′(w)(Σ−1

n0 −Σ−10 )B(w)

≡ Un1(w) + Un2(w).

By (A.20) and Lemma 15, we have

|Un2(w)| ≤ n−1∥B(w)′Σ−1n0 ∥∥Σn0 −Σ0∥∥Σ−1

0 B(w)∥

= op{1/(nh)}, (A.35)

uniformly in w. Then

σ2n(w)− σ2

n(w) = Un1(w) + op{1/(nh)}, (A.36)

32

Page 32 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 34: For Peer Review Only - SUSTech

For Peer Review Only

uniformly in w. For any vector a such that ∥a∥ = 1, we have |a′Σ0a| = |a′Σ0a|{1 + op(1)}.Then, similar to (A.20),

∥B(w)′Σ−1

0 ∥ = Op(h−1), (A.37)

uniformly in w. Note that Rnk(s) = Op(1) for k = 0, 1, 2. By (A.34) and Theorem 4 and byusing the argument in Lemma 15, it can be shown that

∥Σ0 −Σn0∥ = op(1). (A.38)

Then by (A.20), (A.37) and (A.38),

|Un1(w)| ≤ n−1∥B′(w)Σ−1

0 ∥∥Σ0 −Σn0∥∥Σ−1n0 B(w)∥

= n−1Op(h−1)op(1)Op(h

−1) = op{1/(nh)},

uniformly in w. This combined with (A.36) leads to

σ2n(w)− σ2

n(w) = op{1/(nh)},

uniformly in w. Similarly, σ2n,j(wj) − σ2

n,j(wj) = op{1/(nh)}, uniformly in wj. This, togetherwith Theorem 2, completes the proof of the theorem.

Proof of Theorem 6. Similar to the proof of Theorem 1, we have the Bahadur representation.In the following we give an outline here. Let Uj(bj) = ∂ℓj(bj)/∂bj and

Vnk(s,bj) = n−1

n∑i=1

Bj(Wij)⊗kYi(s) exp{β

′Xi + b′

jBj(Wij) + ϕ−j(Wi,−j)}.

Then Uj(bj) = 0. Similar to (A.9), we have

ϕ∗j(wj)− ϕj(wj)− αj(wj) = vnj(wj) + op(h

ℓj + 1/

√nhj), (A.39)

uniformly in wj, where

vnj(wj) = Bj(wj)Σnj(b0j)−1n−1

n∑i=1

∫ τ

0

{Bj(Wij)− Vn1(s,bj)/Vn0(s,bj)} dMi(s).

Using a similar argument to (A.24), we obtain that

vnj(wj) = Bj(wj)Σ−1

0,jj ξnj + op(1/

√nhj), (A.40)

uniformly in wj. Combining (A.39) and (A.40) and using the same argument as for (A.30),we establish the Bahadur representation:

ϕj(wj)− ϕj(wj)− αj(wj) = Bj(wj)Σ−1

0,jj ξnj + op(hℓj + 1/

√nhj).

Then by the same argument of Theorem 2, we obtain the asymptotic normality result.

33

Page 33 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 35: For Peer Review Only - SUSTech

For Peer Review Only

Proof of Theorem 7. By Theorem 1, we have

ϕj(wj)− ϕj(wj) = αj(wj) + B′(w)ejΣ−10 ξn1 + op(h

ℓ + 1/√nh),

uniformly in w ∈ W . Let ηn = Σ−10 ξn1, and for j = 1, . . . , J let the jth block entry of ηn

be ηnj which is equal to (Σj10 , . . . ,Σ

jJ0 )ξn1. Denote by Γ1 = e1Σ

−10 . Then ηn1 = Γ1ξn1.

Note that supwj|αj(wj)| = O(hℓ). By the proof of Theorem 3, we have B′(w)ejΣ

−10 ξn1 =

B′j(wj)ηnj = Op(1/

√nh). If nh2ℓ+1 → 0, then

{ϕj(wj)− ϕj(wj)}2 = η′nj{Bj(wj)}⊗2ηnj + op{1/(nh)},

uniformly in w ∈ W . Hence, under H0, Tn = nhη′n1Anηn1 + op(1), where

An =

∫ 1

0

{B1(w1)}⊗2a(w1) dw1

is a qn × qn matrix. Let εi =∫ τ

0{B(Wi)− R∗

1(s)/R∗0(s)} dMi(s). Then {εi} are independent

and identically distributed with mean zero and covariance matrix Σ0, ξn1 = n−1∑n

i=1 εi, and

Tn = n−1h(n∑

i=1

εi)′Γ′

1AnΓ1(n∑

i=1

εi) + op(1) = Tn1 + Tn2 + op(1),

where Tn1 = n−1h∑n

i=1 ε′iΓ

′1AnΓ1εi and Tn2 = 2n−1h

∑i<j ε

′iΓ

′1AnΓ1εj. Directly comput-

ing the mean of quadratic form Tn1, we obtain that E(Tn1) = htr(Γ′1AnΓ1Σ0). Then it is

straightforward to show that

E(Tn1) = htr(Ane′1Σ

−10 e1) = htr(AnΣ

110 ) = µ∗

n,

and var(Tn1) = o(1). This leads to

Tn = µ∗n + op(1) + Tn2. (A.41)

By Theorem 2, σ2n,1(w1) ≍ (nh)−1 uniformly in w1, and thus by the definition we have µ∗

n ≍O(1). By Lemma 16, the U-statistic σ∗−1

n Tn2 is asymptotically standard normal and σ∗2n ≍ h.

This, combined with (A.41), completes the proof of part (i). For part (ii), the result holdsfrom Theorem 6 and the same argument as above.

Proof of Theorem 8. The proof is similar to that of Theorem 7. We here give a sketch of it.Similar to (A.41), under H1n, we have

Tn = µ∗n + Tn2 + d1n + d2n + op(1), (A.42)

where d1n = −2nh∫ 1

0{ϕ1(w1)−ϕ1(w1)}{ϕ1(w1)−ϕ1,0(w1)}a(w1) dw1 and d2n = nh

∫ 1

0{ϕ1(w1)−

ϕ1,0(w1)}2a(w1) dw1. Since ϕ1(w1)− ϕ1,0(w1) = γng(w1) + o(γn), we have

d2n = c

∫ 1

0

g2(w1)a(w1) dw1 + o(1) ≡ c∗ + o(1),

34

Page 34 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 36: For Peer Review Only - SUSTech

For Peer Review Only

d1n = 2nhγn

∫ 1

0

g(w1)a(w1)B′1(w1)ηn1 dw1{1 + op(1)}

= 2nhγnb′ηn1{1 + op(1)} ≡ d∗1n{1 + op(1)}.

It is easy to see that d∗1n is uncorrelated with Tn2 and is asymptotically normal with E(d∗1n) = 0

and var(d∗1n) = 4chb′Σ110 b{1 + o(1)}. Let the ith component of b be bi. Then

bi =

∫ ξ1,i

ξ1,i−ℓ1

B1,i(w1)g(w1)a(w1) dw1 = O(h1)

uniformly for i = 1, . . . , q1. Hence, by Lemma 8,

|b′Σ110 b| ≤ λmax(Σ

110 )tr(b⊗2) = O(1).

This produces that var(d∗1n) = O(h). Hence, d∗1n = op(1). Then, by (A.42),

s∗−1n {Tn − µ∗

n(1 + op(1))− c∗} D−→ N (0, 1),

where s∗2n = σ∗2n + 4chb′Σ11

0 b = O(h). That is, the result in part (i) holds. Using the sameargument, we establish the result in part (ii).

Proof of Theorem 9. The results are proven by drawing a parallel between the approximatedpartial likelihood and its bootstrap analogue. The argument employed here can be usefulfor proving consistency of bootstrap methods in other scenarios. Let ωki = 1(S∗

k = Si, δ∗k =

δi,W∗k = Wi,X∗

k = Xi), for k, i = 1, . . . , n. Then

P (ωki = 1|Fn) = 1/n, and P (ωki = 0|Fn) = 1− 1/n,

where Fn is the empirical distribution of {Si, δi,Wi,Xi}ni=1. Given the bootstrap sample{S∗

i , δ∗i ,W∗

i ,X∗i }, i = 1, . . . , n, the logarithm of the approximated partial likelihood is

ℓ∗(β,b) =n∑

i=1

δ∗i

{β′X∗

i + ϕn(W∗i )− log

∑k∈R∗

i

exp[β′X∗k + ϕn(W∗

k)]},

where R∗i = {j : S∗

j ≥ S∗i }. Let ωi =

∑nk=1 ωki for i = 1, . . . , n. That is, ωi is the frequency

that the i-th original sample points were drawn in the bootstrap sample. Then

ℓ∗(β,b) =n∑

i=1

ωiδi

{β′Xi + ϕn(Wi)− log

∑k∈Ri

ωk exp[β′Xk + ϕn(Wk)]

}. (A.43)

This is just a random weighted version of the approximated partial likelihood in (2.4). Notethat the bootstrap estimators (β

∗, b

∗) for (β,b) maximize the likelihood in (A.43). Similar

to (A.30) in the proof of Theorem 1, we have

ϕ(b)j (wj)− ϕj(wj)− αj(wj) = B′(w)ejΣ

−10 ξ∗n1 + op(h

ℓ + 1/√nh), (A.44)

35

Page 35 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 37: For Peer Review Only - SUSTech

For Peer Review Only

where ϕ(b)j (wj) is the bootstrap estimate of ϕ(wj) defined in the same way as ϕj(wj) but with

b replaced with b∗, and

ξ∗n1 = n−1

n∑i=1

ωi

∫ τ

0

{B(Wi)− R∗

1(s)/R∗0(s)

}dMi(s).

Hence, by (A.30) and (A.44),

ϕ(b)j (wj)− ϕj(wj) = B′(w)ejΣ

−10 ξn1 + op(h

ℓ + 1/√nh), (A.45)

where ξn1 = n−1∑n

i=1(ωi− 1)∫ τ

0

{B(Wi)−R∗

1(s)/R∗0(s)

}dMi(s). Similar to (A.41), we have

T ∗n = µ∗

n + op(1) + T ∗n2, (A.46)

where T ∗n2 = 2n−1h

∑i<j(ωi − 1)(ωj − 1)ε′iΓ

′1AnΓ1εj. Since ωi =

∑nk=1 ωki, it can be written

that

T ∗n2 = 2n−1h

n∑k=1

ηkk + 2n−1h∑k<ℓ

ηkℓ ≡ T ∗n21 + T ∗

n22,

where ηkℓ =∑

i<j(ωki−1/n)(ωℓj−1/n)ε′iΓ′1AnΓ1εj. Note that E[(ωki−1/n)2|Fn] = 1/n−1/n2,

and for i = j, E[(ωki − 1/n)(ωkj − 1/n)|Fn] = −1/n2, almost surely. It is easy to show thatT ∗n21 → 0 almost surely. Furthermore, conditional on Fn, for k = ℓ, E{ηkℓ|Fn} = 0,

var(ηkℓ|Fn) = (1/n− 1/n2)2∑i<j

(ε′iΓ′1AnΓ1εj)

2{1 + o(1)},

and if (k, ℓ) = (k′, ℓ′), E(ηkℓηk′ℓ′ |Fn) = 0. Therefore,

var(σ∗−1n T ∗

n22|Fn) = 4h2n−2∑i<j

(ε′iΓ′1AnΓ1εj/σ

∗n)

2{1 + o(1)}

→ 1,

almost surely. Using a similar argument to that in the proof of Lemma 16, we obtain that,conditional on Fn, σ∗−1

n T ∗n2 is asymptotically standard normal. Therefore, by (A.46) and

Theorem 7, the asymptotic normal distribution of T ∗n conditional on Fn is the same as the

asymptotic normal distribution of Tn. This combined with the Polya theorem completes theproof of the first result of the theorem. For the second result, it holds from the same argument.

References

Agarwal, G. G. and Studden, W. J. (1980). Asymptotic integrated mean square errorusing least squares and bias minimizing splines. The Annals of Statistics 8, 1307–1325.

36

Page 36 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 38: For Peer Review Only - SUSTech

For Peer Review Only

Barrow, D. L. and Smith, P. W. (1978). Asymptotic properties of best L2[0, 1] approxi-mation by spline with variable knots. Quart. Appl. Math. 36, 293–304.

Bickel, P. J. (1975). One-step Huber estimates in linear models. Journal of American Sta-tistical Association 70, 428–433.

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficientand Adaptive Estimation for Semiparametric Models. John Hopkins University Press.

Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations ofdensity function estimates. Annals of Statistics. 1, 1071-1095.

Breslow, N. E. (1972). Contribution to the discussion on the paper by D. R. Cox, “Regres-sion and life tables”. Journal Royal Statistical Society B 34, 216–217.

Breslow, N. E. (1974). Covariance analysis of censored survival data. Biometrics 30, 89–99.

Buja, A., Hastie, T.J., and Tibshirani, R. J. (1989). Linear Smoothers and AdditiveModels. The Annals of Statistics 17, 453–555.

Cai, J., Fan, J., Jiang, J., and Zhou, H. (2007). Partially linear hazard regression formultivariate survival data. Jour. Journal of American Statistical Association 102, 538–551.

Chen, K.N., Jin, Z., and Ying, Z. (2002). Semiparametric analysis of transformationmodel with censored data. Biometrika 82, 659–668.

Chen, K.N. and Tong X. (2010). Varying coefficient transformation models with censoreddata. Biometrika 97, 969–976.

Clegg, L. X., Cai, J., and Sen, P. K. (1999). A marginal mixed baseline hazards modelfor multivariate failure time data. Biometrics 55, 805812.

Cox, D. R. (1972). Regression models and life-tables (with discussion). Journal Royal Sta-tistical Society B 34, 187–220.

Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–276.

Dawber, T. R. (1980). The Framingham Study: The Epidemilogy of Atherosclerotic Disease.Cambridge, MA: Harvard University Press.

de Boor, C. (1978). A Practical Guide to B-splines. Springer-verlag, New York.

Fan, J. (1992). Design-adaptive nonparametric regression. Journal of the American StatisticalAssociation 87, 998–1004.

37

Page 37 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 39: For Peer Review Only - SUSTech

For Peer Review Only

Fan, J., Feng, Y. and Song, R. (2011). Nonparametric Independence Screening in SparseUltra-High-Dimensional Additive Models. Journal of the American Statistics Association106, 544–557.

Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications, New York:Chapman & Hall.

Fan, J., Gijbels, I. and King, M. (1997). Local likelihood and local partial likelihood inhazard regression. The Annals of Statistics 25, 1661–1690.

Fan, J. and Jiang, J.(2005). Generalized likelihood ratio tests for additive models. J.Amer.Statist. Assoc. 100, 890-907.

Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Meth-ods. New York: Springer-Verlag.

Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. Ann. Statist.27, 1491-1518.

Fan, J, Zhang, C-M. and Zhang, J.(2001). Generalized likelihood ratio statistics andWilks phenomenon. Ann. Statist. 29, 153-193.

Fleming, T. R. and Harrington, D. P. (1991). Counting Process and Survival Analysis.New York: Wiley.

Gasser, T. and Muller, H. -G. (1984) Estimating regression functions and their deriva-tives by the kernel method. Scand. J. Statist. 11, 171–185.

Hall, P. and Heyde, C. C. (1980). Martingale limit theory and its application. Academic Press.

Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, New York:Chapman and Hall.

Hong, Y. and Lee, Y-J. (2013). A loss function approach to model specification testingand its relative efficiency. The Annals of Statistics 41, 1166–1203.

Horowitz, J. L. and Mammen, E.(2004). Nonparametric estimation of an additive modelwith a link function. Ann. Statist. 32, 2412-2443.

Huang, J. (1999). Efficient estimation of the partly linear additive Cox model. The Annalsof Statistics 27, 1536–1563.

Huang, J. Z. (1998). Projection estimation for multiple regression with application to func-tional ANOVA models. The Annals of Statistics 26, 242–272.

38

Page 38 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 40: For Peer Review Only - SUSTech

For Peer Review Only

Huang, J. Z. (2001). Concave extended linear modeling: A theoretical synthesis. StatisticaSinica 11, 173–197.

Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals ofStatistics 31, 1600–1635.

Huang, J. Z. and Liu, L. (2006). Polynomial Spline Estimation and Inference of Pro-portional Hazards Regression Models with Flexible Relative Risk Form. Biometrics 62,793–802.

Huang, J. Z., Kooperberg, C., Stone, C. J. and Truong, Y. K. (2000). FunctionalANOVA modeling for proportional hazards regression. The Annals of Statistics 28, 961–999.

Huang, J. Z. and Shen, H. (2004). Functional coefficient regression models for nonlineartime series: A polynomial spline approach. Scandinavian Journal of Statistics, 31, 515–534.

Huang, J. Z. and Stone, C. J. (1998). The L2 rate of convergence for event historyregression with time-dependent covariates. Scand. J. Statist. 25, 603–620.

Ingster, Yu. I. (1993). Asymptotic Minimax Hypothesis Testing for Nonparametric Alter-natives I-III,” Math. Methods Statist., 2, 85 – 114; 3, 171 – 189; 4, 249 – 268.

Jiang, J. and Jiang X. (2011). Inference for partly linear additive Cox models. StatisticaSinica 21, 901-921.

Jiang, J. and Li, J. (2008). Two-stage local M-estimation of additive models. Science inChina Ser. A. 51, 1315-1338.

Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure TimeData.

Kong, E. and Xia, Y. (2007). Variable selection for the single-index model. Biometrika,94, 217–229.

Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995a). The L2 rate of convergencefor hazard regression. Scand. J. Statist. 22, 143–157.

Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995b). Rate of convergence forlogspline spectral density estimation. J. Time Ser. Anal. 16, 389–401.

Liu, R. and Yang, L. (2016). Spline estimation of a semiparametric GARCH model. Econo-metric Theory 32, 1023-1054.

39

Page 39 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 41: For Peer Review Only - SUSTech

For Peer Review Only

Liu, R.,Yang, L. and Härdle, W. K. (2013). Oracally efficient two-step estimation ofgeneralized additive model. Jour. Amer. Statist. Assoc. 108, 619631.

Lu, X. and Song, P. X.-K. (2015). Efficient estimation of the partly linear additive hazardsmodel with current status data. Scandinavian Journal of Statistics 42, 306-328.

Lu, W. and Zhang, H. (2010). On estimation of partially linear transformation models.Jour. Amer. Statist. Assoc., 105, 683–691.

Ma, S., Carroll, R. J., Liang, H. and Xu, S.(2015). Estimation and inference in general-ized additive coefficient models for nonlinear interactions with high-dimensional covariates.Ann. Statist. 43, 2102-2131.

Ma, S. and Kosorok, M.R. (2005). Penalized log-likelihood estimation for partly lineartransformation models with current status data. Ann. Statist., 33, 2256–2290.

Nan, B., Lin, X., Lisabeth, L. D. and Harlow, S. D. (2005). A varying-coefficient Coxmodel for the effect of age at a marker event on age at menopause. Biometrics 61, 576–583.

Opsomer, J.-D. (2000). Asymptotic Properties of Backfitting Estimators. Journal of Multi-variate Analysis 73, 166 – 179.

Opsomer, J.-D. and Ruppert D. (1997). Fitting a Bivariate Additive Model by LocalPolynomial Regression. Ann. Statist. 25, 186 –211.

Opsomer, J.-D. and Ruppert, D. (1999). A root-n consistent backfitting estimator forsemiparametric additive modeling. Journal of Computational and Graphical Statistics 8,715–732.

Ø’Sullivan, F. (1988). Nonparametric estimation of relative risk using splines and crossval-idation. Siam J. Sci. Stat. Comput. 9, 531–542.

Sasieni, P. (1992). Information bounds for the conditional hazard ratio in a nested familyof regression models. Journal Royal Statistical Society B 54, 617–635.

Schumaker, L. (1981). Spline Functions: Basic Theory, Wiley, New York.

Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals ofStatistics 13, 689–705.

Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models.The Annals of Statistics 14, 590–606.

40

Page 40 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 42: For Peer Review Only - SUSTech

For Peer Review Only

Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariatefunction estimation (with discussion). The Annals of Statistics 22, 118–184.

van der Vaart, A. W. (1991). On differentiable functionals. The Annals of Statistics 19,178–204.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal Royal Statistical Society B 68, 49–67.

Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics for regression splinesand confidence regions. The Annals of Statistics 26, 1760–1782.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.Journal Royal Statistical Society B 67, 301–320.

41

Page 41 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 43: For Peer Review Only - SUSTech

For Peer Review Only

Supplementary material for “Polynomial Spline Estimation of Partly Linear Additive CoxModels”

Lemma 1. (Approximation error of polynomial splines) Under Conditions (B1) and (B7),there exists a ϕ∗

n0(w) =∑J

j=1 ϕ∗n0,j(wj) = b′

0B(w) with each component ϕ∗n0,j ∈ S(ℓ, ξj), such

that

supw∈W

|ϕ(w)− ϕ∗n0(w) + α(w)| = o(hℓ),

where α(w) is defined in Theorem 1.

Proof. By (2.7) of Barrow and Smith (1978), for any wj ∈ [0, 1], the approximation error ofspline functions in S(ℓj, ξj) is

infs(wj)∈S(ℓj ,ξj)

supwj∈[0,1]

|ϕj(wj) + α∗j (wj)− s(wj)| = o(h

ℓjj ), (A.47)

where

α∗j (wj) = −

Kj∑i=0

ϕ(ℓj)j (ξj,i)

hℓjj,i

ℓj!B∗

ℓj(wj − ξj,ihj,i

)I(wj∈Ij,i).

The Bernoulli polynomial satisfies that∫ 1

0B∗

ℓj(x) dx = 0, for ℓj ≥ 1. It follows from (A.47)

that, for j = 1, . . . , J , there exists a ϕ∗n0,j(wj) = b′

j0Bj(wj) ∈ S(ℓj, ξj) such that

supwj∈[0,1]

|ϕj(wj) + α∗j (wj)− ϕ∗

n0,j(wj)| = o(hℓjj ). (A.48)

Obviously, ϕ∗n0(w) = b′

0B(w), where b0 = (b′10, . . . ,b

′J0)

′. Since ϕ(ℓj)j is continuous, we have

ϕ(ℓj)j (wj) = ϕ

(ℓj)j (ξj,i) + o(1), for wj ∈ Ij,i. Then, by Conditions (B1) and (B7),

ϕ(w)− ϕ∗n0(w) =

J∑j=1

Kj∑i=0

1

ℓj!hℓji ϕ

(ℓj)j (ξj,i)B

∗ℓj(wj − ξj,i

hi)I(wj∈Ij,i) + o(hℓ)

=J∑

j=1

Kj∑i=0

1

ℓj!hℓji ϕ

(ℓj)j (wj)B

∗ℓj(wj − ξj,ihj,i

)I(wj∈Ij,i) + o(hℓ)

≡ −α(w) + o(hℓ),

uniformly in w ∈ W .

Lemma 2. (Lemma 6.5 of Zhou et al. 1998) If A and B are l× l nonnegative matrices, thenλmin(A)tr(B) ≤ tr(AB) ≤ λmax(A)tr(B).

Lemma 3. Under Conditions (A) and (B), ∥ϕ−ϕn0∥2 = Op{hℓ+1/√nh} and ∥ϕ−ϕn0∥∞ =

Op(hℓ−1+1/

√nh3), where ϕn0(w) = ϕ∗

n0(w)−∑n

i=1 δiϕ∗n0(Wi)/

∑ni=1 δi, and ϕ∗

n0(·) is given inLemma 1.

42

Page 42 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 44: For Peer Review Only - SUSTech

For Peer Review Only

Proof. Let Φn be the collection of the functions ϕn on [0, 1]J with the additive form ϕn(w) =

ϕ1(w1)+ · · ·+ϕJ(wJ) for each component ϕj ∈ S(ℓ, ξj). By Lemma 1, there exists a ϕ∗n0(w) =

b′0B(w) such that

∥ϕ(w)− ϕ∗n0(w) + α(w)∥∞ = o(hℓ), (A.49)

where ∥ · ∥∞ denotes the supremum norm, and α(w) =∑J

j=1 α∗j (wj). Since E{ϕ(Wi)|δi} = 0,

we haven∑

i=1

δiϕ∗n0(Wi)/

n∑i=1

δi = An1 + An2,

whereAn1 =∑n

i=1 δi{ϕ∗n0(Wi)−ϕ(Wi)}/

∑ni=1 δi andAn2 =

∑ni=1 δi[ϕ(Wi)−E{ϕ(Wi)|δi}]/

∑ni=1 δi.

By (A.49)

An1 =n∑

i=1

δiα(Wi)/n∑

i=1

δi + o(hℓ).

Note that∫B∗

ℓj(t) dt = 0. By a change of variable and the continuity of f(·|δ = 1), the

conditional density of Wj, given δj = 1, we have

E{B∗

ℓj

(Wj − ξj,ihj,i

)I(Wj∈Ij,i)

}=

∫ ξj,i+1

ξj,i

B∗ℓj

(w − ξj,ihj,i

)f(w|δ = 1) dw

= hj,i

∫ 1

0

B∗ℓj(t)f(ξj,i + thj,i|δ = 1) dt

= O(h2j).

Then, using the definition of α∗j , we obtain that E{α∗

j (Wj)|δj = 1} = O(hℓj+1j ) = o(h

ℓjj ), and

hence E{α(W)|δ = 1} = o(hℓ). Calculating the conditional mean and variance and usingthe Chebychev inequality, we obtain that An1 = op(h

ℓ) + Op(1/√n) and An2 = Op(1/

√n).

Therefore,n∑

i=1

δiϕ∗n0(Wi)/

n∑i=1

δi = op(hℓ) +Op(1/

√n),

which together with (A.49) leads to

∥ϕ(w)− ϕn0(w) + α(w)∥∞ = op(hℓ) +Op(1/

√n). (A.50)

By Theorem 3.2 of Huang (1999), ∥ϕ− ϕ∥22 = Op(n−2vp + n−(1−v)). This and (A.50) produce

that, for ℓ ≥ 1,

∥ϕ− ϕn0∥22 ≤ 2{∥ϕ− ϕ∥22 + ∥ϕ− ϕn0∥22

}= Op(n

−2vp + n−(1−v) + h2ℓ) = Op{h2ℓ + 1/(nh)}.

Hence, ∥ϕ− ϕn0∥2 = Op(hℓ + 1/

√nh). Then, by Lemma 7 of Stone (1986),

∥ϕ− ϕn0∥∞ = Op(hℓ−1 + 1/

√nh3).

43

Page 43 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 45: For Peer Review Only - SUSTech

For Peer Review Only

Lemma 4. Let a = (a11, . . . , a1q1 , . . . , aJ1, . . . , aJqJ )′. Under Condition (A), there exist con-

stants 0 < c1 ≤ c2 < ∞ and 0 < c∗1 ≤ c∗2 < ∞ (independent of n and qj) such that, for ℓ ≥ 2

and for any v(w) =∑J

j=1

∑qj,ni=1 ajiBj,i(wji),

{c1 + o(1)}h∥a∥2 ≤ EPn{v2(W)r∗(s)} ≤ {c2 + o(1)}h∥a∥2,

{c∗1 + o(1)}√h∥a∥ ≤ EPn|v(W)r∗(s)| ≤ {c∗2 + o(1)}

√h∥a∥,

uniformly for s ∈ [0, τ ], where r∗(s) = Y (s) exp{β′X + ϕ(W)} and Pn is the empirical distri-bution function of {Wi,Xi, Yi(s)}ni=1.

Proof. Note that {Bj,i(·)}qji=1 is an uncentred spline basis. The result holds from similar

arguments to those for Lemma 6.1 of Zhou et al. (1998).

Lemma 5. Let aijpk =∫ ξj,i+1

ξj,iBk

j,p(z)B∗ℓj((z − ξj,i)/hj,i) dz and ρij = ϕ

(ℓj)j (ξj,i)/{pℓj(ξj,i)ℓj!}.

Assume that gj(x) are continuous functions. Then, for k = 0, 1 and j = 1, . . . , J ,

p−1∑i=p−ℓj

ρijgj(ξj,i)aijpk = o(hj),

uniformly for p = 1, . . . , qj.

Proof. (i) For k = 0, by a change of variable, we have

aijp0 = hj,i

∫ 1

0

B∗ℓj(u) du = 0.

Then the statement holds.(ii) For k = 1, by the identity dB∗

ℓj+1(x) = (ℓj + 1)B∗ℓj(x) dx and the chaining rule, we have

B∗ℓj((z − ξj,i)/hj,i) dz = hj,i(ℓj + 1)−1 dB∗

ℓj+1((z − ξj,i)/hj,i).

Then, using integration by parts, we obtain that

p−1∑i=p−ℓj

ρijgj(ξj,i)aijp1 =

p−1∑i=p−ℓj

ρijgj(ξj,i)hj,iℓj + 1

{Bj,p(z)B

∗ℓj+1

(z − ξj,ihj,i

)∣∣∣ξj,i+1

ξj,i

−∫ ξj,i+1

ξj,i

B∗ℓj+1

(z − ξj,ihj,i

)B′

j,p(z) dz}.

We repeatedly use integrating by parts until the integral becomes∫ ξj,i+1

ξj,i

B∗2ℓj−1

(z − ξj,ihj,i

)B

(ℓj−1)j,p (z) dz,

44

Page 44 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 46: For Peer Review Only - SUSTech

For Peer Review Only

which is zero, since the polynomial splineB(ℓj−1)j,p (·) is constant on (ξj,i, ξj,i+1) and

∫ 1

0B∗

2ℓj−1(x) dx =

0. Then, for ℓj ≥ 2,

p−1∑i=p−ℓ

ρijgj(ξj,i)aijp1 =

p−1∑i=p−ℓj

ρijgj(ξj,i)

ℓj−1∑k=1

(−1)k−1

×( k∏k′=1

hj,iℓj + k′

)B

(k−1)j,p (z)B∗

ℓj+k

(z − ξj,ihj,i

)∣∣∣ξj,i+1

ξj,i

=

ℓj−1∑k=1

(−1)k−1 ℓj!

(ℓj + k)!bkp,

where bkp =∑p−1

i=p−ℓjρijgj(ξj,i)h

kj,i{B

(k−1)j,p (ξj,i+1) − B

(k−1)j,p (ξj,i)}B∗

ℓj+k(0), and we use the factthat the Bernoulli polynomials have the property B∗

k(0) = B∗k+1(1), for k ≥ 1. Then it suffices

to show thatbkp = o(hj), (A.51)

uniformly for p = 1, . . . , qj k = 1, . . . , ℓj − 1. By (2.8) of Barrow and Smith (1978), thereexists a constant M independent of n and p so that

∥B(k)j,p ∥∞ ≤MKk

j = O(h−kj ), j = 0, 1, . . . , ℓj − 1, (A.52)

with B(k−1)j,p (ξj,p−ℓj) = B

(k−1)j,p (ξj,p) = 0, p = 1, . . . , qj. Hence, bkp can be written as a finite

sum of terms

B∗ℓj+k(0)B

(k−1)j,p (ξj,i+1){hkj,iuj(ξj,i)− hkj,i+1uj(ξj,i+1)}, i = p− ℓj, . . . , p− 1,

where uj(ξj,i) = ρijgj(ξj,i) is a continuous function of ξj,i for each j. Then (A.51) follows from(A.52) and the facts that hkj,i − hkj,i+1 = o(hkj ) and ξj,i+1 − ξj,i = hj,i.

Lemma 6. Let r∗i (s,b) = Yi(s) exp{β′Xi+b′B(Wi)}. For k = 1 . . . , q, there exists a uniquepair (j, k′) such that k =

∑j−1j′=1 qj′ +k

′, where j = 1, . . . , J and k′ = 1, . . . , qj, so that Bj,k′(w)

and Bj,k′(w) are the kth components of B(w) and B(w), respectively. For v = 1, 2, let

K∗nv(s,b) = n−1

n∑i=1

r∗i (s,b)Bj,k′(Wi){ϕ(Wi)− ϕn0(Wi)}v.

Assume that b∗ lies between b and b0, where b0 is given in Lemma 1. Under the conditions ofTheorem 1, sups∈[0,τ ] |K∗

n1(s,b∗)| = Op(h

ℓ+0.5 + n−0.5) and sups∈[0,τ ] |K∗n2(s,b

∗)| = Op(h2ℓ+1 +

n−1).

Proof. Since there is a ω ∈ [0, 1] such that b∗ = b0 + ω(b − b0), by Lemma 3 we have|(b∗ − b0)

′B(Wi)| = ω|ϕ(Wi) − ϕn0(Wi)| ≤ |ϕ − ϕn0|∞p→ 0, uniformly for i = 1, . . . , n, if

45

Page 45 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 47: For Peer Review Only - SUSTech

For Peer Review Only

nh3 → ∞. Note that r∗i (s) = Yi(s) exp{β′Xi + ϕ(Wi)}. Then, by (A.50) and the triangleinequality,

|(b∗)′B(Wi)− ϕ(Wi)| ≤ ∥(b∗ − b0)′B(Wi)∥∞ + ∥ϕn0 − ϕ∥∞

p→ 0,

which, combined with Lemma 1 and Condition (B1), leads to ri(s,b∗) = r∗i (s){1 + op(1)},uniformly for s and i. Hence,

|K∗nv(s,b

∗)| = {1 + op(1)}n−1

n∑i=1

r∗i (s)Bj,k′(Wi)|ϕ(Wi)− ϕn0(Wi)|v

≤ {1 + op(1)} supi,s

|r∗i (s)| Knv,k, (A.53)

where Knv,k = n−1∑n

i=1 Bj,k′(Wi)|ϕ(Wi)−ϕn0(Wi)|v. It can be written that Knv,k = K(1)nv,k+

K(2)nv,k with K

(1)nv,k = n−1

∑ni=1Bj,k′(Wi)|ϕ(Wi)− ϕn0(Wi)|v and

K(2)nv,k =

∑ni=1 δiBj,k′(Wi)∑n

i=1 δin−1

n∑i=1

|ϕ(Wi)− ϕn0(Wi)|v,

where Bj,k′(Wi) is the kth entry of B(Wi). For v = 1, 2, by Markov’s inequality, we have

n−1

n∑i=1

|ϕ(Wi)− ϕn0(Wi)|v = Op(∥ϕ− ϕn0∥v2). (A.54)

By (2.3), it is easy to see that

n∑i=1

δiBj,k′(Wi)/n∑

i=1

δi = Op(hj). (A.55)

ThenK

(2)nv,k = Op(hj)Op(∥ϕ− ϕn0∥v2). (A.56)

Using the Holder inequality, we obtain that

K(1)n1,k ≤ {n−1

n∑i=1

B2j,k′(Wi)n

−1

n∑i=1

|ϕ(Wi)− ϕn0(Wi)|2}1/2.

Again, by (2.3), we have n−1∑n

i=1B2j,k′(Wi) = Op(hj). Therefore,

K(1)n1,k = Op(

√hj)Op(∥ϕ− ϕn0∥2). (A.57)

Combining (A.56) and (A.57), we obtain that

Kn1,k = Op(√hj)Op(∥ϕ− ϕn0∥2). (A.58)

46

Page 46 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 48: For Peer Review Only - SUSTech

For Peer Review Only

SinceBj,k′(ξj,k′−ℓj) = Bj,k′(ξj,k′) = 0, by Taylor’s expansion, we obtain that, for w ∈ [ξj,k′−ℓj , ξj,k′ ],Bj,k′(w) = O(hj) uniformly for k′. Then

K(1)n2,k = O(hj)n

−1

n∑i=1

|ϕ(Wi)− ϕn0(Wi)|2

= Op(hj∥ϕ− ϕn0∥22).

This, combined with (A.56), leads to

Kn2,k = Op(hj∥ϕ− ϕn0∥22). (A.59)

Applying Lemma 3 to (A.58) and (A.59), we obtain that

Kn1,k = Op(√h)Op{hℓ + 1/

√nh}. (A.60)

andKn2,k = Op{h2ℓ+1 + 1/n} (A.61)

By Condition (A3), supi,s|r∗i (s)| = Op(1). Then the results of the lemma follow from (A.53),(A.60) and (A.61).

Lemma 7. Let A(s) = R∗2(s)R∗

0(s) − {R∗1(s)}⊗2. Under Conditions (A) and (B), A(s) ≥ 0

for all s ∈ [0, τ ], and there exists an s0 ∈ [0, τ ] such that A(s0) > 0.

Proof. By the definition of R∗k(s), for any nonzero a ∈ RJqn , we have

a′A(s)a = E{a′B(W)⊗2ar∗(s)}E{(r∗(s)} − [E{a′B(W)r∗(s)}]2

= E(ξ2)E(η2)− (Eξη)2,

where ξ = a′B(W){r∗(s)}1/2 and η = {r∗(s)}1/2. Using the Holder inequality, we obtainthat a′A(s)a ≥ 0 (that is, A(s) is nonnegative definite), with the equality holding if andonly if there exists a real number c such that cξ + η = 0 almost surely, or equivalently{ca′B(W)+1}{r∗(s)}1/2 = 0 almost surely, which is not possible. Hence, there exists at leastan s0 ∈ [0, τ ] such that A(s0) > 0.

Lemma 8. For any unit vector a, let v(w) = a′B(w) and r∗n(s) = Y (s) exp{β′X+ ϕ∗n0(W)},

where ϕ∗n0(w) = B′(w)b0 is defined in Lemma 1. Under the conditions in Theorem 1, we have

(i) EPn{vk(w)r∗n(s)} = EPn{vk(w)r∗(s)}+ o(h), uniformly for s ∈ [0, τ ].

(ii) there exist constants 0 < ci ≤ di <∞ (independent of n and qn for i = 1, 2, 3) such that

(a) {c1 + o(1)}h ≤ λmin{Σn(b0)} ≤ λmax{Σn(b0)} ≤ {d1 + o(1)}h,

47

Page 47 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 49: For Peer Review Only - SUSTech

For Peer Review Only

(b) {c2 + o(1)}h ≤ λmin(Σn0) ≤ λmax(Σn0) ≤ {d2 + o(1)}h, and

(c) {c3 + o(1)}h ≤ λmin(Σ0) ≤ λmax(Σ0) ≤ {d3 + o(1)}h;

(iii) ∥Σn1(b0)∥ = Op(h/√n);

(iv) ∥Σ−1n (b0)∥ = O(h−1), ∥Σ−1

n0 ∥ = O(h−1), ∥Σ−1n2 (b0)∥ = O(h−1), and ∥Σ−1

0 ∥ = O(h−1).

Proof. (i). Using the Rayleigh-Ritz theorem, one has

λmin{Σn(b0)} = min∥a∥=1

a′Σn(b0)a and λmax{Σn(b0)} = max∥a∥=1

a′Σn(b0)a,

where a is a Jqn × 1 unit vector. Let a = (a11, . . . , a1qn , . . . , aJ1, . . . , aJqn)′, v(w) = a′B(w),

and r∗nj(s) = Yj(s) exp{β′Xj + ϕ∗n0(Wj)}. By the definition of Σn(b), it is easy to see that

a′Σn(b0)a = n−1

∫ τ

0

{Vn0(s,b0)η2(a, s)− η1(a, s)⊗2

Vn0(s,b0)2

}dN(s),

where η1(a, s) = n−1∑n

i=1 r∗ni(s)B′(Wi)a, and

η2(a, s) = n−1

n∑i=1

r∗ni(s)a′B(Wi)⊗2a.

Then η1(a, s) = EPn{v(W)r∗n(s)}, and

η2(a, s) = n−1

n∑i=1

{a′B(Wi)}2r∗ni(s) = EPn{v2(W)r∗n(s)},

where v(w) is defined in Lemma 4. Put

Gn(a, s) =EPn{v2(W)r∗n(s)}

EPn{r∗n(s)}−[EPn{v(W)r∗n(s)}

EPn{r∗n(s)}

]2. (A.62)

It can be written thata′Σn(b0)a = n−1

∫ τ

0

Gn(a, s) dN(s). (A.63)

By Taylor’s expansion, Lemma 1 and Condition (B3), r∗n(s) = r∗(s){1+O(hℓ)}, uniformly fors ∈ [0, τ ] and w ∈ W . Since |v(w)| = |a′B(w)| ≤ ∥a∥∥B(w)∥ ≤ J1/2 for any unit vector a,

vk(w)r∗n(s) = vk(w)r∗(s) +O(hℓ),

uniformly for s ∈ [0, τ ] and w ∈ W . It follows that, for k = 0, 1, 2,

EPn{vk(W)r∗n(s)} = EPn{vk(W)r∗(s)}+ o(h), (A.64)

uniformly for s ∈ [0, τ ].

48

Page 48 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 50: For Peer Review Only - SUSTech

For Peer Review Only

(ii). Combining (A.62) and (A.64) leads to

Gn(a, s) =EPn{v2(w)r∗(s)}EPn{r∗(s)}

−[EPn{v(w)r∗(s)}

EPn{r∗(s)}

]2+ o(h)

=EP{v2(w)r∗(s)}EP{r∗(s)}

−[EP{v(w)r∗(s)}

EP{r∗(s)}

]2+ o(h), (A.65)

uniformly for s ∈ [0, τ ]. Note that v(w) = a′B(w). It follows that

Gn(a, s) =a′EP{B(W)⊗2r∗(s)}a

EP{r∗(s)}−[a′EP{B(W)r∗(s)}

EP{r∗(s)}

]2+ o(h)

= a′A(s)a/R∗0(s) + o(h),

uniformly for s ∈ [0, τ ], where A(s) is defined in Lemma 7. By Condition (A3), with prob-ability tending to one R∗

0(s) is bounded away from zero and infinity. Therefore, by Lemma7, there exists an s0 ∈ [0, τ ] such that A(s0) > 0. It is easy to see that A(s) is continuous.Therefore, there is a neighborhood of s0 in which A(s) > 0 and thus Gn(a, s) > 0. Combining(A.63), (A.65) and Lemma 4 leads to the result in ii(a). Parts ii(b) -(c) hold from the sameargument as above.

(iii). Similar to (ii), for any Jqn × 1 unit vector a, one has

a′Σn1(b0)a = n−1

n∑i=1

∫ 1

0

Gn(a, s) dMi(s).

Since Mi(s) is a martingale, a′Σn1a is of mean zero and variance

n−1

∫ 1

0

E{G2n(a, s)Yi(s)ri(β, ϕ)} dΛ0(s),

which is of order O(h2/n) from Lemma 4 and (A.62). Therefore,

a′Σn1(b0)a = Op(h/√n).

It follows from Rayleigh-Ritz theorem that λmin{Σn1(b0)} = Op(h/√n) and λmax{Σn1(b0)} =

Op(h/√n). Since Σn1(b0) is a symmetrical matrix, ∥Σn1(b0)∥ = Op(h/

√n).

(iv). Since Σn(b0),Σn0, and Σ0 are all symmetrical, the results for them follow from (ii).The result for Σn2 can be proven similarly.

Lemma 9. Under the conditions in Theorem 1, we have B(w)′rn = op(hℓ + 1/

√nh) and

B′(w)rn1 = op(hℓ + 1/

√nh), where rn and rn1 are defined in the proof of Theorem 1.

Proof. The two statements can be proved in the same line, but the first one is much moredifficult to prove than the second one, since ∥β − β∥ converges faster than ∥b − b∥. In thefollowing we provide only the proof of the 1st statement. Let Hk(b) = n−1(∂2Uk/∂b∂b′),

49

Page 49 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 51: For Peer Review Only - SUSTech

For Peer Review Only

where Uk(β,b) is the kth component of the score function U(β,b) = ∂ℓ(β,b)/∂b. By (A.1)and direct calculation, it is straightforward to show that

Hk(b) = n−1

∫ τ

0

[{Kn2(s,b)Vn0(s,b)

− Vn2(s,b)Kn0(s,b)V2

n0(s,b)

}−2

{Vn1(s,b)⊗ Kn1(s,b)V2

n0(s,b)− V⊗2

n1 (s,b)Kn0(s,b)V3

n0(s,b)

}]dN(s),

where, for j = 0, 1, 2, Vnj(s,b) = n−1∑n

i=1 B(Wi)⊗j r∗i (s,b), and

Knj(s,b) = n−1

n∑i=1

Bk(Wi)B(Wi)⊗j r∗i (s,b),

with r∗i (s,b) = Yi(s) exp{β′Xi + b′B(Wi)}. Let ηk(b∗) = 0.5(b− b0)′Hk(b∗)(b− b0). Then,

by the definition of R∗n, ηn = n−1R∗

n, and the kth component of ηn is

ηk(b∗) = 0.5n−1

∫ τ

0

[{K∗n2(s,b

∗)

V ∗n0(s,b∗)

− V ∗n2(s,b

∗)K∗n0(s,b

∗)

V ∗2n0 (s,b∗)

}−2

{V ∗n1(s,b

∗)K∗n1(s,b

∗)

V ∗2n0 (s,b∗)

− V ∗2n1 (s,b)K∗

n0(s,b∗)

V ∗3n0 (s,b∗)

}]dN(s),

where K∗nj(s,b

∗) is defined in Lemma 6, and

V ∗nj(s,b

∗) = n−1

n∑i=1

r∗i (s,b∗){ϕ(Wi)− ϕn0(Wi)}j,

for j = 1, 2. By Lemma 8, we have

|B′(w)rn| = |B′(w)Σ−1n (b0)n

−1R∗n|

≤ ∥B(w)∥∥Σ−1n (b0)∥∥n−1R∗

n∥

= Op(h−1)∥ηn∥. (A.66)

By Lemma 6, we have

K∗n2(s,b

∗) = Op(h2ℓ+1 + n−1) and K∗

n1(s,b∗) = Op(h

ℓ+0.5 + n−0.5),

uniformly for s ∈ [0, τ ]. Similarly, it can be shown that V ∗n2(s,b

∗) = Op{h2ℓ + 1/(nh)},V ∗n1(s,b

∗) = Op(hℓ+1/

√nh), and Kn0(s,b∗) = Op(h), uniformly for s ∈ [0, τ ], and there exist

positive constants a1 and a2 such that

a1 + op(1) < Vn0(s,b∗) < a2 + op(1).

Then ηk(b∗) = op(hℓ+1.5 + h/

√n) uniformly for k, if nh2 → ∞. Hence, ∥ηn∥ = op(h

ℓ+1 +√h/n). This, combined with (A.66), leads to

|B(w)′rn| = O(h−1)op(hℓ+1 +

√h/n) = op(h

ℓ + 1/√nh).

50

Page 50 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 52: For Peer Review Only - SUSTech

For Peer Review Only

Lemma 10. (Barrow and Smith 1978; Lemma 6.8 of Agarwal and Studden 1980) Assume thatϕj ∈ Cℓj [0, 1] and ξ ∈ [0, 1). Let i be chosen so that ξj,i ≤ ξ < ξj,i+1 and let hj,i = ξj,i+1 − ξj,i,where ξj,i are the knots defined in Section 2. For y ∈ [0, 1), let

Rk(y, ξ) = kℓj(ϕj − ϕ∗n0,j)(ξi + yhj,i),

and K(y, ξ) = {ϕ(ℓj)j (ξ)/ℓj!}p−ℓj(ξ)B∗

ℓj(y).Then there exists a sequence of positive constants

{ϵk}∞k=1 tending to zero and which may be chosen independently of ξ such that

supy

|Rk(y, ξ)−K(y, ξ)| < ϵk.

Lemma 11. For k = 0, 1, let Lnk(s) ≡ n−1∑n

i=1 Bk(Wi)r∗i (s){ϕ(Wi) − ϕ∗

n0(Wi)}, wherer∗i (s) = Yi(s) exp{β′Xi + ϕ(Wi)}. Under the conditions in Theorem 1, we have

(i)∫ τ

0Lnk(s)g(s) dΛ0(s) = op(h

ℓ+1), uniformly for components, where g(s) is a boundedfunction;

(ii) αn(w) = n−1∑n

i=1

∫ τ

0B′(w)Σn(b0)

−1Γn(Wi, s)r∗i (s) dΛ0(s) = op(h

ℓ), uniformly for w ∈W, where Γn(Wi, s) = B(Wi)− Vn1(s,b0)/Vn0(s,b0).

Proof. (i) It can be rewritten that Lnk(s) = Lnk1(s) + Lnk2(s), where

Lnk1(s) = n−1

n∑i=1

Bk(Wi)f(Wi, s){ϕ(Wi)− ϕ∗n0(Wi)},

Lnk2(s) = n−1

n∑i=1

Bk(Wi){r∗i (s)− f(Wi, s)}{ϕ(Wi)− ϕ∗n0(Wi)},

and f(·, s) is defined in Condition (A3). Let L[m]nk2(s) be the mth component of Lnk2(s). Then,

for any ε > 0, by Markov’s inequality,

P{

sup1≤m≤Jqn

∣∣ ∫ τ

0

L[m]nk2(s)g(s) dΛ0(s)

∣∣ > εhℓ+1}

≤ ε−2h−2(ℓ+1)

Jqn∑m=1

E{∫ τ

0

L[m]nk2(s)g(s) dΛ0(s)

}2. (A.67)

By Lemma 1, ∥ϕ(w) − ϕ∗n0(w)∥∞ = O(hℓ). Then, by Condition (A3) and by interchanging

integration and expectation, it is straightforward to show that

E{∫ τ

0

L[m]nk2(s)g(s) dΛ0(s)

}2= O(h2ℓ/n).

This, combined with (A.67), yields that

P{

sup1≤m≤Jqn

∣∣ ∫ τ

0

L[m]nk2(s)g(s) dΛ0(s)

∣∣ > εhℓ+1}= O{1/(nh3)} → 0,

51

Page 51 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 53: For Peer Review Only - SUSTech

For Peer Review Only

as n→ ∞. Therefore, ∫ τ

0

Lnk2(s)g(s) dΛ0(s) = op(hℓ+1), (A.68)

uniformly for components. It suffices to show that∫ τ

0

Lnk1(s)g(s) dΛ0(s) = op(hℓ+1) (A.69)

uniformly for components. For j = 1, . . . , J and p = 1, . . . , qj, let m =∑j−1

j′=1 qj′ + p. Thenm = 1, . . . , q, where q =

∑Jj=1 qj. The mth component, L(m)

nk1(s), of Lnk1(s) can be decomposedinto

L(m)nk1(s) = L(m)

nk11(s) + L(m)nk12(s), (A.70)

where L(m)nk11(s) = n−1

∑Jj=1

∑ni=1B

kj,p(Wij)f

∗(Wij, s){ϕj(Wij) − ϕ∗n0,j(Wij)}, and L(m)

nk12(s) =

n−1∑J

j=1

∑ni=1B

kj,p(Wij){f(Wi, s) − f ∗(Wij, s))}{ϕj(Wij) − ϕ∗

n0,j(Wij)}, with f ∗(Wij, s) =

E{f(Wi, s)|Wij}. Using the same argument as for (A.68), we obtain∫ τ

0

L(m)nk12(s)g(s) dΛ0(s) = op(h

ℓ+1), (A.71)

uniformly for m = 1, . . . , q. Note that, for k = 0, 1,

L(m)nk11(s) =

J∑j=1

∫ ξj,p

ξj,p−ℓj

Bkj,p(wj)f

∗(wj, s){ϕj(wj)− ϕ∗n0,j(wj)} dQnj(wj),

where Qnj is the empirical distribution of {Wij}ni=1. It follows from (2.3) that, by a change ofvariable,

L(m)nk11(s) =

J∑j=1

p−1∑i=p−ℓj

∫ ξj,i+1

ξj,i

Bkj,p(w)f

∗(w, s) {ϕj(w)− ϕ∗n0,j(w)} dQnj(w)

=J∑

j=1

p−1∑i=p−ℓj

∫ 1

0

Bkj,p(ξj,i + zhj,i)f

∗(ξj,i + zhj,i, s)

×{(ϕj − ϕ∗n0,j)(ξj,i + zhj,i)} dQnj(ξj,i + zhj,i).

Using Lemma 10, we obtain that

qℓL(m)nk11(s) =

J∑j=1

p−1∑i=p−ℓj

∫ 1

0

Bkj,p(ξj,i + zhj,i)f

∗(ξj,i + zhj,i, s)

×Rq(z, ξj,i) dQnj(ξj,i + zhj,i)

=J∑

j=1

p−1∑i=p−ℓj

∫ 1

0

Bkj,p(ξj,i + zhj,i)f

∗(ξj,i + zhj,i, s)

×{ρijB∗ℓj(z) + ϵq} dQnj(ξj,i + zhj,i),

52

Page 52 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 54: For Peer Review Only - SUSTech

For Peer Review Only

where ρij = ϕ(ℓj)j (ξj,i)/{pℓj(ξj,i)ℓj!}, and {ϵq} is a sequence of positive numbers tending to zero

and independent of ξj,i. Let g∗(w) =∫ τ

0f ∗(w, s)g(s) dΛ0(s). Then g∗(w) is bounded, and

qℓ∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) =

J∑j=1

p−1∑i=p−ℓj

ρij

∫ ξj,i+1

ξj,i

Bkj,p(w)g

∗(w)

×B∗ℓj((w − ξj,i)/hj,i) dQnj(w) + o(h)

=J∑

j=1

p−1∑i=p−ℓj

ρij

∫ ξj,i+1

ξj,i

Bkj,p(w)g

∗(w)

×B∗ℓj((w − ξj,i)/hj,i) dQj(w)

+J∑

j=1

p−1∑i=p−ℓj

ρij

∫ ξj,i+1

ξj,i

Bkj,p(w)g

∗(w)

×B∗ℓj((w − ξj,i)/hj,i) d{Qnj(w)−Qj(w)}+ o(h), (A.72)

uniformly for m. Following the arguments for (6.30) and (6.31) in Agarwal and Studden(1980), one knows that the second factor on the righthand side of (A.72) is o(h). Then

qℓ∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) =

J∑j=1

p−1∑i=p−ℓj

ρij

∫ ξj,i+1

ξj,i

Bkj,p(w)B

∗ℓj((w − ξj,i)/hj,i)

×g∗(w) dQj(w) + o(h),

uniformly in m. Let qj(w) = dQj(w)/dw. It can be written that

qℓ∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) =

J∑j=1

p−1∑i=p−ℓj

ρijg∗(ξj,i)qj(ξj,i)

×∫ ξj,i+1

ξj,i

Bkj,p(w)B

∗ℓj((w − ξj,i)/hj,i) dw

+J∑

j=1

p−1∑i=p−ℓj

ρij

∫ ξj,i+1

ξj,i

Bkj,p(w)B

∗ℓj((w − ξj,i)/hj,i)

×{g∗(w)qj(w)− g∗(ξj,i)qj(ξj,i)} dw + o(h),

uniformly for m. Then, by the continuity of g∗(·) and qj(·), it is quite easy to show that the2nd term on the righthand side of the above equation is o(h). Hence,

qℓ∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) =

J∑j=1

p−1∑i=p−ℓj

ρijg∗(ξj,i)qj(ξj,i)

×∫ ξj,i+1

ξj,i

Bkj,p(w)B

∗ℓj((w − ξj,i)/hj,i) dw + o(h),

uniformly for m. Applying Lemma 5, we obtain that

qℓ∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) = o(h)

53

Page 53 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 55: For Peer Review Only - SUSTech

For Peer Review Only

uniformly for m. Note that q−1 = O(h). It follows that∫ τ

0

L(m)nk11(s)g(s) dΛ0(s) = o(hℓ+1), (A.73)

uniformly for m. Hence, (A.69) follows from (A.70), (A.71) and (A.73).(ii) Since each term Γn(Wi, s) may be considered of as a covariate centered by its empirical

average calculated with probability mass function which assigns a weight proportional tor∗ni(s) ≡ Yj(s) exp{β′Xj + ϕ∗

n0(Wj)}, then the average of these centered terms with respectto this same discrete probability function is zero. That is,

n∑i=1

∫ τ

0

Γn(Wi, s)r∗ni(s) dΛ0(s) = 0.

Therefore,

αn(w) = n−1

n∑i=1

∫ τ

0

B′(w)Σn(b0)−1Γn(Wi, s){r∗i (s)− r∗ni(s)} dΛ0(s).

By Taylor’s expansion,

r∗i (s)− r∗ni(s) = r∗i (s)[1− exp{ϕ∗n0(Wi)− ϕ(Wi)}]

= r∗i (s){ϕ(Wi)− ϕ∗n0(Wi)}+O(|ϕ∗

n0(Wi)− ϕ(Wi)|2).

The leading term of αn(w) is

αn1(w) = B′(w)Σn(b0)−1n−1

n∑i=1

∫ τ

0

Γn(Wi, s)r∗i (s){ϕ(Wi)− ϕ∗

n0(Wi)} dΛ0(s).

Let θn =∫ τ

0{Ln1(s)−Ln0(s)Vn1(s,b0)/Vn0(s,b0)} dΛ0(s). Then it is straightforward to verify

that

αn1(w) = B′(w)Σ−1n (b0)θn.

Hence,

|αn1(w)| = |B′(w)Σ−1n (b0)θn| = |tr{B′(w)Σ−1

n (b0)θn}|

= |tr{Σ−1n (b0)θnB′(w)}|

≤ tr[{Σ−1n (b0)}⊗2]tr[{θnB′(w)}⊗2]. (A.74)

By the spectral decomposition, Lemma 2, and Lemma 8,

tr[{Σ−1n (b0)}⊗2] ≤ qλ2max(Σ

−1n (b0)) = O(h−3). (A.75)

Directly calculating the mean and variance, we obtain that

Vnk(s,b0) = E{Vnk(s,b0)}+ op(1).

54

Page 54 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 56: For Peer Review Only - SUSTech

For Peer Review Only

By Lemma 1, we have b′0B(w) = ϕ∗

n0(w) = ϕ(w) + O(hℓ) uniformly for w ∈ [0, 1]J . ThenE{Vnk(s,b0)} = R∗

k(s) + o(1), and hence

Vnk(s,b0) = R∗k(s) + op(1).

Using the chaining argument in Bickel (1975), we establish that

Vnk(s,b0) = R∗k(s) + op(1),

uniformly for s ∈ [0, τ ]. By (2.3) and Condition (A3), R∗1(s)/R∗

0(s) is bounded. Therefore,

θn =

∫ τ

0

{Ln1(s)− Ln0(s)R∗1(s)/R∗

0(s)} dΛ0(s){1 + op(1)}.

By Lemma 11(i), θn = op(hℓ+1), uniformly for components. Note that

tr[{θnB′(w)}⊗2] = tr{B(w)B′(w)θnθ′n} ≤ tr{B(w)B′(w)}tr{θnθ

′n}.

It follows thattr[{θnB′(w)}⊗2] ≤ ∥B(w)∥2 · ∥θn∥2 = op(qh

2ℓ+2). (A.76)

Therefore, by (A.74)-(A.76), if ℓ ≥ 2,

αn1(w) = O(h−3)op(qh2ℓ+2) = op(h

2ℓ−2) = op(hℓ),

uniformly for w. Then αn(w) = op(hℓ), uniformly for w.

Lemma 12. For any square matrix sequence A∗n, if ∥A∗

n∥ → 0, then (I+A∗n)

−1 = I−A∗n+γ∗

n,

where ∥γ∗n∥ = O(∥A∗

n∥2).

Proof. Let γ∗n = (I + A∗

n)−1 − I + A∗

n. It suffices to show that ∥γ∗n∥ = O(∥A∗

n∥2). In fact,

I = (I + A∗n)(I + A∗

n)−1 = (I + A∗

n)(I − A∗n + γ∗

n)

= I − A∗2n + (I + A∗

n)γ∗n,

which leads to

γ∗n = (I + A∗

n)−1A∗2

n = (I − A∗n + γ∗

n)A∗2n

= A∗2n − A∗3

n + γ∗nA∗2

n .

Therefore,

∥γ∗n∥ ≤ ∥A∗

n∥2 + ∥A∗n∥3 + ∥γ∗

n∥ ∥A∗n∥2,

which yields that ∥γ∗n∥ = O(∥A∗

n∥2), if ∥A∗n∥ → 0.

55

Page 55 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 57: For Peer Review Only - SUSTech

For Peer Review Only

Lemma 13. Under the conditions in Theorem 1, there are positive constants c1 and c2 suchthat {c1 + o(1)}/(nh) ≤ σ2

n(w) ≤ {c2 + o(1)}/(nh), uniformly in w ∈ W.

Proof. Recalling that 0 ≤ Bj,i(w) ≤ 1,∑qj

i=1Bj,i(w) = 1, for 1 ≤ i ≤ qj, and σ2n(w) =

n−1B′(w)Σ−10 B(w), we have

σ2n(w) = n−1tr{Σ−1

0 B(w)⊗2}.

By Lemma 2,

σ2n(w) ≤ n−1{λmin(Σ0)}−1tr{B(w)⊗2} = n−1{λmin(Σ0)}−1

J∑j=1

qj∑i=1

B2j,i(w)

≤ n−1{λmin(Σ0)}−1

J∑j=1

{qj∑i=1

Bj,i(w)}2 = Jn−1{λmin(Σ0)}−1,

uniformly in w ∈ W . By Lemma 8, there is a constant c1 > 0 such that σ2n(w) ≤ {c1 +

o(1)}/(nh), uniformly in w ∈ W . Similarly, there exists a constant c2 > 0 such that

σ2n(w) ≥ {c2 + o(1)}/(nh),

uniformly in w ∈ W . Therefore, the result of the lemma holds.

Lemma 14. Let Rn(s) = Rn1(s)/Rn0(s) − Vn1(s,b0)/Vn0(s,b0). Then, under Conditions(A) and (B), tr{Rn(s)

⊗2} = O(h2ℓ), uniformly for s ∈ [0, τ ].

Proof. By Lemma 1, Condition (A3) and Taylor’s expansion,

Vnk(s,b0)− Rnk(s) =1

n

n∑i=1

r∗i (s)B(Wi)⊗k{ϕ∗

n0(Wi)− ϕ(Wi)}

+O(h2ℓ), a.s. (A.77)

uniformly for components and for s ∈ [0, τ ]. Then applying Lemma 1 and Condition (A3), weobtain that

tr[{Vn1(s,b0)− Rn1(s)}⊗2]

≤ 2n−2

n∑i,j=1

r∗i (s)r∗j (s)tr{B(Wi)B′(Wj)}

×{ϕ∗n0(Wi)− ϕ(Wi)}{ϕ∗

n0(Wj)− ϕ(Wj)}+O(h4ℓ−1)

= O(h2ℓ), (A.78)

uniformly for s ∈ [0, τ ]. Similarly,

tr[{Vn0(s,b0)− Rn0(s)}⊗2] = O(h2ℓ), (A.79)

56

Page 56 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 58: For Peer Review Only - SUSTech

For Peer Review Only

uniformly for s ∈ [0, τ ]. Note that

tr{Rn(s)⊗2} = tr[Rn1(s)({Vn0(s,b0)− Rn0(s)}

−Rn0(s){Vn1(s,b0)− Rn1(s)})⊗2]R−2n0 (s)V−2

n0 (s,b0)

= O{R−2n0 (s)V−2

n0 (s,b0)}tr{(Rn1(s)⊗2)[Vn0(s,b0)− Rn0(s)]

2

+R2n0(s)tr([Vn1(s,b0)− Rn1(s)]

⊗2)}.

By (2.3), it is straightforward to verify that tr{Rn1(s)⊗2} = O(1). Therefore,

tr{Rn(s)⊗2} = O(1){Vn0(s,b0)− Rn0(s)}2 +O(1)tr[{Vn1(s,b0)− Rn1(s)}⊗2].

This, combined with (A.78) and (A.79), leads to the result of the lemma.

Lemma 15. Under Conditions (A) and (B), ∥Σn0 −Σ0∥ = Op(h/√n).

Proof. For any vector a such that ∥a∥ = 1, let v(Wi) = a′B(Wi) and Gn =√n(Pn − P ).

Then Gn converges in distribution to G, and

|a′{Rn2(s)− R∗2(s)}a| = |n−1

n∑i=1

[v2(Wi)r∗i (s)− E{v2(Wi)r

∗i (s)}]|

= |EPn{v2(w)r∗(s)} − EP{v2(w)r∗(s)}|

= |n−1/2

∫v2(w)r∗(s) dG(w)|{1 + o(1)}.

By an argument similar to that for Lemma 4,

|a′{Rn2(s)− R∗2(s)}a| = Op(h/

√n).

Similarly, |a′{Rn1(s)⊗2−R∗

1(s)⊗2}a| = Op(h/

√n). This leads to λmax(Σn0−Σ0) = Op(h/

√n).

Note that Σn0 −Σ0 is symmetrical. It follows that ∥Σn0 −Σ0∥ = Op(h/√n).

Lemma 16. Let Γ1 = e1Σ−10 , An =

∫ 1

0{B1(w1)}⊗2a(w1) dw1, εi =

∫ τ

0{B(Wi)−R∗

1(s)/R0(s)} dMi(s),

and Tn2 = 2n−1h1∑

i<j ε′iΓ

′1AnΓ1εj. Suppose Conditions in Theorem 7 hold. Then σ∗2

n =

O(h1), and σ∗−1n Tn2 converges in distribution to N(0, 1), as n→ ∞.

Proof. Note that E(Tn2) = 0 and

var(Tn2) = 4n−2h21(n2 − n)var(ε′1Γ

′1AnΓ1ε2).

By simple algebra and independence between ε1 and ε2, we have

var(ε′1Γ′1AnΓ1ε2) = E{tr(ε⊗2

2 Γ′1AnΓ1ε

⊗21 Γ′

1AnΓ1)}

= tr(Σ0Γ′1AnΓ1Σ0Γ

′1AnΓ1)

= tr(e′1Ane1Σ−10 e′1Ane1Σ

−10 )

57

Page 57 of 58 Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Page 59: For Peer Review Only - SUSTech

For Peer Review Only

Note that e1Σ−10 e′1 = Σ11

0 . It follows that

var(ε′1Γ′1AnΓ1ε2) = tr(AnΣ

110 AnΣ

110 )

= n2

∫ 1

0

∫ 1

0

σn(u, v)σn(u, v)a(u)a(v) du dv,

where σn(u, v) = n−1B′1(u)Σ

110 B1(v) = σn(v, u). Hence, var(Tn2) = σ∗2

n {1 + o(1)}. Note thatthe (i, i′)th component of An is

Aii′ =

∫ 1

0

B1,i(w1)B1,i′(w1)a(w1) dw1.

By (2.3), we have

Aii′ = {0, if |i− i′| ≥ ℓ1;

O(h1) if |i− i′| < ℓ1.(A.80)

Then tr(A⊗2n ) = O(h1). By an argument similar to that for Lemma 8, λmax(Σ

110 ) ≍ h−1

1 , andhence it follows from Lemma 2 that

tr(AnΣ110 AnΣ

110 ) ≍ h−2

1 tr(A⊗2n ) = O(h−1

1 ).

Therefore, σ∗2n = O(h1). Let Hn(εi, εj) = 2n−1h1ε

′iΓ

′1AnΓ1εj, Yni =

∑i−1j=1Hn(εi, εj) (for 2 ≤

i ≤ n), Snm =∑m

i=2 Yni, and Fnm = σ(ε1, . . . , εm) be the σ field generated by {ε1, . . . , εm}.Then Snn = Tn2 and for each n, {Snm,Fnm}nm=2 is a sequence of zero mean and squareintegral martingales. Let s2n = E(S2

nn). Then s2n = var(Tn2) = σ∗2n {1 + o(1)}. Define V 2

n =∑ni=2E(Y

2ni|Fn,i−1). It is straightforward to show that s−2

n V 2n

p→ 1 and for each ϵ > 0

s−2n

n∑i=2

E{Y 2niI(|Yni| > ϵsn)|Fn,i−1}

p→ 0,

as n → ∞. Then applying the martingale central limit theorem (Corollary 3.1 of Hall andHeyde (1980)), σ∗−1

n Tn2 → N (0, 1) in distribution, as n→ ∞.

58

Page 58 of 58Journal of the American Statistical Association

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960