thesis - ETH Zurichstat.ethz.ch/~maathuis/papers/thesis.pdf · 2007. 9. 6. · Title: thesis.dvi...
Transcript of thesis - ETH Zurichstat.ethz.ch/~maathuis/papers/thesis.pdf · 2007. 9. 6. · Title: thesis.dvi...
Nonparametric estimation for current status data with
competing risks
Marloes Henriette Maathuis
A dissertation submitted in partial fulfillmentof the requirements for the degree of
Doctor of Philosophy
University of Washington
2006
Program Authorized to Offer Degree: Statistics
University of WashingtonGraduate School
This is to certify that I have examined this copy of a doctoral dissertation by
Marloes Henriette Maathuis
and have found that it is complete and satisfactory in all respects,and that any and all revisions required by the final
examining committee have been made.
Co-Chairs of the Supervisory Committee:
Piet Groeneboom
Jon A. Wellner
Reading Committee:
Piet Groeneboom
Michael G. Hudgens
Jon A. Wellner
Date:
In presenting this dissertation in partial fulfillment of the requirements for the doctoraldegree at the University of Washington, I agree that the Library shall make itscopies freely available for inspection. I further agree that extensive copying of thisdissertation is allowable only for scholarly purposes, consistent with “fair use” asprescribed in the U.S. Copyright Law. Requests for copying or reproduction of thisdissertation may be referred to Proquest Information and Learning, 300 North ZeebRoad, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted“the right to reproduce and sell (a) copies of the manuscript in microform and/or (b)printed copies of the manuscript made from microform.”
Signature
Date
University of Washington
Abstract
Nonparametric estimation for current status data with competing risks
Marloes Henriette Maathuis
Co-Chairs of the Supervisory Committee:
Professor Piet GroeneboomStatistics
Professor Jon A. WellnerStatistics
We study current status data with competing risks. Such data arise naturally in
cross-sectional survival studies with several failure causes. Moreover, generalizations
of these data arise in HIV vaccine clinical trials.
The general framework is as follows. We analyze a system that can fail from K
competing risks, where K ∈ N is fixed. The random variables of interest are (X, Y ),
where X ∈ R+ = (0,∞) is the failure time of the system, and Y ∈ 1, . . . , K is
the corresponding failure cause. However, we cannot observe (X, Y ) directly. Rather,
we observe the ‘current status’ of the system at a single random observation time
T ∈ R+, where T is independent of (X, Y ). This means that at time T , we observe
whether or not failure occurred, and if and only if failure occurred, we also observe
the failure cause Y .
We study nonparametric estimation of the sub-distribution functions F0k(t) =
P (X ≤ t, Y = k), k = 1, . . . , K, t ∈ R+. We focus on two estimators: the nonpara-
metric maximum likelihood estimator (MLE) and the ‘naive estimator’ introduced
by Jewell, Van der Laan and Henneman (2003). Our main interest is in asymptotic
properties of the MLE, and the naive estimator is considered for comparison.
Until now, the asymptotic properties of the MLE have been largely unknown. We
resolve this issue by proving its consistency, n1/3-rate of convergence, and limiting
distribution. The limiting distribution involves a new self-induced limiting process,
consisting of the convex minorants of K correlated two-sided Brownian motion pro-
cesses plus parabolic drifts, plus an additional term involving the difference between
the sum of the K drifting Brownian motions and their convex minorants.
Various other aspects that we consider include characterizations of the estimators,
uniqueness, graph theory, and computational algorithms. Furthermore, we show that
both the MLE and the naive estimator are asymptotically efficient for a family of
smooth functionals, with√n-rate convergence to a normal limit. Finally, we study an
extension of the model, where X is subject to interval censoring and Y is a continuous
random variable. We show that the MLE is typically inconsistent in this model, and
propose a simple method to repair this inconsistency.
TABLE OF CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and problem description . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of previous work . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of new results and outline of this thesis . . . . . . . . . . . 4
Chapter 2: The estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Definition of the estimators . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Censored data perspective . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Graph theory and uniqueness . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 3: Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1 Reduction and optimization . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Iterative convex minorant algorithms . . . . . . . . . . . . . . . . . . 66
Chapter 4: Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Hellinger consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Local and uniform consistency . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 5: Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Hellinger rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Asymptotic local minimax lower bound . . . . . . . . . . . . . . . . . 90
5.3 Local rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Technical lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . 118
i
Chapter 6: Limiting distribution . . . . . . . . . . . . . . . . . . . . . . . . 132
6.1 The limiting distribution of the naive estimator . . . . . . . . . . . . 133
6.2 The limiting distribution of the MLE . . . . . . . . . . . . . . . . . . 146
6.3 Technical lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . 177
Chapter 7: A family of smooth functionals . . . . . . . . . . . . . . . . . . 186
7.1 Information bound calculations . . . . . . . . . . . . . . . . . . . . . 187
7.2 Asymptotic normality of functionals of the MLE . . . . . . . . . . . . 194
Chapter 8: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.1 Menopause data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Chapter 9: An extension: interval censored continuous mark data . . . . . 229
9.1 The model and an explicit formula for the MLE . . . . . . . . . . . . 230
9.2 Inconsistency of the MLE . . . . . . . . . . . . . . . . . . . . . . . . 236
9.3 Repaired MLE via discretization of marks . . . . . . . . . . . . . . . 246
9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
ii
LIST OF FIGURES
Figure Number Page
2.1 The estimators: Graphical representation of the observed data. . . . . 9
2.2 Graph theory: Intersection graph for the MLE. . . . . . . . . . . . . 30
2.3 Convex minorant characterizations: Plots for the data in Table 2.5. . 59
5.1 Asymptotic local minimax lower bound: The perturbation Fnk. . . . . 91
5.2 Local rate: Plot of vn(t) for various values of β. . . . . . . . . . . . . 100
5.3 Local rate: Example clarifying the proof of Lemma 5.16. . . . . . . . 128
6.1 Limiting distribution: Processes for the naive estimator at t0 = 1 . . . 136
6.2 Limiting distribution: Processes for the naive estimator at t0 = 2 . . . 137
6.3 Limiting distribution: Processes for the MLE at t0 = 1 . . . . . . . . 153
6.4 Limiting distribution: Processes for the MLE at t0 = 2 . . . . . . . . 154
6.5 Limiting distribution: Comparison of limiting processes at t0 = 1. . . 155
6.6 Limiting distribution: Comparison of limiting processes at t0 = 2. . . 156
8.1 Menopause data: Question of the Health Examination Study. . . . . . 211
8.2 Menopause data: The MLE and the naive estimator. . . . . . . . . . 212
8.3 Simulations: The true underlying sub-distribution functions. . . . . . 218
8.4 Simulations: The estimators in a single simulation. . . . . . . . . . . 219
8.5 Simulations: Pointwise bias. . . . . . . . . . . . . . . . . . . . . . . . 220
8.6 Simulations: Pointwise variance. . . . . . . . . . . . . . . . . . . . . . 221
8.7 Simulations: Pointwise mean squared error. . . . . . . . . . . . . . . . 222
8.8 Simulations: Pointwise relative efficiency. . . . . . . . . . . . . . . . . 223
8.9 Simulations: Smooth functionals of the MLE for t0 = 2. . . . . . . . . 225
8.10 Simulations: Smooth functionals of the naive estimator for t0 = 2. . . 226
8.11 Simulations: Smooth functionals of the MLE for t0 = 10. . . . . . . . 227
8.12 Simulations: Smooth functionals of the naive estimator for t0 = 10. . 228
9.1 Continuous mark data: Contour lines for estimates of F0(x, y). . . . . 254
iii
9.2 Continuous mark data: Estimates of F0X(x). . . . . . . . . . . . . . . 255
9.3 Continuous mark data: Estimates of F0(x0, y). . . . . . . . . . . . . . 256
iv
LIST OF TABLES
Table Number Page
2.1 Censored data perspective: Example data. . . . . . . . . . . . . . . . 22
2.2 Censored data perspective: Estimators for the data in Table 2.1. . . . 23
2.3 Graph theory: Example data. . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Graph theory: Clique matrix for the data in Table 2.3. . . . . . . . . 35
2.5 Convex minorant characterizations: Example data . . . . . . . . . . . 58
8.1 Simulations: Pointwise bias, variance and MSE at t = 10. . . . . . . . 224
9.1 Continuous mark data: Summary of the examples. . . . . . . . . . . . 249
v
ACKNOWLEDGMENTS
I sincerely thank my advisors, Piet Groeneboom and Jon Wellner, for their
mentorship over the past years. Their knowledge, guidance, inspiration and
encouragement have been very important to me.
I thank Peter Gilbert, Tilmann Gneiting, Peter Hoff and Michael Hudgens
for serving on my committee, with special thanks to Michael for suggesting
this research problem. I thank Bernard Deconinck for serving as the graduate
school representative.
I am grateful to the faculty, staff and students in our department for provid-
ing a stimulating and supportive research environment. In particular, I thank
Fadoua Balabdaoui, Moulinath Banerjee and Hanna Jankowski for helpful dis-
cussions. Finally, I want to express my deep gratitude to Steven, my parents,
my family and my friends, for their continuous support.
vi
1
Chapter 1
INTRODUCTION
1.1 Motivation and problem description
The work in this thesis is motivated by recent clinical trials of candidate vaccines
against HIV/AIDS. The main purpose of such trials is to determine the overall effi-
cacy of a candidate vaccine. Like many viruses, HIV exhibits significant genotypic
and phenotypic variation, so that it can be distinguished into several subtypes. There-
fore, it is also of interest to determine the efficacy of a vaccine against each subtype
of the virus. Establishing vaccine efficacy for certain subtypes can warrant vaccina-
tion of populations in which the given subtypes are highly prevalent. Furthermore,
establishing that the vaccine is efficacious for some subtypes, but not for others, gives
important information for possible improvements of the vaccine.
Thus, the variables of interest are the time of infection and the subtype of the
infecting virus. These variables cannot be observed directly, because participants of a
trial are only tested for the virus at several follow-up times. Since each test indicates
whether or not infection happened before the time of the test, the time of infection
is interval censored, i.e., only known to lie within a time interval determined by the
follow-up times. Since simultaneous infections with several subtypes of a virus are
rare, the subtypes are often analyzed as competing risks (see, e.g., Hudgens, Satten
and Longini (2001)). Hence, these trials yield interval censored survival data with
competing risks.
In this thesis, we analyze current status data with competing risks. Current sta-
tus censoring is the simplest form of interval censoring, where there is exactly one
2
observation time for each subject. We study these data for two reasons. First, such
data arise naturally in cross-sectional studies with several failure causes. Second,
understanding current status data with competing risks is a first step towards under-
standing the more complicated interval censored data with competing risks that arise
in vaccine clinical trials.
We consider the following general framework. We analyze a system that can fail
from K competing risks, where K ∈ N is fixed. The random variables of interest
are (X, Y ), where X ∈ R+ = (0,∞) is the failure time of the system, and Y ∈1, . . . , K is the corresponding failure cause. Due to censoring, we cannot observe
(X, Y ) directly. Rather, we observe the ‘current status’ of the system at a single
random observation time T ∈ R+, where T is independent of (X, Y ). Thus, at time
T we observe whether or not failure occurred, and if and only if failure occurred, we
also observe the failure cause Y .
Examples that fit into this framework can be found in reliability and survival
analysis. For an example, see the menopause data analyzed by Krailo and Pike
(1983), where X is the age at menopause, Y is the cause of menopause (natural or
operative), and T is the age at the time of the survey. In cross-sectional HIV studies
we think of X as the time of HIV infection, Y as the subtype of the infecting HIV
virus, and T as the time of the HIV test. Note that one is free to define the origin
of the time scale as. Common choices include the date of birth and the beginning of
the study.
Given current status data with competing risks, we consider nonparametric estima-
tion of the sub-distribution functions F0k(t) = P (X ≤ t, Y = k), k = 1, . . . , K. This
problem, or close variants thereof, has been studied by Hudgens, Satten and Longini
(2001), Jewell, Van der Laan and Henneman (2003), and Jewell and Kalbfleisch
(2004). However, there are still many open problems. In particular, until now, the
asymptotic properties of the nonparametric maximum likelihood estimator (MLE)
have been largely unknown. In this thesis, we resolve this problem. We prove con-
3
sistency, the rate of convergence and the limiting distribution of the MLE. These
asymptotic results form an important step towards making inference about the sub-
distribution functions.
The outline of the remainder of this chapter is as follows. In Section 1.2 we give
an overview of previous work in this area. In Section 1.3 we give an outline of this
thesis, together with a discussion of our main results.
1.2 Overview of previous work
Hudgens, Satten and Longini (2001) study competing risks data subject to interval
censoring and truncation. They derive the nonparametric maximum likelihood esti-
mator (MLE) and provide an EM algorithm for its computation. They also introduce
an alternative pseudo-likelihood estimator. They apply their methods to data from
a cohort of injecting drug users in Thailand, where the event of interest is infection
with HIV-1, and the competing risks are HIV-1 subtypes B and E.
Jewell, Van der Laan and Henneman (2003) study current status data with com-
peting risks. They consider some simple parametric models, some ad-hoc nonparamet-
ric estimators, and the MLE. They compare these estimators in a simulation study.
Furthermore, they apply their methods to data analyzed by Krailo and Pike (1983),
where the event of interest is menopause and the competing risks are natural and
operative menopause. Finally, the authors discuss results suggesting that the simple
ad-hoc estimators might yield fully efficient estimators for smooth functionals of the
sub-distribution functions.
Jewell and Kalbfleisch (2004) study maximum likelihood estimation of a series of
ordered multinomial parameters. Current status data with competing risks can be
viewed as a special case of this setting. The authors focus on the computation of the
MLE, and introduce an iterative version of the Pool Adjacent Violators Algorithm.
4
1.3 Overview of new results and outline of this thesis
We focus on the following two nonparametric estimators for the sub-distribution func-
tions: the MLE Fn = (Fn1, . . . , FnK), and the ‘naive estimator’ Fn = (Fn1, . . . , FnK)
introduced by Jewell, Van der Laan and Henneman (2003).1 Our main interest is in
asymptotic properties of the MLE, and the naive estimator is considered for compar-
ison.
In Chapter 2 we define the estimators, and discuss the relationship between them.
We show that both the MLE and the naive estimator can be viewed as maximum like-
lihood estimators for censored data. This observation is useful, because it allows us
to use readily available theory and computational algorithms. In particular, the naive
estimator can be viewed as the maximum likelihood estimator for reduced univariate
current status data. Hence, many properties of the naive estimator follow straight-
forwardly from known results on current status data. The censored data perspective
also allows us to use graph theory to study uniqueness properties of the estimators.
Finally, we characterize the estimators in terms of necessary and sufficient condi-
tions, in the form of Fenchel characterizations and (self-induced) convex minorant
characterizations. These characterizations play a key role in the development of the
asymptotic theory, and also lead to computational algorithms.
Computational aspects of the MLE are discussed in Chapter 3. Since there are
no explicit formulas available for the MLE, we compute the MLE with an iterative
algorithm. We discuss two classes of algorithms and the connections between them.
The first class is based on sequential quadratic programming, where each quadratic
programming problem is solved using a support reduction algorithm. The second class
consists of iterative convex minorant algorithms. We prove convergence of algorithms
in both classes. Furthermore, we show that one particular iterative convex minorant
algorithm can be viewed as a sequential quadratic programming method that only
1The subscript n denotes the sample size.
5
uses the diagonal elements of the Hessian matrix.
In Chapter 4 we discuss consistency of the estimators. We prove that both esti-
mators are Hellinger consistent, and we use this to derive various forms of local and
uniform consistency.
The rate of convergence is discussed in Chapter 5. The Hellinger rate of conver-
gence and the local rate of convergence of the naive estimator are n1/3. This follows
from known results on current status data without competing risks. For the MLE, we
prove that the Hellinger rate of convergence is n1/3. Next, we derive a local asymp-
totic minimax lower bound of n1/3, meaning that no estimator can have a better local
rate of convergence than n1/3, in a minimax sense. We proceed by proving that the
local rate of convergence of the MLE is n1/3. This result comes as no surprise given
the local asymptotic minimax lower bound and the local rate of convergence of the
naive estimator. However, the proof of this result turned out to be rather involved,
and required new methods. The key idea is to first establish a rate result for∑K
k=1 Fnk
that holds uniformly on a fixed neighborhood around a point t0, instead of on the
usual shrinking neighborhood of order O(n−1/3).
In Chapter 6 we discuss the limiting distribution of the estimators. The limiting
distribution of the naive estimator is given by the slopes of the convex minorants of
K correlated two-sided Brownian motion processes plus parabolic drifts. The limiting
distribution of the MLE involves a new self-induced limiting process, consisting of the
convex minorants of K correlated two-sided Brownian motion processes plus parabolic
drifts, plus an additional term involving the difference between the sum of the K
drifting Brownian motion processes and their convex minorants.
In Chapter 7 we consider estimation of smooth functionals. Jewell, Van der Laan
and Henneman (2003) suggested that the naive estimator yields asymptotically effi-
cient smooth functionals. We show that this is indeed the case, and that the same
holds for the MLE.
In Chapter 8 we apply our methods to real and simulated data. We compare
6
the MLE and the naive estimator in a simulation study, considering both pointwise
estimation and the estimation of smooth functionals. For pointwise estimation, we
show that the MLE is superior to the naive estimator in terms of mean squared error,
both for small and large sample sizes. For the estimation of smooth functionals,
we show that the behavior of the MLE and the naive estimator is similar, and in
agreement with the results in Chapter 7.
Finally, in Chapter 9 we consider an extension of the model, where X is subject
to interval censoring case k, and Y is a continuous random variable. This model is
referred to as the interval censored continuous mark model. It is applicable to HIV
vaccine clinical trials by letting X be the time of HIV infection, and Y be the ‘viral
distance’ between the infecting HIV virus and the virus present in the vaccine. We
derive the limit of the MLE in this model, and show that the MLE is inconsistent in
general. We also suggest a simple method for repairing the MLE by discretizing Y ,
an operation that transforms the data to interval censored data with competing risks.
We illustrate the behavior of the MLE and the repaired MLE in four examples.
7
Chapter 2
THE ESTIMATORS
In this chapter we study finite sample properties of the MLE and the naive esti-
mator. In Section 2.1 we formally define the model and the estimators. Since both
estimators can be viewed as maximum likelihood estimators for censored data, Sec-
tion 2.2 provides a general discussion on the MLE for censored data. In Section 2.3
we use a graph theoretic perspective to derive properties of the estimators. Finally, in
Section 2.4, we characterize the estimators in terms of necessary and sufficient Fenchel
and convex minorant conditions.
2.1 Definition of the estimators
Before we define the MLE and the naive estimator, we introduce some assumptions
and notation. Recall that K ∈ N denotes the number of competing risks. The
variables of interest are (X, Y ), where X ∈ R+ is the failure time of a system, and
Y ∈ 1, . . . , K is the corresponding failure cause. We do not observe (X, Y ) directly.
Rather, we observe the system at a random observation time T ∈ R+. At this time,
we observe whether or not failure occurred, and if and only if failure occurred, we
also observe the failure cause Y . Our goal is nonparametric estimation of the bivari-
ate distribution function of (X, Y ), or equivalently, of the vector of sub-distribution
functions F0 = (F01, . . . , F0K), where
F0k(t) = P (X ≤ t, Y = k), k = 1, . . . , K.
We make the following assumptions:
8
(a) T is independent of (X, Y );
(b) The system cannot fail from two or more causes at the same time.
Assumption (a) is essential for the development of the theory, and is used in the
definition of the estimators in Sections 2.1.2 and 2.1.3. Assumption (b) ensures that
the failure cause is well defined. This assumption is always satisfied by defining
simultaneous failure from several causes as a new failure cause. We do not make any
other assumptions. In particular, we do not require that all observation times are
distinct.
2.1.1 Notation
We denote the observed data by Z = (T,∆), where ∆ = (∆1, . . . ,∆K+1) and
∆k = 1X ≤ T, Y = k, k = 1, . . . , K, (2.1)
∆K+1 = 1X > T. (2.2)
Thus, for k = 1, . . . , K, ∆k = 1 if and only if failure happened by time T and was due
to cause k. Furthermore, ∆K+1 = 1 if and only if failure did not happen by time T .
Note that∑K+1
k=1 ∆k = 1, and hence ∆K+1 = 1−∑Kk=1 ∆k. A graphical representation
of the observed data is given in Figure 2.1.
Let Z1, . . . , Zn be n i.i.d. observations of Z, where Zi = (Ti,∆i) and ∆i =
(∆i1, . . . ,∆i,K+1). We call an observation Zi right censored if ∆i,K+1 = 1, and left
censored otherwise. Let T(1), . . . , T(n) be the order statistics of T1, . . . , Tn, where ties
are broken arbitrarily after ensuring that left censored observation are ordered before
right censored observations. We denote the corresponding ∆-vectors by ∆(1), . . . ,∆(n),
where ∆(i) = (∆(i)1, . . . ,∆(i),K+1).
9
T T
T T
1
2
3
1
2
3
1
2
3
1
2
3
∆ = (0, 0, 1, 0) ∆ = (0, 0, 0, 1)
∆ = (1, 0, 0, 0) ∆ = (0, 1, 0, 0)
Figure 2.1: Graphical representation of the observed data (T,∆) in an example withK = 3 competing risks. The grey sets indicate the values of (X, Y ) that are consistentwith (T,∆), for each of the four possible values of ∆.
Let ek, k = 1, . . . , K + 1, be the kth unit vector in RK+1, and let
Z = (t, ek) : t ∈ R+, k = 1, . . . , K + 1. (2.3)
Let G be the distribution of T , and let Gn be the empirical distribution of T1, . . . , Tn.
Furthermore, let Pn be the empirical distribution of Z1, . . . , Zn, i.e., for any function
h : Z 7→ R we have Pnh(Z) =∫h(z)dPn(z) = 1
n
∑ni=1 h(Zi). For vectors x =
(x1, . . . , xK) ∈ RK , we define x+ =
∑Kk=1 xk and xK+1 = 1 − x+. For example,
we write ∆+ =∑K
k=1 ∆k, F0+(t) =∑K
k=1 F0k(t) and F0,K+1(t) = 1 − F0+(t). The
only exception to the notation xK+1 = 1 − x+ is that we do not use it for the naive
estimator. The reason for this will become clear in Section 2.1.3.
10
2.1.2 The MLE
We now define the MLE Fn = (Fn1, . . . , FnK) for F0 = (F01, . . . , F0K). Note that
∆|T ∼ MultinomialK+1(1, (F01(T ), . . . , F0,K+1(T ))). (2.4)
Hence, under F = (F1, . . . , FK), the density for a single observation z = (t, δ) is
pF (z) =
K+1∏
k=1
Fk(t)δk , (2.5)
with respect to the dominating measure µ = G×#, where # is counting measure on
ek : k = 1, . . . , K + 1. The corresponding log likelihood (divided by n)1 is
ln(F ) =
∫log pF (u, δ)dPn(u, δ) =
K+1∑
k=1
∫δk logFk(u)dPn(u, δ), (2.6)
and the MLE (if it exists)2 is defined by
ln(Fn) = maxF∈FK
ln(F ), (2.7)
where FK is the set of all K-tuples of sub-distribution functions on R+ with pointwise
sum bounded by one. Note that we can absorbG in the dominating measure µ because
of the assumed independence between T and (X, Y ).
2.1.3 The naive estimator
We now define the naive estimator Fn = (Fn1, . . . , Fn,K+1). The naive estimator Fnk
can be viewed as the MLE for the reduced current status data Zk = (T,∆k). To see
1In order to efficiently use the empirical process notation, we use the convention of dividing alllog likelihoods by n.
2Existence of the estimators will follow from Theorem 2.1 ahead.
11
this, let pk,Fk(u, δ) be the marginal density of the reduced current status data Zk:
pk,Fk(u, δ) = Fk(u)
δk1 − Fk(u)1−δk.
Then the naive estimator Fnk maximizes the marginal log likelihood
lnk(Fk) =
∫log pk,Fk
(u, δ)dPn(u, δ)
=
∫δk logFk(u) + (1 − δk) log(1 − Fk(u)) dPn(u, δ), (2.8)
for k = 1, . . . , K + 1. Thus, the naive estimators (if they exist) are defined by
lnk(Fnk) = maxFk∈F
lnk(Fk), k = 1, . . . , K, (2.9)
ln,K+1(Fn,K+1) = maxS∈S
ln,K+1(S). (2.10)
where F is the collection of all sub-distribution functions on R+, and S is the collection
of all sub-survival functions on R+. Note that we can omit G in the marginal log
likelihood, since T and (X, Y ) are independent.
The naive estimator provides two different estimators for the overall failure time
distribution F0+, namely Fn+ =∑K
k=1 Fnk and 1− Fn,K+1. Since the naive estimator
does not require the sum of the sub-distribution functions to be bounded by one,
Fn+ may exceed one. In contrast, 1 − Fn,K+1 is always bounded between zero and
one. This estimator is simply the MLE for the overall failure time distribution when
information on the failure causes is ignored. In general, Fn,K+1 6= 1 − Fn+, and we
therefore do not use the shorthand notation xK+1 = 1 − x+ for the naive estimator.
2.1.4 Comparison of the two estimators
In order to point out the similarities and differences between the MLE and the naive
estimator, we give the following alternative but equivalent definition of the naive
12
estimator. For F = (F1, . . . , FK), we define
ln(F ) =
∫ K∑
k=1
[δk logFk(u) + (1 − δk) log(1 − Fk(u))
]dPn(u, δ). (2.11)
Then the naive estimator Fn = (Fn1, . . . , FnK) (if it exists) is defined by
ln(Fn) = maxF∈FK
ln(F ), (2.12)
where FK is the space of all K-tuples of sub-distribution functions on R+. Comparing
this optimization problem with the optimization problem (2.7) for the MLE, we see
the following two differences:
(a) The log likelihood (2.6) for the MLE contains a term involving FK+1(u) =
1−F+(u), while the log likelihood (2.11) for the naive estimator does not include
such a term;
(b) The space FK for the MLE includes the constraint that the sum of the sub-
distribution functions is bounded by one, while the space FK for the naive
estimator does not include such a constraint.
Thus, the MLE takes into account the K-dimensional system of sub-distribution
functions, while the naive estimator ignores this aspect of the problem. In fact,
since the sub-distribution functions in optimization problem (2.12) are not related to
each other, the optimization problem can be split into the K optimization problems
defined in (2.9). Since these optimization problems correspond to the MLE for uni-
variate current status data, both computational results and asymptotic theory follow
straightforwardly from known results for current status data (see Groeneboom and
Wellner (1992, Part II, Sections 1.1, 4.1 and 5.1)).
The fact that the MLE takes into account the system of sub-distribution functions
leads to more complicated computation and asymptotic theory. However, these com-
13
plications result in a better pointwise behavior of the MLE, as shown in the simulation
study in Section 8.2.
2.2 Censored data perspective
From the definitions of the MLE and the naive estimator, we see that both estima-
tors can be viewed as nonparametric maximum likelihood estimators for censored
data. Viewing the estimators from this perspective allows us to use readily available
computational algorithms and theory for the MLE for censored data.
We consider the following general framework. Let W be a random variable taking
values in W. Suppose that W has distribution F0. Our goal is to estimate this
distribution. However, we do not observe W directly. Rather, we observe a vector
of random sets D = (D1, . . . , Dp) that form a partition of W, i.e., ∪pj=1Dj = Wand Dj ∩ Dk = ∅ for j 6= k ∈ 1, . . . , p. We assume that D is independent of
W . In principle, we can allow the number of random sets to be random, but for
our purposes that is not needed. Furthermore, we observe an indicator vector ∆ =
(∆1, . . . ,∆p), where ∆j = 1W ∈ Dj, j = 1, . . . , p. Thus, we observe a vector D
containing a random partition of W, and an indicator vector ∆ indicating which set
R ∈ D1, . . . , Dp contains the unobservable W . We call the set R an observed set.
Using the convention 0 ·Dj = ∅, we can write
R = ∪pj=1∆jDj.
Let Z1, . . . , Zn be n i.i.d. copies of Z = (D,∆). These data define n i.i.d. observed
sets R1, . . . , Rn. Writing the log likelihood in terms of these sets gives
ln(F ) =1
n
n∑
i=1
logPF (Ri),
where PF (Ri) denotes the probability mass in Ri under distribution F . The maximum
14
likelihood estimator (if it exists) is defined by
ln(Fn) = maxF∈F
ln(F ), (2.13)
where F is the space of all distribution functions on W. Since ln(F ) is optimized
over the function space F , the optimization problem (2.13) is infinite dimensional.
However, the number of parameters can be reduced by generalizing the reasoning
of Turnbull (1976) for univariate censored data. It follows that the estimators can
only assign mass to a finite collection of disjoint sets A1, . . . , Am, called maximal
intersections by Wong and Yu (1999). In the literature, there are several equivalent
definitions of maximal intersections. Wong and Yu (1999) define Aj to be a maximal
intersection if and only if it is a finite intersection of the Ri’s such that for each i
Aj ∩ Ri = ∅ or Aj ∩ Ri = Aj . Gentleman and Vandal (2002) use a graph theoretic
perspective. They show that the maximal intersections correspond to maximal cliques
of the intersection graph of the observed sets. We discuss this perspective in detail
in the next section. For observed sets that take the form of rectangles in Rp, p ∈ N,
Maathuis (2005) introduces yet another way to view the maximal intersections, using
a height map of the observed sets. This height map is a function h : Rp → 0, 1, . . . , ,where h(x) is defined as the number of observed sets that overlap at the point x ∈ Rp.
Maathuis (2005) shows that the maximal intersections are exactly the local maxima of
the height map of a canonical version of the observed sets. We say that R′1, . . . , R
′n are
a canonical version of R1, . . . , Rn if the following three properties hold: (i) R1, . . . , Rn
and R′1, . . . , R
′n have the same intersection structure, i.e., Ri ∩ Rj = ∅ if and only if
R′i ∩R′
j = ∅, for all i, j ∈ 1, . . . , n; (ii) The x-coordinates of R′1, . . . , R
′n are distinct
and take values in 1, . . . , 2n; (ii) The y-coordinates of R′1, . . . , R
′n are distinct and
take values in 1, . . . , 2n. Thus, any ties that may have been present in R1, . . . , Rn
are resolved in R′1, . . . , R
′n, but in a way that does not affect the intersection structure.
For details on the transformation to canonical sets, see Maathuis (2005, Section 2.1).
15
By generalizing the reasoning of Turnbull (1976), it follows that the MLE is in-
different to the distribution of mass within the maximal intersections. As a result,
the MLE is typically not uniquely defined on the maximal intersections. This type of
non-uniqueness is called representational non-uniqueness by Gentleman and Vandal
(2002). Thus, we can at best hope to determine the probability masses αj = PF (Aj),
j = 1, . . . , m. We let α = (α1, . . . , αm) and write the probability mass in an observed
set Ri in terms of α:
Pα(Ri) =m∑
j=1
αj1Aj ⊆ Ri. (2.14)
Then we can write the log likelihood as
ln(α) =1
n
n∑
i=1
logPα(Ri) =1
n
n∑
i=1
log
(m∑
j=1
αj1Aj ⊆ Ri). (2.15)
Thus, we can think of the computation of the estimators as a two step process. First,
in the reduction step, we compute the maximal intersections A1, . . . , Am. Next, in the
optimization step, we solve the optimization problem
ln(α) = maxA
ln(α), (2.16)
where
A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m, 1Tα = 1
and 1 is the all-one vector in Rm. This optimization problem is an m-dimensional
convex constrained optimization problem. Existence of the MLE follows directly from
standard methods in optimization theory.
Theorem 2.1 The MLE α defined by (2.16) exists.
16
Proof: Letting log(0) = −∞, ln(α) is a continuous extended real valued function on
the nonempty compact set A. Hence, the maximum exists by, e.g., Zeidler (1985,
Corollary 38.10). 2
The optimization problem (2.16) may have several solutions. This forms a second
source of non-uniqueness for the MLE, called mixture non-uniqueness by Gentleman
and Vandal (2002). We will show in Section 2.3 that for current status data with
competing risks, both the MLE and the naive estimator are mixture unique. However,
we first show how both estimators fit into the censored data framework.
2.2.1 Censored data perspective of the MLE
For the MLE, the variable of interest is W = (X, Y ), taking values in the space
W = R+ × 1, . . . , K. The observation time T defines a partition of p = K + 1
random sets in W:
Dk = (0, T ] × k, k = 1, . . . , K, (2.17)
DK+1 = (T,∞) × 1, . . . , K. (2.18)
Since there is a one-to-one correspondence between D = (D1, . . . , DK+1) and T , the
assumption that T is independent of (X, Y ) is equivalent to the assumption that D is
independent of (X, Y ). Furthermore, note that ∆k = 1X ≤ T, Y = k = 1(X, Y ) ∈Dk for k = 1, . . . , K, and ∆K+1 = 1X > T = 1(X, Y ) ∈ DK+1. Hence, the ∆
vector indicates which set contains the unobservable (X, Y ), and the observed data
(T,∆) give exactly the same information as (D,∆).
The corresponding observed sets are R = ∪K+1k=1 ∆kDk, so that
R =
(0, T ] × k if ∆k = 1, k = 1, . . . , K,
(T,∞) × 1, . . . , K if ∆K+1 = 1.(2.19)
17
It follows that we can write the log likelihood (2.6) as ln(F ) = 1n
∑ni=1 logPF (Ri).
The MLE maximizes this expression over all bivariate sub-distribution functions F
on R+ × 1, . . . , K, or equivalently, over all K-tuples of sub-distribution functions
F = (F1, . . . , FK) with pointwise sum bounded by one.
We now consider the maximal intersections of the observed sets R1, . . . , Rn. Note
that the observed sets can take the form (t,∞)×1, . . . , K for some t ∈ R+. Such sets
are not rectangles in R2, and hence we cannot directly use the concept of the height
map of Maathuis (2005). However, by transforming such sets into (t,∞)× [1, K], we
do have rectangles in R2. We can then compute the maximal intersections using the
concept of the height map. Afterwards we transform sets of the form (t,∞) × [1, K]
back to (t,∞) × 1, . . . , K.Once we have computed α, we obtain Fnk(t) by summing the mass in (0, t]×k,
for k = 1, . . . , K and t ∈ R+. For each k ∈ 1, . . . , K + 1, we call A a maximal
intersection for Fnk, if A is involved in the computation of Fnk. A precise definition
is given below.
Definition 2.2 Let k ∈ 1, . . . , K, and let R = R1, . . . , Rn be the observed sets
as defined in (2.19). We call A a maximal intersection for Fnk if it is a maximal
intersection of R and A ∩ (R × k) 6= ∅. We call A a maximal intersection for Fn+
(or equivalently, for Fn,K+1) if A is a maximal intersection for some Fnk, k = 1, . . . , K.
Note that maximal intersections for Fn+ are sets in R+ × 1, . . . , K, although Fn+
is a function on R+. Recall from Section 2.1.1 that we order the observations such
that their observation times are nondecreasing, where ties are broken arbitrarily after
ensuring that left censored observations are ordered before right censored observa-
tions. Hence, if there is an observation Zi such that Ti = T(n) and ∆i,K+1 = 1, then
∆(n),K+1 = 1 holds, even if there are other observations with Ti = T(n) and ∆ik = 1
for some k ∈ 1, . . . , K. This is used in the following lemma, which provides infor-
mation on the form of the maximal intersections for Fnk. The lemma follows directly
18
from the idea of the height map.
Lemma 2.3 Let k ∈ 1, . . . , K. Each maximal intersection for Fnk satisfies one of
the following two conditions:
(i) A = (T(i), T(j)]×k, with i < j, ∆(i),K+1 = 1, ∆(j)k = 1, and ∆(l),K+1 = ∆(l)k =
0 for all l such that T(i) < T(l) < T(j);
(ii) A = (T(n),∞) × 1, . . . , K, with ∆(n),K+1 = 1.
Moreover, if a set A satisfies one of these conditions, then A is a maximal intersection
for Fnk.
2.2.2 Censored data perspective of the naive estimator
For the naive estimator Fnk, we consider the reduced current status data Zk = (T,∆k).
Define the variables
Wk = X1Y = k + ∞ · 1Y 6= k, k = 1, . . . , K,
WK+1 = X,
taking values in W = R+ ∪ ∞. Note that F0k(t) = P (Wk ≤ t) for k = 1, . . . , K,
and F0,K+1(t) = P (WK+1 > t). Hence we can take W1, . . . ,WK+1 to be our variables
of interest.
The observation time T defines a partition of p = 2 random sets in W:
D1 = (0, T ] and D2 = (T,∞]. (2.20)
Since there is a one-to-one correspondence between D = (D1, D2) and T , the as-
sumption that T is independent of (X, Y ) is equivalent to the assumption that D is
independent of W1, . . . ,WK+1.
19
For k = 1, . . . , K, note that ∆k = 1X ≤ T, Y = k = 1Wk ≤ T = 1Wk ∈D1. Hence, the vector (∆k, 1 − ∆k) indicates whether D1 or D2 contains the unob-
servable Wk, and the reduced current status data (T,∆k) give exactly the same infor-
mation as (D,∆k). The corresponding observed sets are R(k) = ∆kD1 ∪ (1 − ∆k)D2,
so that
R(k) =
(0, T ] if ∆k = 1,
(T,∞) if ∆k = 0.(2.21)
We can write the log likelihood (2.8) as lnk(Fk) = 1n
∑ni=1 logPF (R
(k)i ). The naive
estimator maximizes this expression over all sub-distribution functions Fk on R+.
For k = K + 1, note that ∆K+1 = 1X > t = 1WK+1 ∈ D2. Hence, the
vector (1 − ∆K+1,∆K+1) indicates whether D1 or D2 contains the unobservable X,
and the reduced current status data (T,∆K+1) give exactly the same information as
(D,∆K+1). The corresponding observed sets are R(K+1) = (1−∆K+1)D1 ∪∆K+1D2,
so that
R(K+1) =
(0, T ] if ∆K+1 = 0,
(T,∞) if ∆K+1 = 1.(2.22)
We can write the log likelihood (2.8) as ln,K+1(S) = 1n
∑ni=1 logPS(R
(K+1)i ). The naive
estimator Fn,K+1 maximizes this expression over all sub-survival functions S on R+.
Definition 2.4 For k = 1, . . . , K + 1, we call A a maximal intersection for Fnk if it
is a maximal intersection of the observed sets R(k)1 , . . . , R
(k)n as defined in (2.21) and
(2.22).
The maximal intersections for the naive estimator are described in Lemmas 2.5 and
2.6. Both lemmas follow directly from the idea of the height map.
Lemma 2.5 Let k ∈ 1, . . . , K. Each maximal intersections A for Fnk satisfies one
of the following two conditions:
20
(i) A = (T(i), T(j)], with (T(i), T(j)) ∩ T1, . . . , Tn = ∅, ∆(i)k = 0, and ∆(j)k = 1.
(ii) A = (T(n),∞), with ∆(n)k = 0.
Moreover, if an interval A satisfies one of these conditions, then it is a maximal
intersection for Fnk.
Lemma 2.6 Each maximal intersection for Fn,K+1 satisfies one of the following two
conditions:
(i) A = (T(i), T(j)], with (T(i), T(j))∩T1, . . . , Tn = ∅, ∆(i),K+1 = 1, and ∆(j),K+1 =
0.
(ii) A = (T(n),∞), with ∆(n),K+1 = 1.
Moreover, if an interval A satisfies one of these conditions, then A is a maximal
intersection for Fn,K+1.
2.2.3 Comparing the maximal intersections for both estimators
Definition 2.7 For any set A ∈ R2, we define the x-interval and y-interval of A to
be the projections of A on the x-axis and y-axis. Furthermore, we define the lower
and upper endpoint of A to be the lower and upper endpoint of its x-interval.
We now compare the maximal intersections for Fnk and Fnk, for k ∈ 1, . . . , K.
Lemma 2.8 For each k = 1, . . . , K, the number of maximal intersections for Fnk
is at least as large as the number of maximal intersections for Fnk. Moreover, each
upper endpoint of a maximal intersection for Fnk is an upper endpoint of a maximal
intersection for Fnk.
Proof: Let A be a maximal intersection for Fnk. We show that there is a maximal
intersection for Fnk with the same upper endpoint. Note that A must satisfy one of
21
the two conditions of Lemma 2.3. First, suppose that the A = (T(n),∞)×1, . . . , Kwith ∆(n),K+1 = 1. Then ∆(n)k = 0, and A = (T(n),∞) is a maximal intersection
for Fnk by Lemma 2.5. Next, suppose that A = (T(i), T(j)] × k, with ∆(i),K+1 = 1,
∆(j)k = 1 and ∆(l)k = ∆(l),K+1 = 0 for all l such that T(i) < T(l) < T(j). Then
∆(j−1)k = 0, and hence A = (T(j−1), T(j)] is a maximal intersection for Fnk by Lemma
2.5. 2
Lemma 2.9 The number of maximal intersections for Fn,K+1 is at most as large
as the number of maximal intersections for Fn,K+1. Moreover, the collection of lower
endpoints of the maximal intersections for Fn,K+1 is identical to the collection of lower
endpoints of the maximal intersections for Fn,K+1. As a result, the number of regions
on the x-axis where Fn,K+1 can put mass is identical to the number of regions on the
x-axis where Fn,K+1 can put mass. Finally, the union of the maximal intersections
for Fn,K+1 is contained in the union of the x-intervals of the maximal intersections
for Fn,K+1.
Proof: Let A be a maximal intersection for Fn,K+1. We show that there is a maximal
intersection for Fn,K+1 with the same lower endpoint. Note that A must satisfy
one of the two conditions of Lemma 2.6. First, suppose that A = (T(i), T(j)] with
(T(i), T(j))∩T1, . . . , Tn = ∅, ∆(i),K+1 = 1 and ∆(j),K+1 = 0. Since ∆(j),K+1 = 0, there
must be a k ∈ 1, . . . , K such that ∆(j)k = 1. But this implies that (T(i), T(j)]×k is
a maximal intersection for Fnk, by Lemma 2.3. Next, suppose that A = (T(n),∞) with
∆(n),K+1 = 1. Then (T(n),∞)×1, . . . , K is a maximal intersection for Fn1, . . . , FnK
by Lemma 2.3, and hence it is a maximal intersection for Fn,K+1 by definition.
Next, let A be a maximal intersection for Fn,K+1. We show that there is a maximal
intersection for Fn,K+1 with the same lower endpoint. By definition, it follows that
there is a k ∈ 1, . . . , K so that A is a maximal intersection for Fnk. Hence, A
must satisfy one of the two conditions of Lemma 2.3. First, suppose that A =
(T(i), T(j)] × k, with ∆(i),K+1 = 1, ∆(j)k = 1 and ∆(l)k = ∆(l)K+1 = 0 for all l
22
Table 2.1: Example data with K = 2 competing risks, illustrating that the number ofpositive maximal intersections for Fn,K+1 can be larger than the number of positive
maximal intersections for Fn,K+1.
i t(i) δ(i)1 δ(i)2 δ(i)31 1 1 0 02 2 0 0 13 3 0 0 14 4 1 0 05 5 0 0 1
i t(i) δ(i)1 δ(i)2 δ(i)36 6 0 1 07 7 0 1 08 8 1 0 09 9 0 1 0
10 10 0 1 0
such that T(i) < T(l) < T(j). If S = (T(i), T(j)) ∩ T1, . . . , Tn = ∅, then (T(i), T(j)]
is a maximal intersection for Fn,K+1 by Lemma 2.6. Otherwise, (T(i),minS] is a
maximal intersection for Fn,K+1. Next, suppose that A = (T(n),∞)×1, . . . , K with
∆(n),K+1 = 1. Then (T(n),∞) is a maximal intersection for Fn,K+1 by Lemma 2.6.
The last statement follows by combining the fact that the collection of lower
endpoints of the maximal intersections for Fn,K+1 and Fn,K+1 are identical, with the
fact that maximal intersections for Fn,K+1 cannot contain observation times in their
interior (Lemma 2.6). 2
Remark 2.10 The last statement of Lemma 2.9 has implications for representational
non-uniqueness of the estimators. It shows that it is possible that the area in which
the MLE Fn,K+1 suffers from representational non-uniqueness is larger than the area
in which Fn,K+1 suffers from representational non-uniqueness. This was also noted
by Hudgens, Satten and Longini (2001), and partly motivated their pseudo-likelihood
estimator. However, note that it can also happen that Fn,K+1 is non-unique over a
larger area, if many of the maximal intersections for Fn,K+1 get zero mass. For an
example, see Tables 2.1 and 2.2.
Motivated by Remark 2.10, we now consider maximal intersections that get positive
mass. We introduce the following terminology:
23
Table 2.2: The estimators for the data in Table 2.1, in terms of their maximal inter-sections (MIs) and the corresponding probability masses.
Fn,K+1
MIs mass(0, 1] × 1 3/10(3, 4] × 1 0(5, 8] × 1 0(5, 6] × 2 7/10
Fn,K+1
MIs mass(0, 1] 1/3(3, 4] 1/6(5, 6] 1/2
Definition 2.11 Let k ∈ 1, . . . , K + 1. We say that A is a positive maximal
intersection for Fnk if A is a maximal intersection for Fnk and the MLE assigns
positive mass to A. Similarly, we say that Fnk is a positive maximal intersection for
Fnk if A is a maximal intersection for Fnk and Fnk assigns positive mass to A.
After reading Lemma 2.9, one may wonder whether the number of positive maxi-
mal intersections for Fn,K+1 is at most as large as the number of positive maximal
intersections for Fn,K+1. This is indeed often the case in simulations, but not al-
ways. A counter example can be found in Table 2.1. In this example, Fn,K+1 has
four maximal intersections, given in Table 2.2. The naive estimator Fn,K+1 has three
maximal intersections, with corresponding masses given in Table 2.2. Note that the
maximal intersections satisfy the statement in Lemma 2.9. However, there are only
two positive maximal intersections for Fn,K+1, while there are three positive maximal
intersections for Fn,K+1.
2.3 Graph theory and uniqueness
Gentleman and Vandal (2001), Gentleman and Vandal (2002), Maathuis (2003), and
Vandal, Gentleman and Liu (2006) use a graph theoretic perspective to study prop-
erties of the maximum likelihood estimator for censored data. Before we apply these
methods to our problem, we give an introduction to graph theory. This introduction
24
is mostly based on Golumbic (1980), and also partly given in Maathuis (2003, Section
3.3).
2.3.1 Introduction to graph theory for censored data
Let G = (V,E) be an undirected graph, where V is a set of vertices, and E is a set
of edges. An edge is a collection of two vertices. Two vertices v and w are said to
be adjacent in G if there is an edge between v and w, i.e., vw ∈ E. We say that
two sets of vertices S1 and S2 are adjacent if there is at least one pair of vertices
(v, w) such that v ∈ S1, w ∈ S2 and vw ∈ E. A subgraph of G = (V,E) is defined
to be any graph G′ = (V ′, E ′) such that V ′ ⊆ V and E ′ ⊆ E. Given a subset
A ⊆ V of vertices, we define the subgraph induced by A to be GA = (A,EA), where
EA = xy ∈ E : x ∈ A, y ∈ A.We call a subset M ⊆ V of vertices a clique if every pair of distinct vertices in M
is adjacent. We call M ⊆ V a maximal clique if there is no clique in G that properly
contains M as a subset3. Every finite graph has a finite number of maximal cliques
that we denote by C = C1, . . . , Cm.Let R = R1, . . . , Rn be a family of sets. The intersection graph of R is obtained
by representing each set in R by a vertex, and connecting two vertices by an edge if
and only if their corresponding sets intersect. An intersection graph of a collection
of intervals on a linearly ordered set is called an interval graph. Alternatively, an
undirected graph G is called an interval graph if it can be thought of as an intersection
graph of a set of intervals on the real line. Every maximal clique Cj in an intersection
graph has a real representation Aj =⋂R∈Cj
R, given by the intersection of the sets
that form the maximal clique.
A sequence of vertices (v0, v1, . . . , vl) is called a cycle of length l + 1 if vi−1vi ∈ E
for all i = 1, . . . , l and vlv0 ∈ E. A cycle (v0, . . . , vl) is called a simple cycle if vi 6= vj
3Instead of the terms ‘clique’ and ‘maximal clique’, some authors use the terms ‘complete sub-graph’ and ‘clique’.
25
for i 6= j. A simple cycle (v0, v1, . . . , vl) is called chordless if for all i = 0, . . . , l,
vivj ∈ E only for j = (i± 1) mod (l+ 1). A graph is called triangulated if it does not
contain chordless cycles of length strictly greater than three. Hajos (1957) showed
that every interval graph is triangulated.
A clique graph of R is an intersection graph of the maximal cliques C. Thus,
in this graph each vertex represents a maximal clique, and two vertices Cj and Ck
are adjacent if and only if Cj ∩ Ck 6= ∅, i.e., if there is at least one set in R that is
an element of both Cj and Ck. We define the clique matrix to be a vertices versus
maximal cliques incidence matrix. For n observed sets with m maximal cliques, this
is an n×m matrix H with elements Hij = 1Aj ⊆ Ri.4
We now return to the maximum likelihood estimator for censored data. Let R =
R1, . . . , Rn be the observed sets. Gentleman and Vandal (2001) showed that the
maximal intersections A1, . . . , Am of R, defined in Section 2.2, are exactly the real
representations of the maximal cliques of the intersection graph of R. Hence, we
can study the intersection graph to deduce properties of the MLE. In particular,
Gentleman and Vandal (2002, Lemma 4) showed that α is unique if the intersection
graph is triangulated. An alternative proof can be found in Maathuis (2003, Lemma
3.13). Finally, we can use the clique matrix H to rewrite the optimization problem
(2.16). Namely, Pα(Ri) = (Hα)i, so that (2.16) becomes
ln(α) = maxA
n∑
i=1
log ((Hα)i) .
2.3.2 Graph theoretic aspects and uniqueness of the naive estimator
For k = 1, . . . , K + 1, let R(k) = R(k)1 , . . . , R
(k)n be the observed sets for the naive
estimator Fnk, as defined in (2.21) and (2.22). The following proposition uses the
structure of the intersection graph and the form of the maximal intersections to
4Note that our H is the transpose of the incidence matrix defined in Gentleman and Vandal(2002, page 559).
26
prove uniqueness of the naive estimators at the observation times. Alternatively,
uniqueness can be proved from strict concavity of the marginal log likelihoods lnk(Fk),
k = 1, . . . , K + 1, defined in (2.8).
Proposition 2.12 The naive estimators Fnk(t), k = 1, . . . , K + 1, are unique at all
observation times T1, . . . , Tn.
Proof: Let k ∈ 1, . . . , K + 1. Note that the observed sets R(k) are intervals in R.
Hence, the intersection graph of R(k) is an interval graph, and it follows from Hajos
(1957) that the graph is triangulated. Thus, by Gentleman and Vandal (2002, Lemma
4), the naive estimator is mixture unique.
We obtain Fnk(t) by summing all mass in the interval (0, t]. Thus, Fnk(t) is
unique if t is not in the interior of a maximal intersection. Lemma 2.5 implies that
the observation times T1, . . . , Tn can never be contained in the interior of a maximal
intersection, so that Fnk is unique at all observation times. 2
In fact, it follows from the proof of Proposition 2.12 that we can make a stronger
statement: For each k = 1, . . . , K, Fnk(t) is unique if and only if t is not in the
interior of a positive maximal intersection for Fnk (see Definitions 2.4 and 2.11 for
this terminology).
2.3.3 Graph theoretic aspects of the MLE
In this section we study the intersection graph and clique graph for the MLE. We also
derive a bound on the number of maximal intersections and describe the structure of
the clique matrix.
Recall that the observed sets R = R1, . . . , Rn defined in (2.19) are sets in
R+ × 1, . . . , K. We define the following partition of R:
27
Definition 2.13 Let R = R1, . . . , Rn be the observed sets defined in (2.19). We
define
Rk = Ri ∈ R : ∆ik = 1, k = 1, . . . , K + 1. (2.23)
Furthermore, let nk denote the number of observed sets in Rk, k = 1, . . . , K + 1.
Theorem 2.14 describes the structure of the intersection graph of R.
Theorem 2.14 The intersection graph G = (V,E) of R has the following properties:
(a) Each Rk, k = 1, . . . , K + 1 is a clique in G;
(b) For k 6= k′ ∈ 1, . . . , K, Rk and Rk′ are not adjacent in G.
(c) For i 6= j ∈ 1, . . . , n and k ∈ 1, . . . , K, Ri ∈ Rk and Rj ∈ RK+1 are
adjacent in G if and only if Ti > Tj;
(d) G is triangulated.
Proof: Properties (a)-(c) follow from the definition of the observed sets in (2.19).
To prove (a), let k ∈ 1, . . . , K and let Ri and Rj be two different observed sets in
Rk. Then Ri ∩ Rj = (0, Ti ∧ Tj] × k 6= ∅. Hence, the corresponding vertices are
adjacent in G. Similarly, for two different observed sets Ri and Rj in RK+1, we have
Ri ∩ Rj = (Ti ∨ Tj ,∞) × 1, . . . , K. Hence, for each k = 1, . . . , K + 1, every pair
of distinct vertices in Rk is adjacent in G. By definition, this means that each Rk,
k = 1, . . . , K + 1 is a clique.
To prove (b), let k 6= k′ ∈ 1, . . . , K and let Ri ∈ Rk and Rj ∈ Rk′. Then
Ri ⊆ R×k and Rj ⊆ R×k′. Hence, Ri∩Rj = ∅, and Ri and Rj are not adjacent
in G.
28
To prove (c), let k ∈ 1, . . . , K, Ri ∈ Rk, and Rj ∈ RK+1. Then
Ri ∩ Rj =
(Tj, Ti] × k if Ti > Tj,
∅ if Ti ≤ Tj .
Hence, Ri and Rj are adjacent in G if and only if Ti > Tj .
We now prove that G is triangulated. We define
Vk = Rk ∪RK+1, k = 1, . . . , K.
Let GVk= (Vk, Ek) be the subgraph of G that is induced by Vk. Note that GVk
can
be viewed as the intersection graph of the following intervals in R:
(0, Ti] : Ri ∈ Rk ∪ (Ti,∞) : Ri ∈ RK+1.
This implies that GVkis an interval graph, and hence that it is triangulated (Hajos
(1957)). Now consider the original intersection graph G = (V,E) of R. Note that
V =⋃Kk=1 Vk. Furthermore, since for all k 6= k′ ∈ 1, . . . , K, Rk is not adjacent to
Rk′ and Vk ∩ Vk′ = RK+1, it follows that E =⋃Kk=1EVk
. Let c = v0, . . . , vl be a
chordless cycle in G. We show by contradiction that c is completely contained in Vk
for some k ∈ 1, . . . , K. Thus, suppose that c contains vertices from both Rk and
Rk′ for some k 6= k′ ∈ 1, . . . , K. Then, since Rk and Rk′ are not adjacent, c must
contain at least two vertices vi and vj in RK+1, with j 6= (i±1) mod (l+1). However,
since RK+1 is a clique, vi and vj are adjacent in G. This contradicts the assumption
that c is chordless. It follows that c must be completely contained in Vk for some
k ∈ 1, . . . , K. Hence, c is a chordless cycle in GVk, and since each subgraph GVk
is triangulated, the length of c is at most three. This proves that G is triangulated.
2
Figure 2.2 shows the intersection graph for the MLE, for the example data with
29
Table 2.3: Example data with K = 2 competing risks. The data are used to illustratethe intersection graph of the MLE in Figure 2.2, and the corresponding clique matrixin Table 2.4.
i t(i) δ(i)1 δ(i)2 δ(i)31 1 1 0 02 2 1 0 03 4 0 0 14 5 0 1 05 7 1 0 06 8 0 1 07 9 0 0 1
i t(i) δ(i)1 δ(i)2 δ(i)38 11 1 0 09 12 0 0 1
10 15 0 1 011 16 1 0 012 17 0 0 113 18 0 1 014 20 0 0 1
K = 2 competing risks in Table 2.3. We can think of RK+1 = R3 as the backbone of
the graph, to which R1 and R2 are connected as defined by property (c) of Theorem
2.14. Furthermore, note that the sets R1 and R2 are not adjacent to each other, and
that all sets R1,R2,R3 are cliques. Finally, the graph is triangulated.
We now consider the maximal cliques C = C1, . . . , Cm of the intersection graph
G of R. Since R1, . . . ,RK are not adjacent in G (Theorem 2.14 (c)), it follows that a
maximal clique can only contain sets from one of the collections R1, . . . ,RK . Hence,
each maximal clique is contained in Rk ∪RK+1 for some k ∈ 1, . . . , K. We define
the x-interval of a maximal clique C to be the x-interval of the real representation
of C. This allows us to define the following partition of the collection of maximal
cliques:
Definition 2.15 Let C be the collection of maximal cliques of the intersection graph
G of the observed sets R defined in (2.19). We define
Ck = C ∈ C : C ∩Rk 6= ∅, k = 1, . . . , K, (2.24)
CK+1 = C ∈ C : C ⊆ RK+1. (2.25)
30
R2
R3
R1
0 2 4 6 8 10 12 14 16 18 20
1 2
3
4
5
6
7
8
9
10
11
12
13
14
Figure 2.2: Intersection graph for the MLE for the data in Table 2.3. Vertices insidea dashed line form a clique, and edges between such vertices are omitted for clarityof the picture.
Furthermore, let mk denote the number of maximal cliques in Ck, k = 1, . . . , K + 1.
Since all maximal intersections are disjoint, we can order the maximal cliques in Cksuch that the upper endpoints of their x-intervals are increasing. We denote the
ordered maximal cliques in Ck by Ck(1), . . . , Ck(mk), k = 1, . . . , K + 1.
We now prove some properties of the maximal cliques. These properties rely on the
assumed ordering of the observations, defined in Section 2.1.1. Here we assumed that
in case of ties left censored observations are ordered before right censored observations.
Let R(1), . . . , R(n) be the observed sets corresponding to Z(1), . . . , Z(n).
Lemma 2.16
CK+1 =
RK+1 if ∆(n),K+1 = 1,
∅ otherwise.
Proof: By the definition of CK+1 in (2.25) and the fact that RK+1 is a clique (The-
orem 2.14 (a)), it follows that RK+1 is the only possible element of CK+1 is RK+1.
31
First suppose that ∆(n),K+1 = 1, or equivalently, R(n) ∈ RK+1. Then there cannot
exist a Tj > T(n) with ∆j+ = 1. By Theorem 2.14 (c), this implies that R(n) ∈ RK+1
is not adjacent to any Rj /∈ RK+1. Hence, RK+1 is a maximal clique.
Now suppose that ∆(n),K+1 = 0. Then there must be a k ∈ 1, . . . , K such that
∆(n)k = 1, and hence R(n) ∈ Rk. By Theorem 2.14 (c), R(n) is adjacent to all vertices
in RK+1, since T(n) > T for all observation times T corresponding to R ∈ RK+1.
Hence, RK+1 is not a maximal clique. 2
Lemma 2.17 Let k ∈ 1, . . . , K and j ∈ 1, . . . , mk − 1. Then
(a) Ck(j+1) ∩Rk is a strict subset of Ck(j) ∩Rk;
(b) Ck(j) ∩RK+1 is a strict subset of Ck(j+1) ∩RK+1.
Proof: All sets of Rk that are in Ck(j+1) must also be in Ck(j). This implies that
Ck(j+1) ∩Rk ⊆ Ck(j) ∩Rk. (2.26)
Similarly, all sets of RK+1 that are in Ck(j) must also be in Ck(j+1). This implies that
Ck(j) ∩RK+1 ⊆ Ck(j+1) ∩RK+1. (2.27)
Now suppose that (2.26) or (2.27) holds with equality. Then, since
Ck(j) = Ck(j) ∩Rk ∪ Ck(j) ∩RK+1,
it follows that Ck(j) ⊆ Ck(j+1) or Ck(j+1) ⊆ Ck(j). This contradicts the assumption
that Ck(j) and Ck(j+1) are two different maximal cliques. Hence, the relations (2.26)
and (2.27) must hold strictly. 2
The following theorem gives a bound on the number of maximal cliques of the in-
tersection graph of R. This bound is important for computational purposes, since the
32
dimensionality of the optimization problem (2.16) is equal to the number of maximal
cliques.
Theorem 2.18 Let m be the number of maximal intersections of R = R1, . . . , Rndefined in (2.19). Then
m ≤
K
K + 1· (n+ 1) + 1
∧ n. (2.28)
Proof: Recall the notation nk and mk from Definitions 2.13 and 2.15, and note
that n =∑K+1
k=1 nk and m =∑K+1
k=1 mk. Let k ∈ 1, . . . , K and consider the or-
dered maximal cliques Ck(1), . . . , Ck(mk) of Definition 2.15. Lemma 2.17 implies that
#Ck(j) ∩ Rk > #Ck,(j+1) ∩ Rk, for j = 1, . . . , mk − 1, where #C denotes the
cardinality of the set C. In other words, each maximal clique Ck(j) contains at least
one more set of Rk than Ck(j+1), for j = 1, . . . , mk − 1. Since by definition Ck(mk)
contains at least one set of Rk, and since Rk contains nk sets, it then follows that
mk ≤ nk. Lemma 2.17 also implies that #Ck(j+1) ∩RK+1 > #Ck(j) ∩RK+1, for
j = 1, . . . , mk − 1. Together with the fact that Ck(1) may contain zero sets of RK+1,
it follows by similar reasoning that mk ≤ nK+1 + 1. Hence, mk ≤ nk ∧ (nK+1 + 1),
k = 1, . . . , K, and
m =
K+1∑
k=1
mk ≤K∑
k=1
mk + 1 ≤K∑
k=1
nk ∧ (nK+1 + 1) + 1, (2.29)
The right side of (2.29) is largest if nk = nK+1 + 1 for all k = 1, . . . , K. In that
case
n =
K+1∑
k=1
nk = K(nK+1 + 1) + nK+1 = (K + 1)nK+1 +K,
so that nK+1 = (n −K)/(K + 1) and nk = nK+1 + 1 = (n + 1)/(K + 1). Plugging
this into the right side of (2.29), we obtain an upper bound of K(n+ 1)/(K + 1) + 1.
33
This yields the first part of the inequality.
To show that m ≤ n, note that Lemma 2.16 implies that mK+1 ≤ nK+1. Together
with mk ≤ nk for k = 1, . . . , K, this implies that m =∑K+1
k=1 mk ≤ ∑K+1k=1 nk = n.
2
We now consider the n×m clique matrix H , with Hij = 1Aj ∈ Ri. Each column
of H corresponds to a maximal clique Cj , or equivalently, to a maximal intersection
Aj. Each row of H corresponds to an observed set Ri. Thus, the jth column of the
matrix indicates which observed sets form maximal clique Cj . Analogously, the ith
row of the matrix indicates which maximal intersections are contained in Ri. We
order the columns of H so that they correspond to the ordered maximal intersections
C1(1), . . . , C1(m1), . . . , CK(1), . . . , CK(mK), CK+1,(1). We order the rows of H so that they
correspond to the observed sets R1, . . . ,RK+1, where the sets in Rk, k = 1, . . . , K+1,
are ordered such that their observation times are nondecreasing. If for some k ∈1, . . . , K+1 the sets Rk or Ck are empty, then the corresponding rows and columns
are omitted from H . Finally, we say that a column is a column of Ck if it corresponds
to a maximal clique in Ck. We say that a row is a row of Rk if it corresponds to an
observed set in Rk.
Lemma 2.19 The n×m clique matrix H of R has the following properties:
(a) m ≤
KK+1
(n+ 1) + 1∧ n.
(b) For each k = 1, . . . , K + 1, the columns of Ck can only have nonzero elements
in rows of Rk or RK+1.
(c) For each k = 1, . . . , K, the block matrix formed by the columns of Ck and the
rows of Rk, is an nk×mk matrix with mk ≤ nk∧(nK+1+1). The first column of
this block is completely filled with ones. The other columns consist of a sequence
of zeroes, followed by a sequence of ones, where both sequences must be of positive
34
length. The length of the sequence of ones in the ith column is strictly smaller
than the length of the sequence of ones in the (i−1)th column, for i = 2, . . . , mk.
(d) For each k = 1, . . . , K+1, the block matrix formed by the columns of Ck and the
rows of RK+1 is an nK+1×mk matrix with mk ≤ nk∧(nK+1+1). All columns of
this block matrix consist of a sequence of ones, followed by a sequence of zeroes.
Either sequence may be of zero length. The length of the sequence of ones in
the ith column is strictly larger than the length of the sequence of ones in the
(i− 1)th column, for i = 2, . . . , mk.
(e) H can be stored in O(Kn) space.
(f) Hα can be computed in O(Kn) time.
(g) Null(H) = ∅.
Proof: Property (a) follows immediately from Theorem 2.18. Property (b) follows
from Definition 2.15. Properties (c) and (d) follow from Lemma 2.17 and the proof
of Theorem 2.18. To prove (e), let k ∈ 1, . . . , K. By property (c), the first row
of Rk contains exactly one nonzero element. Furthermore, property (c) implies that
two successive rows of Rk are either identical, or differ in one element. Hence, we
can store the difference between these rows using at most one element, and we need
at most nk elements to store the information in all rows of Rk. Next, we consider
RK+1. The last row of RK+1 contains at most K nonzero elements. Furthermore,
property (d) implies that two successive rows of RK+1 differ in at most K elements,
and we need at most KnK+1 elements to store the information in the rows of RK+1.
Hence, we can store the clique matrix H using∑K
k=1 nk +KnK+1 = O(Kn) elements.
Property (f) can be proved by similar reasoning.
Finally, note that all unit vectors ej ∈ Rm, j = 1, . . . , m can be generated by
taking the difference of rows of H . This implies that the row space of H is of full
35
Table 2.4: Clique matrix for the data in Table 2.3. The rows are divided into threegroups corresponding to R1,R2 and R3, and the columns are divided into three groupscorresponding to C1, C2 and C3. In each group Rk, k = 1, . . . , 3, the observationsare ordered such that the observation times are nondecreasing. In each group Ck,k = 1, . . . , 3, the maximal cliques are ordered according to their x-intervals, as definedin Definition 2.15.
H =
111 11 1 11 1 1 1
111 11 1 1
1 1 1 1 1 11 1 1 1 1
1 1 1 11 1
1
rank, and this proves (g). 2
The clique matrix for the example data in Table 2.3 is given in Table 2.4. Note that
the special block structure of the matrix is clearly visible.
2.3.4 Uniqueness of the MLE
Recall from Theorem 2.14 (d) that the intersection graph G of R is triangulated.
Hence, the results of Gentleman and Vandal (2002, Lemma 4) or Maathuis (2003,
Lemma 3.13) immediately imply that α is unique. We also give an alternative proof
of this result, using the clique matrix H .
Theorem 2.20 The MLE α is unique.
36
Proof: Let qi = Pα(Ri) and qi = Pα(Ri), i = 1, . . . , n. The vector q is uniquely
determined because the log likelihood is strictly concave in q. Hence, the set of
maximum likelihood estimators is
α ∈ A : Hα = q.
By Lemma 2.19 (f), Null(H) = ∅, and hence the system Hα = q has a unique solution
α. 2
Remark 2.21 We now discuss generalizations of Theorem 2.20 to the mixed case
interval censored competing risks model, studied by Hudgens, Satten and Longini
(2001). We distinguish the following two cases: (i) the failure cause Y is observed if
and only if an observation is not right censored; (ii) the failure cause Y is observed
only if an observation is not right censored, but not for all such observations.
In case (i), Theorem 2.20 can be generalized, and implies that the MLE is always
mixture unique. The main difference between the intersection graphs for interval
censored data and current status data lies in the sets R1, . . . ,RK . For current status
data these sets are cliques, while this is typically not the case for interval censored
data. However, for both types of data, the set RK+1 is a clique. Furthermore, for
both types of data, the subgraphs GVkinduced by Vk = Rk∪RK+1 are interval graphs
and therefore triangulated. Hence, by similar reasoning as in the proof of Theorem
2.14, it follows that the intersection graph for interval censored data with competing
risks is triangulated.
In case (ii), the intersection graph is generally not triangulated, and the MLE
may be mixture non-unique. Hence, in this case our Theorem 2.20 cannot be gener-
alized, and one should use other methods to assess mixture uniqueness. For example,
Hudgens, Satten and Longini (2001, page 76)) suggest to check the Kuhn-Tucker
conditions given in Gentleman and Geyer (1994).
37
Finally, note that Hudgens, Satten and Longini (2001) use a slightly different
parametrization of the MLE, usingK parameters to represent the mass in the maximal
intersection (T(n),∞)× 1, . . . , K which arises when ∆(n),K+1 = 1. Only the sum of
these K parameters can be uniquely defined.
We now translate the uniqueness of α into a statement about uniqueness of Fnk(t).
Recall from Proposition 2.12 that the naive estimators Fnk, k = 1, . . . , K + 1, are
unique at all observation times. For the MLE we do not obtain uniqueness at all
observation times. Rather, we get uniqueness at the following sets:
Definition 2.22 For k = 1, . . . , K + 1 we define
Tk = Ti, i = 1, . . . , n : ∆ik + ∆i,K+1 > 0 ∪ T(n). (2.30)
Proposition 2.23 Fnk(t), k = 1, . . . , K + 1 is unique for all t ∈ Tk.
Proof: Let k ∈ 1, . . . , K and t ∈ R+. Recall that we obtain Fnk(t) by summing
all probability mass in (0, t] × k. Hence, Fnk(t) is unique if t is not contained in
the interior of a maximal intersection for Fnk. Observation times Ti ∈ Tk are never
contained in the interior of a maximal intersection for Fnk by Lemma 2.3. 2
Similar to the comment after Proposition 2.12, we actually obtain a stronger statement
in the proof of Proposition 2.23: For each k = 1, . . . , K + 1, Fnk(t) is unique if and
only if t is not in the interior of the x-interval of a positive maximal intersection for
Fnk (see Definitions 2.2 and 2.11 for this terminology).
2.4 Characterizations
We now characterize the estimators in terms of necessary and sufficient conditions.
In Section 2.4.1 we consider the maximum likelihood estimator for the general cen-
sored data setting discussed in Section 2.2. Recall that for current status data with
38
competing risks, both the MLE and the naive estimator can be viewed as maximum
likelihood estimators for censored data. Hence, the characterizations in this section
can be used for both estimators.
In Section 2.4.2 we give Fenchel and convex minorant characterizations for the
naive estimator. In Section 2.4.3 we give analogous characterizations for the MLE.
The main difference between the characterizations for the naive estimator and the
MLE is that the characterizations for the MLE are self-induced, similar to the char-
acterization for univariate interval censored data given in Groeneboom and Wellner
(1992, Proposition 1.4, page 29).
The characterizations in Section 2.4.1 are used to check convergence of the support
reduction algorithm for the MLE, described in Section 3.1. The characterizations in
2.4.3 are used to check convergence of the iterative convex minorant algorithms for the
MLE, discussed in Section 3.2. Furthermore, we use the characterizations in Section
2.4.3 to derive asymptotic properties of the MLE.
2.4.1 Characterizations of the maximum likelihood estimator for censored data
Recall the general censored data setting discussed in Section 2.2. Given n i.i.d. ob-
served sets R1, . . . , Rn, the MLE α is defined by
ln(α) = maxA
ln(α), (2.31)
where α = (α1, . . . , αm) denotes the mass in the maximal intersections A1, . . . , Am,
ln(α) =1
n
n∑
i=1
logPα(Ri) =1
n
n∑
i=1
log
(m∑
j=1
αj1Aj ⊆ Ri),
A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m, 1Tα = 1,
and 1 is the all-one vector in Rm.
39
In order to give characterizations for α, we first translate optimization problem
(2.31) into an optimization problem over a cone. To this end, we adjust the object
function, analogously to Silverman (1982, Theorem 3.1), Jongbloed (1995, Corollary
2.2) and Maathuis (2003, Lemma 3.7).
Lemma 2.24 The vector α ∈ A maximizes ln(α) over A if and only if it maximizes
ln(α) over A, where
A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m , (2.32)
ln(α) = ln(α) − 1Tα. (2.33)
Proof: Suppose α maximizes ln(α) over A. We will show that ln(α) ≥ ln(α) for all
α ∈ A. Note that this inequality holds trivially when α = 0. Hence, let α ∈ Awith 1Tα = cα > 0. Then α/cα ∈ A and therefore ln(α) ≥ ln(α/cα). Together with
1T α = 1, this yields
ln(α) = ln(α) − 1 ≥ ln(α/cα) − 1 =1
n
n∑
i=1
log
(1
cα
m∑
j=1
αj1Aj ⊆ Ri)
− 1
=1
n
n∑
i=1
log
(m∑
j=1
αj1Aj ⊆ Ri)
− log(cα) − 1
= ln(α) + cα − log(cα) − 1 ≥ ln(α),
since x− log(x) − 1 ≥ 0 for x > 0. Hence, α maximizes ln(α) over A.
Now suppose α maximizes ln(α) over A, and suppose 1T α = c > 0. Then α/c ∈ Aand ln(α) ≥ ln(α/c). By the same reasoning as above this gives
ln(α) ≥ ln(α/c) = ln(α/c) − 1 = ln(α) + c− log(c) − 1.
Since x− log(x) − 1 ≤ 0 if and only if x = 1, this yields c = 1. Hence, α ∈ A, and α
maximizes ln(α) over A ⊆ A. 2
40
Using the Fenchel conditions given in Robertson, Wright and Dykstra (1988, Sec-
tion 6.2), it follows that α is the MLE if and only if
〈α,∇ln(α)〉 ≤ 0 for all α ∈ A and 〈α,∇ln(α)〉 = 0, (2.34)
where ∇ln(α) is the vector of partial derivatives:
∇ln(α) =
(∂ln(α)
∂α1, . . . ,
∂ln(α)
∂αm
).
Since A consists of vectors with nonnegative elements, (2.34) is equivalent to
∂ln(α)
∂αj
≤ 0 for all j = 1, . . . , m,
= 0 if αj > 0.(2.35)
In Proposition 2.25, we use (2.34) and (2.35) to derive characterizations for the max-
imum likelihood estimator for censored data.
Proposition 2.25 (Compare Lemma 3.8 of Maathuis (2003)) The vector α ∈ Asatisfies (2.31) if and only if
1
n
n∑
i=1
Pα(Ri)
Pα(Ri)≤ 1Tα for all α ∈ A and 1T α = 1; (2.36)
or, equivalently, if and only if
1
n
n∑
i=1
1Aj ⊆ RiPα(Ri)
≤ 1 for all j = 1, . . . , m,
= 1 if αj > 0.(2.37)
Proof: Note that
∂ln(α)
∂αj=
1
n
n∑
i=1
1Aj ⊆ RiPα(Ri)
− 1.
41
Condition (2.37) now follows directly by plugging this expression into (2.35). To prove
(2.36), let α ∈ A. We have
〈α,∇ln(α)〉 =m∑
j=1
αj
(1
n
n∑
i=1
1Aj ⊆ RiPα(Ri)
− 1
)
=1
n
n∑
i=1
∑mj=1 αj1Aj ⊆ Ri
Pα(Ri)− 1Tα =
1
n
n∑
i=1
Pα(Ri)
Pα(Ri)− 1Tα.
Similarly, we find
〈α,∇ln(α)〉 =1
n
n∑
i=1
Pα(Ri)
Pα(Ri)− 1T α = 1 − 1T α.
Condition (2.36) then follows by combining the last two displays with (2.34). 2
2.4.2 Fenchel and convex minorant characterizations for the naive estimator
We now give Fenchel and convex minorant characterizations for the naive estimator.
Recall that Gn is the empirical distribution of the observation times T1, . . . , Tn. Fur-
thermore, recall that the naive estimators Fnk, k = 1, . . . , K + 1 can be viewed as
maximum likelihood estimators for the reduced current status data Zk = (T,∆k).
Hence, their characterizations follow directly from the characterizations for current
status data given in Groeneboom and Wellner (1992, Proposition 1.1, page 39 and
Proposition 1.2, page 41). We restate these characterizations in a slightly different
way, so that they can be easily compared to our characterizations of the MLE in
Section 2.4.3. We first introduce some definitions.
Definition 2.26 Given a set of points S ∈ R2, the convex minorant is the largest
convex function lying below S. The concave majorant is the smallest concave function
lying above S. 5
5Some authors use the terms greatest convex minorant and smallest concave majorant instead.
42
Definition 2.27 Let B be a collection of points in R. We say that a nondecreasing
right-continuous function F : R+ → [0, 1] with F (0) = 0 has a jump at t w.r.t. B
if either t = minB and F (t) > 0, or minB < t ∈ B and F (t) > F (t′) for all
preceding t′ ∈ B. Similarly, we say that a nonincreasing right-continuous function
S : R+ → [0, 1] with S(0) = 1 has a jump at t w.r.t. B if either t = minB and
S(t) < 1, or minB < t ∈ B and S(t) < S(t′) for all preceding t′ ∈ B.
Proposition 2.28 Let k ∈ 1, . . . , K. Then Fnk is a naive estimator if and only if
∫
(0,t)
Fnk(u)dGn(u) ≤∫
u∈(0,t)
δkdPn(u, δ), for all t ∈ R+,
and equality holds if (i) Fnk has a jump at t w.r.t. T1, . . . , Tn, and (ii) t = ∞.
Proposition 2.29 Fn,K+1 is a naive estimator if and only if:
∫
(0,t)
Fn,K+1(u)dGn(u) ≥∫
u∈(0,t)
δK+1dPn(u, δ), for all t ∈ R+,
and equality holds if (i) Fn,K+1 has a jump at t w.r.t. T1, . . . , Tn, and (ii) t = ∞.
These characterization can also be written as convex minorant characterizations:
Proposition 2.30 Let the cumulative sum diagrams φnk be defined by
φnk(t) =
(Gn(t),
∫
(0,t]
δkdPn(u, δ)
), t ∈ R+, k = 1, . . . , K.
Then Fnk(s), s ∈ T1, . . . , Tn, k = 1, . . . , K, are the uniquely defined values of the
naive estimator if and only if for all such s, Fnk(s) is the left slope of the convex
minorant of φnk(s) at Gn(s).
Proposition 2.31 Let the cumulative sum diagram φn,K+1 be defined by
φn,K+1(t) =
(Gn(t),
∫
(0,t]
δK+1dPn(u, δ)
), t ∈ R+.
43
Then Fn,K+1(s), s ∈ T1, . . . , Tn are the uniquely defined values of the naive estima-
tor if and only if for all such s, Fn,K+1(s) is the left slope of the concave majorant of
φn,K+1(s) at Gn(s).
These geometric characterizations provide one-step convex minorant algorithms for
the computation of the naive estimators, as discussed in Groeneboom and Wellner
(1992, pages 40-41). An alterative but equivalent computational method is the Pool
Adjacent Violators Algorithm (PAVA), see, e.g., Ayer, Brunk, Ewing, Reid and Sil-
verman (1955) and Barlow, Bartholomew, Bremner and Brunk (1972, pages 13-15).
2.4.3 Fenchel and convex minorant characterizations for the MLE
Groeneboom and Wellner (1992, Proposition 1.4, page 49) show that a characteriza-
tion of the type of Proposition 2.30 also holds in a more complicated situation, where
the cumulative sum diagram is “self-induced”, in the sense that it is constructed from
the solution itself. We now derive such self-induced characterizations for the MLE
for current status data with competing risks. First, in Lemma 2.32 we translate op-
timization problem (2.7) into an optimization problem over a cone, by removing the
constraint F+(u) ≤ 1. In Propositions 2.34 and 2.36 we give Fenchel characterizations
of the MLE. Proposition 2.40 gives a self-induced convex minorant characterization.
We illustrate this characterization in Example 2.41.
Lemma 2.32 Fn ∈ FK maximizes ln(F ) over FK if and only if it maximizes ln(F )
over FK, where
FK = the collection of all K-tuples of bounded nonnegative
nondecreasing right-continuous functions, (2.38)
ln(F ) =
∫ K∑
k=1
δk logFk(u) + (1 − δ+) log(cF − F+(u))
dPn − cF , (2.39)
and cF = F+(∞).
44
Proof: The proof is analogous to the proof of Lemma 2.24. Suppose that Fn maxi-
mizes ln(F ) over FK . We will show that ln(Fn) ≥ ln(F ) for all F ∈ FK . Note that this
inequality holds trivially when F+(∞) = 0. Hence, let F ∈ FK with F+(∞) = cF > 0.
Then F/cF ∈ FK , and therefore ln(Fn) ≥ ln(F/cF ). Together with Fn+(∞) = 1 this
yields
ln
(Fn
)= ln(Fn) − 1 ≥ ln(F/cF ) − 1
=
∫ K∑
k=1
δk logFk(t) + (1 − δ+) log cF − F+(t)dPn − log cF − 1
= ln(F ) + cF − log cF − 1 ≥ ln(F ),
since x− log x− 1 ≥ 0 for x > 0. Hence Fn maximizes ln(F ) over FK .
Now suppose Fn maximizes ln(F ) over FK , and suppose Fn+(∞) = c > 0. Then
ln(Fn) ≥ ln(Fn/c), and by the same reasoning as above this gives
ln(Fn) ≥ ln(Fn/c) = ln(Fn/c) − 1 = ln(Fn) + c− log c− 1.
Since x − log x − 1 ≤ 0 if and only if x = 1, this yields c = 1. Hence, Fn ∈ FK , and
Fn maximizes ln(F ) over FK ⊆ FK . 2
We now give several Fenchel characterizations of the MLE. The proof of Proposition
2.34 is based on the finite dimensional nature of the optimization problem, which
allows us to use Fenchel conditions that are analogous to (2.34). We give two differ-
ent proofs for Proposition 2.36. The first proof follows from Proposition 2.34. The
second proof can be used in cases where the optimization problem is truly infinite di-
mensional. Proposition 2.40 is derived from Proposition 2.36 and gives a self-induced
convex minorant characterization.
45
Definition 2.33 For F ∈ FK , we define
βnF = 1 −∫
1 − δ+F+(∞) − F+(u)
dPn(u, δ), (2.40)
Furthermore, recall the definition of Tk in Definition 2.22, and the definition of jump
points in Definition 2.27. We use the conventions that 0/0 = 0 and 0 · ∞ = 0.
Proposition 2.34 Let Fn ∈ FK. Then Fnk(t), t ∈ Tk, k = 1, . . . , K, are the uniquely
defined values of the MLE (2.7) if and only if
(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).
(b) For all k = 1, . . . , K and t ∈ R+,
∫
u∈[t,∞)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)
dPn(u, δ) ≤ βnFn
, (2.41)
and equality holds if Fnk has a jump at t w.r.t. Tk.
Proof: This proof uses the fact that the optimization problem is finite dimensional.
We introduce the following notation. Let T ′(i), i = 1, . . . , nd, nd ≤ n be the or-
der statistics of the distinct observation times in T1, . . . , Tn and let T ′(0) = 0 and
T ′(nd+1) = ∞. Furthermore, for i = 1, . . . , nd and k = 1, . . . , K + 1, let
Nik =n∑
j=1
∆jk1Tj = T ′(i) and Ni =
n∑
j=1
1Tj = T ′(i).
Thus, Ni represents the number of observations with observation time T ′(i), and Nik is
the number of those observations with ∆k = 1. For any set of functions (F1, . . . , FK)
and i ∈ 0, . . . , nd + 1, we define
Fik = Fk(T′(i)), Fik = Fnk(T
′(i)), Fi+ =
K∑
k=1
Fik, and Fi+ =
K∑
k=1
Fik.
46
Let F = (F1, . . . , FK), where Fk = (F1k, . . . , Fnd+1,k). Since the extended log like-
lihood function ln(F ) defined in Lemma 2.32 only depends on values Fk(T′(i)), i =
1, . . . , nd + 1, k = 1, . . . , K, we can write it as:
ln(F ) =1
n
nd∑
i=1
K∑
k=1
Nik logFik +Ni,K+1 log(Fnd+1,+ − Fi+)
− Fnd+1,+.
We need to maximize this function over the space
FK = F ∈ R(nd+1)K : 0 ≤ F1k ≤ · · · ≤ Fnd+1,k for all k = 1, . . . , K.
Thus, we maximize the concave function ln(F ) over a convex cone, and hence we have
Fenchel optimality conditions analogous to (2.34):
〈F,∇ln(F )〉 ≤ 0 for all F ∈ FK and 〈F ,∇ln(F )〉 = 0, (2.42)
where
∇ln(F ) =
((∂ln(F )
∂F1k, . . . ,
∂ln(F )
∂Fnd+1,k
), k = 1, . . . , K
).
Rewriting the first expression of (2.42) yields
0 ≥ 〈F,∇ln(F )〉 =K∑
k=1
nd+1∑
j=1
Fjk∂ln(F )
∂Fjk=
K∑
k=1
nd+1∑
j=1
j∑
i=1
(Fik − Fi−1,k)∂ln(F )
∂Fjk
=
K∑
k=1
nd+1∑
i=1
(Fik − Fi−1,k)
nd+1∑
j=i
∂ln(F )
∂Fjk.
Now fix an l ∈ 1, . . . , nd + 1 and k ∈ 1, . . . , K. Since the above inequality must
hold for all F ∈ FK, it must hold for F with
Flk − Fl−1,k > 0, and Fik′ − Fi−1,k′ = 0 otherwise.
47
This implies that∑nd+1
j=l∂ln(F )∂Fjk
≤ 0. Since this holds for all l ∈ 1, . . . , nd + 1 and
k ∈ 1, . . . , K, we obtain
nd+1∑
j=i
∂ln(F )
∂Fjk≤ 0, i = 1, . . . , nd + 1, k = 1, . . . , K. (2.43)
Furthermore, by rewriting the expression for 〈F ,∇ln(F )〉 in a similar way, it follows
that we must have equality in (2.43) if Fik > Fi−1,k. Considering condition (2.43) for
i = nd + 1 and plugging in
∂ln(F )
Fnd+1,k=
1
n
nd∑
j=1
Nj,K+1
Fnd+1,+ − Fj+− 1 = −βnF , k = 1, . . . , K,
yields
βnF = 1 − 1
n
nd∑
j=1
Nj,K+1
Fnd+1,+ − Fj+= 1 −
∫1 − δ+
Fn+(∞) − Fn+(u)dPn(u, δ) ≥ 0. (2.44)
Furthermore, equality must hold in (2.44) if Fnd+1,k − Fnd,k > 0 for some k ∈1, . . . , K, or equivalently, if Fnd+1,+ − Fnd,+ > 0. This gives condition (a) of the
lemma. Similarly, condition (2.43) together with
∂ln(F )
∂Fjk=
1
n
(Njk
Fjk− Nj,K+1
Fnd+1,+ − Fj+
), j = 1, . . . , nd, k = 1, . . . , K,
yields that
1
n
nd∑
j=i
(Njk
Fjk− Nj,K+1
Fnd+1,+ − Fj+
)≤ βnF , (2.45)
where equality must hold if Fik > Fi−1,k. We can rewrite (2.45) as
48
∫
[t,∞)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)
dPn(u, δ) ≤ βnF , (2.46)
for t = T ′(i). Now let k ∈ 1, . . . , K and let s1 < s2 be two successive points in Tk.
Note that the left side of (2.46) has the same value for all t ∈ (s1, s2]. Furthermore,
note that the following two statements are equivalent:
(a) Fik > Fi−1,k for some i such that T ′(i) ∈ (s1, s2],
(b) Fnk(s2) > Fnk(s1),
and that the left side of (2.46) vanishes for t = T ′(nd+1) = ∞. This implies that
satisfying (2.46) for i = 1, . . . , nd + 1, is equivalent to satisfying (2.46) for t ∈ Tk,where equality must hold if Fnk has a jump at t w.r.t. Tk. This is condition (b) of the
lemma. 2
We now give a second Fenchel characterization for the MLE:
Definition 2.35 For F ∈ FK , we define
GnF (t) =
∫
(0,t]
1 − δ+F+(∞) − F+(u)
dPn(u, δ) + βnF1[T(n),∞)(t), t ∈ R+, (2.47)
where βnF is defined in (2.40).
Proposition 2.36 Let Fn ∈ FK. Then Fnk(t), t ∈ Tk, k = 1, . . . , K, are the uniquely
defined values of the MLE (2.7) if and only if
(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).
(b) For all k = 1, . . . , K and t ∈ R+,
∫
u∈(0,t)
Fnk(u)dGnFn(u) ≤
∫
u∈(0,t)
δkdPn(u, δ), (2.48)
and equality holds if (i) Fnk has a jump at t w.r.t. Tk, and (ii) t = ∞.
49
We give two proofs of Proposition 2.36. The first proof follows from Proposition 2.34.
The second proof does not require the result in Proposition 2.34, and can be used for
truly infinite dimensional optimization problems.
Proof 1 of Proposition 2.36: We show that condition (b) of Proposition 2.34 is
equivalent to condition (b) of Proposition 2.36. Let k ∈ 1, . . . , K. Then condition
(b) of Proposition 2.34 is equivalent to the following three conditions:
(i) For the last jump point τ of Fnk w.r.t. Tk, we have
∫
[τ,s)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)
dPn ≥ βnFn
1s > T(n), for all s > τ,
and equality must hold if s > T(n).
(ii) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have
∫
[σ,s)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)
dPn ≥ 0, for all s ∈ (σ, τ ],
and equality holds if s = τ .
(iii) For the first jump point σ of Fnk w.r.t. Tk, we have
∫
[s,σ)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)
dPn ≤ 0, for all s ∈ [0, σ).
To see this, assume that condition (b) of Proposition 2.34 holds. Let τ be the last
jump point of Fnk w.r.t. Tk. Then equality holds in (2.41) for t = τ . Furthermore,
inequality holds in (2.41) for t = s. Subtracting these two conditions, we obtain the
inequality part of condition (i). If s > T(n), then the left side of (2.41) is zero, so that
equality holds in (i) for s > T(n). Conditions (ii) and (iii) can be proved analogously.
50
Furthermore, it can be easily verified that conditions (i)-(iii) above imply condition
(b) of Proposition 2.34.
In conditions (i) and (ii), we can multiply both sides of the equations by Fnk(u),
because this is a constant and positive quantity on the intervals of integration. This
means that (i) and (ii) are equivalent to:
(i’) For the last jump point τ of Fnk w.r.t. Tk, we have for all s > τ :
∫
[τ,s)
δkdPn ≥∫
[τ,s)
Fnk(u)1 − δ+
Fn+(∞) − Fn+(u)dPn + βnFn
1s > T(n)Fnk(T(n)),
=
∫
[τ,s)
Fnk(u)dGnFn(u),
where equality must hold if s > T(n).
(ii’) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have for s ∈(σ, τ ]:
∫
[σ,s)
δkdPn ≥∫
[σ,s)
Fnk(u)1 − δ+
Fn+(∞) − Fn+(u)dPn =
∫
[σ,s)
Fnk(u)dGnFn(u),
where equality must hold if s = τ .
In (iii) we cannot multiply through by Fnk(u), because Fnk(u) = 0 on the interval of
integration. However, note that (iii) is equivalent to the condition that δk = 0 before
the first jump point of Fnk(u). An alternative way of writing this condition is
(iii’) For the first jump point σ of Fnk w.r.t. Tk, we have for s ∈ (0, σ]:
∫
(0,s)
δkdPn =
∫
(0,s)
Fnk(u)dGnFn(u).
This completes the proof, since (i’)-(iii’) are equivalent to condition (b) of Proposition
2.36. 2
51
Proof 2 of Proposition 2.36: Suppose that Fn ∈ FK is an MLE. For t ∈ R+
and |h| < min1/Fn1(∞), . . . , 1/FnK(∞) we define the perturbation F(h)nk (u) =
Fnk(u)1 + hFnk(u). Let F(h,k)n ∈ FK be the corresponding vector of components,
where only the kth component is changed to F(h)nk . By Lemma 2.32, Fn maximizes
the function ln(F ) over F ∈ FK . Hence, we get
0 = limh↓0
h−1ln(F (h,k)n
)− ln(Fn)
=
∫Fnk(u)
δk −
Fnk(u)(1 − δ+)
Fn+(∞) − Fn+(u)
dPn(u, δ) − F 2
nk(∞)βnFn
=
∫ ∫
[t,∞)
δk −Fnk(u)(1 − δ+)
Fn+(∞) − Fn+(u)dPn(u, δ) − Fnk(∞)βnFn
dFnk(t). (2.49)
Here we obtain the last line by writing Fnk(u) =∫(0,u]
dFnk(t) and using Fubini’s
theorem. Furthermore, for h ≥ 0 we consider the perturbation F(h,t)nk (u) = Fnk(u)1+
h1[t,∞)(u). Let F(h,t,k)n ∈ FK be the corresponding vector of components, where only
the kth component is changed to F(h,t)nk . Then we get
0 ≥ limh↓0
h−1ln(F (h,t,k)n
)− ln(Fn)
=
∫
[t,∞)
δk −
Fnk(u)(1 − δ+)
Fn+(∞) − Fn+(u)
dPn(u, δ) − Fnk(∞)βnFn
. (2.50)
Note that (2.50) is left-continuous in t, and constant between successive points t in Tk.Hence, by combining (2.49) and (2.50), it follows that equality in (2.50) must hold for
points of jump t ∈ Tk of Fnk w.r.t. Tk. Taking t > T(n) in (2.50) yields βnFn≥ 0, which
is the first part of condition (a). Furthermore, (2.49) yields βnFnFnk(∞)Fnk(∞) −
Fnk(T(n)) = 0. This implies that
βnFnFnk(∞) − Fnk(T(n)) = 0, (2.51)
52
so that we can write (2.50) as
0 ≥∫
[t,∞)
δk −
Fnk(u)(1 − δ+)
Fn+(∞) − Fn+(u)
dPn(u, δ) − Fnk(T(n))βnFn
=
∫
[t,∞)
δkdPn(u, δ) −∫
[t,∞)
Fnk(u)dGnFn(u). (2.52)
Equality must hold at points of jump t of Fnk w.r.t. Tk. Note that the MLE Fnk(u)
must be positive at the first point with δk = 1. Otherwise, the log-likelihood would be
−∞. Hence, for the first jump point σ of Fnk with respect to Tk, we have∫(0,σ)
δkdPn =∫(0,σ)
FnkdGnFn= 0. Together with the equality in (2.52) at t = σ, this implies
∫δkdPn =
∫FnkdGnFn
. Combining this with (2.52) gives condition (a). Furthermore,
if Fn+(T(n)) < Fn+(∞), then there must be a k ∈ 1, . . . , K such that Fnk(T(n)) <
Fnk(∞). Together with (2.51) this yields βnFn= 0. This completes condition (b).
Now assume that Fn ∈ FK satisfies conditions (a) and (b). Let c = Fn+(∞). We
show that Fn maximizes ln(F ) over F ∈ FK . As in the proof of Proposition 2.36, we
can multiply through in condition (b) by a function that has the same jump points
as Fnk w.r.t. Tk. Multiplying by log Fnk(u) under the convention 0 · ∞ = 0 gives
∫δk log Fnk(u)dPn(u, δ) =
∫Fnk(u) log Fnk(u)dGnFn
(u), k = 1, . . . , K.
Furthermore, by definition of GnFn, we have
∫(1 − δ+) log(c− Fn+(u))dPn(u, δ) =
∫(c− Fn+(u)) log(c− Fn+(u))dGnFn
(u).
Note that this also holds if βnFn> 0 because of the equality condition in (a). Namely,
βnFn> 0 implies that Fn+(T(n)) = Fn+(∞). By combining the last two displays we
can write ln(F ) as:
53
ln(Fn) =
∫ [ K∑
k=1
δk log Fnk(u) + (1 − δ+) log(c− Fn+(u))
]dPn(u, δ)
=
∫ [ K∑
k=1
Fnk(u) log Fnk(u) + (c− Fn+(u)) log(c− Fn+(u))
]dGnFn
(u) − c.
Furthermore, using the stochastic ordering property of condition (b), it follows that
for monotone nondecreasing functions F ∈ FK with F+(∞) = cF , we have
∫δk logFk(u)dPn(u, δ) ≤
∫Fnk(u) logFk(u)dGnFn
(u), k = 1, . . . , K. (2.53)
We can prove this by writing logFk(u) =∫(0,u]
d logFk(s) and using Fubini:
∫Fnk(u) logFk(u)dGnFn
(u) −∫δk logFk(u)dPn(u, δ)
=
∫ [∫
[s,∞)
Fnk(u)dGnFn(u) −
∫
[s,∞)
δkdPn
]d logFk(s) ≥ 0,
since the integrand is nonnegative by condition (b). Furthermore,
∫(1 − δ+) log(cF − F+(u))dPn(u, δ) =
∫(c− Fn+(u)) log(cF − F+(u))dGnFn
(u),
by the definition of GnFn. Combining the above display with (2.53) gives
ln(F ) =
∫ [ K∑
k=1
δk logFk(u) + (1 − δ+) log(cF − F+(u))
]dPn(u, δ) − cF
≤∫ [ K∑
k=1
Fnk(u) logFk(u) + (c− Fn+(u)) log(cF − F+(u))
]dGnFn
(u) − cF .
It follows that
ln(Fn) − ln(F ) ≥∫ [ K∑
k=1
Fnk(u) log
(Fnk(u)
Fk(u)
)
+ (c− Fn+(u)) log
(c− Fn+(u)
cF − F+(u)
)+ cF − c
]dGnFn
(u). (2.54)
54
Here we use that∫dGnFn
= 1, so that we can pull cF − c inside the integral. We
now show that this expression is nonnegative, by showing that its integrand is non-
negative. To do so, consider two vectors q and p in RK+1 with∑K+1
k=1 qk = cq and∑K+1
k=1 pk = cp. We use the nonnegativity of the Kullback-Leibler number for a multi-
nomial distribution to show that
0 ≤K∑
k=1
(qk/cq) log
(pk/cqpk/cp
)+ (1 − q+/cq) log
(1 − q+/cq1 − p+/cp
)
=
K∑
k=1
(qk/cq) log
(pkpk
)+ (1 − q+/cq) log
(cq − q+cp − p+
)+ log(cp/cq)
≤K∑
k=1
(qk/cq) log
(pkpk
)+ (1 − q+/cq) log
(cq − q+cp − p+
)− 1 + c/cq
=1
cq
K∑
k=1
qk log
(pkpk
)+ (cq − q+) log
(cq − q+cp − p+
)+ cp − cq
,
using again the inequality log(x) ≤ x − 1 for x > 0. Letting qk = Fnk(u) and pk =
Fk(u), it follows that the integrand of (2.54) is nonnegative. Hence, ln(Fn)−ln(F ) ≥ 0,
so that Fn maximizes ln(F ) over F ∈ FK . 2
Alternatively, we can formulate Proposition 2.36 in terms of self-induced convex mi-
norant characterization:
Definition 2.37 For k = 1, . . . , K, let the cusum diagram φnkF be defined by
φnkF (t) =
(GnF (t),
∫
(0,t]
δkdPn(u, δ)
), t ∈ R+, k = 1, . . . , K, (2.55)
Proposition 2.38 Let Fn ∈ FK. Then Fnk(t), t ∈ Tk, k = 1, . . . , K, are the uniquely
defined values of the MLE (2.7) if and only if
(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).
55
(b) For all k = 1, . . . , K and t ∈ Tk, Fnk(t) is the slope of the convex minorant of
φnkFnat GnFn
(t), where we take the left-continuous slope if GnFnhas a jump at
t, and the right-continuous slope otherwise.
Before we prove Proposition 2.38, we give an equivalent formulation in terms of a
fixed point of a self-induced convex minorant mapping:
Definition 2.39 For each F ∈ FK such that βnF ≥ 0, and each k ∈ 1, . . . , K, we
define the mapping Snk : F 7→ SnkF by
[SnkF ](t) is the slope of the convex minorant of φnkF at GnF (t), t ∈ Tk, (2.56)
where we take the left-continuous slope if GnF has a jump at t, and the right-
continuous slope otherwise. We define Sn(F ) by
Sn(F ) = (Sn1(F ), . . . , SnK(F )) . (2.57)
We can now reformulate Proposition 2.38 in terms of a fixed point of the mapping
Sn(F ).
Proposition 2.40 Let Fn ∈ FK. Then Fnk(t), t ∈ Tk, k = 1, . . . , K, are the uniquely
defined values of the MLE (2.7) if and only if
(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).
(b) Fn is a fixed point of the mapping Sn in the sense that
[SnkFn](t) = Fnk(t) for all t ∈ Tk, k = 1, . . . , K.
Proof of Propositions 2.38 and 2.40: We show that condition (b) of Proposition
2.36 is equivalent to condition (b) of Propositions 2.38 and 2.40. Let k ∈ 1, . . . , K,
56
and recall that condition (b) of Proposition 2.36 is equivalent to conditions (i’), (ii’)
and (iii’) given in its Proof 1. Furthermore, note that Fnk(t) is constant on the
intervals of integration in each of these statements. Hence, we can take Fnk(t) outside
the integral, yielding:
(i”) For the last jump point τ of Fnk w.r.t. Tk, we have for all s > τ :
Fnk(s) = Fnk(τ) ≤∫[τ,s)
δkdPn∫[τ,s)
dGnFn(u)
,
where equality must hold if s > T(n).
In terms of the cusum diagram, this means that Fnk(τ) is the slope of the line segment
connecting the points φnkFn(τ−) and φnkFn
(T(n)). Furthermore, no points φnkFn(s)
for s ≥ τ may lie below this line segment. Similarly, condition (ii’) of Proof 1 of
Proposition 2.36 is equivalent to:
(ii”) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have for s ∈(σ, τ ]:
Fnk(s) = Fnk(σ) ≤∫[σ,s)
δkdPn∫[σ,s)
dGnFn(u)
,
where equality must hold if s = τ .
In terms of the cusum diagram, this means that Fnk(σ) is the slope of the line segment
connecting the points φnkFn(σ−) and φnkFn
(τ−). Furthermore, no points φnkFn(s) for
s ∈ [σ, τ) may lie below this line segment.
Next, we consider condition (iii’) of Proof 1 of Proposition 2.36. Let σ be the first
jump point of Fnk w.r.t. Tk. Then by definition Fnk(t) = 0 for t < σ. Condition (iii’)
says that∫(0,σ)
δkdPn = 0, and this is equivalent to Fnk(t), t ∈ (0, σ), being the slope
of the line segment connecting φnkFn(0) and φnkFn
(σ−).
57
Thus, it follows that condition (b) of Proposition 2.36 for the kth component is
equivalent to Fnk being the slope of a piecewise linear function Hnk + c, where Hnk is
below the cusum diagram φnkFn, and touches the cusum diagram whenever it has a
change of slope. Without loss of generality we take c = 0. Taking the monotonicity
constraint on Fn into account as well, it follows that Hnk must be convex, and hence
Hnk must be the convex minorant of the cusum diagram.
Finally, note that Hnk can only touch the cusum diagram at points at which GnFn
jumps. By the above reasoning, we must take the left derivative at these points.
Since there can be vertical stretches of points in the cusum diagram, we take the
right derivative at all other points. 2
Example 2.41 To illustrate Proposition 2.40, we consider the example data in Table
2.5, and the corresponding plots in Figure 2.3. Note that T1 = 2, 3, 8, 10, T2 = 2, 3and T3 = 3. Note that the cumulative sum diagrams φnkFn
, k = 1, 2, only depend
on Fn through the value of Fn+(Ti) for which ∆i+ = 0. Thus, we can construct the
cumulative sum diagrams for the MLE using Fn+(3) = 3/5. We can then compute
Fnk(s), s ∈ Tk, by taking the slope of the convex minorant of φnkFnat GnFn
(s), where
we take the left slope if Fnk has a jump at s w.r.t. Tk, and the right slope otherwise.
Note that GnFn(s) jumps at s = 3 and s = 10. Hence, we get Fn1(2) = Fn1(3) = 2/5
and Fn1(8) = Fn1(10) = 4/5. Note that at GnFn(3) = GnFn
(8), the left slope is 2/5,
while the right slope is 4/5. For s = 3 we need the left slope, and for s = 8 we
need the right slope. This minor difficulty with left and right slopes is caused by the
vertical pieces in the cumulative sum diagram φnkFn.
Similarly, for k = 2, we obtain Fn2(2) = Fn2(3) = 1/5 and Fn2(T(n)) = Fn2(10) =
1/5. By monotonicity this implies that also Fn2(8) = 1/5. Note that these values
correspond exactly to the ones given in Table 2.5. Thus, one can recover the values
of Fnk(t), t ∈ Tk, from the values of Zi, i = 1, . . . , n, and Fn+(t), t ∈ TK+1, using a
simple one-step convex minorant algorithm.
58
Table 2.5: Example data with K = 2 competing risks, to illustrate Propositions 2.30and 2.40. The convex minorants are given in Figure 2.3.
i t(i) δ(i)1 δ(i)2 δ(i)3 Fn1(t(i)) Fn2(t(i)) Fn1(t(i)) Fn2(t(i)) Fn+(t(i))1 2 1 0 0 1/3 1/5 2/5 1/5 3/52 2 0 1 0 1/3 1/5 2/5 1/5 3/53 3 0 0 1 1/3 1/5 2/5 1/5 3/54 8 1 0 0 1 1/5 4/5 1/5 15 10 1 0 0 1 1/5 4/5 1/5 1
We now compare the cusum diagrams φnk and φnkFnfor the naive estimator and
the MLE. Note that the y-coordinates of the points in both cusum diagrams are
given by∫(0,t]
δkdPn, so that the points in the left and right panels of Figure 2.3 align
horizontally. Furthermore, the x-coordinates Gn(t) of φnk(t) and GnFn(t) of φnkFn
do
not depend on k. Hence, the points in the upper panels of Figure 2.3 align vertically
with those in the lower panels. Furthermore, we always have Gn(T(n)) = GnFn(T(n)) =
1, and, for s < T(n) such that Fn+(s) < 1,
GnFn(s) = Gn(s) +
∫
(0,s]
Fn+(t) − δ+
1 − Fn+(t)dPn(t, δ) + βnFn
1s ≥ T(n). (2.58)
This expression shows that the x-coordinates of φnk and φnkFndiffer by two terms
that we refer to as the Fn+-term and the βnFn-term. The Fn+-term comes from
the difference in the log likelihoods (2.6) and (2.8). The βnFn-term comes from the
constraint F+ ≤ 1 on the space FK .
Remark 2.42 Note that the Kullback-Leibler number used in Proof 2 of Proposition
2.36 equals zero if and only if Fnk(u) = Fk(u) for all k = 1, . . . , K at points of jump
of GnFn(u). Hence, Fn+(u) is unique at points of jump of GnFn
(u). Since these
values uniquely determine Fnk(u), u ∈ Tk, via the convex minorant characterization
in Proposition 2.40, this gives another proof of uniqueness.
59
Naive MLE
k = 1
k = 2
0 .5 1 0 .5 1
0 .5 1 0 .5 1
0
.2
.4
.6
0
.2
.4
.6
0
.2
.4
.6
0
.2
.4
.6
2 3
8
10
2 3 8 10
2 3
8
10
2 3, 8 10
Figure 2.3: Convex minorant plots for the data in Table 2.5. The left column corre-sponds to the naive estimator, and shows the cusum diagrams and its convex mino-rants (see Proposition 2.30). The right column corresponds to the MLE, and shows
the self-induced cusum diagrams (based on the values of Fn+ given in Table 2.5) andits convex minorants (see Proposition 2.40). All points are labeled by their observa-tion times.
In Remark 2.41 we saw that the x-coordinates of the cusum diagrams of the MLE
and the naive estimator are different, while their y-coordinates are identical. We can
also derive convex minorant characterizations where the x-coordinates of the cusum
diagrams of both estimators are identical, and their y-coordinates are different. To
illustrate the freedom in the convex minorant characterizations, we now discuss a
particular family of convex minorant characterizations. These new characterizations
look more complicated, but will prove useful for computational purposes in Section
3.2.
We define
sk = minT1, . . . , Tn : ∆ik = 1 for k = 1, . . . , K. (2.59)
60
Note that it follows from the form of the log likelihood (2.6) that Fnk(t) = 0 for
t < sk. We now define a family of cusum diagrams:
Definition 2.43 Let ck : Z → R+, k = 1, . . . , K, where Z is defined in (2.3). For
F ∈ FK such that βnF ≥ 0, let
G∗nk(s) =
∫
(0,s]
ck(u, δ)dPn(u, δ),
V ∗nkF (s) =
∫
(0,s]
δk
Fk(u)− 1 − δ+F+(∞) − F+(u)
+ ck(u, δ)Fk(u)
dPn(u, δ)
− βnF1s ≥ T(n),
φ∗nkF (t) = (G∗
nk(t), V∗nkF (t)),
Proposition 2.44 Let Fnk(u) = 0 for u < sk, k = 1, . . . , K. Then Fnk(t), t ∈Tk ∩ [sk,∞), are the uniquely defined values of the MLE if and only if
(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).
(b) For all k = 1, . . . , K and t ∈ Tk ∩ [sk,∞), Fnk(t) is the slope of the convex
minorant of φ∗nkFn
at G∗nk(t), where we take the left-continuous slope if G∗
nk(t)
has a jump at t, and the right-continuous slope otherwise.
Note that the definition of sk is used to avoid the part of the convex minorant where
we may get negative slopes.
Proof: The proof is analogous to the proof of Proposition 2.40. For example, consider
condition (i) in Proof 1 of Proposition 2.36. Let τ be the last jump point of Fnk w.r.t.
Tk. Adding∫[τ,s)
ck(u, δ)Fnk(u)dPn(u, δ) to both sides of the equation yields that we
61
have for s > τ :
∫
[τ,s)
δk
Fnk(u)− 1 − δ+
Fn+(∞) − Fn+(u)+ ck(u, δ)Fnk(u)
dPn(u, δ) − βnFn
1s > T(n)
≥∫
[τ,s)
ck(u, δ)Fnk(u)dPn(u, δ),
and equality must hold if s > T(n). Pulling Fnk(u) = Fnk(τ) out of the integral on the
right side yields, for s > τ that Fnk(s) = Fnk(τ) where
Fnk(τ)≤∫[τ,s)
δk
Fnk(u)− 1−δ+
Fn+(∞)−Fn+(u)+ ck(u, δ)Fnk(u)
dPn(u, δ) − βnFn
1s > T(n)∫[τ,s)
ck(u, δ)dPn(u, δ).
In terms of the cusum diagram φ∗nkFn
, this means that Fnk(τ) is the slope of the
line segment connecting φ∗nkFn
(τ−) and φ∗nkFn
(T(n)). Furthermore, no points φ∗nkFn
(s),
s > τ may lie below this line segment. Condition (ii) in Proof 1 of Proposition 2.44
can be treated analogously, and condition (iii) can be omitted since Fnk(t) = 0 for
t < sk, k = 1, . . . , K. 2
62
Chapter 3
COMPUTATION
We already noted in Section 2.4.2 that the naive estimator can be computed with
existing algorithms for current status data, such as a one-step convex minorant algo-
rithm or the Pool Adjacent Violators Algorithm (PAVA). In this chapter we therefore
focus on the computation of the MLE.
There are several algorithms available for the computation of the MLE. Hudgens,
Satten and Longini (2001) and Jewell, Van der Laan and Henneman (2003) use the
EM algorithm, and Jewell and Kalbfleisch (2004) propose an iterative Pool Adjacent
Violators Algorithm. The EM algorithm is known for its slow convergence, and indeed
also in this problem it needs an extremely large number of iterations and a very long
computing time. The algorithm of Jewell and Kalbfleisch (2004) seems to improve on
EM when ∆(n),K+1 = 1. However, when ∆(n),K+1 = 0 the algorithm does not converge
to the MLE directly, and in this case one needs to do a K − 1 dimensional search to
find the MLE. Such a search is very costly in computing time.
We propose to compute the MLE using sequential quadratic programming (SQP)
methods. The basic idea of SQP is as follows. Suppose we want to minimize a function
θ(F ) over F ∈ H. Let F (0) ∈ H be fixed and set l = 0. For each l = 0, 1, . . . , let F new
be the minimizer of θ(l)(F ) over H, where θ(l)(F ) is a quadratic approximation of θ(F )
around F (l). We then obtain the next iterate by taking F (l+1) = F (l) +α(F new−F (l))
for a suitable α > 0. We continue this process until the necessary and sufficient
conditions for the optimum are satisfied within a specified tolerance.
Thus, in this procedure we need to minimize the quadratic functions θ(l)(F ) over
H for l = 0, 1, . . . . In order to solve these quadratic optimization problems, we can
63
use either the complete Hessian matrix of θ(F ), or only its diagonal elements. We
developed algorithms for both approaches. Section 3.1 describes a method that uses
the complete Hessian and employs the support reduction algorithm of Groeneboom,
Jongbloed and Wellner (2002). The main advantage of this method is that it can
be used for any censored data problem. Section 3.2 describes a method that uses
only the diagonal elements. Using only the diagonal elements reduces the quadratic
optimization problems to isotonic regression problems which can be solved by a one-
step convex minorant algorithm. Hence, this approach results in an iterative convex
minorant algorithm, and we show that it corresponds to the convex minorant charac-
terization in Proposition 2.44 for a specific choice of the functions ck, k = 1, . . . , K.
If the Hessian matrix is sparse off the diagonal, then this algorithm is expected to
be faster, because the speed gained in solving the quadratic optimization problems
outweighs the fact that we do not solve the exact quadratic optimization problems.
3.1 Reduction and optimization
As noted in Section 2.2, a general approach for the computation of the MLE for
censored data consists of a reduction step followed by an optimization step. In the
reduction step we compute the maximal intersections A1, . . . , Am of the observed sets
R1, . . . , Rn. In the optimization step we solve the optimization problem defined in
(2.15) and (2.16).
The main advantage of this approach is its versatility. The form of the log likeli-
hood (2.15) is the same for all censored data problems, so that the same optimization
algorithm can be used for all problems, and only the reduction step may require ad-
justments. Another advantage of this approach, compared to the one of Jewell and
Kalbfleisch (2004), is that we estimate a significantly smaller number of parameters.
Hudgens, Satten and Longini (2001) also employ a reduction and an optimization
step for the computation of the MLE. However, they use an EM algorithm for the
optimization step, while we use the support reduction algorithm of Groeneboom,
64
Jongbloed and Wellner (2002). We now describe our implementation of both the
reduction step and the optimization step in more detail.
3.1.1 Reduction step
We first note that we can use the height map algorithm of Maathuis (2005) for the
reduction step. The idea behind this algorithm is as follows. Given n observed sets in
Rp, p ∈ N, taking the form of p-dimensional rectangles1, the height map is a function
h : Rp → 0, 1, . . . , , where h(x) is defined as the number of observed sets that
overlap at the point x ∈ Rp. Maathuis (2005) shows that the maximal intersections
correspond exactly to the local maxima of the height map of a canonical version of
the observed sets. However, this algorithm only works when the observed sets are
rectangles in Rp for some p ∈ N. As discussed in Section 2.2.1, the observed sets for
the MLE can take the form (t,∞) × 1, . . . , K for t ∈ R+, and such sets are not
rectangles in R2. We resolve this problem by transforming sets (t,∞)×1, . . . , K into
(t,∞)× [1, K]. After this transformation we compute the maximal intersections, and
if we find any maximal intersections of the form (t,∞) × [1, K], then we transform
these back to (t,∞) × 1, . . . , K. This reduction algorithm has time complexity
O(n2) (Maathuis (2005)).
We can exploit the special structure of current status data with competing risks to
create a reduction algorithm with lower time complexity. Essentially, the data can be
thought of as K one-dimensional data sets. This is also apparent in the intersection
graph of the observed sets (see Figure 2.2 on page 30), in which the sets R1, . . . ,RK
are not adjacent to each other, but are all adjacent to RK+1 as described in Theorem
2.14 (c). As a result of this one-dimensional structure, we can use the idea of the
height map algorithm for each k = 1, . . . , K separately. This is done in Algorithm 1,
given in pseudo code. This algorithm is of time complexity O(n logn), since the most
1Thus, an observed set can be written as [x11, x12]× [x21, x22]×· · ·× [xp1, xp2], where it is allowedthat xj1 = xj2, j = 1, . . . , p, and where the boundaries of intervals can also be open.
65
time consuming step consists of sorting the observations.
Algorithm 1: Reduction algorithm(R1, . . . ,RK+1):Input: The sets R1, . . . ,RK+1 as defined in (2.23).Output: The maximal intersections of R.
1: for k = 1 to K do
2: Sort the observations in Rk on their observation times.3: Find the maximal intersections of Rk, using the conditions given in Lemma 2.3.4: Output the union of all maximal intersections
3.1.2 Optimization step
After finding the maximal intersections, we need to solve the m-dimensional convex
constrained optimization problem (2.16). This problem can be approached in various
ways. Hudgens, Satten and Longini (2001) use an EM algorithm. Since the EM
algorithm is known for its slow convergence properties, we instead use the support
reduction algorithm of Groeneboom, Jongbloed and Wellner (2002). Convergence of
this algorithm follows from their Theorem 3.1.
The versatility of this approach is illustrated by the fact that we could re-use
programs that were developed for bivariate interval censored data in Maathuis (2003).
Details of the implementation can be found there.
Remark 3.1 To obtain fast convergence, it is important to use a good starting value
for the iterations. In simulation studies, a suitable starting value can be generated
from the true underlying distribution. If the underlying distribution is unknown, one
can use a starting value based on the naive estimator. We found that fast convergence
is obtained by starting in a value that is close to the ‘truncated naive estimator’ that
we will discuss in Section 8.2.
66
3.2 Iterative convex minorant algorithms
We now discuss iterative convex minorant algorithms for the computation of the
MLE. First, note that any convex minorant characterization can be turned into an
iterative convex minorant algorithm. To do so, let F (0) ∈ FK (see equation (2.38))
be some starting value. Furthermore, let P(l)k , k = 1, . . . , K, denote the points of
the cumulative sum diagrams in the lth iteration step. Thus, using Propositions 2.40
or 2.44, we have P(l)k = φnkF (l)(t), t ≥ 0 or P(l)
k = φ∗nkF (l)(t), t ≥ 0. Then, for
l = 0, 1, . . . , let F new be the slope of the convex minorant of P(l)k , and take as the
next iterate F (l+1) = F (l) + α(F new − F (l)) for a suitable α > 0.
If such an algorithm converges, it converges to the MLE. However, the convergence
properties of the algorithm will depend on the choice of the convex minorant char-
acterization. To illustrate the iterative convex minorant algorithms, we discuss one
algorithm in detail and prove its convergence. The algorithm we discuss corresponds
to a specific choice of the functions ck, k = 1, . . . , K, in Proposition 2.44. Further-
more, the algorithm corresponds to a sequential quadratic programming approach
that only uses the diagonal elements of the Hessian matrix, showing the connection
with the approach discussed in Section 3.1.
To describe the algorithm, we repeat the notation that was used in the proof of
Proposition 2.34. Let T ′(i), i = 1, . . . , nd, nd ≤ n, be the order statistics of the distinct
observation times in T1, . . . , Tn and let T ′(0) = 0 and T ′
(nd+1) = ∞. Furthermore, for
i = 1, . . . , nd and k = 1, . . . , K + 1, let
Nik =
n∑
j=1
∆jk1Tj = T ′(i) and Ni =
n∑
j=1
1Tj = T ′(i).
Furthermore, for any set of functions (F1, . . . , FK) and i ∈ 0, . . . , nd + 1, we define
Fik = Fk(T′(i)) and Fi+ =
K∑
k=1
Fik.
67
Let F = (F1, . . . , FK), where Fk = (F1k, . . . , Fnd+1,k). We can then write the extended
log likelihood function ln(F ) as
ln(F ) =1
n
nd∑
i=1
K∑
k=1
Nik logFik +Ni,K+1 log(Fnd+1,+ − Fi+)
− Fnd+1,+. (3.1)
We need to maximize this function over the space
FK = F ∈ R(nd+1)K : 0 ≤ F1k ≤ · · · ≤ Fnd+1,k for all k = 1, . . . , K.
If Nnd,K+1 = 0, we can take Fnd+1,+ = Fnd,+, since in this case the MLE will never
put any mass to the right of T ′(nd). Making this substitution in (3.1), and multiplying
by -1 to turn to the problem into a minimization problem, yields the new criterion
function
ϕn(F ) = −1
n
nd∑
i=1
K∑
k=1
Nik logFik +Ni,K+1 log(Fnd,+ − Fi+)
+ Fnd,+.
On the other hand, if Nnd,K+1 > 0, the constraint Fnd,+ ≤ 1 is automatically satisfied
and we do not need the Lagrange term Fnd+1,+ in (3.1). Thus, in this case we work
with the criterion function
ψn(F ) = −1
n
nd∑
i=1
K∑
k=1
Nik logFik +Ni,K+1 log(1 − Fi+)
.
Recall the definitions of sk in (2.59), and of Tk in Definition 2.22. Let Ik denote the
indices of the order statistics corresponding to Tk ∩ [sk,∞):
Ik = i = 1, . . . , nd : T ′(i) ∈ Tk ∩ [sk,∞), k = 1, . . . , K.
Furthermore, let mk = |Ik| and m =∑K
k=1mk. We set Fik = 0 for i < sk. Then
ϕn(F ) and ψn(F ) only depend on Fik for i ∈ Ik, i = 1, . . . , K. Restricting F to only
68
contain elements Fik for i ∈ Ik, k = 1, . . . , K, the computation of the MLE reduces
to finding the minimizer of θ(F ) over H, where
H = F ∈ Rm : 0 ≤ Fik ≤ Fjk for all i < j ∈ Ik, k = 1, . . . , K,
and where θ(F ) = ϕn(F ) if Nnd,K+1 = 0, and θ(F ) = ψn(F ) if Nnd,K+1 > 0. This
means that in step l of the algorithm we need to solve the quadratic optimization
problem
F new = argminF∈Hθ(F(l)) + (F − F (l))T∇l +
1
2(F − F (l))THl(F − F (l)),
where ∇l ∈ Rm is the vector of first derivatives of θ(F ) at F (l), and Hl is the m×m
diagonal matrix containing the second derivatives of θ(F ) at F (l). Note that this
optimization problem is equivalent to
F new = argminF∈H1
2
F −
(F (l) −Hl
−1∇l
)THl
F −
(F (l) −Hl
−1∇l
).
Since F ∈ H is required to be monotone in each of the components F1, . . . , FK , and
since there are no constraints between the components, this minimization problem
can be further reduced to K isotonic least squares problems
F newk = argminFk∈Gk
1
2
∑
i∈Ik
Fik −
(F
(l)ik −
(∂2θ(F (l))
∂F 2ik
)−1∂θ(F (l))
∂Fik
)2∂2θ(F (l))
∂F 2ik
,
(3.2)
where Gk = Fk ∈ Rmk : 0 ≤ Fik ≤ Fjk for all i < j ∈ Ik for k = 1, . . . , K. It
is well known (see, e.g., Robertson, Wright and Dykstra (1988)) that the solution of
the isotonic least squares problem
min1
2
n∑
i=1
(xi − yi)2hi
69
for a fixed y ∈ Rn and positive values h1, . . . , hn can be found as the left derivative
of the convex minorant of the points P = Pi = (G(i),V(i)), i = 0, . . . , n where
P0 = (0, 0) and
G(i) =i∑
j=1
hj , V(i) =i∑
j=1
hjyj.
Hence, for k ∈ 1, . . . , K, the solution F newk of the isotonic least squares problems
(3.2) is given by the left derivative of the convex minorant of the points (0, 0) and
(∑
j∈Ik,j≤i
∂2θ(F (l))
∂F 2ik
,∑
j∈Ik,j≤i
∂2θ(F (l))
∂F 2ik
F(l)ik − ∂θ(F (l))
∂Fik
), i ∈ Ik. (3.3)
If Nnd,K+1 = 0, we replace θ(F ) by ϕn(F ) in (3.3). Note that
∂ϕn(F(l))
∂Fik= −1
n
(Nik
F(l)ik
− Ni,K+1
F(l)nd,+
− F(l)i+
)+ 1i = nd
(1 − 1
n
nd∑
j=1
Nj,K+1
F(l)nd,+
− F(l)j+
),
and
n∂2ϕn(F
(l))
∂F 2ik
=Nik(F
(l)ik
)2 +Ni,K+1(
F(l)nd,+ − F
(l)i+
)2 + 1i = ndnd∑
j=1
Nj,K+1(F
(l)nd,+ − F
(l)j+
)2 .
Hence, this corresponds exactly to Proposition 2.44 with Fnd+1,+ = Fnd,+,
c(l)k (T ′
(j)) = c(l)jk = n
∂2ϕn(F(l))
∂F 2jk
and βnF (l) = 1 − 1
n
nd∑
j=1
Nj,K+1
F(l)nd,+ − F
(l)j+
.
If Nnd,K+1 > 0, we replace θ(F ) by ψn(F ) in (3.3). In this case, we have
∂ψn(F(l))
∂Fik= −1
n
(Nik
F(l)ik
− Ni,K+1
1 − F(l)i+
),
70
and
n∂2ψn(F
(l))
∂F 2ik
=Nik(F
(l)ik
)2 +Ni,K+1(
1 − F(l)i+
)2 .
Hence, we again obtain an algorithm that corresponds to Proposition 2.44, but now
with Fnd+1,+ = 1,
c(l)k (T ′
(j)) = c(l)jk = n
∂2ψn(F(l))
∂F 2jk
and βnF (l) = 0.
Convergence of the iterative convex minorant algorithm is proved in Jongbloed (1998).
Since both criterion functions ϕn and ψn satisfy the conditions of his theorem, it
follows that our iterative convex minorant algorithm yields a direction of descent of
the criterion function. Hence, it converges to the MLE when augmented by a line
search procedure.
71
Chapter 4
CONSISTENCY
In this chapter we prove global and local consistency of the MLE and the naive es-
timator. In Section 4.1 we prove Hellinger consistency, using empirical process theory
and Glivenko-Cantelli preservation theorems. This also leads to Lr(G) consistency
for r > 0. In Section 4.2, we prove several types of local and uniform consistency,
following the methods of Schick and Yu (2000).
4.1 Hellinger consistency
We first define the Hellinger and total variation distance for both estimators. Recall
the definitions of the MLE and the naive estimator in Sections 2.1.2 and 2.1.3. The
MLE is based on the observed data Z = (T,∆). The density for one observation
z = (t, δ) with respect to µ = # × G is pF (z) =∏K
k=1 Fk(t)δk(1 − F+(t))1−δ+ . Here
G is the distribution function of the observation time T and # is counting measure
on ek, k = 1, . . . , K + 1. Recall that Fn+ =∑K
k=1 Fnk, Fn,K+1 = 1 − Fn+ and
F0,K+1 = 1 − F0+. Furthermore, recall that FK is the class the class of all K-
tuples of sub-distribution functions on R with pointwise sum bounded by one. The
Hellinger and total variation distance between two vectors F = (F1, . . . , FK) and
F ′ = (F ′1, . . . , F
′K) in FK are given by
h2(pF , pF ′) =1
2
∫(√pF −√
pF ′)2 dµ =1
2
K+1∑
k=1
∫ (√Fk −
√F ′k
)2
dG, (4.1)
dTV (pF , pF ′) =
K+1∑
k=1
∫ ∣∣∣Fnk − F0k
∣∣∣ dG. (4.2)
72
The naive estimator Fnk, k = 1, . . . , K + 1, is based on the marginal data Zk =
(T,∆k). The density for one observation zk = (t, δk) with respect to µ = # × G
is pk,Fk(zk) = Fk(t)
δk(1 − Fk(t))1−δk . Here # is counting measure on (1, 0), (0, 1).
Recall that F is the class of all sub-distribution functions on R, and that S is the
class of all sub-survival functions on R. The Hellinger and total variation distance
between two components Fk and F ′k, k ∈ 1, . . . , K in F , or FK+1 and F ′
K+1 in S,
are given by
h2(pk,Fk, pk,F ′
k) =
1
2
∫ (√pk,Fk
−√pk,F ′k
)2
dµ
=1
2
∫ (√Fk −
√F ′k
)2
+(√
1 − Fk −√
1 − F ′k
)2dG (4.3)
and
dTV (pk,Fk, pk,F ′
k) = 2
∫|Fk − F ′
k| dG, (4.4)
for k = 1, . . . , K + 1. We now prove Hellinger consistency for the naive estimators
and the MLE. For the naive estimator, Hellinger consistency follows immediately from
Theorem 7 of Van der Vaart and Wellner (2000), which gives Hellinger consistency
for the MLE for mixed case interval censored data.
Theorem 4.1
h(pk,Fnk, pk,F0k
) →a.s. 0, k = 1, . . . , K + 1. (4.5)
Proof: Let k ∈ 1, . . . , K+1. The naive estimator Fnk is the MLE for the marginal
current status data Zk = (T,∆k). Univariate current status data is a special case
of univariate mixed case interval censored data. Hence, Hellinger consistency for the
naive estimator follows immediately from known results for the MLE for univari-
ate mixed case interval censored data (see, e.g., Van der Vaart and Wellner (2000,
73
Theorem 7)). 2
For the MLE, Hellinger consistency follows from Theorem 9 of Van der Vaart and
Wellner (2000). Theorem 9 is a more general version of their Theorem 7. It uses
the concept of VC-class, which is defined as (see, e.g., Dudley (1978), Pollard (1984),
Van der Vaart and Wellner (2000, page 85)):
Definition 4.2 A collection C of subsets of a sample space W is said to pick out a
certain subset of the finite set x1, . . . , xn ⊆ W if it can be written as x1, . . . , xn∩Cfor some C ∈ C. The collection C is said to shatter x1, . . . , xn if C picks out each
of the 2n subsets. The VC-index V (C) of C is the smallest n for which no set of size
n is shattered by C. A collection C of measurable sets is a called a VC-class if its
VC-index is finite.
Theorem 9 of Van der Vaart and Wellner (2000) holds in the censored data setting
that we discussed in Section 2.2. We briefly recall the set-up. Let W be a random
variable taking values in W. Suppose that W has distribution Q0. We do not observe
W directly. Rather, we observe a vector of random sets D = (D1, . . . , Dp) that form
a partition of W: ∪pk=1Dk = W and Dk ∩ Dj = ∅ for j 6= k. Here the number of
random sets p is allowed to be random. However, that is not needed in our case, in
which p = K + 1 and K is the number of competing risks. Furthermore, we observe
an indicator vector ∆ = (∆1, . . . ,∆K+1), where ∆k = 1W ∈ Dk, k = 1, . . . , K + 1.
We assume W and D are independent. The observed random variable is Z = (D,∆),
and Z1, . . . , Zn are n i.i.d. copies of Z. Finally, Qn is the nonparametric maximum
likelihood estimator of Q0 based on Z1, . . . , Zn. Then Theorem 9 of Van der Vaart and
Wellner (2000) states that h(pQn, pQ0) →a.s. 0 if all Dk ∈ D and D is a VC collection
of subsets of W. Recall from Section 2.2.1 that the MLE for current status data
with competing risks fits this framework, with W = (X, Y ), W = R × 1, . . . , K,Dk(T ) = (−∞, T ]×k for k = 1, . . . , K, and DK+1(T ) = (T,∞)×1, . . . , K. Note
that D and (X, Y ) are indeed independent. This follows from the independence of
74
(X, Y ) and T (assumption (a) in Section 2.1), and the fact that D only depends on
T . Furthermore, the class D is a VC-class with VC-index 3. Hence, it follows that the
MLE for the bivariate distribution function of (X, Y ) is Hellinger consistent. Since
estimating the bivariate distribution function of (X, Y ) is equivalent to estimating the
sub-distribution functions, it follows that Fn = (Fn1, . . . , FnK) is Hellinger consistent.
For completeness we also give a direct proof of this result. We first recall the
definitions of outer integrals and measurable majorants, as given in Van der Vaart
and Wellner (1996, Section 1.2, page 6).
Definition 4.3 Let (Ω,A,P) be an arbitrary probability space, and let T : Ω 7→[−∞,∞] be an arbitrary map. The outer integral of T with respect to P is
E∗T = infEU : U ≥ T, U : Ω 7→ [−∞,∞] is measurable and EU exists.
Here, EU is understood to exist if at least one of EU+ or EU− is finite. The functions
U are allowed to take the value ∞, so that the infimum is never empty. The outer
probability of an arbitrary subset B of Ω is
P ∗(B) = infP (A) : A ⊃ B,A ∈ A.
The infima in the above definitions are always achieved, and are denoted by T ∗ and
B∗ respectively.
Next, we recall the definitions of (universal) Glivenko-Cantelli classes and envelope
functions, as given in Van der Vaart and Wellner (1996, page 81 and 84).
Definition 4.4 Let (X ,B, P0) be a probability space. Let F be a class of measurable
functions f : X → R. Let Pn be the empirical measure of n i.i.d. copies X1, . . . , Xn
75
of X ∼ P0. Then F is a P0-Glivenko-Cantelli class if
‖Pn − P0‖∗F = supf∈F
|(Pn − P0)f |∗ →a.s. 0.
If the statement above holds for all probability measures P on (X ,B), then F is called
a universal Glivenko-Cantelli class.
Definition 4.5 Let F be a class of measurable functions f : X → R. An envelope
function of the class F is any function F : X → R such that |f(x)| ≤ F (x) for every
x ∈ X and f ∈ F .
We now give a direct proof of Hellinger consistency of the MLE.
Theorem 4.6
h(pFn, pF0) →a.s. 0. (4.6)
Proof: We first give an outline of the proof. Let
P = pF : F ∈ FK. (4.7)
Since FK is convex, it follows that P is convex. Hence, we can use the following
inequality for convex classes P:
h2(pFn, pF0) ≤ (Pn − P0)φ(pFn
/pF0), (4.8)
where φ(t) = (t−1)/(t+1) (Van der Vaart and Wellner (2000, Proposition 3); see also
Pfanzagl (1988) and Van de Geer (1993, 1996)). This inequality shows that Hellinger
consistency of the MLE follows if
P1 = φ(pF/pF0) : F ∈ FK (4.9)
76
is a P0-Glivenko Cantelli class.
In the remainder, we prove that P1 is indeed a P0-Glivenko-Cantelli class. We start
by showing that P is a P0-Glivenko-Cantelli class. Note that P is a class of functions
on the space X = (t, ek) : t ∈ R, k = 1, . . . , K+1. The spaces Xk = (t, ek) : t ∈ R,k = 1, . . . , K + 1, form a partition of X . Define Pk = pF1Xk : F ∈ FK, for
k = 1, . . . , K + 1. Then we can use Theorem 4 of Van der Vaart and Wellner (2000),
which states that P is P0-Glivenko-Cantelli if Pk, k = 1, . . . , K + 1 are P0-Glivenko-
Cantelli and P has a P0-integrable envelope function. Note that Pk = Fk : F ∈ FKfor k = 1, . . . , K, and PK+1 = 1−F+ : F ∈ FK. Thus, for each k = 1, . . . , K+1, Pk
consists of monotone functions bounded by one. Hence, they are universal Glivenko
Cantelli classes (Van der Vaart and Wellner (1996, Theorem 2.4.1, page 122 and
Theorem 2.7.5, page 159); see also Birman and Solomjak (1967) and Van de Geer
(1991)). Furthermore, the class P has an integrable envelope function f(t, ek) = 1 for
all k = 1, . . . , K + 1. Hence, it follows that P is P0-Glivenko-Cantelli.
Next, we use Theorem 3 of Van der Vaart and Wellner (2000), which states that
H = ψ(P1, . . . ,Pj) is P0-Glivenko-Cantelli if the following conditions hold: P1, . . . ,Pjare P0-Glivenko-Cantelli, ψ : Rj → R is a continuous function, and H has a P0-
integrable envelope. First, we apply this theorem to
P2 = ψ(P, p−1F0), (4.10)
where ψ(t, s) = ts. Note that p−1F0 is a P0-Glivenko-Cantelli class, because it consists
of a single integrable function: P0p−1F0
=∫
1dµ with µ = # × G. Furthermore, note
that p−1F0
is an envelope for P2. Hence, P2 has an integrable envelope. Since ψ(s, t) is
a continuous function, it follows that P2 is a P0-Glivenko-Cantelli class.
Finally, note that P1 = φ(P2) with φ(t) = (1− t)/(1 + t). Since φ is a continuous
function which is bounded by one, it follows that P1 = φ(P2) is a P0-Glivenko-Cantelli
class. 2
77
We now give several corollaries from Theorems 4.1 and 4.6, which yield consistency
of the estimators in total variation distance and Lr(G) for r > 0.
Corollary 4.7
dTV (pk,Fnk, pk,F0k
) = 2
∫ ∣∣∣Fnk − F0k
∣∣∣ dG→a.s. 0, k = 1, . . . , K + 1, (4.11)
dTV (pFn, pF0) =
K+1∑
k=1
∫ ∣∣∣Fnk − F0k
∣∣∣ dG→a.s. 0. (4.12)
Proof: This follows directly from the second part of the well-known inequality relat-
ing Hellinger distance and total variation distance:
h2(pF1, pF2) ≤ dTV (pF1, pF2) ≤√
2h(pF1, pF2).
2
Corollary 4.8 For any r ≥ 1, we have
K+1∑
k=1
∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣r
dG(t) →a.s. 0, (4.13)
K∑
k=1
∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣r
dG(t) →a.s. 0. (4.14)
Proof: This follows directly from Corollary 4.7 and the inequality |a− b|r ≤ |a− b|for a, b ∈ [0, 1] and r ≥ 1. 2
4.2 Local and uniform consistency
It is clear from the Lr(G) consistency that the observation time distribution G plays
a key role in obtaining local consistency of the estimators. For example, it follows
immediately that one cannot expect consistency at intervals at which G has no mass.
We make this observation more precise, and give several different conditions under
78
which we obtain local or uniform consistency, using techniques from Section 3 of
Schick and Yu (2000). We only give proofs for the MLE, since the proofs for the
naive estimator are analogous and follow almost directly from Schick and Yu (2000).
We start with a simple corollary that is analogous to Corollary 2.3 of Schick and Yu
(2000).
Corollary 4.9 For each point a with G(a) > 0 we have:
Fnk(a) →a.s. F0k(a), k = 1, . . . , K + 1,
Fnk(a) →a.s. F0k(a), k = 1, . . . , K,
Proof: Note that
G(a)K∑
k=1
∣∣∣Fnk(a) − F0k(a)∣∣∣ ≤
K∑
k=1
∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣ dG(t) →a.s. 0
by (4.14) with r = 1. Hence, if G(a) > 0, it follows that
K∑
k=1
∣∣∣Fnk(a) − F0k(a)∣∣∣→a.s. 0,
or equivalently,∣∣∣Fnk(a) − F0k(a)
∣∣∣→a.s. 0 for all k = 1, . . . , K. 2
We now introduce some terminology used by Schick and Yu (2000).
Definition 4.10 Let a ∈ R. We say that a is a support point of G if G((a−ǫ, a+ǫ)) >0 for every ǫ > 0. We say that a is regular if G((a−ǫ, a]) > 0 and G([a, a+ ǫ)) > 0 for
every ǫ > 0. We say that a is strongly regular if G((a−ǫ, a)) > 0 and G([a, a+ ǫ)) > 0
for every ǫ > 0. We say that a is a point of increase of a distribution function F if
F (a+ ǫ) − F (a− ǫ) > 0 for every ǫ > 0.
Lemmas 4.11 and 4.12 state some properties of the continuity points and the points
of increase of F01, . . . , F0K and F0+. These properties follow easily from monotonicity
79
of F01, . . . , F0K .
Lemma 4.11 F0+ is continuous at a point t0 if and only if all sub-distribution func-
tions F01, . . . , F0K are continuous at t0.
Lemma 4.12 The point t0 is a point of increase of F0+ if and only if t0 is a point of
increase of at least one of the sub-distribution functions F01, . . . , F0K .
We now define the following set:
ΩG =
ω ∈ Ω :
K∑
k=1
∫ ∣∣∣Fnk(·;ω) − F0k
∣∣∣ dG+K+1∑
k=1
∫ ∣∣∣Fnk(·;ω)− F0k
∣∣∣ dG→ 0
.
By Corollary 4.8 with r = 1, we have P (ΩG) = 1. We can now prove several propo-
sitions concerning local consistency. The propositions and proofs are analogous to
Schick and Yu (2000), with the difference that we have a collection of sub-distribution
functions.
Fix an ω ∈ ΩG. Let Fk be a pointwise limit of Fnk(·;ω), meaning that Fn′k(t, ω) →Fk(t) for all t ∈ R and some subsequence n′. The existence of such a pointwise
limit is guaranteed by Helly’s selection theorem (see, e.g., Rudin (1976, page 167)).
Similarly, let Fk be a pointwise limit of Fnk(·;ω). We assume without loss of generality
that limn→∞ Fnk(t) = Fk(t) and limn→∞ Fnk(t;ω) = Fk(t) for all t ∈ R. Let
Bk =t ∈ R : Fk(t) 6= F0k(t) or Fk(t) 6= F0k(t)
, k = 1, . . . , K, (4.15)
BK+1 =t ∈ R : FK+1(t) 6= F0,K+1(t)
, (4.16)
B =K+1⋃
k=1
Bk. (4.17)
By Corollary 4.8, we have G(Bk) = 0 for k = 1, . . . , K + 1, and G(B) = 0. We now
give a proposition that is analogous to Proposition 1 of Schick and Yu (2000).
80
Proposition 4.13 For each ω ∈ ΩG and each regular continuity point a of F0+,
Fnk(a;ω) → F0k(a), k = 1, . . . , K + 1,
Fnk(a;ω) → F0k(a), k = 1, . . . , K.
Proof: We only prove the result for the MLE. Let ω ∈ ΩG. We need to show that
B does not contain regular continuity points of F0+. Let t0 be a continuity point of
F0+. Then t0 is a continuity point of F01, . . . , F0K by Lemma 4.11. If t0 ∈ B, then
there is a k ∈ 1, . . . , K such that Fk(t0) 6= F0k(t0). Continuity of F0k at t0, and
monotonicity of Fk and F0k imply there exists an ǫ > 0 such that either (t0−ǫ, t0] ⊆ B
or [t0, t0 + ǫ) ⊆ B. Since G(B) = 0, this implies that either G((t0 − ǫ, t0]) = 0 or
G([t0, t0 + ǫ)) = 0. Hence, t0 is not regular. 2
We obtain the following corollary for a fixed k ∈ 1, . . . , K, by replacing B by Bk in
the proof of Proposition 4.13.
Corollary 4.14 Let k ∈ 1, . . . , K. Then
Fnk(a;ω) → F0k(a) and Fnk(a;ω) → F0k(a),
for all regular continuity points a of F0k.
Such corollaries can be derived for many of the results that follow. However, we will
not point this out each time, and focus on the joint consistency results instead. We
now give a proposition that is analogous to Proposition 2 of Schick and Yu (2000).
Proposition 4.15 Suppose every point in an open interval (a, b) is a support point
of G. Then
Fnk(t;ω) → F0k(t), k = 1, . . . , K + 1
Fnk(t;ω) → F0k(t), k = 1, . . . , K,
81
for every continuity point t of F0+ in (a, b) and every ω ∈ ΩG. If also F0+(a) = 0 and
F0+(b−) = 1, then
Fnk(t;ω) → F0k(t), k = 1, . . . , K,
for all continuity points t of F0+ and all ω ∈ ΩG.
Proof: We only prove the result for the MLE. Let t0 ∈ (a, b) be a continuity point of
F0+. Then t0 is a continuity point of F01, . . . , F0K by Lemma 4.11. Suppose t0 ∈ B.
This implies that there exists a k ∈ 1, . . . , K such that Fk(t) 6= F0k(t). Since F0k is
continuous at t0, there exists an ǫ > 0 such that either (t0 − ǫ] ⊆ B or [t0, t0 + ǫ) ⊆ B.
Furthermore, since t0 ∈ (a, b) and all points in (a, b) are support points of G, there
exist support points t1 and t2 of G and an η > 0 such that (t1 − η, t1 + η) ⊆ (t0 − ǫ, t0]
and (t2 − η, t2 + η) ⊆ [t0, t0 + ǫ). This leads to the contradiction G(B) > 0. Hence, B
does not contain continuity points t ∈ (a, b) of F0+. This proves the first part of the
proposition.
If F0+(a) = 0 and F0+(b−) = 1, then F0k(a) = 0 and F0k(b−) = limt→∞ F0k(t) for
all k = 1, . . . , K. In this case we obtain Fk(t) = F0k(t) for all continuity points t of
F0+ and for all k = 1, . . . , K. This follows from the monotonicity of Fk and F0k, and
the fact that F+ and F0+ are bounded by zero and one. This second part does not
follow automatically for the naive estimator, since Fn+ is not bounded by one. 2
Next, we give propositions that are analogous to Propositions 3 and 4 of Schick and
Yu (2000).
Proposition 4.16 If every point of increase of F0+ is strongly regular, then
Fnk(t;ω) → F0k(t), k = 1, . . . , K + 1,
Fnk(x;ω) → F0k(x), k = 1, . . . , K,
82
for all continuity points of F0+ and all ω ∈ ΩG.
Proof: We only prove the result for the MLE. Suppose every point of increase of F0+
is strongly regular. We show that B does not contain continuity points of F0+. First,
let t0 be a continuity point of F0+. If t0 is a point of increase of F0+, then it must
be strongly regular, and hence regular. Proposition 4.13 then implies that t0 cannot
belong to B.
Now let t0 be a continuity point, but not a point of increase of F0+. Then t0
is not a point of increase of F01, . . . , F0K by Lemma 4.12, and it is a continuity
point of F01, . . . , F0k by Lemma 4.11. We now show by contradiction that t0 does
not belong to B. Thus, suppose t0 ∈ B. Then there is a k ∈ 1, . . . , K such
that Fk(t0) 6= F0k(t0). This means that Fk(t0) > F0k(t0) or Fk(t0) < F0k(t0). In
either case we can derive the contradiction G(B) > 0. Suppose first that Fk(t0) >
F0k(t0). Then b = supt : F0k(t) = F0k(t0) is a point of increase of F0k, b > t0 and
Fk(b−) ≥ Fk(t0) > F0k(t0) = F0k(b−). Hence, (t0, b) ⊆ B and, since b is strongly
regular by assumption, G(B) ≥ G((t0, b)) > 0. Suppose now that Fk(t0) < F0k(t0).
Then a = inft ∈ R : F0k(t) = F0k(t0) is a point of increase of F0k, a < t0 and
Fk(a) ≤ Fk(t0) < F0k(t0) = F0k(a). Hence (a, t0) ⊆ B and, since a is strongly regular
by assumption, G(B) ≥ G((a, t0)) > 0. This shows that B does not contain regular
continuity points of F0+. 2
Proposition 4.17 Suppose F0+ is continuous and that, for all a < b, 0 < F0+(a) <
F0+(b) < 1 implies that G((a, b)) > 0. Then the naive estimator and the MLE are
uniformly strongly consistent, i.e.,
supt∈R
∣∣∣Fnk(t) − F0k(t)∣∣∣→a.s. 0, k = 1, . . . , K + 1,
supt∈R
∣∣∣Fnk(t) − F0k(t)∣∣∣→a.s. 0, k = 1, . . . , K.
Proof: We only prove the result for the MLE. Make the assumptions of the propo-
83
sition. Suppose B contains a point t0. Then there is a k ∈ 1, . . . , K such that
Fk(t0) 6= F0k(t0). Since F0+ is continuous, all sub-distribution functions F01, . . . , F0K
are continuous (Lemma 4.11). Hence, we can construct an open interval (a, b) ⊆ B
that contains a point of increase of F0k. Every point of increase of F0k is a point
of increase of F0+, by Lemma 4.12. Hence, F0+(b) > F0+(a), so that by assumption
G((a, b)) > 0. This leads to the contradiction G(B) ≥ G((a, b)) > 0. Hence, B is
empty. This implies that Fk → F0k pointwise, and this convergence is uniform since
F0k is continuous. 2
Finally, we prove a proposition that is analogous to Proposition 5 of Schick and Yu
(2000).
Proposition 4.18 Suppose the following four conditions hold for real numbers τ1 <
τ2.
(a) F0+ is continuous at every point in the interval (τ1, τ2];
(b) either G(τ1) > 0 or F0+(τ1) = 0;
(c) either G(τ2) > 0 or F0+(τ2−) = 1;
(d) for all a and b in (τ1, τ2), 0 < F0+(a) < F0+(b) < 1 implies G((a, b)) > 0.
Then the MLE is strongly uniformly strongly consistent on [τ1, τ2]:
supx∈[τ1,τ2]
∣∣∣Fnk(x) − F0k(x)∣∣∣→a.s. 0, k = 1, . . . , K.
Proof: We only prove the result for the MLE, and for the case that G(t1) > 0 and
F0+(τ2−) = 1. Note that F0+(τ2−) = 1 implies that F0k(τ2−) = limt→∞ F0k(t) for all
k = 1, . . . , K. We show that B′ = B ∩ [τ1, τ2] = ∅. This implies that Fnk(t) → F0k(t)
for all t ∈ [τ1, τ2]. Since F0+ is continuous, all sub-distribution functions F01, . . . , F0K
are continuous (Lemma 4.11), and hence this convergence is uniform on [τ1, τ2].
84
It follows from Corollary 4.9 that Fk(τ1) = F0k(τ1) for all k = 1, . . . , K. This gives
the desired result if F0+(τ1) = 1, using the monotonicity of F0k and Fk and the fact
that F0+ and F+ are bounded by one. (Note that this does not follow automatically for
the naive estimator, since Fn+ is not bounded by one.) Therefore, assume F0+(τ1) < 1.
We need to show that B′ is empty. Suppose B′ contains a point t0. This implies there
is a k ∈ 1, . . . , K such that Fk(t0) 6= F0k(t0). We can then use the continuity of F0k,
the monotonicity of Fk and F0k, and Fk(τ1) = F0k(τ1) < F0k(τ2−) = limt→∞ F0k(t) to
show that B′ contains an interval (a, b), strictly contained in (τ1, τ2), such that 0 <
F0k(a) < F0k(b) < limt→∞ F0k(t). This implies 0 < F0+(a) < F0+(b) < 1, and hence
by assumption G((a, b)) > 0. This gives the contradiction G(B′) ≥ G((a, b)) > 0. We
can conclude that B′ is empty. 2
Remark 4.19 The consistency results of this section show that the observation time
distribution G plays a key role in the local consistency of the estimators. This obser-
vation is important for the design of clinical trials. For example, if we let G have a
positive density on an interval (a, b), then the estimators Fnk and Fnk are consistent
at all continuity points of F0k in (a, b) by Proposition 4.15. On the other hand, if
we choose G to have zero mass on an interval (a, b), then we cannot expect that the
estimators Fnk and Fnk are consistent on this interval.
85
Chapter 5
RATE OF CONVERGENCE
The Hellinger rate of convergence of the naive estimator is n1/3. This follows
from Van de Geer (1996) or Van der Vaart and Wellner (1996, Theorem 3.4.4, page
327). Under certain regularity conditions, the local rate of convergence of the naive
estimator is also n1/3. This follows from Groeneboom and Wellner (1992, Lemma 5.4,
page 95). Furthermore, this local rate result implies that the distance between two
successive jump points of Fnk around a point t0 is of order Op(n−1/3).
In this chapter we discuss similar results for the MLE. In Section 5.1 we show
that the global rate of convergence is n1/3. Subsequently, we prove in Section 5.2 that
n1/3 is an asymptotic local minimax lower bound for the rate of convergence, meaning
that no estimator can converge locally at a rate faster than n1/3, in a minimax sense.
Hence, the naive estimator converges locally at the optimal rate. Since the MLE
is expected to be at least as good as the naive estimator, one may expect that the
MLE also converges locally at the optimal rate of n1/3. This is indeed the case, and
this is proved in Section 5.3. Our main tool for proving this result is Theorem 5.10,
which gives a uniform rate of convergence of Fn+ on a fixed neighborhood of a point,
rather than on the usual shrinking neighborhood of order n−1/3. Technical lemmas
and proofs are collected in Section 5.4.
5.1 Hellinger rate of convergence
We prove the global rate of convergence of the MLE using the rate theorem for M-
estimators of Van der Vaart and Wellner (1996, Theorem 3.4.1, page 322). A slightly
simplified version of this theorem can be found in Wellner (2003):
86
Theorem 5.1 Let Mn, n ≥ 1 be stochastic processes indexed by a set Θ, and let
M : Θ 7→ R be a deterministic function. Furthermore, let
θn = argmaxθ∈Θ Mn(θ),
θ0 = argmaxθ∈Θ M(θ),
and assume that M satisfies1
M(θ) − M(θ0) . −d2(θ, θ0) (5.1)
for every θ in a neighborhood of θ0. Suppose there exists a γ0 > 0 such that for all
n ≥ 1 and γ < γ0, the centered process Mn − M satisfies2
E∗ supd(θ,θ0)<γ
|(Mn − M)(θ) − (Mn − M)(θ0)| .φn(γ)√
n, (5.2)
where φn are functions such that γ 7→ φn(γ)/γα is decreasing for some α < 2 not
depending on n. Let rn . γ−1n satisfy
r2nφn
(1
rn
)≤
√n, for every n.
If θn satisfies Mn(θn) ≥ Mn(θ0) − Op(r−2n ) and converges in (outer) probability to
θ0, then rn(θn, θ0) = Op(1). If the given conditions hold for every θ and γ, then the
hypothesis that θn is consistent is unnecessary.
In order to verify condition (5.2), we will use bracketing numbers. We recall the
following definitions, adapted from Van der Vaart and Wellner (1996, pages 83 and
324).
1The notation . means “is bounded above up to a universal constant”.
2The star indicates an outer integral, see Definition 4.3.
87
Definition 5.2 Given two functions l and u, the bracket [l, u] is the set of all functions
f with l ≤ f ≤ u. An ǫ-bracket w.r.t. a norm ‖ · ‖ is a bracket [l, u] with ‖u− l‖ < ǫ.
The bracketing number N[ ](ǫ,F , ‖ · ‖) is the minimum number of ǫ-brackets w.r.t.
‖ · ‖ needed to cover F . In this definition, the upper and lower bounds u and l of the
brackets do not need to belong to F themselves but are assumed to have finite norms.
The entropy with bracketing or bracketing entropy is the logarithm of the bracketing
number. Finally, the bracketing integral is defined by
J[ ](γ,F , ‖ · ‖) =
∫ γ
0
√1 + logN[ ](ǫ,F , ‖ · ‖)dǫ. (5.3)
Recall the definitions of pF and h(pF , pF0) in (2.5) and (4.1). Theorem 5.3 gives the
Hellinger rate of convergence of the MLE.
Theorem 5.3 n1/3h(pFn, pF0) = Op(1).
Proof: We use Theorem 5.1 with Θ = FK , θ = F = (F1, . . . , FK), d(F, F0) =
h(pF , pF0), Mn(F ) = PnmpF, M(F ) = P0mpF
, Gn(F ) =√n(Mn − M)(F ), and
mpF(t, δ) = log
(pF (t, δ) + pF0(t, δ)
2pF0(t, δ)
).
We use Theorem 3.4.4 of Van der Vaart and Wellner (1996, page 327) to verify the
conditions of Theorem 5.1. The former theorem directly implies that condition (5.1)
of Theorem 5.1 is satisfied. Furthermore, it implies that
E∗‖Gn‖Mγ ≤ J[ ](γ,P, h)
1 +J[ ](γ,P, h)γ2√n
, (5.4)
where P = pF : F ∈ FK, Mγ = F ∈ FK : h(pF , pF0) < γ, and E∗‖Gn‖Mγ =
E∗ supMγGn(F ). Since mpF0
≡ 0, the key condition (5.2) of Theorem 5.1 can be
written as E∗‖Gn‖Mγ . φn(γ). Thus, a bound on the right side of (5.4) is a candidate
for the function φn(γ) in (5.2).
88
In view of (5.3) and (5.4), we need to bound the bracketing entropy logN[ ](ǫ,P, h).Let F = (F1, . . . , FK) ∈ FK , and recall that FK+1 = 1−F+. For each k = 1, . . . , K+1,
let [lk, uk] be a bracket containing Fk, with size ǫ/√K + 1 w.r.t. the L2(G) norm:
∥∥∥√uk −√lk
∥∥∥2
L2(G)=
∫(√uk −
√lk)
2dG ≤ ǫ2
K + 1. (5.5)
Then
[pl(t, δ), pu(t, δ)] =
[K+1∏
k=1
lk(t)δk ,
K+1∏
k=1
uk(t)δk
]
is a bracket containing pF , and by assumption (5.5) its size w.r.t. the Hellinger
distance is bounded by:
h2(pl, pu) =1
2
K+1∑
k=1
∫(√uk −
√lk)
2dG ≤ ǫ2.
Note that pl and pu are typically not in the class P, since we do not require that
lK+1 = 1 − l+ and uK+1 = 1 − u+. However, the upper and lower bounds of the
brackets in Definition 5.2 are not required to be in the class P, so this does not pose
a problem.
We now count how many brackets [pl, pu] we need to cover the class P. First, note
that [√lk,
√uk] contains
√Fk for all k = 1, . . . , K + 1. Furthermore, note that all
√Fk, k = 1, . . . , K + 1 are contained in the class
F = F : R 7→ [0, 1] is monotone.
It is well-known that
logN[ ] (δ,F , L2(Q)) . 1/δ, (5.6)
89
uniformly in Q (see, e.g., Van de Geer (2000, page 18, equation (2.5)) or Van der Vaart
and Wellner (1996, Theorem 2.7.5, page 159)). Hence, considering all possible combi-
nations of (K + 1)-tuples of brackets [√lk,
√uk] with ‖√uk −
√lk‖L2(G) ≤ ǫ/
√K + 1,
it follows that
logN[ ](ǫ,P, h) ≤ log(N[ ](ǫ/
√K + 1,F , L2(G))K+1
)
= (K + 1) logN[ ]
(ǫ/√K + 1,F , L2(G)
)
.(K + 1)3/2
ǫ.
Dropping the dependence onK (since K is fixed), this implies that J[ ](γ,P, h) . γ1/2,
and together with (5.4) we obtain E‖Gn‖Mγ ≤ √γ + (γ
√n)−1. Note that (
√γ +
(γ√n)−1)/γ is decreasing in γ. Hence, we can take φn(γ) =
√γ+(γ
√n)−1 in Theorem
5.1. We then obtain that rnh(pFn, pF0) = Op(1) provided that h(pFn
, pF0) → 0 in outer
probability, and r2nφn(r
−1n ) ≤ √
n for all n. The first condition is fulfilled by the almost
sure Hellinger consistency of the MLE (Theorem 4.6). The second condition holds for
rn = cn1/3 and c = ((√
5 − 1)/2)2/3. 2
Analogously to the comments directly following Theorem 4.6, Theorem 5.3 implies
n1/3dTV (pFn, pF0) = Op(1) and n1/3‖Fn − F0‖1 = Op(1). Furthermore, we have
n1/3‖Fn − F0‖2 = n1/3
(K∑
k=1
∫Fnk − F0k2dG
)1/2
= Op(1), (5.7)
since
‖F − F0‖22 =
K∑
k=1
∫ √Fk −
√F0k
2 √Fk +
√F0k
2
dG
≤ 4
K∑
k=1
∫ √Fk −
√F0k
2
dG ≤ 8h2(pF , pF0).
90
5.2 Asymptotic local minimax lower bound
In this section we prove that n1/3 is an asymptotic local minimax lower bound for
the rate of convergence of Fnk, k = 1, . . . , K. We use the set-up of Groeneboom
(1996, Section 4.1, page 132). Let P be a set of probability densities on a measurable
space (Ω,A) with respect to a σ-finite dominating measure. We estimate a parameter
θ = Up ∈ R, where U is a real-valued functional and p ∈ P. Let Un, n ≥ 1, be a
sequence of estimators based on a sample of size n, i.e., Un = tn(Z1, . . . , Zn), where
Z1, . . . , Zn is a sample from the density p, and tn : Ωn → R is a Borel measurable
function. Let l : [0,∞) → [0,∞) be an increasing convex loss function with l(0) = 0.
The risk of the estimator Un in estimating Up is defined by En,pl(|Un − Up|), where
En,p denotes the expectation with respect to the product measure P⊗n corresponding
to the sample Z1, . . . , Zn. We now recall Lemma 4.1 of Groeneboom (1996, page 132).
Lemma 5.4 For any p1, p2 ∈ P such that the Hellinger distance h(p1, p2) < 1:
infUn
max En,p1l(|Un − Up1|), En,p2l(|Un − Up2|)
≥ l
(1
4|Up1 − Up2|
(1 − h2(p1, p2)
)2n).
Let k ∈ 1, . . . , K. We apply Lemma 5.4 to the estimation of F0k(t0). Let
Unk, n ≥ 1, be a sequence of estimators of F0k(t0). Furthermore, let c > 0 and
let F kn = (Fn1, . . . , FnK) be a perturbation of F0 where only the kth component is
changed in the following way (see Figure 5.1):
Fnk(x) =
F0k(t0 − cn−1/3) if x ∈ [t0 − cn−1/3, t0),
F0k(t0 + cn−1/3) if x ∈ [t0, t0 + cn−1/3),
F0k(x) otherwise,
(5.8)
and Fnj(x) = F0j(x) for j 6= k. Note that F kn is a valid vector of sub-distribution
functions with corresponding survival function Fn,K+1 = 1 − Fn+. Proposition 5.5
91
t0cn−1/3 cn−1/3
F0k
Fnk
Figure 5.1: Perturbation used to derive the asymptotic local minimax lower bound.
gives a minimax lower bound for the rate of convergence for estimating F0k(t0).
Proposition 5.5 Fix k ∈ 1, . . . , K. Let 0 < F0k(t0) < F0k(∞), and let F0k and G
be continuously differentiable at t0 with strictly positive derivatives f0k(t0) and g(t0).
Then for r ≥ 1 we have:
lim infn→∞
nr/3 infUn
maxEn,pF0
|Unk − F0k(t0)|r , En,pF k
n|Unk − Fnk(t0)|r
≥ dr[g(t0)
f0k(t0)
1
F0k(t0)+
1
1 − F0+(t0)
]−r/3, (5.9)
where d = 2−5/3e−1/3.
Proof: Let r ≥ 1. We apply Lemma 5.4 with l(x) = xr, p1 = pF0 and p2 = pF kn,
where pF is defined in (2.5). This yields:
nr/3 infUn
maxEn,pF0
|Unk − F0k(t0)|r , En,pFn|Unk − Fnk(t0)|r
≥ nr/3(
1
4|Fnk(t0) − F0k(t0)|
(1 − h2(pFn, pF0)
)2n)r
. (5.10)
We now compute the quantities in this expression. First, continuous differentiability
92
of F0k in a neighborhood around t0 yields
n1/3 |Fnk(t0) − F0k(t0)| = n1/3∣∣F0k(t0 + cn−1/3) − F0(t0)
∣∣ = cf0k(t0) + o(1). (5.11)
Next, we compute the Hellinger distance h2(pF0, pF kn), defined in (4.1). Since Fnj = F0j
for j 6= k, we only need to compute∫ (√
F0j −√Fnj)2dG for j = k and j = K + 1.
We first consider j = k:
∫ (√F0k −
√Fnk
)2
dG
=
∫ t0
t0−cn−1/3
(√F0k −
√Fnk
)2
dG+
∫ t0+cn−1/3
t0
(√F0k −
√Fnk
)2
dG. (5.12)
Using the definition of Fnk in (5.8), the condition F0k(t0) > 0, and the continuous
differentiability of G and F0k in a neighborhood around t0, we can write the first term
of (5.12) as
∫ t0
t0−cn−1/3
(√F0k(t) −
√F0k(t0 − cn−1/3)
)2
dG(t)
=
∫ t0
t0−cn−1/3
g(t0)(t− t0 + cn−1/3
)2 (f0k(t0))2
4F0k(t0)+ o(n−2/3)
=1
n
(f0k(t0))2g(t0)c
3
12F0k(t0)+ o(n−1).
Using an analogous derivation for the second term of (5.12), we obtain
∫ (√F0k −
√Fnk
)2
dG =1
n
(f0k(t0))2g(t0)c
3
6F0k(t0)+ o(n−1).
Similarly, for j = K + 1, we get
∫ (√F0,K+1 −
√Fn,K+1
)2
dG =1
n
(f0k(t0))2 g(t0)c
3
6F0,K+1(t0)+ o(n−1),
93
so that
h2(pF0 , pFn) =1
2n
1
6g(t0)c
3(f0k(t0))2
1
F0k(t0)+
1
F0,K+1(t0)
+ o(n−1). (5.13)
Plugging the expressions (5.11) and (5.13) into the lower bound (5.10), and using that
limn→∞(1 + x/n)n = exp(x), gives the asymptotic lower bound
[1
4cf0k(t0) exp
(−1
6g(t0)c
3(f0k(t0))2
1
F0k(t0)+
1
1 − F0+(t0)
)]r. (5.14)
The maximum of (5.14) over c is attained at
c =
(1
2g(t0)(f0k(t0))
2
1
F0k(t0)+
1
1 − F0+(t0)
)−1/3
and its value is given in (5.9). 2
Remark 5.6 Note that the lower bound (5.10) consists of a part depending on the
underlying distribution, and a universal constant d. It is not clear whether the con-
stant depending on the underlying distribution is sharp, because it has not been
proved that any estimator achieves this constant. However, we do know that the
naive estimator Fnk does generally not achieve this constant. To see this, recall that
Fnk is the MLE for the reduced data (Ti,∆ki), i = 1, . . . , n. Hence, its asymptotic risk
is bounded below by the asymptotic local minimax lower bound for current status
data:
dr[g(t0)
f0k(t0)
1
F0k(t0)+
1
1 − F0k(t0)
]−r/3(5.15)
(see Groeneboom (1996, page 135, equation (4.2)), or take K = 1 in Proposition 5.5).
Since 1 − F0k(t0) > 1 − F0+(t0) if F0j(t0) > 0 for some j ∈ 1, . . . , K, j 6= k, this
bound is larger than the lower bound of Proposition 5.5.
94
We can also apply a generalized version of Lemma 5.4 to the vector of components
(Fn1, . . . , FnK). To do this, we use the following set-up. Let P be a set of probability
densities on a measurable space (Ω,A) with respect to a σ-finite dominating measure.
We estimate a parameter θ = Up ∈ RK , where U is a vector-valued functional and
p ∈ P. Let Un, n ≥ 1, be a sequence of estimators based on sample size n. Let
l : [0,∞) → [0,∞) be an increasing convex loss function with l(0) = 0. The risk of
the estimator Un in estimating Up is defined by En,pl(‖Un − Up‖), where ‖ · ‖ is a
norm on RK . We now state a generalized version of Lemma 5.4, which can be derived
by replacing the absolute value signs |·| in the proof of Lemma 5.4 by norms ‖ · ‖.
Lemma 5.7 For any p1, p2 ∈ P such that the Hellinger distance h(p1, p2) < 1:
infUn
max En,p1l(‖Un − Up1‖), En,p2l(‖Un − Up2‖)
≥ l
(1
4‖Up1 − Up2‖
(1 − h2(p1, p2)
)2n).
We apply this lemma to the estimation of (F01(t0), . . . , F0K(t0)). Let Un, n ≥ 1 be
a sequence of estimators for F0(t0) = (F01(t0), . . . , F0K(t0)). For c > 0, let Fn =
(Fn1, . . . , FnK) be a perturbation of F0, where each component is changed in the
following way (see Figure 5.1):
Fnk(x) =
F0k
(t0 − cn−1/3
)if x ∈
[t0 − cn−1/3, t0
),
F0k
(t0 + cn−1/3
)if x ∈
[t0, t0 + cn−1/3
),
F0k(x) otherwise.
Note that (Fn1(x), . . . , FnK(x)) is a valid set of sub-distribution functions with corre-
sponding survival function Fn,K+1(x) = 1 − Fn+(x).
Proposition 5.8 For each k ∈ 1, . . . , K, let 0 < F0k(t0) < F0k(∞), and let F0k
and G be continuously differentiable at t0 with positive derivatives. Then, for r ≥ 1
95
any norm ‖ · ‖ on RK, we have:
lim infn→∞
nr/3 infUn
maxEn,pF0
‖Un − F0(t0)‖r, En,pFn‖Un − Fn(t0)‖r
≥ dr
‖f0(t0)‖
(g(t0)
K+1∑
k=1
(f0k(t0))2
F0k(t0)
)−1/3r
. (5.16)
where d = 2−5/3e−1/3.
Proof: Let r ≥ 1. We apply Lemma 5.7 with l(x) = xr, p1 = pF0 and p2 = pFn,
where pF is defined in (2.5). This yields:
nr/3 infUn
maxEn,pF0
‖Un − F0(t0)‖r, En,pFn‖Un − Fn(t0)‖r
≥ nr/3(
1
4‖Fn(t0) − F0(t0)‖
(1 − h2(pFn, pF0)
)2n)r
. (5.17)
We now compute the quantities in the expression on the right side. Analogously to
the proof of Proposition 5.5, we get that
∫ (√F0k −
√Fnk
)2
dG =1
n
(f0k(t0))2g(t0)c
3
6F0k(t0)+ o(n−1)
for k = 1, . . . , K. Furthermore, using F0+(t0) < 1 we get
∫ (√F0,K+1 −
√Fn,K+1
)2
dG =1
n
(∑Kk=1 f0k(t0)
)2
g(t0)c3
6F0,K+1(t0)+ o(n−1)
=1
n
(f0,K+1(t0))2 g(t0)c
3
6F0,K+1(t0)+ o(n−1).
Hence,
h2(pF0 , pFn) =1
2
K+1∑
k=1
∫ (√F0k −
√F ′k
)2
dG =1
2n
1
6g(t0)c
3
K+1∑
k=1
(f0k(t0))2
F0k(t0)+ o(n−1).
96
Furthermore, the continuous differentiability of F0 in a neighborhood around t0 yields
that
n1/3‖Fn(t0) − F0(t0)‖ = n1/3‖F0(t0 + cn−1/3) − F0(t0)‖ = c‖f0(t0)‖ + o(1).
Analogously to the proof of Proposition 5.5, plugging these expressions into the lower
bound (5.17) and using that limn→∞(1+x/n)n = exp(x), yields the following asymp-
totic lower bound:[
1
4c‖f0(t0)‖ exp
(−1
6g(t0)c
3K+1∑
k=1
(f0k(t0))2
F0k(t0)
)]r. (5.18)
The maximum of (5.18) over c is attained at
c =
(1
2g(t0)
K+1∑
k=1
(f0k(t0))2
F0k(t0)
)−1/3
and its value is given in (5.16). 2
5.3 Local rate of convergence
As mentioned in the introduction of this chapter, the n1/3 local rate of convergence of
the naive estimator and the n1/3 local minimax lower bound for the rate of convergence
suggest that the MLE converges locally at rate n1/3. This is indeed the case, and we
now give the proof of this result. However, although this result is intuitively clear, the
proof is rather involved and required new methods. The main difficulties are that the
MLE has no closed form, and that we have to handle the system of sub-distribution
functions.
There are currently no general methods available to prove the local rate of conver-
gence of the maximum likelihood estimator in similar estimation problems. This is in
contrast to the global rate of convergence, for which there are fairly standard methods
from empirical process theory. Thus, the local rate of convergence is still proved on
97
a case by case basis. The common theme in existing proofs is to rely heavily on the
characterization of the MLE in terms of Fenchel conditions (see, e.g., Groeneboom
and Wellner (1992) for case 2 interval censored data, and Groeneboom, Jongbloed
and Wellner (2001b) for convex density estimation). This is done because the MLE
has no closed form in these problems, so that the characterization is all one has to
work with. We will use this approach as well.
The outline of this section is as follows. In Section 5.3.1 we revisit the Fenchel
conditions. These conditions will show that the term Fn+ plays an important role.
Therefore, in Section 5.3.2 we first prove a rate result for Fn+ (Theorem 5.10). This
rate result is stronger than the usual local rate result, because it holds uniformly on
a fixed neighborhood of a point t0, instead of on the usual shrinking neighborhood of
order n−1/3. In Remark 5.11, we discuss the meaning of Theorem 5.10, by comparing
it to several existing results for current status data without competing risks. Subse-
quently, we give the proof of Theorem 5.10. Finally, in Section 5.3.3 we use Theorem
5.10 to prove the local rate of convergence for the components Fn1, . . . , FnK in Theo-
rem 5.20. Technical lemmas and proofs are deferred to Section 5.4. Throughout, we
assume that for each k ∈ 1, . . . , K, Fnk is piecewise constant and right-continuous,
with jumps only at points in Tk (see Definition 2.22).
5.3.1 Revisiting the Fenchel conditions
Assume without loss of generality that Fn+(∞) = 1 and recall the definition of GnFn
in (2.47). Let τnk be a jump point of Fnk, and let τnk < s. Then Proposition 2.36
implies that
∫
[τnk,s)
δkdPn(u, δ) −∫
[τnk,s)
Fnk(u)dGnFn(u) ≥ 0. (5.19)
To see this, note that equality must hold in (2.48) at t = τnk and that inequality must
hold at t = s. Subtracting these two equations yields (5.19).
98
For s < T(n), we can rewrite (5.19) as follows:
0 ≤∫
[τnk,s)
δkdPn(u, δ) −∫
[τnk,s)
Fnk(u)dGnFn(u)
=
∫
[τnk,s)
δk − Fnk(u) + Fnk(u)
(1 − 1 − δ+
1 − Fn+(u)
)dPn(u, δ)
=
∫
[τnk,s)
δk − Fnk(u) + Fnk(u)
δ+ − Fn+(u)
1 − Fn+(u)
dPn(u, δ)
=
∫
[τnk,s)
δk − Fnk(u) + (δ+ − Fn+(u))
F0k(s)
1 − F0+(s)+RksFn
(u, δ)
dPn(u, δ), (5.20)
where
RksFn(u, δ) = (δ+ − Fn+(u))
(Fnk(u)
1 − Fn+(u)− F0k(s)
1 − F0+(s)
)
= (δ+ − Fn+(u))Fnk(u)(1 − F0+(s)) − F0k(s)(1 − Fn+(u))
(1 − Fn+(u))(1 − F0+(s)). (5.21)
The term RksFnarises since we replace Fnk(u)/(1 − Fn+(u)) by the constant and
deterministic factor F0k(s)/(1 − F0+(s)). Lemma 5.9 provides a bound on
∣∣∣∣∫
[w,s)
RksFn(u, δ)dPn(u, δ)
∣∣∣∣
for w < s in a neighborhood of t0. Note that the given bound grows with the length of
the integration interval. However, this growth is dominated by terms with quadratic
growth that we will encounter later. Hence, RksFncan be viewed as a remainder term.
The proof of Lemma 5.9 is given in Section 5.4.
Lemma 5.9 Let the conditions of Theorem 5.10 be satisfied. Then there exists an
r > 0 such that, uniformly in t0 − 2r < w < s < t0 + 2r, and for k = 1, . . . , K,
∣∣∣∣∫
[w,s)
RksFn(u, δ)dPn
∣∣∣∣ = Op
(n−2/3 + n−1/3(s− w)3/2
).
99
Given that RksFncan be viewed as a remainder term, and that F0k(s)/(1 − F0+(s))
is a constant factor, the Fenchel conditions (5.20) contain two important parts:
∫
[τnk,s)
δk − Fnk(u)dPn(u, δ) and
∫
[τnk ,s)
δ+ − Fn+(u)dPn(u, δ). (5.22)
The first term is equivalent to the Fenchel conditions for the naive estimator, and can
be handled without much difficulty. In order to control the second term, we need the
rate result for Fn+ that is given in the next section.
5.3.2 Uniform rate of convergence for Fn+ on a fixed neighborhood of t0
The important rate result for Fn+ is given in Theorem 5.10. The main virtue of this
theorem is that it holds uniformly on a fixed neighborhood [t0 − r, t0 + r] of t0, rather
than on a shrinking neighborhood of the form [t0−Mn−1/3, t0+Mn−1/3]. Such a fixed
neighborhood is needed, because we will use Theorem 5.10 to derive a bound on the
second term in (5.22). The usual result on a shrinking neighborhood is not enough
for this purpose, because in the proof of the local rate of the components (Theorem
5.20), we cannot assume that the length of the interval [τnk, s) is of order Op(n−1/3).
Theorem 5.10 For all k = 1, . . . , K, let 0 < F0k(t0) < F0k(∞), and let F0k and G
be continuously differentiable at t0 with strictly positive derivatives f0k(t0) and g(t0).
For β ∈ (0, 1) we define
vn(t) =
n−1/3 if |t| ≤ n−1/3,
n− 1−β3 |t|β if |t| > n−1/3.
(5.23)
Then there exists a constant r > 0 so that
supt∈[t0−r,t0+r]
∣∣∣Fn+(t) − F0+(t)∣∣∣
vn(t− t0)= Op(1). (5.24)
100
n−1/3
β = 0.95
β = 0.75
β = 0.40
β = 0.05
vn(t)
t
Figure 5.2: Plot of vn(t) for various values of β. The dotted lines are y = x andy = n−1/3. Note that β close to zero gives the sharpest bound.
Before giving the proof of this theorem, we discuss its meaning by comparing it to
several known results for current status data without competing risks.
Remark 5.11 By taking K = 1 in Theorem 5.10, it follows that the theorem holds
for the MLE Fn for current status data without competing risks. Thus, to clarify
the meaning of Theorem 5.10, we can compare it to known results for Fn. First,
we consider the local rate of convergence given in Groeneboom and Wellner (1992,
Lemma 5.4, page 95). For M > 0, they prove that
supt∈[−M,M ]
∣∣∣Fn(t0 + n−1/3t) − F0(t0)∣∣∣ = Op(n
−1/3). (5.25)
Applying Theorem 5.10 to t ∈ [t0 −Mn−1/3, t0 +Mn−1/3] yields
supt∈[t0−Mn−1/3,t0+Mn−1/3]
∣∣∣Fn+(t) − F0+(t)∣∣∣
vn(t− t0)= Op(1).
Combining this with the continuous differentiability of F0+ at t0, and with the fact
101
that
vn(t− t0) ≤ vn(Mn−1/3) = Mβn−1/3, for M ≥ 1,
yields the bound in (5.25). Hence, Theorem 5.10 is stronger than (5.25) for M ≥ 1.
Next, we consider the global bound of Groeneboom and Wellner (1992, Lemma
5.9):
supt∈R
∣∣∣Fn(t) − F0(t)∣∣∣ = Op(n
−1/3 log n). (5.26)
The result in Theorem 5.10 is fundamentally different from (5.26), because it is
stronger in some ranges, but weaker in others. For example, for |t− t0| = n−1/3 log n,
Theorem 5.10 is stronger, since
vn(t− t0) = n− 1−β3 |t− t0|β = n−1/3(logn)β < n−1/3 logn, for n ≥ 3,
for all β ∈ (0, 1). Similarly, for |t− t0| = n−1/3 log log n we have vn(t − t0) <
n−1/3 log log n. On the other hand, for |t − t0| = n−1/3+γ for some γ > 0, Theo-
rem 5.10 is weaker, because vn(t− t0) = n−1/3+γβ and n−1/3 log n = o(n−1/3+γβ
), for
any β ∈ (0, 1) and γ > 0.
Remark 5.12 Note that Theorem 5.10 gives a family of bounds in β. Choosing β
close to zero gives the tightest bound, as illustrated in Figure 5.2. For the proof of
the local rate of convergence of Fnk, k = 1, . . . , K, it is sufficient that Theorem 5.10
holds for one arbitrary value of β ∈ (0, 1). Stating the theorem for one fixed β leads
to a somewhat simpler proof. However, for completeness we present the result for all
β ∈ (0, 1).
We now provide several lemmas that are needed in the proof of Theorem 5.10.
First, Lemma 5.13 shows that we can replace∫[t,s)
F (s)−F (u)dGn(u) by∫[t,s)
F (s)−
102
F (u)dG(u), at the cost of a term Op(n−1/2(s− t)).
Lemma 5.13 Let F : R 7→ R be continuously differentiable at t0 with strictly positive
derivative f(t0). Then there exists an r > 0 such that uniformly in t0 − 2r ≤ t ≤ s ≤t0 + 2r:
∣∣∣∣∫
[t,s)
F (s) − F (u)d(Gn −G)(u)
∣∣∣∣ = Op(n−1/2(s− t)).
Proof: Integration by parts yields
n1/2
∫
[t,s)
F (s) − F (u)d(Gn −G)(u)
= −n1/2F (s) − F (t)Gn(t) −G(t) + n1/2
∫
[t,s)
Gn(u) −G(u)dF (u).
Note that n1/2 supu∈R|Gn(u) − G(u)| is tight, since it converges in distribution to
supu∈R|B(G(u))| ≤ supx∈[0,1] |B(x)|, where B is a standard Brownian bridge on [0, 1].
Hence, both terms on the right side of the display are of order Op(1)F (s)−F (t) =
Op(1)(s− t). 2
Next, Lemma 5.14 shows that∫[w,s)
F0k(s) − F0k(u)dGn(u) has a quadratic drift.
This result follows by replacing Gn by G using Lemma 5.13, and then using the
continuous differentiability of F0k. This quadratic drift plays an important role in the
proof of the local rate result, because it dominates all other terms.
Lemma 5.14 Let the conditions of Theorem 5.10 be satisfied. Then there exists an
r > 0 such that for all k = 1, . . . , K,
P(∫
[w,s)
F0k(s) − F0k(u)dGn(u) ≥ g(t0)f0k(t0)(s− w)2/8
for all w, s ∈ [t0 − 2r, t0 + 2r] such that s− w > n−1/3)→ 1, n→ ∞.
103
Proof: Let k ∈ 1, . . . , K. Note that
∫
[w,s)
F0k(s) − F0k(u)dGn(u)
=
∫
[w,s)
F0k(s) − F0k(u)d(Gn −G)(u) +
∫
[w,s)
F0k(s) − F0k(u)dG(u)
≥∫
[w,s)
F0k(s) − F0k(u)dG(u)−∣∣∣∣∫
[w,s)
F0k(s) − F0k(u)d(Gn −G)(u)
∣∣∣∣ . (5.27)
We write (5.27) as I−II. Note that I ≥ f0k(t0)g(t0)(s−w)2/4 for r small enough, by
the assumption that F0k and G are continuously differentiable with positive deriva-
tives. Furthermore, II is of order n−1/2(s− w)Op(1) by Lemma 5.13. Since s− w >
n−1/3, this is in turn bounded above by f0k(t0)g(t0)(s− w)2/8 with probability arbi-
trarily close to one for n sufficiently large. Plugging these results into (5.27) completes
the proof. 2
We are now ready to give the proof of Theorem 5.10.
Proof of Theorem 5.10: Let β ∈ (0, 1) and ǫ > 0. It is sufficient to show that we
can choose n1, M and r such that for all n > n1
P∃t ∈ [t0 − r, t0 + r] : Fn+(t) /∈ (F0+(t−Mvn(t− t0)), F0+(t+Mvn(t− t0)))
< ǫ,
(5.28)
since, for r small enough, the continuous differentiability of F0+ gives
F0+(t+Mvn(t− t0)) ≤ F0+(t) + 2Mvn(t− t0)f0+(t0), t ∈ [t0 − r, t0 + r],
F0+(t−Mvn(t− t0)) ≥ F0+(t) − 2Mvn(t− t0)f0+(t0), t ∈ [t0 − r, t0 + r],
and combining this with (5.28) proves (5.24):
P∃t ∈ [t0 − r, t0 + r] : |Fn+(t) − F0+(t)| ≥ 2Mvn(t− t0)f0+(t0)
< ǫ, n > n1.
104
Thus, in the remainder we prove (5.28). In fact, we only prove that there exist
n1, M and r such that for all n > n1
P∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0))
<ǫ
4, (5.29)
since the proofs for Fn+(t) ≤ F0+ (t−Mvn(t− t0)) and the interval [t0 − r, t0] are
analogous. To prove this, we make use of the fact that we can choose n1, r and
C > 0, such that for all n > n1 the following event holds with high probability:
EnrC = E(1)nr ∩ E(2)
nr ∩ E(3)nrC , (5.30)
where
E(1)nr = ∩Kk=1
Fnk has a jump in [t0 − 2r, t0 − r]
,
E(2)nr = ∩Kk=1
∫
[w,s)
(F0k(s) − F0k(u))dGn(u) ≥ g(t0)f0k(t0)(s− w)2/8
for all w, s ∈ [t0 − 2r, t0 + 2r], w − s > n−1/3
,
E(3)nrC = ∩Kk=1
∣∣∣∣∫
[w,s)
RksFn(u, δ)dPn(u, δ)
∣∣∣∣ ≤(n−2/3 + n−1/3t3/2
)C
for all w, s ∈ [t0 − 2r, t0 + 2r]
.
To see that event E(1)nr holds with high probability, let k ∈ 1, . . . , K, and note
that by Proposition 4.15 and the continuity of F0k in a neighborhood of t0, it follows
that Fnk is almost surely uniformly consistent on [t0 − 2r, t0 − r] for r small enough.
Together with the fact that F0k is strictly increasing in a neighborhood of t0, this
implies that for n large Fnk must have a jump on [t0 − 2r, t0 − r]. Lemmas 5.9 and
5.14 imply that events E(2)nr and E
(3)nrC hold with high probability.
Hence, we can choose n1, r and C such that P(EcnrC
)< ǫ/8 for all n > n1. By
105
writing
P∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0))
≤ P (EcnrC)+P
(∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0)) ∩EnrC
). (5.31)
it follows that we can complete the proof by showing that we can choose n1, M and
r such that the second term of (5.31) is bounded by ǫ/8 for all n > n1.
In order to prove this, we put a grid on the interval [t0, t0 +r], analogously to Kim
and Pollard (1990, Lemma 4.1). The grid points tnj and grid cells Inj are denoted by
tnj = t0 + jn−1/3 and Inj = [tnj, tn,j+1) , (5.32)
where j = 0, . . . , Jn = ⌈rn1/3⌉. Then it is sufficient to show that we can choose n1,
M and r such that for all n > n1 and j = 0, . . . , Jn,
P(
∃t ∈ Inj : Fn+(t) ≥ F0+(t+Mvn(t− t0))∩ EnrC
)≤ pjM , (5.33)
where pjM is defined by
pjM =
d1 exp(−d2M3/2) if j = 0,
d1 exp(−d2(Mjβ)3/2) if j = 1, . . . , Jn,(5.34)
for some positive constants d1 and d2. To see that this is sufficient, note that (5.33)
implies that
P(∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+(t+Mvn(t− t0)) ∩ EnrC
)
≤∞∑
j=0
pjM = d1 exp(−d2M3/2) +
∞∑
j=1
d1 exp(−d2(Mjβ)3/2),
and for any β ∈ (0, 1) this is a convergent sum that can be made arbitrarily small by
106
choosing M large.
Thus, we are left with proving (5.33). Using the monotonicity of Fn+, it is in turn
sufficient to prove that for all n > n1 and j = 0, . . . , Jn,
P(Fn+(tn,j+1) ≥ F0+(snjM)
∩EnrC
)= P (AnjM ∩EnrC) ≤ pjM , (5.35)
where
snjM = tnj +Mvn(tnj − t0), (5.36)
AnjM =Fn+(tn,j+1) ≥ F0+(snjM)
. (5.37)
Let τnk be the last jump point of Fnk before tn,j+1, for k = 1, . . . , K. On the event
EnrC , these jump points exist and are in [t0 − 2r, tn,j+1). Without loss of generality
we assume that the sub-distribution functions are labeled such that τn1 ≤ · · · ≤ τnK .
On the event AnjM there must be a k ∈ 1, . . . , K for which Fnk(tn,j+1) ≥ F0k(snjM).
Hence, on the event AnjM , we can define l ∈ 1, . . . , K such that
Fnk(tn,j+1) < F0k(snjM), k = l + 1, . . . , K, (5.38)
Fnl(tn,j+1) ≥ F0l(snjM). (5.39)
Next, we note that the Fenchel conditions imply that
∫
[τnl,snjM )
δldPn(u, δ) −∫
[τnl,snjM )
Fnl(u)dGnFn(u) ≥ 0
must hold. Hence,
P (AnjM ∩EnrC)
= P
(∫
[τnl,snjM )
δldPn(u, δ) −∫
[τnl,snjM )
Fnl(u)dGnFn(u) ≥ 0
∩ AnjM ∩ EnrC
).
107
Using (5.20) this probability is bounded above by
P
(∫
[τnl,snjM )
δl − Fnl(u) +RlsnjM Fn
(u, δ)dPn ≥ 0
∩ AnjM ∩ EnrC
)(5.40)
+ P
(∫
[τnl,snjM )
δ+ − Fn+(u)dPn ≥ 0
∩AnjM ∩EnrC
). (5.41)
Note that we can discard the factor F0k(snjM)/1 − F0+(snjM) because it is a finite
and positive constant and therefore plays no role in the sign of the integral in (5.41).
Using (5.39), the definition of τnl, and the fact that Fnl is piecewise constant and
monotone nondecreasing, it follows that on the event AnjM we have for u > τnl
Fnl(u) ≥ Fnl(τnl) = Fnl(tn,j+1) ≥ F0l(snjM).
Hence, we can bound (5.40) by
P
(∫
[τnl,snjM )
δl − F0l(snjM) +RlsnjM Fn
(u, δ)dPn(u, δ) ≥ 0
∩EnrC
)
≤ P
(sup
k ∈ 1, . . . , K
w ∈ [t0 − 2r, tn,j+1]
∫
[w,snjM )
δk − F0k(snjM)dPn(u, δ)
+(n−2/3 + n−1/3(snjM − w)3/2
)C ≥ 0
∩ EnrC
)
≤ P
(sup
k ∈ 1, . . . , K
w ∈ [t0 − 2r, tn,j+1]
∫
[w,snjM )
δk − F0k(snjM)dPn(u, δ)
+(n−2/3 + n−1/6(snjM − w)3/2
)C ≥ 0
∩ EnrC
),
using the definition of EnrC in (5.30). We can bound this probability by pjM/2 for M
sufficiently large, using Lemma 5.15 below. Expression (5.41) is also bounded above
by pjM/2 for M large, using Lemma 5.16 below. This proves (5.35) and completes
the proof. 2
108
Lemmas 5.15 and 5.16 are crucial lemmas in the proof of Theorem 5.10. The main
idea of Lemma 5.15 is that we can write
∫
[w,snjM )
δk − F0k(snjM)dPn
=
∫
[w,snjM )
δk − F0k(u)dPn +
∫
[w,snjM)
F0k(u) − F0k(snjM)dGn.
The first term on the right side is a martingale, and the second term on the right
side has a negative quadratic drift on the event EnrC . This quadratic drift dominates
both the martingale part and the term(n−2/3 + n−1/6(snjM − w)3/2
)C for sufficiently
large M . We obtain the uniformity in w by using a second grid with grid size n−1/3,
and we get the exponential bound pjM by using Orlicz norms. The proof is given in
Section 5.4.
Lemma 5.15 Let the conditions of Theorem 5.10 be satisfied, and let C > 0. Then
there exist r > 0, n1 > 0 and M > 5 such that for all n > n1 and j ∈ 0, . . . , Jn, we
have
P
(sup
k ∈ 1, . . . , K
w ∈ (t0 − 2r, tn,j+1)
∫
[w,snjM)
δk − F0k(snjM)dPn
+ (n−2/3 + n−1/6(snjM − w)3/2)C ≥ 0
∩ EnrC
)≤ pjM
2,
where snjM = tnj + Mvn(tnj − t0), and vn(·), EnrC and pjM are defined in (5.23),
(5.30) and (5.34), respectively.
Lemma 5.16 gives a similar bound, but then for the sum of the components. In this
lemma the key idea is to exploit the system of sub-distribution functions. On the
event AnjM , we play out the sub-distribution functions against each other until the
problem is reduced to a situation to which Lemma 5.15 can be applied. The proof of
this lemma is given in Section 5.4.
109
Lemma 5.16 Let the conditions of Theorem 5.10 be satisfied, and let C > 0. Then
there are M > 0, n1 > 0 and r > 0 such that for all n > n1 and j ∈ 0, . . . , Jn:
P
(∫
[τnl,snjM )
δ+ − Fn+(u)dPn(u, δ) ≥ 0
∩AnjM ∩EnrC
)≤ pjM
2,
where l is defined in (5.38), τnl is the last jump point of Fnl before tn,j+1, snjM =
tnj +Mvn(tnj− t0), and EnrC, pjM and AnjM are defined in (5.30), (5.34) and (5.37).
Remark 5.17 The conditions of Theorem 5.10 also hold when t0 is replaced by s,
for s in a neighborhood of t0. Hence, the results in this section continue to hold when
t0 is replaced by s ∈ [t0 − r/2, t0 + r/2], for r > 0 sufficiently small. To be precise,
there exists an r > 0 such that for every ǫ > 0 there exist M > 0 and n1 > 0 such
that for all s ∈ [t0 − r/2, t0 + r/2] and n > n1:
P
(sup
t∈[−r,r]
Fn+(t) − F0k(t)
v(t− s)> M
)< ǫ.
5.3.3 Local rate of convergence of Fn1, . . . , FnK
We now prove the local rate of convergence for the components Fn1, . . . , FnK . Recall
from the introduction of this chapter that our proof relies on the Fenchel conditions
(5.20). These Fenchel conditions consist of three parts: the integral of δk − Fnk(u),
the integral of δ+ − Fn+(u), and the integral of RksFn(u, δ). We can bound the part
involving RksFnusing Lemma 5.9. For the term involving δ+ − Fn+ we write
∫
[w,s)
δ+ − Fn+(u)dPn
=
∫
[w,s)
δ+ − F0+(u)dPn +
∫
[w,s)
F0+(u) − Fn+(u)dGn. (5.42)
The first term of (5.42) is bounded in Lemma 5.18. This lemma is very similar
to Lemma 4.1 of Kim and Pollard (1990), with the only difference that our class
110
of functions depends on n. In Corollary 5.19 we bound the second term of (5.42),
using Theorem 5.10 with β = 1/2. It then follows that the term involving δk − Fnk
drives the local rate of convergence for Fnk, just as for current status data without
competing risks. The local rate of convergence for the components Fn1, . . . , FnK is
given in Theorem 5.20.
Lemma 5.18 Let the conditions of Theorem 5.10 be satisfied. Then there exists an
r > 0 such that for all M > 0 and every γ > 0 there exist random variables An of
order Op(1) such that
∣∣∣∣∫
[t,snM )
δ+ − F0+(u)dPn(u, δ)∣∣∣∣ ≤ γ(snM − t)2 + n−2/3A2
n, (5.43)
for all t ∈ [t0 − r, snM), where snM = t0 + 2Mn−1/3.
Proof: We use a slightly generalized version of Lemma 4.1 of Kim and Pollard (1990).
We introduce the following notation:
qnt(u, δ) = (δ+ − F0+(u))1[t,snM)(u), t ≤ snM ,
Qnr = qnt : t ∈ (t0 − r, snM), r > 0,
Qnr = |δ+ − F0+(u)| 1[t0−r,snM ](u).
Here Qnr is the class of functions of interest and Qnr is its envelope. Note that Qnr is
uniformly manageable in the sense of Kim and Pollard (1990), since the functions qnt
are the product of a fixed bounded function and an indicator of a VC class of sets.
Furthermore,
PQ2nr ≤ P1[t0−r,snM ](u) ≤ 2g(t0)(r +Mn−1/3),
for r small and n large, since G is continuously differentiable at t0 with a positive
derivative. Hence, we can choose r1 > 0 such that PQ2nr ≤ 2g(t0)(r+Mn−1/3) for all
111
r ≤ r1. We can use this bound in the proof of Lemma 4.1 of Kim and Pollard (1990)
without making any other modifications, and we obtain that for every γ > 0 there
exist random variables An of order Op(1) such that
∣∣∣∣∫
[t,snM )
δ+ − F0+(u)dPn(u, δ)∣∣∣∣ = |(Pn − P )qnt| ≤ γ(snM − t)2 + n−2/3A2
n,
for all t ∈ (t0 − r, snM). 2
Corollary 5.19 provides a bound on the second term of (5.42). The proof uses the
modulus of continuity result of Van de Geer (2000) and Theorem 5.10 with β = 1/2.
Corollary 5.19 Let the conditions of Theorem 5.10 be satisfied. Then there exists
an r > 0 such that for all M > 1 we have
∣∣∣∣∫
[t,snM )
Fn+(u) − F0+(u)dGn(u)
∣∣∣∣ = Op(n−2/3 + n−1/6(snM − t)3/2), (5.44)
uniformly in t ∈ [t0 − r, t0 +Mn−1/3], where snM = t0 + 2Mn−1/3.
Proof: We write
∣∣∣∣∫
[t,snM )
Fn+(u) − F0+(u)dGn(u)
∣∣∣∣
≤∣∣∣∣∫
[t,snM )
Fn+(u) − F0+(u)d(Gn −G)(u)
∣∣∣∣+∫
[t,snM )
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u).
The first term is of order Op(n−2/3), uniformly in t ≤ snM , by the modulus of conti-
nuity result of Van de Geer (2000, Lemma 5.13, eq. (5.42)). To see this, let
Q = q(u) = qtF (u) = F (u) − F0+(u)1[t,snM )(u) : F ∈ F , t ≤ snM,
where F is the class of monotone functions F : R → [0, 1]. Taking q0 ≡ 0, it is clear
112
that
supq∈Q
‖q − q0‖∞ ≤ 1,
so that Van de Geer’s condition (5.39) is satisfied. In order to satisfy her condition
(5.40), we need to show that the bracketing entropy logN[ ](γ,Q, L2(G)) is bounded
by Aγ−1, for some constant A > 0. It is well-known that logN[ ](γ,F , L2(H)) . 1/γ,
uniformly in probability measures H on the underlying sample space (see, e.g., Van de
Geer (2000, page 18, equation (2.5)) or Van der Vaart and Wellner (1996, Theorem
2.7.5, page 159)). Furthermore, the same bound holds for the class of indicator
functions 1[t,snM ) : t ≤ snM, since they are of bounded variation (see, e.g., Van de
Geer (2000, page 18, equation (2.6))). In fact we can get a much sharper bound
on the bracketing numbers for this class, but that is not needed here. Since the
functions q ∈ Q consist of the product of functions from these two classes, it follows
by Proposition 5.23 that logN[ ](γ,Q, L2(G)) < Aγ−1 for some A > 0. Hence, Van
de Geer’s condition (5.40) is satisfied. Next, we define
Q(γ) = q ∈ Q : ‖q‖2 ≤ γ.
Using the L2(G) rate of convergence (5.7), we have
‖qtFn+‖2
2 =
∫
[t,snM )
Fn+(u) − F0+(u)2dG(u) = Op(n−2/3),
uniformly in t ≤ snM . Hence, for every ǫ > 0 we can find a C > 0 such that
P(qtFn+
∈ Q(Cn−1/3) for all t ≤ snM
)> 1 − ǫ.
Finally, applying Van de Geer (2000, Lemma 5.13, eq. (5.42)) with α = 1 and β = 0
113
to the class Q(Cn−1/3) yields
supq∈Q(Cn−1/3)
∣∣∣∣∫q d(Pn − P )
∣∣∣∣ = Op(n−2/3).
To bound the second term, note that Theorem 5.10 with β = 1/2 implies that,
uniformly in t ∈ [t0 − r, t0 + r]:
∫ t0∨t
t0∧t
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) =
∫ t0∨t
t0∧tOp(vn(t− t0))dG(u)
= Op(n−2/3 ∨ n−1/6|t− t0|3/2). (5.45)
We now distinguish the following two cases: (i) t < t0 and (ii) t ∈ [t0, t0+Mn−1/3).
In case (i) we get using (5.45) and snM = t0 + 2Mn−1/3,
∫
[t,snM )
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)
=
∫
[t,t0)
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) +
∫
[t0,snM )
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)
= Op(n−2/3 ∨ n−1/6(t0 − t)3/2) +Op(n
−1/6(2Mn−1/3)3/2)
= Op(n−1/6(snM − t)3/2),
uniformly in t ∈ [t0 − r, t0). Similarly, in case (ii) we get, using (5.45) and Mn−1/3 ≤snM − t,
∫
[t,snM )
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) ≤
∫
[t0,snM )
∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)
= Op(n−1/6(2Mn−1/3)3/2) = Op(n
−1/6(snM − t)3/2),
uniformly in t ∈ [t0, t0 +Mn−1/3). 2
We are now ready to prove the local rate of convergence of Fn1, . . . , FnK .
114
Theorem 5.20 Let the conditions of Theorem 5.10 be satisfied, and let M1 > 0.
Then
supt∈[−M1,M1]
∣∣∣Fnk(t0 + n−1/3t) − F0k(t0)∣∣∣ = Op(n
−1/3), k = 1, . . . , K.
Proof: Let the M1 > 0 be given, let k ∈ 1, . . . , K and let ǫ > 0. It is sufficient to
show that there exist constants M > M1 and n1 > 0 such that for all n > n1
PFnk(t0 +Mn−1/3) ≥ F0k(t0 + 2Mn−1/3) < ǫ, (5.46)
PFnk(t0 −Mn−1/3) ≤ F0k(t0 − 2Mn−1/3) < ǫ, (5.47)
since together with the monotonicity of Fnk this implies that with probability at least
1 − 2ǫ,
supt∈[−M,M ]
∣∣∣Fnk(t0 + n−1/3t) − F0k(t0)∣∣∣
≤ maxF0k(t0 + 2Mn−1/3) − F0k(t0), F0k(t0) − F0k(t0 − 2Mn−1/3),
which is bounded by 4f0k(t0)Mn−1/3 for large n. Since M > M1, the result then
follows.
We only prove (5.46), since the proof of (5.47) is analogous. Thus, we need to
show that there exist constants M > M1 and n1 > 0 such that for all n > n1
P(Fnk(t0 +Mn−1/3) ≥ F0k(snM)
)= P (BnkM) ≤ ǫ, (5.48)
where
snM = t0 + 2Mn−1/3,
BnkM = Fnk(t0 +Mn−1/3) ≥ F0k(snM).
115
Let τnk be the largest jump point of Fnk before t0 +Mn−1/3. As discussed in the
proof of Theorem 5.10, we can choose n1 and r so that for all n > n1
P (Fnk does not have a jump in [t0 − r, t0]) < ǫ/4.
Next, note that the Fenchel conditions imply that
∫
[τnk ,snM )
δkdPn(u, δ) −∫
[τnk,snM )
Fnk(u)dGnFn(u) ≥ 0
must hold. Hence,
P (BnkM)
= P
(∫
[τnk ,snM )
δkdPn(u, δ) −∫
[τnk ,snM )
Fnk(u)dGnFn(u) ≥ 0
∩BnkM
)
≤ ǫ/4 + P
(sup
w∈[t0−r,t0+Mn−1/3)
∫
[w,snM)
δkdPn(u, δ) −∫
[w,snM)
Fnk(u)dGnFn(u) ≥ 0
∩ BnkM
). (5.49)
Using (5.20), we have
∫
[w,snM)
δkdPn(u, δ) −∫
[w,snM)
Fnk(u)dGnFn(u)
=
∫
[w,snM)
δk − Fnk(u) +
F0k(snM)
1 − F0+(snM)δ+ − Fn+(u) +RksnM Fn
dPn. (5.50)
We now derive an upper bound for the last two terms in the integral (5.50). Starting
with the part that involves δ+ − Fn+(u), we write
∣∣∣∣∫
[w,snM)
δ+ − Fn+(u)dPn(u, δ)∣∣∣∣
≤∣∣∣∣∫
[w,snM)
δ+ − F0+(u)dPn(u, δ)∣∣∣∣+∣∣∣∣∫
[w,snM)
F0+(u) − Fn+(u)dGn(u)
∣∣∣∣ .
116
We bound these terms using Lemma 5.18 and Corollary 5.19. Furthermore, we use
Lemma 5.9 to bound∫[w,snM)
RksnM FndPn. It follows that we can choose r > 0 such
that for all M > 0 and γ > 0 we can find C > 0 and n1 > 0 such that for all n > n1:
P
(∃w ∈ [t0 − r, t0 +Mn−1/3) :
∫
[w,snM)
F0k(snM)
1 − F0+(snM)δ+ − Fn+(u) +RksnM Fn
dPn
> γ(snM − w)2 + (n−2/3 + n−1/6(snM − w)3/2)C
)<ǫ
4.
This implies
P (BnkM) ≤ ǫ
2+ P
(sup
w∈[t0−r,t0+Mn−1/3)
∫
[w,snM)
δk − Fnk(u)dPn + γ(snM − w)2
+(n−2/3 + n−1/6(snM − w)3/2
)C
≥ 0
∩ BnkM
).
Next, we consider the driving part∫[w,snM)
δk − Fnk(u)dPn(u, δ) of (5.50). The
definition of τnk, and the fact that Fnk is piecewise constant and monotone nonde-
creasing, imply that on the event BnkM we have, for u ≥ τnk:
Fnk(u) ≥ Fnk(τnk) = F0k(t0 +Mn−1/3) ≥ F0k(snM). (5.51)
Hence,
P (BnkM) ≤ ǫ
2+ P
(sup
w∈[t0−r,t0+Mn−1/3)
∫
[w,snM)
δk − F0k(snM)dPn + γ(snM − w)2
+(n−2/3 + n−1/6(snM − w)3/2
)C ≥ 0
)
and this can be bounded by ǫ by choosing γ and M large, by a slight adaptation
of Lemma 5.15. Note that γ should be chosen such that the negative quadratic
drift arising from∫[w,snM)
δk − F0k(snM)dPn dominates γ(snM − w)2. The choice
117
γ = g(t0)f0k(t0)/32 works. 2
Remark 5.21 Theorem 5.20 also holds when t0 is replaced by s ∈ [t0−r/2, t0 +r/2],
for r sufficiently small, for the reason discussed in Remark 5.17. To be precise, there
exists an r > 0 such that for every M1 > 0 and ǫ > 0 there exist M > 0 and n1 > 0
such that for all s ∈ [t0 − r/2, t0 + r/2] and n > n1:
P
(sup
h∈[−M1,M1]
∣∣∣Fnk(s+ n−1/3h) − F0k(s)∣∣∣ > M
)< ǫ.
Theorem 5.20 and Remark 5.21 lead to the following corollary about the distance
between the jump points of Fnk around t0:
Corollary 5.22 Let t0 ∈ R, and let the conditions of Theorem 5.20 be satisfied. Let
τ−nk(t) be the last jump point of Fnk before t, and let τ+nk(t) the first jump point of Fnk
after t, for k = 1, . . . , K. Then for every ǫ > 0 there exist C > 0 and n1 > 0 such
that for all n > n1 and s ∈ [t0 − r/2, t0 + r/2]:
P (τ+nk(s) − τ−nk(s) > Cn−1/3) < ǫ.
Proof: Let s ∈ [t0 − r/2, t0 + r/2]. Using Remark 5.21, we apply Theorem 5.20 two
times: one time with t0 replaced by s, and one time with t0 replaced by s− C2n−1/3
for C2 > 0. This yields that for every ǫ > 0 and M1 > 0, there exist n1 > 0 and
M > 0, not depending on s and C2, such that for all n > n1:
P
(sup
t∈[−M1,M1]
∣∣∣Fnk(s− C2n−1/3 + n−1/3t) − F0k(s− C2n
−1/3)∣∣∣ > Mn−1/3
)< ǫ,
P
(sup
t∈[−M1,M1]
∣∣∣Fnk(s+ n−1/3t) − F0k(s)∣∣∣ > Mn−1/3
)< ǫ.
118
Furthermore, for C2 sufficiently large we have
F0k(s− C2n−1/3) +Mn−1/3 < F0k(s) −Mn−1/3.
It follows that for each s ∈ [t0 − r/2, t0 + r/2],
P(Fnk has a jump in the interval (s− C2n
−1/3, s))> 1 − 2ǫ.
The statement now follows by using similar reasoning for τ+nk(s). 2
5.4 Technical lemmas and proofs
Propositions 5.23 and 5.24 give preservation theorems for bracketing entropy. These
propositions are used to verify conditions about bracketing entropy in Lemma 5.9,
Corollary 5.19, and also in Lemma 7.13 in Chapter 7. For completeness we give
proofs, although these may be known results.
Proposition 5.23 Let P be a probability measure on (X ,A). For h : X 7→ R and
h ∈ L2(P ), let ‖h‖2 = (∫h2dP )1/2. Let H1 and H2 be two classes of nonnegative
functions from X to R+, with
logN[ ](γ,Hi, ‖ · ‖2) ≤ Aiγ−αi, i = 1, 2,
for some constants Ai > 0 and αi > 0. Let H1 and H2 be the envelopes of H1 and
H2, and assume that ‖H1‖2, ‖H2‖2 and ‖H1H2‖2 are finite. Furthermore, define
H3 = h1h2 : h1 ∈ H1, h2 ∈ H2.
Then logN[ ](γ,H3, ‖ · ‖2) ≤ A′γ−(α1∨α2) for γ ≤ 1, for some constant A′ > 0.
Proof: Let h1 ∈ H1 and h2 ∈ H2. Let [l1, u1] be a γ-bracket containing h1 and
let [l2, u2] be a γ-bracket containing h2, where the size of both brackets is computed
119
w.r.t. ‖ · ‖2. Without loss of generality we assume that 0 ≤ l1 ≤ u1 ≤ H1 and
0 ≤ l2 ≤ u2 ≤ H2. Then we can define a new bracket [l3, u3] = [l1l2, u1u2] which
contains h1h2. The upper and lower bounds of this bracket are guaranteed to have
finite norms by the assumption that ‖H1H2‖2 is finite. Using the triangle inequality
and the Cauchy-Schwarz inequality, we obtain
‖u1u2 − l1l2‖2 = ‖u1(u2 − l2) + l2(u1 − l1)‖2
≤ ‖u1‖2 · ‖u2 − l2‖2 + ‖l2‖2 · ‖u1 − l1‖2
≤ γ(‖u1‖2 + ‖l2‖2) ≤ γ(‖H1‖2 + ‖H2‖2) ≡ γM.
Let Ni = N[ ](γ,Hi, ‖ · ‖2) for i = 1, 2. Then N[ ](γM,H3, ‖ · ‖2) ≤ N1N2. Hence, the
γM-bracketing entropy for H3 is bounded by log(N1)+ log(N2) ≤ (A1 +A2)γ−(α1∨α2),
for γ ≤ 1. This implies that logN[ ](γ,H3, ‖·‖2) ≤Mα1∨α2(A1+A2)γ−(α1∨α2) ≡ A′γ−α.
2
Proposition 5.24 Let ‖ · ‖ be an arbitrary norm. Let H1 and H2 be two classes of
functions with
logN[ ](γ,Hi, ‖ · ‖) ≤ Aiγ−αi, i = 1, 2,
for some constants Ai > 0 and αi > 0. Let H1 and H2 be the envelopes of H1 and
H2, and assume that ‖H1‖ and ‖H2‖ are finite. Furthermore, define
H3 = h1 + h2 : h1 ∈ H1, h2 ∈ H2.
Then logN[ ](γ,H3, ‖ · ‖) ≤ A′γ−(α1∨α2) for γ ≤ 1, for some constant A′ > 0.
Proof: Let h1 ∈ H1 and h2 ∈ H2. Let [l1, u1] be a γ-bracket containing h1 and let
[l2, u2] be a γ-bracket containing h2, where the size of both brackets is computed w.r.t.
‖ · ‖. Without loss of generality, we assume that li and ui are contained in [−Hi, Hi],
120
for i = 1, 2. Note that we can define a new bracket [l3, u3] = [l1 + l2, u1 + u2] which
contains h1 + h2. The upper and lower bounds of this bracket are guaranteed to
have finite norms. by the assumption that ‖H1‖ and ‖H2‖ are finite. The size of the
bracket [l3, u3] is
‖(u1 + u2) − (l1 + l2)‖ ≤ ‖u1 − l1‖ + ‖u2 − l2‖ ≤ 2γ.
Let Ni = N[ ](γ,Hi, ‖ · ‖) for i = 1, 2. Then N[ ](2γ,H3, ‖ · ‖) ≤ N1N2. Hence, the
2γ-bracketing entropy for H3 is bounded by log(N1) + log(N2) ≤ (A1 +A2)γ−(α1∨α2).
This implies that logN[ ](γ,H3, ‖ · ‖) ≤ 2α1∨α2(A1 +A2)γ−(α1∨α2) ≡ A′γ−(α1∨α2). 2
Next, we provide the proofs of Lemmas 5.9, 5.15 and 5.16.
Proof of Lemma 5.9: Fix k ∈ 1, . . . , K. Note that
RksFn(u, δ) =
δ+ − Fn+(u)
(1 − Fn+(u))(1 − F0+(s))
[Fnk(u)(1 − F0+(s)) − F0k(s)(1 − Fn+(u))
]
=δ+ − Fn+(u)
(1 − Fn+(u))(1 − F0+(s))
[Fnk(u)(Fn+(u) − F0+(s))
+(1 − Fn+(u))(Fnk(u) − F0k(s))]
= (δ+ − Fn+(u))(Fn+(u) − F0+(s))F0k(s)
(1 − F0+(s))2(1 +O(s− u) + op(1))
+ (δ+ − Fn+(u))(Fnk(u) − F0k(s))1
1 − F0+(s).
The last line follows from continuity of F0k and F0+ and consistency of Fnk and Fn+
(Proposition 4.15), so that we can replace Fnk(u)/(1 − Fn+(u)) by
F0k(s)
1 − F0+(s)+Fnk(u)(Fn+(u) − F0+(s)) + (Fnk(u) − F0k(s))(1 − Fn+(u))
(1 − Fn+(u))(1 − F0+(s))
=F0k(s)
1 − F0+(s)(1 +O(s− u) + op(1)).
121
It is sufficient to analyze the leading terms (δ+ − Fn+(u))(Fn+(u) − F0+(s)) and
(δ+ − Fn+(u))(Fnk(u) − F0k(s)) of RksFn. In fact, we only need to analyze the latter
term, since the result for the first term then follows by summing over k = 1, . . . , K.
We can discard the factors F0k(s)/1− F0+(s)2 and 1/1− F0+(s), since these are
bounded between two positive constants under the conditions of Theorem 5.10. We
now write:
δ+ − Fn+(u)Fnk(u) − F0k(s)
= δ+ − F0+(u)Fnk(u) − F0k(u) + δ+ − F0+(u)F0k(u) − F0k(s)
+ F0+(u) − Fn+(u)Fnk(u) − F0k(u) + F0+(u) − Fn+(u)F0k(u) − F0k(s)
≡ R(1)(u, δ) +R(2)(u, δ) + R(3)(u, δ) +R(4)(u, δ).
For j = 1, . . . , 4, we write
∣∣∣∣∫
[w,s)
R(j)dPn
∣∣∣∣ ≤∣∣∣∣∫
[w,s)
R(j)d(Pn − P )
∣∣∣∣+∣∣∣∣∫
[w,s)
R(j)dP
∣∣∣∣ . (5.52)
We first show that the second term on the right side of (5.52) is of order Op(n−2/3 +
n−1/3(s− w)3/2), uniformly in w < s. Namely, let w < s, and note that
∫
[w,s)
R(1)dP =
∫
[w,s)
R(2)dP = 0.
Furthermore, using Cauchy-Schwarz and the L2(G) rate of convergence (5.7), yields
∣∣∣∣∫
[w,s)
R(3)dP
∣∣∣∣ ≤√∫
F0+(u) − Fn+(u)2dG
√∫Fnk(u) − F0k(u)2dG
= Op(n−2/3).
Similarly, we obtain∣∣∣∫[w,s)
R(4)dP∣∣∣ = Op(n
−1/3(s− w)3/2).
122
We now consider the first term on the right side of (5.52), starting with j = 2:
∣∣∣∣∫
[w,s)
R(2)d(Pn − P )
∣∣∣∣ ≤∣∣∣∣∫
[w,s)
δ+F0k(u) − F0k(s)d(Pn − P )
∣∣∣∣
+
∣∣∣∣∫
[w,s)
F0+(u)F0k(u) − F0k(s)d(Gn −G)
∣∣∣∣ . (5.53)
The second term on the right side of (5.53) is of order Op(n−1/2(s − w)), uniformly
in t0 − 2r < w < s < t0 + 2r, by Lemma 5.13. Letting Gn+(u) = Pn∆+1T ≤ uand G+ = P∆+1T ≤ u, the first term on the right side of (5.53) can be written
as∫[w,s)
F0k(u) − F0k(s)d(Gn+ − G+)(u). Note that n1/2(Gn+ − G+) converges in
distribution to a mean zero Gaussian process, and satisfies supu |Gn+(u) −G+(u)| =
Op(n−1/2). Hence, we can also bound the first term on the right side of (5.53) by
Op(n−1/2(s− w)), along the lines of Lemma 5.13.
We are left with the terms∫[w,s)
R(j)d(Pn − P ), for j = 1, 3, 4. We bound these
terms using the modulus of continuity result of Van de Geer (2000, Lemma 5.13, page
79). We only consider j = 1, since j = 3 and j = 4 are analogous. Let
Q = q(u, δ) = qwsF (u, δ) = δ+ − F0+(u)F (u) − F0k(u)1[w,s)(u) : w < s, F ∈ F,
where F is the class of monotone functions F : R → [0, 1]. Taking q0 = 0, it is clear
that
supq∈Q
‖q − q0‖∞ ≤ 1,
so that Van de Geer’s condition (5.39) is satisfied. In order to satisfy her condition
(5.40), we need to show that logN[ ](γ,Q, L2(G)) is bounded by Aγ−1, for some
constant A > 0. It is well-known that logN[ ](γ,F , L2(H)) . 1/γ, uniformly in
probability measures H on the underlying sample space (see, e.g., Van de Geer (2000,
page 18, equation (2.5)) or Van der Vaart and Wellner (1996, Theorem 2.7.5, page
123
159)). Furthermore, the same bound holds for the class of indicator functions 1[w,s),
since they are of bounded variation (see, e.g., Van de Geer (2000, page 18, equation
(2.6))). Since the functions q ∈ Q consist of the sums and products of functions from
classes with bracketing entropy bounded by Aγ−α, it follows from Propositions 5.23
and 5.24 that logN[ ](γ,Q, L2(G)) is bounded by A′γ−1 for some constant A′ > 0.
Hence, Van de Geer’s condition (5.40) is satisfied.
Next, we define Q(γ) = q ∈ Q : ‖q‖2 ≤ γ. Using the L2(G) rate of convergence
(5.7), we have
‖qwsFnk‖2
2 =
∫
[w,s)
δ+ − F0+(u)2Fnk(u) − F0k(u)2dP (u)
≤∫
Fnk(u) − F0k(u)2dG(u) = Op(n−2/3),
uniformly in w < s. Hence, for every ǫ > 0 we can choose C > 0 such that
P(qwsFnk
∈ Q(Cn−1/3) for all w < s)> 1 − ǫ.
Applying Van de Geer (2000, Lemma 5.13, page 79, eq. (5.42)) with α = 1 and β = 0
to the class Q(Cn−1/3) yields:
supq∈Q(Cn−1/3)
∣∣∣∣∫q d(Pn − P )
∣∣∣∣ = Op(n−2/3).
Hence∫[w,s)
R(1)d(Pn − P ) = Op(n−2/3) uniformly in w < s. The integrals involving
R(3) and R(4) can be handled similarly.
Combining everything, we have shown that there exists an r > 0 such that:
∣∣∣∣∫
[w,s)
RksFn(u, δ)dPn
∣∣∣∣ = Op
n−2/3 + n−1/2(s− w) + n−1/3(s− w)3/2
= Op(n−2/3 + n−1/3(s− w)3/2),
124
uniformly in t0 − 2r < w < s < t0 + 2r, and for k = 1, . . . , K. 2
Proof of Lemma 5.15: Let C > 0. It is sufficient to show that there exist r > 0,
n1 > 0 and M > 0 such that the statement holds for a fixed k. Let k ∈ 1, . . . , K,n > 0 and j ∈ 0, . . . , Jn. On the event EnrC we have, for w ∈ [t0 − 2r, tn,j+1):
∫
[w,snjM)
δk − F0k(snjM)dPn =
∫
[w,snjM )
δk − F0k(u) + F0k(u) − F0k(snjM)dPn
≤∫
[w,snjM)
δk − F0k(u)dPn −g(t0)f0k(t0)(snjM − w)2
8,
since snjM −w ≥ snjM − tn,j+1 ≥ (M − 1)n−1/3 > n−1/3 for M > 2. Furthermore, for
M large we have
(n−2/3 + n−1/6(snjM − w)3/2
)C ≤ g(t0)f0k(t0)(snjM − w)2/16,
since snjM − w > (M − 1)n−1/3. Hence, it is sufficient to bound
P
[sup
w∈(t0−2r,tn,j+1)
∫
[w,snjM )
δk − F0k(u)dPn −g(t0)f0k(t0)(snjM − w)2
16≥ 0
]. (5.54)
To do so, we put a grid on the interval [t0 −2r, tn,j+1). The grid points tn,j−q and grid
cells In,j−q are given by
In,j−q = [tn,j−q, tn,j−q+1) = [t0 + (j − q)n−1/3, t0 + (j − q + 1)n−1/3),
for q = 0, . . . , Qnj = ⌈2rn1/3 + j⌉. We then bound (5.54) above by
Qnj∑
q=0
P
sup
w∈In,j−q
∫
[w,snjM )
δk − F0k(u)dPn ≥ λnkjqM
, (5.55)
where λnkjqM = f0k(t0)g(t0)(snjM − tn,j−q+1)2/16. If we bound the qth term in (5.55)
125
by
pjqM =
2 exp−d2(q +M)3/2 if j = 0, q = 0, . . . , Qn0,
2 exp−d2(q +Mjβ)3/2 if j = 1, . . . , Jn, q = 0, . . . , Qnj,
then we are done, because summing over q and using (a + b)3/2 ≥ a3/2 + b3/2 for
a, b > 0, yields
pjM ≤
d1 exp−d2M
3/2, if j = 0,
d1 exp−d2(Mjβ)3/2
if j = 1, . . . , Jn,
where d1 = 2∑∞
q=0 exp(−d2q3/2) <∞.
To prove that the qth term in (5.55) is bounded by pjqM , we use the fact that a
bounded Orlicz norm ‖X‖ψp, for p ≥ 1, gives an exponential bound on tail probabil-
ities, see, e.g., Van der Vaart and Wellner (1996, page 96 or 239):
P (|X| > t) ≤ 2 exp(−tp/‖X‖pψp). (5.56)
Here the Orlicz norm is ‖X‖ψp = infc > 0 : Eψp(|X|/c) ≤ 1 with ψp(x) = exp(xp)−1. In order to apply inequality (5.56), we define
FnkjqM = (δk − F0k(u))1[w,snjM)(u) : w ∈ In,j−q
and ‖Gn‖FnkjqM= supf∈FnkjqM
Gnf = supf∈FnkjqM
√n(Pn − P )f . Then the qth term
of (5.55) equals
P‖Gn‖FnkjqM
≥√nλnkjqM
≤ 2 exp
(−√nλnkjqM/
∥∥‖Gn‖FnkjqM
∥∥ψ1
), (5.57)
where the inequality follows by applying (5.56) with p = 1. Thus, if we can bound
the ψ1-Orlicz norm of ‖Gn‖FnkjqM, then we are done. Let FnkjqM be the envelope of
126
FnkjqM :
FnkjqM(u, δ) = |δk − F0k(u)| 1[tn,j−q ,snjM )(u) ≤ 1[tn,j−q ,snjM )(u).
Using Theorem 2.14.5 of Van der Vaart and Wellner (1996, page 244) with p = 1,
followed by their Theorem 2.14.1 on page 239, we get
∥∥∥‖Gn‖∗FnkjqM
∥∥∥ψ1
.∥∥∥‖Gn‖FnkjqM
∥∥∥1+ n−1/2(1 + log n)‖FnkjqM‖ψ1
. J(1,FnkjqM)‖FnkjqM‖2 + n−1/2 logn‖FnkjqM‖ψ1 . (5.58)
Note that
PF 2nkjqM ≤ G(1[tn,j−q ,snjM )).
The function J(1,FnkjqM) is constant in our case. Hence the first term of (5.58) is
given by√G(1[tn,j−q,snjM )).
We now compute the second term of (5.58). Since FnkjqM(u, δ) ≤ 1[tn,j−q ,snjM )(u),
we have
ψ1(FnkjqM/c) = exp(FnkjqM/c) − 1 ≤ exp(1/c) − 11[tn,j−q,snjM )(u),
and Pψ1(FnkjqM/c) ≤ exp(1/c)−1G(1[tn,j−q,snjM )). This expectation is bounded by
one if and only if c ≥ [log1 + 1/G(1[tn,j−q,snjM ))]−1. Hence,
‖FnkjqM‖ψ1 ≤[log1 + 1/G(1[tn,j−q,snjM ))
]−1.
Plugging this into (5.58) gives
∥∥∥‖Gn‖∗FnkjqM
∥∥∥ψ1
. J(1,FnkjqM)‖FnjkqM‖2 + n−1/2 log n‖FnkjqM‖ψ1
.√G(1[tn,j−q ,snjM )) + n−1/2 logn
[log1 + 1/G(1[tn,j−q,snjM ))
]−1(5.59)
127
The first term of (5.59) dominates the expression. To see this, let x = G(1[tn,j−q,snjM ))
and note that x ∈ [0, 1]. Since the length of the interval [tn,j−q, snjM) is at least
Mn−1/3, we can assume that x ≥ d0n−1/3 with d0 = g(t0)M/2. Now note that
2√d0
√x log(1 + 1/x) ≥ 2√
d0
√x log 2 ≥ 2(log 2)n−1/6 ≥ n−1/2 log n. (5.60)
Here the first inequality follows from the fact that f(x) = log(1 + 1/x) ≥ f(1) =
log 2 for x ∈ [0, 1], since it is a decreasing function. The second inequality follows
from x ≥ d0n−1/3, and the third inequality follows from n−1/3 log n ≤ 2 log 2 for all
n ≥ 1. Dividing both sides of (5.60) by log(1 + 1/x) yields that (5.59) is bounded by
(1 + 2/√d0) times its first term. Plugging this into (5.57) yields, for some constant
b > 0,
P (‖Gn‖FnkjqM>
√nλnkjqM) ≤ 2 exp
− b
√nλnkjqM√
G(1[tn,j−q ,snjM ))
.
Now recall that
snjM − tn,j−q = qn−1/3 +Mvn(tnj − t0),
λnkjqM = f0k(t0)g(t0)(q − 1)n−1/3 +Mvn(tnj − t0)
2/16,
and let
xnjqM = qn−1/3 +Mvn(tnj − t0)) =
(q +M)n−1/3 if j = 0
(q +Mjβ)n−1/3 if j = 1, . . . , Jn.
Then we have λnkjqM ≥ f0k(t0)g(t0)x2njqM/32 for M ≥ 5, and G(1[tn,j−q ,snjM )) ≤
2g(t0)xnjqM . Hence,
b√nλnjqM√
G(1[tn,j−q ,snjM ))≥ d2
√nx2
njkM√xnjkM
=
d2(q +M)3/2 if j = 0,
d2(q +Mjβ)3/2 if j = 1, . . . , Jn,
128
F02
F01
F0+
Fn2
Fn1
Fn+
snjMtn,j+1τn2τn1
Figure 5.3: Example clarifying the treatment of the Fn+ term in Lemma 5.16. Notethat Fn+(tn,j+1) > F0+(snjM), Fn1(tn,j+1) > F01(snjM), and Fn2(tn,j+1) < F02(snjM).
Thus, in this example l = 1 (see (5.38) and (5.39)). Since Fn+(τn1) < F0+(snjM) wecannot apply the method of Lemma 5.15.
where d2 = bf0k(t0)√g(t0)/(32
√2). 2
Proof of Lemma 5.16: We first note that l is only defined on the event AnjM =
Fn+(tn,j+1) ≥ F0+(snjM). Hence, this entire proof should be read on the event
AnjM . Furthermore, note that we can apply the method of proof Lemma 5.15 if
Fn+(u) ≥ F0+(snjM) for all u ≥ τnl. This situation occurs if l = K, because in that
case none of the sub-distribution functions jump on the interval (τnl, tn,j+1).
Now suppose that l < K. Then we typically do not have that Fn+(u) ≥ F0+(snjM)
for all u ≥ τnl, as illustrated in Figure 5.3. Hence, we cannot apply the method of
Lemma 5.15. Instead, we exploit the K-dimensional system of sub-distribution func-
tions by breaking∫[τnl,snjM )
δ+ − Fn+(u)dPn into pieces that we analyze separately.
First, we define l∗ ∈ l, . . . , K as follows. If
∫
[τnl,τnk)
δ+ − Fn+(u)dPn ≥ 0, for all k = l + 1, . . . , K, (5.61)
129
we let l∗ = l. Otherwise we define l∗ such that
∫
[τnl,τnk)
δ+ − Fn+(u)dPn ≥ 0, k = l∗ + 1, . . . , K, (5.62)
∫
[τnl,τnl∗ )
δ+ − Fn+(u)dPn < 0. (5.63)
Then, by (5.63) and the decomposition [τnl, snjM) = [τnl, τnl∗) ∪ [τnl∗ , snjM), we get
∫
[τnl,snjM )
δ+ − Fn+(u)dPn ≤∫
[τnl∗ ,snjM )
δ+ − Fn+(u)dPn, (5.64)
where strict inequality holds if l 6= l∗. By rearranging the sum and using the notation
τn,K+1 = snjM , we can rewrite the right side of (5.64) as
K∑
k=l∗+1
∫
[τnl∗ ,τnk)
δk − Fnk(u)dPn +K∑
k=l∗
k∑
p=1
∫
[τnk,τn,k+1)
δp − Fnp(u)dPn. (5.65)
We now derive upper bounds for both terms in (5.65), on the event AnjM ∩ EnrC .
Starting with the first term, note that
∫
[τnl∗ ,τnk)
δ+ − Fn+(u)dPn ≥ 0, k = l∗ + 1, . . . , K. (5.66)
Namely, if l = l∗ then (5.66) is the same as (5.61). On the other hand, if l < l∗
then (5.66) follows (with strict inequality) from (5.62), (5.63) and the decomposition
[τnl, τnk) = [τnl, τnl∗)∪[τnl∗ , τnk). Furthermore, the Fenchel conditions (see Proposition
2.36 and expression (5.20)) imply that
∫
[t,τnk)
δk − Fnk(u) +
F0k(τnk)
1 − F0+(τnk)δ+ − Fn+(u) +RkτnkFn
(u, δ)
dPn ≤ 0,
for k = 1, . . . , K, t ≤ τnk. Using this inequality with t = τnl∗ together with (5.66) and
130
F0k(τnk)/1 − F0+(τnk) > 0 yields that
∫
[τnl∗ ,τnk)
δk − Fnk(u) +RkτnkFn
(u, δ)dPn ≤ 0.
Hence, on the event EnrC we have
∫
[τnl∗ ,τnk)
δk − Fnk(u)dPn ≤ −∫
[τnl∗ ,τnk)
RkτnkFn(u, δ)dPn
≤(n−2/3 + n−1/6(τnk − τnl∗)
3/2)C
≤(n−2/3 + n−1/6(snjM − τnl∗)
3/2)C,
for k = l∗ + 1, . . . , K, using the definition of EnrC in (5.30). This implies that, on the
event EnrC , the first term of (5.65) is bounded by(n−2/3 + n−1/6(snjM − τnl∗)
3/2)KC.
We now derive an upper bound for the second term of (5.65). Note that the
inequalities (5.38) in the definition of l imply that on the event AnjM
K∑
p=k+1
Fnp(tn,j+1) <K∑
p=k+1
F0p(snjM), k = l, . . . , K.
Together with the definition of τn1, . . . , τnK , this yields that on the event AnjM =
Fn+(tn,j+1) ≥ F0+(snjM, we have
k∑
p=1
Fnp(τnp) =k∑
p=1
Fnp(tn,j+1) >k∑
p=1
F0p(snjM), k = l, . . . , K.
Furthermore, Fnp(τnp) ≤ Fnp(τnk) for p ≤ k by the monotonicity of Fnp and the
ordering τn1 ≤ · · · ≤ τnK . Hence, we get for k = l, . . . , K and u ≥ τnk:
k∑
p=1
Fnp(u) ≥k∑
p=1
Fnp(τnk) ≥k∑
p=1
Fn+(τnp) >
k∑
p=1
F0p(snjM).
131
This means that on the event AnjM the second term of (5.65) is bounded above by
K∑
k=l∗
k∑
p=1
∫
[τnk ,τn,k+1)
δp − F0p(snjM)dPn =K∑
k=1
∫
[τnk∨τnl∗ ,snjM )
δk − F0k(snjM)dPn.
Combining (5.64), (5.65) and the upper bound for (5.65) on the event AnjM ∩ EnrC ,
we obtain:
P
(∫
[τnl,snjM )
δ+ − Fn+dPn ≥ 0
∩ AnjM ∩ EnrC
)
≤ P
(∫
[τnl∗ ,snjM )
δ+ − Fn+dPn ≥ 0
∩AnjM ∩EnrC
)
≤ P
((n−2/3 + n−1/6(snjM − τnl∗)
3/2)KC
+K∑
k=1
∫
[τnk∨τnl∗ ,snjM )
δk − F0k(snjM)dPn ≥ 0
∩EnrC
)
≤ P
((n−2/3 + n−1/6(snjM − τnl∗)
3/2)KC
+
∫
[τnl∗ ,snjM )
δ1 − F01(snjM)dPn ≥ 0
∩ EnrC
)
+ P
( K∑
k=2
∫
[τnk∨τnl∗ ,snjM )
δk − F0k(snjM)dPn ≥ 0
∩ EnrC
)
≤ P
(sup
w∈(t0−2r,tn,j+1)
(n−2/3 + n−1/6(snjM − w)3/2
)KC
+
∫
[w,snjM)
δ1 − F01(snjM)dPn≥ 0
∩ EnrC
)
+ P
(sup
k ∈ 1, . . . , K
w ∈ (t0 − 2r, tn,j+1)
∫
[w,snjM)
δk − F0k(snjM)dPn ≥ 0
∩EnrC
).
We can bound both terms on the last two lines by pjM/4, using Lemma 5.15.
132
Chapter 6
LIMITING DISTRIBUTION
In Section 5.3 we showed that, for k = 1, . . . , K,
n1/3Fnk(t0) − F0k(t0) = Op(1) and n1/3Fnk(t0) − F0k(t0) = Op(1).
In this chapter we discuss the limiting distributions of these quantities. In Section 6.1
we show that the limiting distribution of the naive estimator (Fn1, . . . , FnK) is given
by the slopes of the convex minorants of a K-tuple of two-sided correlated Brownian
motion processes plus parabolic drifts. In Section 6.2 we discuss analogous results for
the MLE. We will see that the limiting distribution of the MLE is given by the slopes
of the convex minorants of the K-tuple of two-sided Brownian motion processes plus
parabolic drifts, plus an extra term involving the difference between the sum of the
K drifting Brownian motions and their convex minorants. This extra term causes
the system of processes to be self-induced. Hence, existence and uniqueness of these
processes are not automatic, and we formally establish these properties in Theorem
6.9. In Theorem 6.10 we prove convergence of the MLE to its limiting distribution.
Technical proofs are collected in Section 6.3.
Throughout this chapter, we use the following conventions and notation. We
assume that the naive estimators are right-continuous and piecewise constant with
jumps only at T1, . . . , Tn. Similarly, we assume that for each k ∈ 1, . . . , K, the
MLE Fnk is right-continuous and piecewise constant with jumps only at points in
Tk (see Definition 2.22). We denote the right-continuous derivative of a function
f : R 7→ R by f ′ (if it exists). Furthermore, N is the collection of nonnegative integers
133
0, 1, . . ., l∞[−m,m] denotes the set of uniformly bounded real functions on [−m,m],
C[−m,m] is the set of continuous real functions on [−m,m], and D[−m,m] is the set
of cadlag functions on [−m,m]. Finally, we use the following definition for integrals
and indicator functions:
Definition 6.1 For t < t0 we define
1[t0,t](u) = −1[t,t0](u) and 1[t0,t)(u) = −1[t,t0)(u).
Furthermore, in analogy with the definition of the signed Riemann integral, we define
for t < t0:
∫
[t0,t)
f(u)dA(u) =
∫f(u)1[t0,t)(u)dA(u)
= −∫f(u)1[t,t0)(u)dA(u) = −
∫
[t,t0)
f(u)dA(u),
if dA is a Lebesgue-Stieltjes measure, with a similar definition if both endpoints of the
interval are closed. We use the same notation for integrals with respect to Brownian
motion W (·). Thus, we define for t < t0:
∫ t
t0
f(u)dW (u) = −∫ t0
t
f(u)dW (u).
6.1 The limiting distribution of the naive estimator
The limiting distribution of the naive estimator follows by generalizing known results
on the MLE for univariate current status data (Groeneboom and Wellner (1992,
Theorem 5.1, page 89)). To describe this limiting distribution, we define the following
processes:
Definition 6.2 Let W = (W1, . . . ,WK) be a K-tuple of two-sided Brownian motion
134
processes originating from zero, with mean zero and covariances
EWj(t)Wk(s) = (|s| ∧ |t|)1st > 0Σjk, s, t ∈ R,
where Σjk = g(t0)1j = kF0k(t0) − F0j(t0)F0k(t0), for j, k ∈ 1, . . . , K. Further-
more, let
Xk(t) =Wk(t)
g(t0)+
1
2f0k(t0)t
2, k = 1, . . . , K, t ∈ R.
Definition 6.3 Let Hk be the convex minorant of Xk, i.e., Hk is convex and satisfies
the following conditions:
Hk(t) ≤ Xk(t), k = 1, . . . , K, t ∈ R
∫Hk(t) −Xk(t)dH ′
k(t) = 0, k = 1, . . . , K.
Furthermore, let H = (H1, . . . , HK), and let U(t) = (U1(t), . . . , UK(t)) be the vector
of right derivatives of H at t, i.e., Uk(t) = H ′k(t) for k = 1, . . . , K and t ∈ R.
Note that the processes H1, . . . , HK exist and are unique. The main result of this
section is given in Theorem 6.4.
Theorem 6.4 For each k = 1, . . . , K, let F0k be continuously differentiable at t0 with
strictly positive derivative f0k(t0). Furthermore, let G be continuously differentiable at
t0 with strictly positive derivative g(t0). Let U be as defined in Definition 6.3. Then
n1/3Fn(t0) − F0(t0) →d U(0), in RK .
ForK = 1, Theorem 6.4 just gives the limiting distribution of the maximum likelihood
estimator for univariate current status data. For K > 1, we obtain for each k =
1, . . . , K the limiting distribution of the maximum likelihood estimator for the reduced
135
current status data (T,∆k). The multinomial covariance structure of the Brownian
motions comes from the multinomial distribution of ∆|T .
Example 6.5 Throughout this chapter, we consider the following example. Let T
be independent of (X, Y ), and let T , Y and X|Y have the following distributions:
K = 2,
G(t) = P (T ≤ t) = 1 − exp(−t), (6.1)
P (Y = k) =k
3, k = 1, 2,
P (X ≤ t|Y = k) = 1 − exp(−kt), k = 1, 2,
so that
F0k(t) =k
31 − exp(−kt), k = 1, 2.
.
Figures 6.1 and 6.21 show the limiting processes for the naive estimator, for t0 = 1
and t0 = 2. Comparing the two figures, we see that for t0 = 2 the variance of
the Brownian motions Wk(h)/g(t0) is larger, the negative correlation between the
two component processes W1(h)/g(t0) and W2(h)/g(t0) is stronger, and the parabolic
drifts f0k(t0)h2/2 are weaker. These observations follow from the definition of the pro-
cesses (Definition 6.2) and the fact that F0k is increasing and g and f0k are decreasing.
Finally, note that the slope processes Uk(h) have much fewer jumps for t0 = 2.
We now provide a proof of Theorem 6.4 that is in the same spirit as the proof
of the limiting distribution of the MLE in Section 6.2 ahead. The idea is as follows.
First, we characterize the localized estimator in terms of localized processes X locn ,
1These figures are constructed using the localized processes defined in the proof of 6.4, for samplesize n = 100,000.
136
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Brownian motion, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Brownian motion, k=2
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=2
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of convex minorant, k=2
h
Figure 6.1: Limiting processes for the naive estimator, for the model given in Example6.5 and t0 = 1. The top row shows Wk(h)/g(t0), k = 1, 2. The middle row shows
Xk(h) (grey) and its convex minorant Hk(h) (red), k = 1, 2. The parabolic driftsf0k(t0)h
2/2 are denoted by dashed lines. The bottom row shows the slope process
Uk(h), together with a dashed line of slope f0k(t0), k = 1, 2.
137
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Brownian motion, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Brownian motion, k=2
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=2
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of convex minorant, k=2
h
Figure 6.2: Limiting processes for the naive estimator, for the model given in Example6.5 for t0 = 2. Please see Figure 6.1 for further explanation.
138
H locn , U loc
n . Next, we show that these processes, restricted to [−m,m], are tight in an
appropriate space, for each m ∈ N. Via a diagonal argument, it then follows that
every subsequence (X locn′ , H loc
n′ , U locn′ ) has a convergent subsequence (X loc
n′′ , H locn′′ , U loc
n′′ ),
converging to a limit (X,H,U), with U = H ′ and with its component processes
defined on R. By the continuous mapping theorem, this limit satisfies the conditions
of Definition 6.3 on intervals [−m,m], for each m ∈ N. By letting m → ∞ we
obtain that these conditions are satisfied on R. This shows existence of a process
satisfying the conditions of Definition 6.3. Since the processes defined in Definition
6.3 are unique, all subsequences must converge to the same limit (X, H, U). Hence
(X locn , H loc
n , U locn ) →d (X, H, U).
Note that we obtain existence of the limiting processes while proving convergence
of the naive estimator to its limiting distribution. For the naive estimator, this proof
is vacuous, since existence of the convex minorants of X1, . . . , XK is well-known.
However, existence of the limiting processes for the MLE is not known, and hence
this step will be important for the MLE. Furthermore, note that uniqueness of the
limiting processes is used in the proof. For the naive estimator, uniqueness of the
convex minorants of X1, . . . , XK is known and hence we use it without proof. For
the MLE, uniqueness of the limiting processes is not known, and we establish this
separately in Section 6.2.2. Finally, note that our approach is different from the
one used by Groeneboom, Jongbloed and Wellner (2001a,b) for maximum likelihood
estimation of convex densities. They first establish existence and uniqueness of the
limiting process separately, and then prove convergence to the limiting distribution.
We now provide several results that are needed in the proof of Theorem 6.4.
Let τnk be the last jump point of Fnk before t0, k = 1, . . . , K. Lemma 6.6 shows
n2/3∫[τnk,t0)
δk − Fnk(u)dPn(u, δ) is tight.
139
Lemma 6.6 Let τnk be the last jump point of Fnk before t0. Then
∫
[τnk,t0)
Fnk(u) − δkdPn(u, δ) = Op(n−2/3), k = 1, . . . , K.
Proof: We write
∫
[τnk ,t0)
Fnk(u) − δkdPn(u, δ)
=
∫
[τnk ,t0)
Fnk(u) − F0k(u)dGn(u) +
∫
[τnk ,t0)
F0k(u) − δkdPn(u, δ)
=
∫
[τnk ,t0)
Fnk(u) − F0k(u)d(Gn −G)(u) +
∫
[τnk,t0)
Fnk(u) − F0k(u)dG(u)
+
∫
[τnk ,t0)
F0k(u) − δkd(Pn − P )(u, δ)
≡ I + II + III.
As mentioned in the introduction of Chapter 5, we know that t0 − τnk = Op(n−1/3).
Combining this with Lemma 4.1 of Kim and Pollard (1990) yields that terms I and III
are of order Op(n−2/3). Term II is of order Op(n
−2/3) by the local rate of convergence
and t0 − τnk = Op(n−1/3). 2
The next Lemma 6.7 formalizes that dGn(u) ≈ dG(u) ≈ g(t0)du for u ∈ [t0 −mn−1/3, t0 +mn−1/3].
Lemma 6.7 Let the conditions of Theorem 6.4 be satisfied, let m > 0, and k ∈1, . . . , K. Then
1
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dGn(u)
=
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du+ op(n−2/3), (6.2)
uniformly in h ∈ [−m,m].
140
Proof: Let m ∈ N and k ∈ 1, . . . , K. We write
1
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dGn(u)
=1
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)d(Gn −G)(u)
+1
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dG(u)
≡ I + II.
To see that term I is of order Op(n−1), note that by the local rate of convergence, we
can assume at the cost of probability ǫ that
suph∈[−m,m]
∣∣∣Fnk(t0 + n−1/3h) − F0k(t0)∣∣∣ ≤ Cn−1/3.
Applying Theorem 2.11.22 of Van der Vaart and Wellner (1996) to the class Qn, where
Qn =
qnFh(u) = n1/2Fn(u) − F0k(t0)1[t0,t0+n−1/3h](u) : h ∈ [−m,m], Fn ∈ Fn
,
Fn =
Fn : R 7→ [0, 1], Fn monotone, sup
h∈[−m,m]
∣∣Fn(t0 + n−1/3h) − F0k(t0)∣∣ ≤ Cn−1/3
,
yields that GnqnFh : qnFh ∈ Qn is tight in l∞[−m,m]. Since
GnqnFh =√n(Pn − P )qnFh = n
∫
[t0,t0+n−1/3h]
Fn(u) − F0k(t0)d(Gn −G)(u),
this implies that term I is of order Op(n−1).
For the second term (II) we write:
II =
∫ t0+n−1/3h
t0
Fnk(u) − F0k(t0)du
+
∫ t0+n−1/3h
t0
Fnk(u) − F0k(t0)g(u) − g(t0)
g(t0)du = IIa + IIb.
141
Note that IIa is present in (6.2). Term IIb is of order op(n−2/3), uniformly in h ∈
[−m,m], using the Cauchy-Schwarz inequality, the local rate of convergence and the
continuity of g:
|IIb| ≤1
g(t0)
(∫ t0+n−1/3h
t0
Fnk(u) − F0k(t0)2du
)1/2(∫ t0+n−1/3h
t0
g(u)− g(t0)2du
)1/2
= Op(n−1/2)o(n−1/6) = op(n
−2/3).
2
Proposition 6.8 gives convergence to the Brownian motion processes plus parabolic
drifts, as defined in 6.2.
Proposition 6.8 Let the conditions of Theorem 6.4 be satisfied. Let m > 0. Then
X locnk (h) ≡ n2/3
g(t0)
∫
[t0,t0+n−1/3h]
δk − F0k(t0)dPn(u, δ)
→d
Wk(h)
g(t0)+
1
2f0k(t0)h
2 = Xk(h),
jointly for k = 1, . . . , K in (l∞[−m,m])K .
Proposition 6.8 is quite standard. To show where the Brownian motion and the
parabolic drift come from, we write
δk − F0k(t0) = δk − F0k(u) + F0k(u) − F0k(t0).
The part δk − F0k(u) gives a martingale that converges to the Brownian motion Wk,
and the part F0k(u) − F0k(t0) gives the quadratic drift. The multinomial covariance
structure of the Brownian motions W1, . . . ,WK comes from the multinomial distribu-
tion of ∆|T , given in (2.4). For completeness, we give a proof of Proposition 6.8 in
Section 6.3.
We are now ready to prove Theorem 6.4.
142
Proof of Theorem 6.4: Let τnk be the last jump point of Fnk before t0, for k =
1, . . . , K. Recall from Proposition 2.28 that the naive estimators Fnk(t), k = 1, . . . , K,
are characterized by
∫
[τnk,t)
Fnk(u)dGn(u) ≤∫
[τnk,t)
δkdPn(u, δ), k = 1, . . . , K, t ∈ R, (6.3)
where equality must hold if t is a jump point of Fnk. In order to change the integration
interval [τnk, t) to [t0, t), we define
cnk =
∫
[τnk,t0)
Fnk(u) − δkdPn(u, δ), k = 1, . . . , K.
Then (6.3) is equivalent to
cnk +
∫
[t0,t)
Fnk(u)dGn(u) ≤∫
[t0,t)
δkdPn(u, δ), k = 1, . . . , K, t ∈ R, (6.4)
where equality must hold if t is a jump point of Fnk.
We now localize this expression, by subtracting∫[t0,t)
F0k(t0)dGn(u) on both sides,
and applying the change of variable t→ t0 + n−1/3h. This yields
cnk +
∫
[t0,t0+n−1/3h)
Fnk(u) − F0k(t0)dGn(u)
≤∫
[t0,t0+n−1/3h)
δk − F0k(t0)dPn(u, δ), k = 1, . . . , K, h ∈ R, (6.5)
where equality must hold if t0 + n−1/3h is a jump point of Fnk. Next, we define the
following localized processes for k = 1, . . . , K and h ∈ R:
X locnk (h) =
n2/3
g(t0)
∫
[t0,t0+n−1/3h]
δk − F0k(t0)dPn(u, δ),
H locnk (h) = n2/3
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du,
143
and
U locnk (h) = n1/3Fnk(t0 + n−1/3h) − F0k(t0).
Note that U locnk = (H loc
nk )′ at continuity points of U locnk . Furthermore, define
c locnk =n2/3
g(t0)cnk,
R locnk (h) =
n2/3
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dGn(u)
− n2/3
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du.
Then multiplying (6.5) by n2/3/g(t0) yields
c locnk + R locnk (h) + H loc
nk (h) ≤ X locnk (h), h ∈ R, k = 1, . . . , K,
and c locnk + R locnk (h−)+ H loc
nk (h−) = X locnk (h−) if U loc
nk has a jump at h. Combining these
statements, we obtain:
c locnk + R locnk (h) + H loc
nk (h) ≤ X locnk (h), h ∈ R, k = 1, . . . , K, (6.6)
∫ c locnk + R loc
nk (h−) + H locnk (h−) −X loc
nk (h−)dU loc
nk (h) = 0, k = 1, . . . , K. (6.7)
Note that these conditions also hold when the processes are restricted to [−m,m], for
each m ∈ N.
We define the following vectors:
c locn = (c locn1 , . . . , clocnK ), H loc
n = (H locn1 , . . . , H
locnK),
R locn = (R loc
n1 , . . . , RlocnK), U loc
n = (U locn1 , . . . , U
locnK),
X locn = (X loc
n1 , . . . , XlocnK),
144
and for m ∈ N, we define the space
E[−m,m] = RK × (D[−m,m])K × (D[−m,m])K × (C[−m,m])K × (D[−m,m])K
≡ RK × I × II × III × IV,
endowed with the product topology induced by the uniform topology on I×II×III,and the Skorohod topology on IV . Note that this space supports the vector
Vn|[−m,m] ≡ (c locn , R locn , X loc
n , H locn , U loc
n )|[−m,m],
where the notation |[−m,m] denotes that all processes R locnk , X loc
nk , H locnk and U loc
nk ,
k = 1, . . . , K, are restricted to [−m,m].
Analogously to Groeneboom, Jongbloed and Wellner (2001b), we now show that
Vn|[−m,m] is tight in E[−m,m] for each m ∈ N. Note that Lemma 6.6 implies tight-
ness of c locn in RK . Furthermore, Lemma 6.7 implies that R locn |[−m,m] is of order
op(1). Next, note that the subset of D[−m,m] consisting of absolutely bounded
nondecreasing functions is compact in the Skorohod topology. Hence, Theorem
5.20 and the monotonicity of U locnk , k = 1, . . . , K, yield that U loc
n |[−m,m] is tight
in (D[−m,m])K endowed with the Skorohod topology. Moreover, since the set of
absolutely bounded continuous functions with absolutely bounded derivatives is com-
pact in C[−m,m] with the uniform topology, it follows that H locn |[−m,m] is tight in
(C[−m,m])K endowed with the uniform topology. Furthermore, by Proposition 6.8
we have that
X locn (t0, h) →d
W (h)
g(t0)+
1
2f0(t0)h
2 = X(h), (6.8)
uniformly on compacta, where f0(t0) = (f01(t0), . . . , f0K(t0)), and W = (W1, . . . ,WK)
and X = (X1, . . . , XK) are defined in Definition 6.2. Hence, X locn |[−m,m] is tight in
(D[−m,m])K endowed with the uniform topology. Combining everything, we have
145
that Vn|[−m,m] is tight in E[−m,m] for each m ∈ N.
It now follows by a diagonal argument that any subsequence Vn′ of Vn has a further
subsequence Vn′′ that converges in distribution to a limit
V = (c, 0, X,H, U)
∈ RK × (C(−∞,∞))K × (C(−∞,∞))K × (C(−∞,∞))K × (D(−∞,∞))K.
Using a representation theorem (see, e.g., Dudley (1968), Pollard (1984, Representa-
tion Theorem 13, page 71) or Van der Vaart and Wellner (1996, Theorem 1.10.4, page
59)), we can assume that Vn′′ →a.s. V . Hence, U = H ′ at continuity points of U , see
Lemma 6.26 on page 177.
Conditions (6.6) and (6.7) and the continuous mapping theorem imply that the
vector (c,X,H, U) must satisfy, for all m ∈ N:
inft∈[−m,m]
Xk(t) −Hk(t) − ck ≥ 0, k = 1, . . . , K,
∫
[−m,m]
Xk(t−) −Hk(t−) − ckdUk(t) = 0, k = 1, . . . , K.
Since Xk and Hk are continuous, we can write the second condition as
∫
[−m,m]
Xk(t) −Hk(t) − ckdUk(t) = 0, k = 1, . . . , K.
Letting m→ ∞ gives
inft∈R
Xk(t) −Hk(t) − ck ≥ 0, k = 1, . . . , K,∫
Xk(t) −Hk(t) − ckdUk(t) = 0, k = 1, . . . , K.
146
Defining Hk(t) = Hk(t) + ck, k = 1, . . . , K, we have H ′k = H ′
k and
inft∈R
Xk(t) − Hk(t) ≥ 0, k = 1, . . . , K,∫Xk(t) − Hk(t)dUk(t) = 0, k = 1, . . . , K.
This proves existence of a process satisfying the conditions of Definition 6.3. Since the
processes defined in Definition 6.3 are unique, it follows that Hk = Hk and Uk = Uk,
for k = 1, . . . , K. Hence, each subsequence converges in distribution to the same
limit, so that U locn →d U in the Skorohod topology. In particular,
U locn (0) = n1/3(Fn(t0) − F0(t0)) →d U(0), in R
K .
2
6.2 The limiting distribution of the MLE
As noted in the introduction of this chapter, the limiting processes for the MLE
contain an extra term involving the difference between the sum of the drifting Brow-
nian motions and their convex minorants. We now prove existence and uniqueness of
this system of processes (Theorem 6.9), and convergence of the MLE to its limiting
distribution (Theorem 6.10). We first state the main results.
Theorem 6.9 Let
ak = 1/F0k(t0), k = 1, . . . , K + 1, (6.9)
and recall the definition of X1, . . . , XK in Definition 6.2. Then there exists an almost
surely unique K-tuple H = (H1, . . . , HK) of convex functions with right-continuous
derivatives U = (U1, . . . , UK), satisfying the following conditions:
(i) akHk(t) + aK+1H+(t) ≤ akXk(t) + aK+1X+(t), for k = 1, . . . , K, t ∈ R.
147
(ii)∫ akHk(t) + aK+1H+(t) − akXk(t) − aK+1X+(t)
dUk(t) = 0, k = 1, . . . , K.
(iii) For each M > 0 and each k = 1, . . . , K, there exist points τ1k < −M and
τ2k > M so that
akHk(t) + aK+1H+(t) = akXk(t) + aK+1X+(t) for t = τ1k and t = τ2k.
Theorem 6.10 For each k = 1, . . . , K, let F0k be continuously differentiable at t0
with strictly positive derivative f0k(t0). Furthermore, let G be continuously differen-
tiable at t0 with strictly positive derivative g(t0). Let U = (U1, . . . , UK) be defined as
in Theorem 6.9. Then
n1/3Fn(t0) − F0(t0) →d U(0), in RK .
The outline of this section is as follows. In Section 6.2.1 we discuss the processes
H1, . . . , HK , and compare them to the processes H1, . . . , HK for the naive estimator.
In Section 6.2.2 we prove that the processes H1, . . . , HK are unique. Next, in Section
6.2.3 we prove convergence of the MLE to its limiting distribution. In this proof, we
automatically obtain existence of the limiting processes H1, . . . , HK , hence completing
the proof of Theorem 6.9. In this sense our approach is different from the one followed
by Groeneboom, Jongbloed and Wellner (2001a,b), who first establish existence and
uniqueness of the limiting processes, before proving convergence. However, apart from
this difference, our approaches are very similar.
6.2.1 The process H = (H1, . . . , HK)
We now discuss the processes H1, . . . , HK . In Lemma 6.11 we study the collection of
points of touch between akHk + aK+1H+ and akXk + aK+1X+. The results in this
lemma rely on the observation that the process akHk +aK+1H+ is pointwise bounded
148
above by the convex minorant of akXk + aK+1X+. Since the convex minorant of a
Brownian motion process plus parabolic drift is well-studied (Groeneboom (1989)),
this point of view allows us to deduce properties of akHk + aK+1H+.
Lemma 6.11 Let Sk be the collection of points of touch between akHk(t)+aK+1H+(t)
and akXk(t) + aK+1X+(t). Then
(i) Sk is a subset of the points of touch of akXk(t) + aK+1X+(t) and its convex
minorant.
(ii) At points t ∈ Sk, the right and left derivatives of akHk(t) + aK+1H+(t) are
bounded above and below by the right and left derivatives of the convex minorant
of akXk(t) + aK+1X+(t).
Proof: Note that akHk(t) + aK+1H+(t) is a convex function, bounded above by
akXk(t) + aK+1X+(t). Hence, akHk(t) + aK+1H+(t) is bounded above by the con-
vex minorant of akXk(t) + aK+1X+(t). This yields (i). Property (ii) then follows
immediately from a graphical argument. 2
Property (i) of Lemma 6.11 leads to Corollary 6.12, which states that Hk is piecewise
linear, and Uk is piecewise constant, for all k = 1, . . . , K.
Corollary 6.12 Let H be defined as in Theorem 6.9. Then for each k ∈ 1, . . . , K,Hk is a piecewise linear function, and Uk is piecewise constant.
Proof: With probability one, the collection of points of touch between akXk(t) +
aK+1X+(t) and its convex minorant has no condensation points in a finite interval
(Groeneboom (1989)). By property (i) of Lemma 6.11, this implies that with proba-
bility one, Sk has no condensation points in a finite interval. Conditions (i) and (ii) of
Theorem 6.9 imply that Uk can only increase at points t ∈ Sk. Hence, Uk is piecewise
constant and Hk is piecewise linear. 2
149
In the discussion preceding Lemma 6.11, we interpreted akHk(t)+aK+1H+(t) as a
convex function below akXk(t) + aK+1X+(t). We now make this interpretation more
precise. Note that conditions (i) and (ii) of Theorem 6.4 imply that
akHk(h) + aK+1H+(h) = akXk(h) + aK+1X+(h)
at points of change of slope of Hk, k = 1, . . . , K. But akHk + aK+1H+ has a change
of slope if any Hj, j = 1, . . . , K has a change of slope. Thus, akHk(h) + aK+1H+(h)
can have changes of slope without touching akXk(h)+aK+1X+(h). This is illustrated
in Figures 6.3 and 6.4, for t0 = 1 and t0 = 2 respectively2. For example, in Figure
6.3, we see that a1H1(h)+ aK+1H+(h) has a change of slope just before zero, without
touching a1X1(h)+aK+1X+(h). This is allowed, since U1(h) does not have a jump at
this point. On the other hand, U2(h) does have a jump at this point, and we indeed
see that a2H2(h) + aK+1H+(h) touches a2X2(h) + aK+1X+(h).
In Lemma 6.13 we give two different interpretations of H1, . . . , HK that emphasize
the difference between the MLE and the naive estimator.
Lemma 6.13 Let H be defined in Theorem 6.9. Then H satisfies the following self-
induced convex minorant characterizations:
(a) For each k = 1, . . . , K, Hk(t) is the convex minorant of
Xk(t) +aK+1
ak(X+(t) − H+(t)). (6.10)
(b) For each k = 1, . . . , K, Hk(t) is the convex minorant of
Xk(t) +aK+1
ak + aK+1
(X(−k)+ (t) − H
(−k)+ (t)), (6.11)
2These figures are made using the localized processes defined in the proof of Theorem 6.10, withn = 100, 000. The convex minorant does not fit exactly, due to omission of the term R loc
nk .
150
where X(−k)+ (t) =
∑Kj=1,j 6=kXj(t) and H
(−k)+ (t) =
∑Kj=1,j 6=k Hj(t).
Proof: Characterization (a) holds since conditions (i) and (ii) of Theorem 6.9 are
equivalent to:
Hk(t) ≤ Xk(t) +aK+1
ak(X+(t) − H+(t)), t ∈ R,
∫ Hk(t) −Xk(t) −
aK+1
ak(X+(t) − H+(t))
dH ′
k(t) = 0,
for k = 1, . . . , K.
Characterization (b) holds since conditions (i) and (ii) of Theorem 6.9 are equiv-
alent to:
Hk(t) ≤ Xk(t) +aK+1
ak + aK+1(X
(−k)+ (t) − H
(−k)+ (t)), t ∈ R,
∫ Hk(t) −Xk(t) −
aK+1
ak + aK+1(X
(−k)+ (t) − H
(−k)+ (t))
dH ′
k(t) = 0,
for k = 1, . . . , K. 2
Characterization (a) of Lemma 6.13 is illustrated in Figures 6.5 and 6.6, for t0 = 1
and t0 = 2. The top row shows the extra term (aK+1/ak)X+(h) − H+(h) in the
processes for the MLE. Note that this term appears to be nonnegative. This is
indeed the case, and will be proved in Lemma 6.14. Furthermore, note that the
extra term (aK+1/ak)X+(h) − H+(h) is more prominent for t0 = 2 than for t0 = 1,
due to the larger variance of the Brownian motions at t0 = 2, and the fact that
aK+1/ak = F0k(t0)/(1 − F0+(t0)) increases with t0. The middle row of the Figures
depicts Hk and Hk, for k = 1, 2. It appears that Hk(h) ≥ Hk(h). This is indeed the
case and will be proved in Lemma 6.15.
We now discuss the origin of the extra term X+ − H+ that appears in the lim-
iting processes for the MLE. Recall the differences between the MLE and the naive
estimator, discussed in Section 2.1.4:
151
(a) The log likelihood (2.6) for the MLE contains a term involving FK+1(u) =
1−F+(u), while the log likelihood (2.11) for the naive estimator does not include
such a term;
(b) The space FK for the MLE includes the constraint that the sum of the sub-
distribution functions is bounded by one, while the space FK for the naive
estimator does not include such a constraint.
These differences were also present in the convex minorant characterization in equa-
tion (2.58), where the convex minorant characterization for the MLE contained two
extra terms: a Fn+-term and a βnFn-term. The Fn+-term came from the term 1−F+(t)
in the log likelihood (2.6), and βnFn-term came from the constraint on FK . For the
local limiting distribution at an interior point, the constraint on FK does not play a
role. Hence, we do not see the βnFn-term in the limiting process for the MLE. On the
other hand, the Fn+-term does play a role and results in the extra term X+ − H+.
In Lemma 6.14 we show that X+(t) − H+(t) ≥ 0 for all t ∈ R. This inequality is
illustrated in Figures 6.5 and 6.6.
Lemma 6.14 Let H be defined as in Theorem 6.9. Then
H+(t) ≤ X+(t), t ∈ R.
Proof: Note that condition (i) of Theorem 6.9 can be written as
Hk(t) +aK+1
akH+(t) ≤ Xk(t) +
aK+1
akX+(t), k = 1, . . . , K, t ∈ R.
Plugging in the values of a1, . . . , aK+1 as defined in (6.9) yields
Hk(t) +F0k(t0)
1 − F0+(t0)H+(t) ≤ Xk(t) +
F0k(t0)
1 − F0+(t0)X+(t), k = 1, . . . , K, t ∈ R.
152
Summing over k = 1, . . . , K gives
H+(t) +F0+(t0)
1 − F0+(t0)H+(t) ≤ X+(t) +
F0+(t0)
1 − F0+(t0)X+(t), t ∈ R.
and this is equivalent to H+(t) ≤ X+(t) for all t ∈ R. 2
We now use Lemma 6.14 to compare the MLE and the naive estimator, and find that
Hk ≤ Hk. This inequality is also illustrated in Figures 6.5 and 6.6.
Lemma 6.15 The following relation holds:
Hk(t) ≤ Hk(t), k = 1, . . . , K.
Proof: Recall that Hk(t) is the convex minorant of Xk(t). In Lemma 6.14, we saw
that the adjustment (aK+1/ak)(X+(t) − H+(t)) for the MLE is nonnegative. Hence,
Hk(t) is a convex function below Xk(t) + (aK+1/ak)(X+(t) − H+(t)). Since Hk(t) is
the convex minorant of Xk(t) + (aK+1/ak)(X+(t) − H+(t) (Lemma 6.13), it follows
that Hk(t) ≤ Hk(t), k = 1, . . . , K. 2
The following point of view is related to Lemma 6.15. By Theorem 6.4, we know
that Hk is the convex minorant of Xk(t). This implies that Hk(t) ≤ Xk(t), and
by summing over k = 1, . . . , K, we also have H+(t) ≤ X+(t). Hence, Hk(t) ≤Xk(t)+(aK+1/ak)(X+(t)− H+(t)), so that the naive estimator satisfies the inequality
conditions for the MLE. However, the naive estimator does not satisfy the equality
conditions, since typically H+(t) does not equal X+(t) when Hk(t) has a change of
slope. Hence, Hk(t) is a convex function below Xk(t) + (aK+1/ak)(X+(t) − H+(t)),
but it is typically not the convex minorant.
6.2.2 Uniqueness of the limiting process
In order to prove that the limiting process H = (H1, . . . , HK) defined in Theorem
6.9 is unique, we need that Uk(t) is tight for each t ∈ R. Such a tightness result
153
−10 −5 0 5 10
−20
020
4060
Limiting process, k=1
h
−10 −5 0 5 10−
200
2040
60
Limiting process, k=2
h
−10 −5 0 5 10
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of component 1
h
−10 −5 0 5 10
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of component 2
h
Figure 6.3: Limiting processes for the MLE, for the model given in Example 6.5 andt0 = 1. The top row shows the drifted Brownian motion process akXk(h)+aK+1X+(h)
(black) and the convex function akHk(h)+aK+1H+(h) (green), k = 1, 2. The parabolic
drift is denoted by a black dashed line. The bottom row shows the slope process Uk(h)(green), with a black dashed line of slope f0k(t0), k = 1, 2.
154
−10 −5 0 5 10
−20
020
4060
Limiting process, k=1
h
−10 −5 0 5 10
−20
020
4060
Limiting process, k=2
h
−10 −5 0 5 10
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of component 1
h
−10 −5 0 5 10
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope of component 2
h
Figure 6.4: Limiting processes for the MLE, for the model given in Example 6.5 andt0 = 2. Please see Figure 6.3 for further explanation.
155
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Correction term MLE, k=1
h
MLENaive
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Correction term MLE, k=2
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=2
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope convex minorant, k=2
h
Figure 6.5: Comparison of the limiting processes for the MLE and the naive estimator,for the model given in Example 6.5 and t0 = 1. The top row shows (aK+1/ak)X+(h)−H+(h), k = 1, 2. The middle row shows Xk(h) (grey) with convex minorant Hk(h)
(red), and Xk + (aK+1/ak)X+(h) − H+(h) (black) with convex minorant Hk(h)
(green), k = 1, 2. The bottom row shows Uk(h) (red) and Uk(h) (green), togetherwith a dashed line of slope f0k(t0), k = 1, 2.
156
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Correction term MLE, k=1
h
MLENaive
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Correction term MLE, k=2
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−10
−5
05
1015
Convex minorant, k=2
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope convex minorant, k=1
h
−15 −10 −5 0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Slope convex minorant, k=2
h
Figure 6.6: Comparison of the limiting processes for the MLE and the naive estimator,for the model given in Example 6.5 and t0 = 2. Please see Figure 6.5 for furtherexplanation.
157
is analogous to the local rate of convergence result in Theorem 5.20. Hence, as in
Section 5.3, we first prove a stronger tightness result for U+(t) in Lemma 6.16. We
then use this to prove tightness of the components Uk(t) in Lemma 6.20. Finally, we
use this tightness result in Lemma 6.21 to prove that the limiting process is unique.
Lemma 6.16 Let
F0k(t) = f0k(t0)t, k = 1, . . . , K + 1. (6.12)
Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9. For β ∈ (0, 1) we
define
v(t) =
1, if |t| ≤ 1,
|t|β , if |t| > 1.(6.13)
Then for every ǫ > 0 there exists an M = M(ǫ) such that for every s ∈ R,
P
sup
t∈R
∣∣∣U+(t) − F0+(t)∣∣∣
v(t− s)≥M
< ǫ.
Proof: Let ǫ > 0. We first prove the result for s = 0. It is sufficient to show that we
can choose M > 0 such that
P(∃t ∈ R : U+(t) /∈ (F0+(t−Mv(t)), F0+(t+Mv(t)))
)
= P(∃t ∈ R :
∣∣∣U+(t) − F0+(t)∣∣∣ ≥ f0+(t0)Mv(t)
)< ǫ.
In fact, we only prove that there exists an M such that
P(∃t ∈ [0,∞) : U+(t) ≥ F0+(t+Mv(t))
)<ǫ
4, (6.14)
since the proofs for U+(t) ≤ F0+(t −Mv(t)) and (−∞, 0] are analogous. We put a
158
grid on [0,∞), with grid points j ∈ N = 0, 1, . . .. Then it is sufficient to show that
we can choose M such that
P(∃t ∈ [j, j + 1) : U+(t) ≥ F0+(t+Mv(t))
)≤ pjM , j ∈ N, (6.15)
where pjM is defined by
pjM = d1 exp(−d2(Mv(j))3), (6.16)
and d1 and d2 are positive constants. To see that this is sufficient, note that (6.15)
yields
P(∃t ∈ [0,∞) : U+(t) ≥ F0+(t+Mv(t))
)≤
∞∑
j=0
pjM .
For each β ∈ (0, 1), the sum∑∞
j=0 pjM can be made arbitrarily small by choosing M
large, which proves (6.14).
In the remainder we prove (6.15). Using the monotonicity of U+, it is sufficient to
show that P (Aj) ≤ pjM for all j ∈ N, where
Aj = U+(j + 1) ≥ F0+(j +Mv(j)).
Fix j ∈ N. By property (iii) of Theorem 6.9, akHk + aK+1H+ and akXk + aK+1X+
have points of touch to the left of j + 1, for each k = 1, . . . , K. For each k, define
τk to be the largest such point. Without loss of generality, we assume that the sub-
distribution functions are labeled such that τ1 ≤ · · · ≤ τK . On the event Aj, there
is a k ∈ 1, . . . , K such that Uk(j + 1) ≥ F0k(j + Mv(j)). Hence, we can define
l ∈ 1, . . . , K such that
Uk(j + 1) < F0k(j +Mv(j)), k = l + 1, . . . , K, (6.17)
Ul(j + 1) ≥ F0l(j +Mv(j)). (6.18)
159
The definition of τl and condition (i) of Theorem 6.9 imply that
alHl(τl) + aK+1H+(τl) = alXl(τl) + aK+1X+(τl),
alHl(t) + aK+1H+(t) ≤ alXl(t) + aK+1X+(t), t ∈ R.
Dividing both lines by al, and subtracting the first line from the second yields, for
t = j +Mv(j):
∫ j+Mv(j)
τl
[Ul(t)dt− dXl(t) +
aK+1
alU+(t)dt− dX+(t)
]≤ 0.
Hence,
P (Aj) = P
(∫ j+Mv(j)
τl
[Ul(t)dt− dXl(t) +
aK+1
alU+(t)dt− dX+(t)
]≤ 0
∩Aj
)
≤ P
(∫ j+Mv(j)
τl
[Ul(t)dt− dXl(t)
]≤ 0
∩ Aj
)(6.19)
+ P
(∫ j+Mv(j)
τl
[U+(t)dt− dX+(t)
]≤ 0
∩ Aj
). (6.20)
Using the definition of τl and the fact that Ul is monotone nondecreasing and piecewise
constant (Corollary 6.12), it follows that on the event Aj we have for t ≥ τl
Ul(t) ≥ Ul(τl) = Ul(j + 1) ≥ F0l(j +Mv(j)).
Hence we can bound (6.19) above by
P
(∫ j+Mv(j)
τl
[F0l(j +Mv(j))dt− dXl(t)
]≤ 0
)
≤ P
(sup
k ∈ 1, . . . , K
w ≤ j + 1
∫ j+Mv(j)
w
[dXk(t) − F0k(j +Mv(j))dt
]≥ 0
)≤ pjM/2,
160
where the last inequality follows from Lemma 6.17 below. The term (6.20) is also
bounded by pjM/2, using Lemma 6.18 below. For s 6= 0, the proof is exactly the
same, using stationarity of the increments of Brownian motion. 2
Lemmas 6.17 and 6.18 are the key lemmas in the proof of Lemma 6.16. They are
analogous to Lemmas 5.15 and 5.16. To get insight in Lemma 6.17, recall that
dXk(t) = dWk(t)/g(t0) + F0k(t). Hence,
∫ j+Mv(j)
w
[F0k(j +Mv(j))dt− dXk(t)
]
=
∫ j+Mv(j)
w
[F0k(j +Mv(j) − t)dt− dWk(t)/g(t0)
]
=1
2f0k(t0)(j +Mv(j) − w)2 −
∫ j+Mv(j)
w
dWk(t)/g(t0).
The quadratic drift dominates the term (j+Mv(j)−w)3/2C and the Brownian motion.
We obtain uniformity in w by using a second grid, and we obtain the exponential
bounds by using standard properties of Brownian motion. Finally, note that the
lemma also holds when (j+Mv(j)−w)3/2C is omitted, since this term is positive for
M > 1 and w ≤ j + 1.
Lemma 6.17 There exists an M > 1 such that for all j ∈ N:
P
(sup
k ∈ 1, . . . , K
w ≤ j + 1
∫ j+Mv(j)
w
[dXk(t) − F0k(j +Mv(j))dt
]
+ (j +Mv(j) − w)3/2C
≥ 0
)≤ pjM/2,
where v(·) and pjM are defined in (6.13) and (6.16).
Analogously to Lemma 5.16, Lemma 6.18 relies on the system of component processes.
By playing out the different component processes against each other, we can reduce
the problem to a situation to which Lemma 6.17 can be applied.
161
Lemma 6.18 There exists an M > 0 such that for all j ∈ N,
P
(∫ j+Mv(j)
τl
[dX+(t) − U+(t)dt
]≥ 0
∩Aj
)≤ pjM/2,
where v(·) and pjM are defined in (6.13) and (6.16), τl is the last point of touch
between alHl(t) + aK+1H+(t) and alXl(t) + aK+1X+(t) before j + 1, and l is defined
in (6.17) and (6.18).
Lemma 6.16 with β = 1/2 leads to the following corollary, which we give without
proof. It is analogous to Corollary 5.19.
Corollary 6.19 Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9.
Then for every ǫ > 0 there is a C = C(ǫ) such that for all s ∈ R:
P
supu∈R+
∫ ss−u
∣∣∣U+(t) − F0+(t)∣∣∣ dt
u ∨ u3/2≥ C
< ǫ.
We can now prove tightness of Uk(t), for k = 1, . . . , K, t ∈ R. Lemma 6.20 is
analogous to Theorem 5.20.
Lemma 6.20 Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9.
Then for every ǫ > 0 there is an M = M(ǫ) such that for all k = 1, . . . , K and t ∈ R:
P(∣∣∣Uk(t) − F0k(t)
∣∣∣ ≥M)< ǫ.
Proof: Let k ∈ 1, . . . , K, t ∈ R and ǫ > 0. It is sufficient to show that there exists
an M > 1 such that
P (Uk(t) ≥ F0k(t+M)) < ǫ, (6.21)
P (Uk(t) ≤ F0k(t−M)) < ǫ. (6.22)
162
We only prove (6.21), since the proof of (6.22) is analogous. Define
Bk = Uk(t) ≥ F0k(t+M). (6.23)
Let τk be the last point of touch between akHk +aK+1H+ and akXk +aK+1X+ before
t. Such a point exists by condition (iii) of Theorem 6.9. Together with condition (i)
of Theorem 6.9, this implies that
∫ t+M
τk
[Uk(s)ds− dXk(s) +
aK+1
ak(U+(s)ds− dX+(s))
]≤ 0.
Hence,
P (Bk) = P
(∫ t+M
τk
[Uk(s)ds− dXk(s) +
aK+1
ak(U+(s)ds− dX+(s))
]≤ 0
∩Bk
).
Note that
∣∣∣∣∫ t+M
τk
U+(s)ds− dX+(s)
∣∣∣∣
≤∫ t+M
τk
∣∣∣U+(s) − F0+(s)∣∣∣ ds+
∣∣∣∣∫ t+M
τk
F0+(s)ds− dX+(s)
∣∣∣∣ . (6.24)
We bound the first term of (6.24) by Corollary 6.19: for every ǫ > 0 there exists a
C > 0, such that for all t ∈ R:
P
supu∈R+
∫ t+Mt−u
∣∣∣U+(s) − F0+(s)∣∣∣ ds
(M + u)3/2> C
<
ǫ
2,
using that (M + u) ∨ (M + u)3/2 = (M + u)3/2 for M > 1 and u > 0. For the second
term of (6.24), note that dX+(s) = dW+(s)/g(t0) + F0+(s)ds, so that
∫ t+M
τk
F0+(s)ds− dX+(s)
=
∫ t+M
τk
dW+(s)
g(t0).
163
For every γ > 0 and ǫ > 0, we can find M > 0 such that
P
(supu∈R+
∫
[t−u,t+M)
dW+(s)
g(t0)− γ(M + u)2 ≥ 0
)< ǫ,
see (6.40).
Furthermore, using the definition of τk and the fact that Uk is monotone nonde-
creasing and piecewise constant (Corollary 6.12), it follows that on the event Bk, we
have for all t ≥ τk:
Uk(t) ≥ Uk(τk) = Uk(t) ≥ F0k(t+M).
Hence, for every γ > 0 and ǫ > 0, we can find C > 0 and M1 > 0 such that for all for
M ≥M1:
P (Bk) = P
(∫ t+M
τk
[Uk(s)ds− dXk(s) +
aK+1
ak(U+(s)ds− dX+(s))
]≤ 0
∩Bk
)
≤ ǫ
2+ P
(supw≤t
∫ t+M
w
[dXk(s) − F0k(s+M)ds
]+ (t+M − w)3/2C
+ γ(t+M − w)2
≥ 0
).
This probability can be made arbitrarily small by choosing γ small and M large, by
a slight adaptation of Lemma 6.17. The choice γ = 18f0k(t0) works. 2
Lemma 6.21 Let H and H be two K-tuples satisfying the conditions of Theorem
6.9. Then H ≡ H almost surely.
Proof: Let H = (H1, . . . , HK) and H = (H1, . . . , HK) be two processes satisfying
the conditions of Theorem 6.9, and let U = (U1, . . . , UK) and U = (U1, . . . , UK) be
164
the corresponding derivatives, i.e., Uk = H ′k and Uk = H ′
k for k = 1, . . . , K. We define
φm(U) =K∑
k=1
ak
[1
2
∫ m
−mU2k (t)dt−
∫ m
−mUk(t)dXk(t)
]
+ aK+1
[1
2
∫ m
−mU2
+(t)dt−∫ m
−mU+(t)dX+(t)
], m ∈ N.
Note that
φm(U) − φm(U)
=
K∑
k=1
ak2
∫ m
−mU2
k (t) − U2k (t)dt−
K∑
k=1
ak
∫ m
−mUk(t) − Uk(t)dXk(t) (6.25)
+aK+1
2
∫ m
−mU2
+(t) − U2+(t)dt− aK+1
∫ m
−mU+(t) − U+(t)dX+(t). (6.26)
Using U2k − U2
k = (Uk − Uk)2 + 2Uk(Uk − Uk), we rewrite the first term of (6.25) as
K∑
k=1
ak2
∫ m
−mU2
k (t) − U2k (t)dt
=K∑
k=1
ak2
∫ m
−mUk(t) − Uk(t)2dt+
K∑
k=1
ak
∫ m
−mUk(t)Uk(t) − Uk(t)dt.
Similarly, we rewrite the first term of (6.26) as
aK+1
2
∫ m
−mU2
+(t) − U2+(t)dt
=aK+1
2
∫ m
−mU+(t) − U+(t)2dt+ aK+1
∫ m
−mU+(t)U+(t) − U+(t)dt.
Defining
Ak(t) = akHk(t) −Xk(t) + aK+1H+(t) −X+(t),
Ak(t) = akHk(t) −Xk(t) + aK+1H+(t) −X+(t),
165
this yields
φm(U) − φm(U) =K∑
k=1
ak2
∫ m
−mUk(t) − Uk(t)2dt+
aK+1
2
∫ m
−mU+(t) − U+(t)2dt
+
K∑
k=1
∫ m
−mUk(t) − Uk(t)dAk(t). (6.27)
Using integration by parts, we rewrite the third term on the right side of (6.27):
K∑
k=1
∫ m
−mUk(t) − Uk(t)dAk(t)
=K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m−
K∑
k=1
∫ m
−mAk(t)dUk(t) − Uk(t)
≥K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m. (6.28)
The inequality on the last line follows from the following two facts:
(a)∫ m−m Ak(t)dUk(t) = 0, since Ak(t) = 0 at points of jump of Uk by conditions (i)
and (ii) of Theorem 6.9;
(b)∫ m−m Ak(t)dUk(t) ≤ 0, since Ak(t) ≤ 0 by condition (i) of Theorem 6.9, and Uk
is monotone nondecreasing.
By combining (6.27) and (6.28), we obtain
φm(U) − φm(U) ≥K∑
k=1
ak2
∫ m
−mUk(t) − Uk(t)2dt+
aK+1
2
∫ m
−mU+(t) − U+(t)2dt
+
K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m.
Using the same expression, but with U and U interchanged, we get
166
0 = φm(U) − φm(U) + φm(U) − φm(U)
≥K∑
k=1
ak
∫ m
−mUk(t) − Uk(t)2dt+ aK+1
∫ m
−mU+(t) − U+(t)2dt
+
K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m+
K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m.
We rewrite the last two terms of this display as follows:
K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m+
K∑
k=1
Uk(t) − Uk(t)Ak(t)∣∣∣∣m
−m
=K∑
k=1
[Uk(m) − Uk(m)Ak(m) − Ak(m)
+ Uk(−m) − Uk(−m)Ak(−m) −Ak(−m)].
It then follows that
K∑
k=1
ak
∫ m
−mUk(t) − Uk(t)2dt+ aK+1
∫ m
−mU+(t) − U+(t)2dt
≤K∑
k=1
[Uk(m) − Uk(m)Ak(m) − Ak(m)
+ Uk(−m) − Uk(−m)Ak(−m) −Ak(−m)].
This inequality holds for all m ∈ N, and hence we can take lim infm→∞. On the left
side we can replace lim infm→∞ by limm→∞, since this is a monotone sequence in m:
K∑
k=1
ak
∫Uk(t) − Uk(t)2dt+ aK+1
∫U+(t) − U+(t)2dt
≤ lim infm→∞
K∑
k=1
[Uk(m) − Uk(m)Ak(m) − Ak(m)
+ Uk(−m) − Uk(−m)Ak(−m) − Ak(−m)]. (6.29)
167
We will now show that the right side of (6.29) is almost surely equal to zero. We
prove this in two steps. First, we show that it is of order Op(1), and then we use this
to show that it is almost surely equal to zero.
To show that the right side of (6.29) is of order Op(1), let k ∈ 1, . . . , K, and
note that the tightness of Lemma 6.20 yields that Uk(m) − F0k(m) and Uk(m) −F0k(m) are of order Op(1). This implies that also Uk(m)−Uk(m) is of order Op(1).
Furthermore, Lemma 6.20 implies that the distance of m to jump points of Uk and
Uk is of order Op(1). This means that both Ak(m) and Ak(m) are of order Op(1), and
hence also Ak(m) − Ak(m) is of order Op(1). Using the same argument for −m,
this proves that the right side of (6.29) is of order Op(1).
We will now use this result to show that the right hand side of (6.29) is almost
surely equal to zero. Let k ∈ 1, . . . , K and η > 0. We will show that
P(lim infm→∞
∣∣∣Uk(m) − Uk(m)∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η)
= 0. (6.30)
Since
P(lim infm→∞
∣∣∣Uk(m) − Uk(m)∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η)
≤ lim infm→∞
P(∣∣∣Uk(m) − Uk(m)
∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η),
it is sufficient to show that
lim infm→∞
P(∣∣∣Uk(m) − Uk(m)
∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η)
= 0. (6.31)
Let τkm be the last jump point of Uk before m. Let τ−km be the last jump point
of Uk at or before τkm, and let τ+km be the first jump point of Uk after τkm. We now
168
define the following events:
E1m = E1m(ǫ) =
∫ ∞
τ−km
Uk(t) − Uk(t)2dt < ǫ
,
E2m = E2m(δ) = size of jump of Uk at τkm > δ ,
E3m = E3m(C) =∣∣∣Uk(m) − Uk(m)
∣∣∣ < C,
Em = Em(ǫ, δ, C) = E1m(ǫ) ∩ E2m(δ) ∩ E3m(C).
Let ǫ1 > 0 and ǫ2 > 0. Since the right side of (6.29) is of order Op(1), it follows that∫Uk(t)− Uk(t)2dt = Op(1) for every k ∈ 1, . . . , K. This implies that
∫∞mUk(t)−
Uk(t)2dt →p 0 as m → ∞. Together with the fact that m − τ−km = Op(1), this
implies that there is an m1 > 0 such that P (E1m(ǫ1)c) < ǫ1 for all m > m1. Using a
stationarity argument (which actually needs some further elaboration) it follows that
there are δ > 0 and m2 > 0 so that P (E2m(δ)c) < ǫ2/2 for all m > m2. By tightness
of Uk(m) − Uk(m), there are C > 0 and m3 > 0 so that P (E3m(C)c) < ǫ2/2 for all
m > m3. Combining these observations yields that P (Em(ǫ1, δ, C)c) < ǫ1 + ǫ2 for all
m > m0 = maxm1, m2, m3.
Returning to (6.31), we now have
lim infm→∞
P(∣∣∣Uk(m) − Uk(m)
∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η)
= 0
≤ ǫ1 + ǫ2 + lim infm→∞
P(∣∣∣Uk(m) − Uk(m)
∣∣∣∣∣∣Ak(m) − Ak(m)
∣∣∣ > η∩ Em(ǫ1, δ, C)
)
≤ ǫ1 + ǫ2 + lim infm→∞
P(∣∣∣Ak(m) − Ak(m)
∣∣∣ > η
C
∩ Em(ǫ1, δ, C)
), (6.32)
using the definition of E3m(C) in the last line. On the event E1m(ǫ1),
ǫ1 ≥∫ ∞
τ−km
Uk(t) − Uk(t)2dt ≥∫ τkm
τ−km
Uk(t) − Uk(t)2dt+
∫ τ+km
τkm
Uk(t) − Uk(t)2dt
= I + II.
169
Furthermore, on the event E2m(δ) one of the following must hold: (i) τ−km = τkm,
(ii)∣∣∣Uk(τkm) − Uk(τkm)
∣∣∣ ≥ δ/2, or (iii)∣∣∣Uk(τkm−) − Uk(τkm−)
∣∣∣ ≥ δ/2. Suppose (ii)
holds. Then II ≥ δ2(τ+km − τkm)/4 since Uk − Uk is piecewise constant, and hence
τ+km − τkm ≤ 4ǫ1/δ
2. Next, suppose that (iii) holds. Then I ≥ δ2(τkm − τ−km)/4, so
that τkm − τ−km ≤ 4ǫ1/δ2. Thus, on the event E1m(ǫ1) ∩ E2m(δ) there is a jump point
of Uk that is within 4ǫ1/δ2 of τkm. Without loss of generality, we assume that this
holds for τ−km.
We now return to (6.32) and consider the quantity Ak(m) − Ak(m). First, note
that
akHk(m) + aK+1H+(m) = akXk(τkm) + aK+1X+(τkm)
+
∫ m
τkm
akUk(u) + aK+1U+(u)du,
akHk(m) + aK+1H+(m) = akXk(τ−km) + aK+1X+(τ−km)
+
∫ m
τ−km
akUk(u) + aK+1U+(u)du.
Then
Ak(m) − Ak(m) = ak(Hk(m) − Hk(m)) + aK+1(H+(m) − H+(m))
= akXk(τkm) −Xk(τ−km) + aK+1X+(τkm) −X+(τ−km),
+
∫ m
τkm
ak(Uk(u) − Uk(u)) + aK+1(U+(u) − U+(u))du
−∫ τkm
τ−km
akUk(u) + aK+1U+(u)du.
Since τkm−τ−km ≤ 4ǫ1/δ2, and since Xk, X+, Hk and H+ are continuous, it follows that
the first and third lines on the right side of this expression can be made arbitrarily
170
small by choosing ǫ1 small. Hence, we only need to consider the third term:
∣∣∣∣∫ m
τkm
akUk(u) − Uk(u) + aK+1U+(u) − U+(u)
du
∣∣∣∣
≤ ak
(∫ m
τkm
Uk(u) − Uk(u)2du
)1/2
(m− τkm)1/2
+ aK+1
(∫ m
τkm
U+(u) − U+(u)2du
)1/2
(m− τkm)1/2
= Op
(∫ m
τkm
Uk(u) − Uk(u)2du
)1/2
+Op
(∫ m
τkm
U+(u) − U+(u)2du
)1/2
= op(1), m→ ∞.
The inequality follows from the Cauchy-Schwarz inequality. The first equality follows
from m− τkm = Op(1), and the second equality follows since the integrals∫Uk(u)−
Uk(u)2du and∫U+(u)− U+(u)2du are of order Op(1), so that the tail distribution
of the integrals must be of order op(1).
This implies that
lim infm→∞
P(∣∣∣Ak(m) − Ak(m)
∣∣∣ > η
C
∩ E1(ǫ1, δ, C)
)= 0.
Using similar reasoning for −m, it follows that the right side of (6.29) equals zero
with probability one. In turn, this implies that with probability one, Uk = Uk almost
everywhere for k = 1, . . . , K. Taking into account the monotonicity and right con-
tinuity of Uk and Uk, we find that Uk must be identical to Uk with probability one.
2
Remark 6.22 An alternative method for proving uniqueness could proceed along
the following lines. Let ǫ > 0, and define the following event
Am =
∫ m+1
m
Uk(t) − Uk(t)2dt > ǫ
, m ∈ N.
171
Since the integrals∫Uk(t)−Uk(t)2dt are of order Op(1), we know that P (Am i.o.) =
0. If we can show that the sequence Am is strongly mixing in the sense of Theorem
2 of Yoshihara (1979), then it follows that the second Borel-Cantelli lemma holds,
so that∑P (Am) < ∞. Since the Am are identically distributed, this implies that
P (Am) = 0 for all m ∈ N, and this implies Uk ≡ Uk.
6.2.3 Convergence of the MLE to the limiting distribution
We prove the limiting distribution of the MLE (Theorem 6.10) along the same lines
as the limiting distribution of the naive estimator (Theorem 6.4). Thus, we start by
localizing the characterization. However, the characterization of the MLE is more
complicated than the characterization of the naive estimator. To simplify it, we
replace (Fnk(u))−1 and (1−Fn+(u))−1 by (F0k(t0))
−1 and (1−F0+(t0))−1. This results
in a rest term which is bounded in Lemma 6.23. The proof of this lemma is given in
Section 6.3.
Lemma 6.23 Let τnk be the last jump point of Fnk before t0, for k = 1, . . . , K. Then
for every m > 0, and k = 1, . . . , K:
∫
[τnk,t0+n−1/3h)
Fnk(u) − δk
Fnk(u)+Fn+(u) − δ+
1 − Fn+(u)
dPn(u, δ)
=
∫
[τnk,t0+n−1/3h)
Fnk(u) − δkF0k(t0)
+Fn+(u) − δ+1 − F0+(t0)
dPn(u, δ) + op(n
−2/3),
uniformly in h ∈ [−m,m].
Next, we give analogues of Lemmas 6.6 and 6.7. We only mention the key ingredients
of the proof, since the proofs themselves are completely analogous to the proofs of
Lemmas 6.6 and 6.7. The key ingredients are the local rate of convergence of the
MLE (Theorem 5.20) and t0 − τnk = Op(n−1/3) (Corollary 5.22), where τnk is the last
jump point of Fnk before t0, k = 1, . . . , K.
172
Lemma 6.24 Let τnk be the last jump point of Fnk before t0, for k = 1, . . . , K. Then
∫
[τnk,t0)
Fnk(u) − δkdPn = Op(n−2/3), k = 1, . . . , K.
Lemma 6.25 Let the conditions of Theorem 6.4 be satisfied, let m > 0, and k ∈1, . . . , K. Then
1
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dGn(u)
=
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du+ op(n−2/3),
uniformly in h ∈ [−m,m].
We now give the proof of Theorem 6.10.
Proof of Theorem 6.10: Let τnk be the last jump point of Fnk before t0, for k =
1, . . . , K. Recall from Proposition 2.34 that the MLE Fnk(t), k = 1, . . . , K, is char-
acterized by
∫
[τnk ,t)
δk
Fnk(u)− 1 − δ+
1 − Fn+(u)
dPn(u, δ) ≥ 0,
for k = 1, . . . , K and t < T(n), where equality must hold if t is a jump point of Fnk.
This is equivalent to:
∫
[τnk,t)
Fnk(u) − δk
Fnk(u)+Fn+(u) − δ+
1 − Fn+(u)
dPn(u, δ) ≤ 0, (6.33)
for k = 1, . . . , K and t < T(n), where equality must hold if t is a jump point of Fnk.
We now replace (Fnk(u))−1 by (F0k(t0))
−1 = ak, and (1 − Fn+(u))−1 by (1 −
173
F0+(t0))−1 = aK+1, at the cost of a term
∫[τnk ,t)
RnkdPn, where
Rnk(u, δ) =
Fnk(u) − δk
Fnk(u)+Fn+(u) − δ+
1 − Fn+(u)
−Fnk(u) − δkF0k(t0)
+Fn+(u) − δ+1 − F0+(t0)
.
This yields
∫
[τnk ,t)
akFnk(u) − δk + aK+1Fn+(u) − δ+ + Rnk(u, δ)
dPn(u, δ) ≤ 0, (6.34)
for k = 1, . . . , K and t < T(n), where equality must hold if t is a jump point of Fnk.
In order to change the integration interval [τnk, t) to [t0, t), we define for k =
1, . . . , K:
cnk =
∫
[τnk,t0)
akFnk(u) − δk + aK+1Fn+(u) − δ+ + Rnk(u, δ)
dPn(u, δ).
Then (6.34) is equivalent to
cnk +
∫
[t0,t)
akFnk(u) − δk + aK+1Fn+(u) − δ+ + Rnk(u, δ)
dPn(u, δ) ≤ 0,
for k = 1, . . . , K, where equality must hold if t is a jump point of Fnk.
We now localize this expression, by adding and subtracting∫[t0,t)
F0k(t0)dGn(u)
and∫[t0,t)
F0+(t0)dGn(u), and applying the change of variable t → t0 + n−1/3h. This
yields
cnk +
∫
[t0,t0+n−1/3h)
Rnk(u, δ)dPn(u, δ) + ak
∫
[t0,t0+n−1/3h)
Fnk(u) − F0k(t0)dGn(u)
+ aK+1
∫
[t0,t0+n−1/3h)
Fn+(u) − F0+(t0)dGn(u) (6.35)
≤ ak
∫
[t0,t0+n−1/3h)
δk − F0k(t0)dPn(u, δ) + aK+1
∫
[t0,t0+n−1/3h)
δ+ − F0+(t0)dPn(u, δ),
174
for k = 1, . . . , K, h < n1/3(T(n)−t0), where equality must hold if t0 +n−1/3h is a jump
point of Fnk. Next, for k = 1, . . . , K and h ∈ R, we define the following processes:
X locnk (h) =
n2/3
g(t0)
∫
[t0,t0+n−1/3h]
δk − F0k(t0)dPn(u, δ),
H locnk (h) = n2/3
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du,
U locnk (h) = n1/3Fnk(t0 + n−1/3h) − F0k(t0).
Note that U locnk = (H loc
nk )′ at continuity points of U locnk . Furthermore, define
c locnk =n2/3
g(t0)cnk,
R locnk (h) =
n2/3
g(t0)
∫
[t0,t0+n−1/3h]
Rnk(u, δ)dPn(u, δ)
+ ak
(n2/3
g(t0)
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)dGn(u)
− n2/3
∫
[t0,t0+n−1/3h]
Fnk(u) − F0k(t0)du)
+ aK+1
(n2/3
g(t0)
∫
[t0,t0+n−1/3h]
Fn+(u) − F0+(t0)dGn(u)
− n2/3
∫
[t0,t0+n−1/3h]
Fn+(u) − F0+(t0)du).
Then multiplying (6.35) by n2/3/g(t0) yields, for all k = 1, . . . , K and h < n1/3(T(n) −t0):
c locnk + R locnk (h) + akH
locnk (h) + aK+1H
locn+(h) ≤ akX
locnk (h) + aK+1X
locn+(h) (6.36)
and c locnk + R locnk (h−) + akH
locnk (h−) + aK+1H
locn+(h−) = akX
locnk (h−) + aK+1X
locn+(h−) if
175
U locnk has a jump at h. Combining these statements gives
c locnk + R locnk (h) + akH
locnk (h) + aK+1H
locn+(h) ≤ akX
locnk (h) + aK+1X
locn+(h), (6.37)
∫ n1/3(T(n)−t0)
−∞
c locnk + R loc
nk (h−) + akHlocnk (h−) + aK+1H
locn+(h−)
− akXlocnk (h−) − aK+1X
locn+(h−)
dU loc
nk (h) = 0, (6.38)
where (6.37) must hold for all h < n1/3(T(n)−t0). Note that these conditions also hold
when the processes are restricted to [−m,m] ∩ (−∞, n1/3(T(n) − t0), for each m ∈ N.
Next, we define the following vectors:
c locn = (c locn1 , . . . , clocnK), H loc
n = (H locn1 , . . . , H
locnK),
R locn = (R loc
n1 , . . . , RlocnK), U loc
n = (U locn1 , . . . , U
locnK),
X locn = (X loc
n1 , . . . , XlocnK).
Furthermore, for m ∈ N, we define the space
E[−m,m] = RK × (D[−m,m])K × (D[−m,m])K × (C[−m,m])K × (D[−m,m])K
≡ RK × I × II × III × IV,
endowed with the product topology induced by the uniform topology on I×II×III,and the Skorohod topology on IV . Note that this space supports the vector
Vn|[−m,m] = (c locn , R locn , X loc
n , H locn , U loc
n )|[−m,m],
where the notation |[−m,m] denotes that the processes R locnk , X loc
nk , H locnk and U loc
nk are
restricted to [−m,m] for all k = 1, . . . , K.
The remainder of the proof is analogous to the proof of Theorem 6.4, and we
omit some details that can be found there. First, we show that Vn|[−m,m] is tight
176
in E[−m,m] for each m ∈ N. To do so, we use Lemmas 6.24 and 6.25 to show that
cnk = Op(1) and R locnk = op(1) uniformly in h ∈ [−m,m], for all k = 1, . . . , K. Hence,
by a diagonal argument it follows that for every subsequence Vn′ there is a further
subsequence that converges in distribution to a limit
V = (c, 0, X,H, U)
∈ RK × (C(−∞,∞))K × (C(−∞,∞))K × (C(−∞,∞))K × (D(−∞,∞))K,
with H ′ = U at continuity points of U . By the continuous mapping theorem and
(6.37) and (6.38), it follows that for each m ∈ N:
inf[−m,m]
akXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) − ck ≥ 0,
∫ m
−makXk(t−) + aK+1X+(t−) − akHk(t−) − aK+1H+(t−) − ckdUk(t) = 0.
Since Xk and Hk are continuous, we can write the second condition as
∫ m
−makXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) − ckdUk(t) = 0.
Defining Hk = Hk + ck/ak − F0k(t0)∑K
k=1(ck/ak), we have
akHk = akHk + ck −K∑
k=1
ckak,
aK+1H+ = aK+1H+ +
K∑
k=1
ckak,
using aK+1 = (1 − F0+(t0))−1 to obtain the second line. This gives
akHk + aK+1H+ + ck = akHk + aK+1H+,
177
so that
inf[−m,m]
akXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) ≥ 0,
∫ m
−makXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t)dUk(t) = 0.
Letting m→ ∞ it follows that H1, . . . , HK satisfy conditions (i) and (ii) of Theorem
6.9. Furthermore, condition (iii) of Theorem 6.9 is satisfied by Corollary 5.22.
Hence, there exists a K-tuple of processes (H1, . . . , HK) that satisfies the condi-
tions of Theorem 6.9. Furthermore, there is only one such K-tuple, by the uniqueness
established in Lemma 6.21. Hence, each subsequence converges to the same limit
H = H , with H as defined in Theorem 6.10. This implies that U locn →d U in the
Skorohod topology. In particular,
U locn (0) = n1/3(Fn(t0) − F0(t0)) →d U(0), in R
K .
2
6.3 Technical lemmas and proofs
Lemma 6.26 states a well-known fact about convex functions. We provide this lemma
and its proof for completeness.
Lemma 6.26 Let (fn) be a sequence of convex functions on R converging pointwise
to a convex function f on R. Then, at each point t where the two-sided derivative f ′
of f exists, we have:
limn→∞
D+fn(t) = limn→∞
D−fn(t) = f ′(t),
where D+fn and D−fn are the right and left derivative of fn, respectively.
Proof: Fix ǫ > 0 and suppose that f is differentiable at t. Then there exists an η > 0
178
such that
f ′(t) − ǫ ≤ f(t) − f(t− η)
η≤ f(t+ η) − f(t)
η≤ f ′(t) + ǫ.
Moreover,
limn→∞
fn(t+ η) − fn(t)
η=f(t+ η) − f(t)
η,
and also
limn→∞
fn(t) − fn(t− η)
η=f(t) − f(t− η)
η.
The statement now follows from
fn(t) − fn(t− η)
η≤ D−fn(t) ≤ D+fn(t) ≤
fn(t+ η) − fn(t)
η.
2
Proof of Proposition 6.8: Note that
X locnk (h) = n2/3
∫
[t0,t0+n−1/3h]
δk − F0k(u)dPn(u, δ)
+ n2/3
∫
[t0,t0+n−1/3h]
F0k(u) − F0k(t0)dGn(u). (6.39)
For each k = 1, . . . , K, the first term on the right side of (6.39) converges in distri-
bution to Wk(h) in l∞[−M,M ], where Wk(h) is the Brownian motion process defined
in Definition 6.2. This follows by applying Theorem 2.11.22 of Van der Vaart and
Wellner (1996, page 220) to the class of functions
Fnk = fnkh(u, δ) = n1/61[t0,t0+n−1/3h](u)δk − F0k(u) : h ∈ [−M,M ], n ∈ N.
This convergence also holds jointly in k = 1, . . . , K. Namely, marginal tightness of
the processes implies joint tightness. Hence, there is a subsequence that converges
179
weakly to a tight Borel measure, jointly for k = 1, . . . , K. The marginal distributions
of this Borel measure are given by W1, . . . ,WK , and its covariance structure can
be determined by considering the finite dimensional distributions. Since ∆|T has
a multinomial distribution (see (2.4)), (W1, . . . ,WK) has a multinomial covariance
structure.
The second term on the right side of (6.39) can be written as
n2/3
∫
[t0,t0+n−1/3h]
F0k(u) − F0k(t0)dGn(u)
= n2/3
∫
[t0,t0+n−1/3h]
F0k(u) − F0k(t0)dG(u)
+ n2/3
∫
[t0,t0+n−1/3h]
F0k(u) − F0k(t0)d(Gn −G)(u)
= n2/3
∫
[t0,t0+n−1/3h]
F0k(u) − F0k(t0)dG(u) + op(1)
→ 1
2f0k(t0)g(t0)h
2,
where the convergence is uniformly for h ∈ [−M,M ], and jointly for k = 1, . . . , K.
Here the second to last line of the display follows from Lemma 5.13. The last line of the
display follows from the continuity and positivity of f0k(t) and g(t) in a neighborhood
of t0. 2
Proof of Lemma 6.17: It is sufficient to show that there exists an M > 0 such that
the statement holds for a fixed k. Let k ∈ 1, . . . , K and j ∈ 0, 1, . . .. Note that
dXk(t) = dWk(t)/g(t0) + F0k(t)dt. Hence,
∫ j+Mv(j)
w
[F0k(j +Mv(j))dt− dXk(t)
]
=
∫ j+Mv(j)
w
[F0k(j +Mv(j) − t)dt− dWk(t)/g(t0)
]
=1
2f0k(t0)(j +Mv(j) − w)2 −
∫ j+Mv(j)
w
dWk(t)/g(t0).
180
Furthermore, for any C > 0 fixed, we have that
(j +Mv(j) − w)3/2C ≤ 1
4f0k(t0)(j +Mv(j) − w)2,
for M sufficiently large. Hence, it is sufficient to show that
P
(supw≤j+1
∫ j+Mv(j)
w
dWk(t)/g(t0) −1
4f0k(t0)(j +Mv(j) − w)2 ≥ 0
)≤ pjM
2. (6.40)
We again consider a grid, now with grid points j + 1− q, q ∈ N. Then we can bound
the left side of the above display by
∞∑
q=0
P
(sup
w∈[j−q,j−q+1)
∫ j+Mv(j)
w
dWk(t)
g(t0)− 1
4f0k(t0)(j +Mv(j) − w)2 ≥ 0
)
∞∑
q=0
P
(sup
w∈[j−q,j−q+1)
∫ j+Mv(j)
w
dWk(t) ≥ λkjq
), (6.41)
where λkjq is obtained by plugging in w = j + 1 − q in the quadratic term:
λkjq =1
4f0k(t0)g(t0)(Mv(j) − 1 + q)2.
Let Bk(·) denote standard Brownian motion. We write the qth term in (6.41) as
P
(sup
w∈[j−q,j−q+1)
Wk(j +Mv(j) − w) ≥ λkjq
)
≤ P
(sup
w∈[0,Mv(j)+q)
Wk(w) ≥ λkjq
)= P
(supw∈[0,1)
Wk((Mv(j) + q)w) ≥ λkjq
)
= P
(supw∈[0,1)
Wk(w) ≥ λkjq√Mv(j) + q
)≤ P
(supw∈[0,1]
Bk(w) ≥ λkjq
bk√Mv(j) + q
)
≤ 2P
(N(0, 1) ≥ λkjq
bk√Mv(j) + q
)≤ 2bkq exp
−1
2
(λkjq
bk√Mv(j) + q
)2 ,
181
where
bk =√F0k(t0)(1 − F0k(t0))g(t0), k = 1, . . . , K,
bkq =bk√Mv(j)+q
λkjq
√2π
, k = 1, . . . , K, q ∈ N.
Here we used standard properties of Brownian motion. The second to last inequality
is given in for example Shorack and Wellner (1986, equation 6, page 33), and the last
inequality follows from Mills’ ratio (Gordon (1941, Equation (10), page 366)). Note
that bkq ≤ bk/(f0k(t0)g(t0)√
2π) for M > 3. Hence, returning to (6.41), we have
∞∑
q=0
P
(sup
w∈[j−q,j−q+1)
∫ j+Mv(j)
w
dWk(t) ≥ λkjq
)
≤∞∑
q=0
2bk
f0k(t0)g(t0)√
2πexp
−1
2
(λkjq
bk√Mv(j) + q
)2
≈∞∑
q=0
2bk
f0k(t0)g(t0)√
2πexp
(−1
2
(Mv(j) + q)3
b2k
)≤ d1 exp(−d2(Mv(j))3),
using (a+ b)3 ≥ a3 + b3 for a, b ≥ 0. 2
Proof of Lemma 6.18: Since l is only defined on the event Aj , this entire proof
should be read on the event Aj. If l = K, then we can apply the method of Lemma
6.17. Therefore, assume that l < K. In this case, we cannot apply the method of
Lemma 6.17, for the reason discussed in the proof of Lemma 5.16 and illustrated in
Figure 5.3. Hence, we break the term∫ j+Mv(j)
τl
[dX+(t)− U+(t)dt
]into pieces that we
analyze separately. We define l∗ ∈ l, . . . , K as follows. If
∫ τk
τl
[dX+(t) − U+(t)dt
]≥ 0, for all k = l + 1, . . . , K, (6.42)
we let l∗ = l. Otherwise we define l∗ such that
182
∫ τk
τl
[dX+(t) − U+(t)dt
]≥ 0, k = l∗ + 1, . . . , K, (6.43)
∫ τl∗
τl
[dX+(t) − U+(t)dt
]< 0. (6.44)
Then, by (6.44) and the decomposition∫ j+Mv(j)
τl=∫ τl∗τl
+∫ j+Mv(j)
τl∗, we get
∫ j+Mv(j)
τl
[dX+(t) − U+(t)dt
]≤∫ j+Mv(j)
τl∗
[dX+(t) − U+(t)dt
], (6.45)
where strict inequality holds if l 6= l∗. Rearranging the sum and using the notation
τK+1 = j +Mv(j), we can rewrite the right side of (6.45) as
K∑
k=1
∫ τk
τl∗
[dXk(t) − Uk(t)dt
]
=
K∑
k=l∗+1
∫ τk
τl∗
[dXk(t) − Uk(t)dt
]+
K∑
k=l∗
k∑
p=1
∫ τk+1
τk
[dXp(t) − Up(t)dt
]. (6.46)
We now derive upper bounds for the terms in (6.46). For the first term, note that
∫ τk
τl∗
[dX+(t) − U+(t)dt
]≥ 0, k = l∗ + 1, . . . , K. (6.47)
Namely, if l = l∗, then (6.47) is the same as (6.42). If l < l∗, then (6.47) follows from
(6.43), (6.44) and the decomposition∫ τkτl
=∫ τl∗τl
+∫ τkτl∗
. Furthermore, the definition of
τ1, . . . , τK and condition (i) of Theorem 6.9 imply that
∫ τk
t
[dXk(t) − Uk(t)dt+
aK+1
akdX+(t) − U+(t)dt
]≤ 0, k = 1, . . . , K, t ≤ τk.
Using this inequality with t = τl∗ together with (6.47) yields that
∫ τk
τl∗
[dXk(t) − Uk(t)dt
]≤ 0,
183
for k = l∗ + 1, . . . , K. This implies that the first term of (6.46) is bounded above by
zero.
We now derive an upper bound for the second term of (6.46). On the event Aj ,
the inequalities (6.17) in the definition of l imply that
K∑
p=k+1
Up(j + 1) <
K∑
p=k+1
F0p(j +Mv(j)), k = l, . . . , K.
Together with the definition of τk, it follows that on the event Aj we have
k∑
p=1
Up(τp) =
k∑
p=1
Up(j + 1) >
k∑
p=1
F0p(j +Mv(j)), k = l, . . . , K.
Furthermore, Up(τp) ≤ Up(τk) for p ≤ k by the monotonicity of Up and the ordering
τ1 ≤ · · · ≤ τK . Hence, we get for k = l, . . . , K, and u ≥ τk:
k∑
p=1
Up(u) ≥k∑
p=1
Up(τk) >k∑
p=1
F0p(j +Mv(j)).
This means that on the event Aj , the second term of (6.46) is bounded above by
K∑
k=l∗
k∑
p=1
∫ τk+1
τk
[dXp(t) − F0p(j +Mv(j))dt
]
=
K∑
k=1
∫ j+Mv(j)
τk∨τl∗
[dXk(t) − F0k(j +Mv(j))dt
].
Combining (6.45), (6.46) and the upper bound for (6.46), we obtain
P
(∫ j+Mv(j)
τl
[dX+(t) − U+(t)dt
]≥ 0
∩Aj
)
≤ P
(∫ j+Mv(j)
τl∗
[dX+(t) − U+(t)dt
]≥ 0
∩Aj
)
≤ P
( K∑
k=1
∫ j+Mv(j)
τk∨τl∗
[dXk(t) − F0k(j +Mv(j))dt
]≥ 0
).
184
In turn, this is bounded above by
P
(sup
k ∈ 1, . . . , K
w ≤ j + 1
∫ j+Mv(j)
w
[dXk(t) − F0k(j +Mv(j))dt
]≥ 0
),
and this can be bounded by pjM/2 using Lemma 6.17. 2
Proof of Lemma 6.23: Note that
∫
[τnk,t0+n−1/3h)
Fnk(u) − δk
Fnk(u)+Fn+(u) − δ+
1 − Fn+(u)
dPn(u, δ)
=
∫
[τnk ,t0+n−1/3h)
Fnk(u) − δkF0k(t0)
+Fn+(u) − δ+1 − F0+(t0)
dPn(u, δ)
+
∫
[τnk ,t0+n−1/3h)
Fnk(u) − δkF0k(t0) − Fnk(u)Fnk(u)F0k(t0)
dPn(u, δ)
+
∫
[τnk ,t0+n−1/3h)
Fn+(u) − δ+Fn+(u) − F0+(t0)1 − Fn+(u)1− F0+(t0)
dPn(u, δ)
≡ I + II + III.
Since the terms II and III are analogous, we only show that II is of order op(n−2/3).
As in Lemma 5.9, it is sufficient to consider the numerator. We write
∫
[τnk,t0+n−1/3h)
Fnk(u) − δkF0k(t0) − Fnk(u)dPn(u, δ)
=
∫
[τnk,t0+n−1/3h)
Fnk(u) − F0k(u)F0k(t0) − Fnk(u)d(Pn − P )(u, δ)
+
∫
[τnk,t0+n−1/3h)
Fnk(u) − F0k(u)F0k(t0) − Fnk(u)dP (u, δ)
+
∫
[τnk,t0+n−1/3h)
F0k(u) − δkF0k(t0) − Fnk(u)d(Pn − P )(u, δ)
≡ IIa + IIb + IIc.
Note that IIb is of order Op(n−1), using that the length of the integration interval
185
is Op(n−1/3) (Corollary 5.22) and the local rate of convergence (Theorem 5.10). The
terms IIa and IIc are of order op(n−2/3) by Theorem 2.11.22 of Van der Vaart and
Wellner (1996), analogously to the treatment of term I in Lemma 6.7. 2
186
Chapter 7
A FAMILY OF SMOOTH FUNCTIONALS
Let c : R → R be a fixed function. We consider estimation of the following smooth
functionals of the sub-distribution functions:
Vk(F ) =
∫Fk(t)c(t)dG(t) =
∫Cg(x)dFk(x), k = 1, . . . , K + 1,
where Cg(t) =∫[t,∞)
c(x)dG(x). The second equality follows from Fubini’s theorem
if∫Fk(t) |c(t)| dG(t) < ∞. We choose to consider the functionals
∫Fk(t)c(t)dG(t)
instead of∫Fk(t)c(t)dt, because doing so allows us to get asymptotic results with
few assumptions on G. Furthermore, the functionals∫Fk(t)c(t)dt fit into the fam-
ily∫Fk(t)b(t)dG(t) by assuming that G has a density g with respect to Lebesgue
measure, and choosing b(t) = c(t)/g(t).
Jewell, Van der Laan and Henneman (2003, Section 8) discuss results that suggest
that the naive estimators yield fully efficient estimators for these smooth functionals,
and that under some conditions
√nVk(Fn) − Vk(F0) =
√n
∫Fnk(t) − F0k(t)c(t)dG(t)
→d N
(0,
∫F0k(t)(1 − F0k(t))c
2(t)dG(t)
).
We show that the same is true for the MLE, and hence that the naive estimator and
the MLE are asymptotically equivalent for these smooth functionals. In Section 7.1
we derive the information lower bound for our model, and in Section 7.2 we show that
the MLE achieves this lower bound. We assume that the MLEs Fnk are piecewise
187
constant and right-continuous, with jumps only at points in Tk (see Definition 2.22).
7.1 Information bound calculations
Since our variables of interest (X, Y ) are subject to censoring, we can consider so
called hidden and observed models for our data. The hidden data consist of the
triplets (T,X, Y ), and the hidden model is Q = QF : F ∈ FK. The corresponding
density qF (x, y = k) is simply fk(x). The observed data are H(T,X, Y ) = (T,∆) and
the observed model is
P = QFH−1 : F ∈ FK. (7.1)
The density of PF ∈ P is
pF (t, δ) =K∏
k=1
Fk(t)δk(1 − F+(t))1−δ+ ,
with respect to µ, where µ = G × # and # is counting measure on the unit vectors
ek ∈ RK+1, k = 1, . . . , K + 1.Let L2(P ) be the equivalence class of P -square integrable functions, with inner
product 〈g1, g2〉L2(P ) =∫g1g2dP and norm ‖g‖2 =
∫g2dQ1/2. Let L0
2(P ) be the
subset of g ∈ L2(P ) with EP (g) =∫gdP = 0. Finally, note that both P ∈ P and
Q ∈ Q depend on the underlying distribution F . However, we often suppress this
dependence in the notation.
The functionals Vk(F ), F ∈ Q are implicitly defined in the sense that the observed
data are from PF rather than directly from F . In terms of the observed data, we can
write the functionals as Θ(PF ), P ∈ P. Here the observation time distribution G acts
as a nuisance parameter. Information bounds for such implicitly defined functionals
were studied by Van der Vaart (1991). Discussions of Van der Vaart’s work can be
found in Groeneboom and Wellner (1992, pages 23-32), and Bickel, Klaassen, Ritov
and Wellner (1993, pages 201-210).
188
Throughout, we need the following assumptions:
(a) The distribution G of T is fixed and known (see Remark 7.16 for a discussion
of the effects of not knowing G);
(b) I−1Fk
=∫Fk(t)(1 − Fk(t))c
2(t)dG(t) <∞;
(c)∫Fk(t) |c(t)| dG(t) <∞.
Proposition 7.1 The score operator lF relates the observed model to the hidden
model. It is the bounded linear operator from L02(Q) to L0
2(P ) given by
[lFa](t, δ) =
K∑
k=1
∫
(−∞,t]
a(x, k)dFk(x)
(δk
Fk(t)− δK+1
1 − F+(t)
)a.e. PF (7.2)
The adjoint lT of the score operator is the bounded linear functional from L02(P ) to
L02(Q) given by
[lT b](x, k) =
∫
[x,∞)
b(t, ek)dG(t) +
∫
(−∞,x)
b(t, eK+1)dG(t) a.e. F. (7.3)
Note that lT does not depend on F .
Proof: Let a ∈ L02(Q). By for example Groeneboom and Wellner (1992, page 8,
equation (1.5)), we have
[lFa](t, δ) = E (a(X, Y )|H(T,X, Y ) = (t, δ))
=
K∑
k=1
δk
∫(−∞,t]
a(x, k)dFk(x)
Fk(t)+ δK+1
∫(t,∞)
a(x, k)dFk(x)
1 − F+(t)
=K∑
k=1
δk
∫(−∞,t]
a(x, k)dFk(x)
Fk(t)− δK+1
∫(−∞,t]
a(x, k)dFk(x)
1 − F+(t)
=
K∑
k=1
∫
(−∞,t]
a(x, k)dFk(x)
(δk
Fk(t)− δK+1
1 − F+(t)
)a.e. PF ,
189
where we use∑K
k=1
∫a(x, k)dFk(x) = 0 (since a ∈ L0
2(Q)) to obtain the third line.
Let b ∈ L02(P ). The adjoint lT of the score operator is the bounded linear func-
tional from L02(P ) to L0
2(Q) given by
[lT b](x, k) = E (b(T,∆)|(X, Y ) = (x, k))
=
∫
[x,∞)
b(t, ek)dG(t) +
∫
(−∞,x)
b(t, eK+1)dG(t) a.e. F.
2
The functional Vk(F ) is said to be pathwise differentiable at F in the hidden model
Q if there is a continuous linear map vkF from L02(Q) to R such that
η−1(Vk(Fη) − Vk(F )) → vkF
for every path Fη in Q. We call vkF the canonical gradient of Vk(F ) in the hidden
model.
Proposition 7.2 The canonical gradients v1F , . . . , vK+1,F of V1, . . . , VK+1 at F in the
hidden model are bounded linear functionals from L02(Q) to R, given by
[vkFa](x) =
K∑
j=1
(∫Cg(x)1j = k −
∫Cg(t)dFk(t)
)a(x, j)dFj(x),
[vK+1,Fa](x) = −K∑
k=1
∫ (Cg(x) −
∫Cg(t)dF+(t)
)a(x, k)dFk(x).
where the first equality holds for k = 1, . . . , K. Furthermore, their adjoints are
bounded linear functions from R to L02(Q), given by
[vTkF b](x, j) =
(Cg(x)1j = k −
∫Cg(t)dFk(t)
)b, k = 1, . . . , K,
[vTK+1,Fb](x) = −(Cg(x) −
∫Cg(t)dF+(t)
)b.
190
Proof: Let a ∈ L02(Q) be bounded. Consider the perturbation
Fkη(t) = Fk(t) + η
∫
(−∞,t]
a(x, k)dFk(x).
Note that these functions are monotone nondecreasing for small η. Furthermore,
a ∈ L02(Q) implies
∑Kk=1
∫a(x, k)dFk(x) = 0, so that
K∑
k=1
Fkη(∞) = F+(∞) + ηK∑
k=1
∫a(x, k)dFk(x) = F+(∞) = 1.
It follows that the Fkη’s are valid sub-distribution functions. Now let k ∈ 1, . . . , K.Then
Vk(Fη) − Vk(F ) =
∫Cg(x)d
[η
∫
(−∞,x]
a(t, k)dFk(t)
]= η
∫Cg(x)a(x, k)dFk(x),
so that
[vkFa](x) =
∫Cg(x)a(x, k)dFk(x) =
K∑
j=1
∫Cg(x)a(x, j)1j = kdFj(x)
=
K∑
j=1
∫ (Cg(x)1j = k −
∫Cg(t)dFk(t)
)a(x, j)dFj(x),
again using that a ∈ L02(Q). Furthermore,
FK+1,η = 1 − F+η = 1 − F+(t) − η
K∑
k=1
∫
(−∞,t]
a(x, k)dFk(x).
Hence,
VK+1 (Fη) − VK+1 (F ) =
∫Cg(x)d
[−η
K∑
k=1
∫
(−∞,x]
a(t, k)dFk(t)
]
= −ηK∑
k=1
∫Cg(x)a(x, k)dFk(x).
191
Hence,
[vK+1,Fa](x) = −K∑
k=1
∫Cg(x)a(x, k)dFk(x)
= −K∑
k=1
∫ (Cg(x) −
∫Cg(t)dF+(t)
)a(x, k)dFk(x).
We find the adjoints vTkF , k = 1, . . . , K + 1, using the relation 〈vkFa, b〉R =
〈a, vTkF b〉L2(Q). This yields
〈vkFa, b〉R =
K∑
j=1
∫b
(Cg(x)1j = k −
∫Cg(t)dFk(t)
)a(x, j)dFj(x)
=
⟨(Cg(x)1j = k −
∫Cg(t)dFk(t)
)b, a(x, j)
⟩
L2(Q)
,
so that
[vTkF b](x, j) =
(Cg(x)1j = k −
∫Cg(t)dFk(t)
)b, k = 1, . . . , K.
The adjoint vTK+1,F can be derived analogously. 2
It now follows from Van der Vaart (1991, Theorem 3.1) that the functionals
Vk(F ) = Θk(PF ), k = 1, . . . , K + 1, are pathwise differentiable in the observed model
if and only if
vTkF ∈ R(lT ),
and if this holds, then the canonical gradient is the unique element θkF ∈ R(l)
satisfying
lT θkF = vTkF .
192
Proposition 7.3 The canonical gradients θ1F , . . . , θK+1,F of V1, . . . , VK+1 in the ob-
served model are bounded linear functionals from L02(P ) to R, given by
θjF (t, δ) = δj − Fj(t)c(t), j = 1, . . . , K + 1. (7.4)
Furthermore,
θjF (t, δ) =K∑
k=1
δk
Fk(t)− 1 − δ+
1 − F+(t)
djF (t, k), j = 1, . . . , K + 1, (7.5)
where
djF (t, k) = Fj(t)(1j = k − Fk(t))c(t), j = 1, . . . , K + 1. (7.6)
The information lower bounds for estimating V1, . . . , VK+1 in the observed model are
I−1Fj
= ‖θjF‖2L2(P ) =
∫Fj(t)(1 − Fj(t))c
2(t)dG(t), j = 1, . . . , K + 1. (7.7)
Proof: We first consider θjF for j ∈ 1, . . . , K. For all k = 1, . . . , K, we have
[lT θjF ](x, k) =
∫ ∞
x
θjF (t, ek)dG(t) +
∫ x
−∞θjF (t, eK+1)dG(t)
=
∫ ∞
x
1k = j − Fj(t)c(t)dG(t) −∫ x
−∞Fj(t)c(t)dG(t)
= Cg(x)1k = j −∫Cg(t)dFj(t) = vTjF (x, k).
We now consider θK+1,F :
[lT θK+1,F ](x) = −∫
[x,∞)
FK+1(t)c(t)dG(t) +
∫
(−∞,x)
1 − FK+1(t)c(t)dG(t)
= −∫FK+1(t)c(t)dG(t) +
∫
(−∞,x)
c(t)dG(t).
193
This can be written as
∫F+(t)c(t)dG(t) −
∫
[x,∞)
c(t)dG(t)
=
∫Cg(x)dF+(x) − Cg(x) = vTK+1,F (x).
Next, we check expression (7.5). For j = 1, . . . , K, we have
K∑
k=1
δk
Fk(x)− 1 − δ+
1 − F+(x)
djF (x, k)
=K∑
k=1
δk
Fk(x)− 1 − δ+
1 − F+(x)
Fj(x)(1j = k − Fk(x))c(x)
= c(x)
[δj
Fj(x)− 1 − δ+
1 − F+(x)
Fj(x)(1 − Fj(x))
−∑
k 6=j
δk
Fk(x)− 1 − δ+
1 − F+(x)
Fj(x)Fk(x)
]
= c(x)
[δj(1 − Fj(x)) −
(1 − δ+)Fj(x)(1 − Fj(x))
1 − F+(x)
−∑
k 6=j
δkFj(x) −
(1 − δ+)Fj(x)Fk(x)
1 − F+(x)
]
= c(x)
[δj − δ+Fj(x) −
(1 − δ+)Fj(x)
1 − F+(x)+
(1 − δ+)Fj(x)F+(x)
1 − F+(x)
]
= c(x)δj − δ+Fj(x) − (1 − δ+)Fj(x)
= c(x)δj − Fj(x) = θjF (x, δ).
We verify the expression for j = K + 1 analogously:
K∑
k=1
δk
Fk(x)− 1 − δ+
1 − F+(x)
dK+1,F (x, k)
= −K∑
k=1
δk
Fk(x)− δK+1
FK+1(x)
FK+1(x)Fk(x)c(x).
194
This can be written as
− c(x)
K∑
k=1
δkFK+1(x) − δK+1Fk(x)
= −c(x)δ+FK+1(x) − δK+1F+(x)
= δK+1 − FK+1(x)c(x) = θK+1,F (x, δ).
The expressions for the information lower bounds follow from direct computation.
2
Remark 7.4 The expressions djF (t, k) given in (7.6) typically have discontinuities
that do not coincide with discontinuities of Fk, for two reasons: (i) Fj can have jumps
at other locations than Fk; (ii) the function c(t) may have jumps at other locations
than Fk. In such cases we cannot express djF (t, k) as∫ t−∞ a(x, k)dFk(x) for some
a ∈ L02(Q). Hence θjF ∈ R(l)\R(l).
7.2 Asymptotic normality of functionals of the MLE
We now let c(x) = ξ(x)1[0,t](x), where ξ : R+ 7→ R+ is Lipschitz continuous. With this
choice of the function c(·), assumptions (b) and (c) of Section 7.1 are automatically
satisfied when t is finite. Furthermore, this choice of the function c(·) yields
θj,F,t(u, δ) = δj − Fj(u)ξ(u)1[0,t](u),
δj,F,t(u, k) = Fj(u)(1j = k − Fk(u))ξ(u)1[0,t](u).
Throughout, we assume F0+(0) = 0, and we use the convention 0/0. We now give the
main result of this chapter.
Theorem 7.5 Let t0 ∈ R and c(t) = ξ(t)1[0,t0](t), where ξ : R+ 7→ R+ is Lips-
chitz continuous. Assume that F01, . . . , F0K are absolutely continuous with respect to
Lebesgue measure on [0, t0], with densities f01, . . . , f0K. Assume that ǫ < f0k(t) < M
195
for some constants 0 < ǫ < M , for all t ∈ [0, t0] and k = 1, . . . , K. Furthermore, as-
sume that F0+(t0) < 1, that F0+ is continuous at t0, and that G has a strictly positive
density g on a neighborhood of t0. Then
√n(Vk(F0) − Vk(Fn)) =
√n
∫
[0,t0]
F0k(t) − Fnk(t)ξ(t)dG(t)
→d N
(0,
∫
[0,t0]
F0k(t)(1 − F0k(t))ξ2(t)dG(t)
),
for k = 1, . . . , K + 1.
Proof: The proof is similar in spirit to the proofs of Huang and Wellner (1995) and
Geskus and Groeneboom (1996, 1997, 1999). However, the new aspect here is that
we have a system of sub-distribution functions.
We first consider FK+1. In Lemma 7.14 (ahead), we show that
∫
[0,t0]
F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) ≤∫θK+1,F0,t0d(P0 − Pn) +Op(n
−2/3).
Letting θ+,F,t =∑K
k=1 θk,F,t and noting that FK+1 = 1 − F+ and θ+,F,t = −θK+1,F,t,
we find that this is equivalent to
∫
[0,t0]
F0+(t) − Fn+(t)ξ(t)dG(t) ≥∫θ+,F0,t0d(P0 − Pn) +Op(n
−2/3). (7.8)
Next, we consider the components F1, . . . , FK . In Lemma 7.15 (ahead), we show that
∫
[0,t0]
F0k(t) − Fnk(t)ξ(t)dG(t) ≤ maxτ∈σnk ,τnk
∫θk,F0,τ−d(P0 − Pn) +Op(n
−2/3), (7.9)
where σnk is the last jump point of Fnk before t0, and τnk is the first jump point of Fnk
after t0. Consistency of Fnk in a neighborhood of t0 (Proposition 4.15) and f0k(t0) > 0
imply that τnk − σnk →a.s. 0. Furthermore, the central limit theorem implies that for
196
any α > 0
√n
∫θk,F0,t0+α − θk,F0,t0d(P0 − Pn)
=√n
∫
(t0,t0+α]
δk − F0k(t)ξ(t)d(P0 − Pn)
→d N
(0,
∫
(t0,t0+α]
F0k(t)(1 − F0k(t)ξ2(t)dG(t)
).
Hence,
maxτ∈σnk ,τnk
∫θk,F0,τ−d(P0 − Pn) =
∫θk,F0,t0d(P0 − Pn) + op(n
−1/2).
Combining this with (7.9) yields
∫
[0,t0]
F0k(t) − Fnk(t)ξ(t)dG(t) ≤∫θk,F0,t0d(P0 − Pn) + op(n
−1/2), (7.10)
and summing over k = 1, . . . , K gives
∫
[0,t0]
F0+(t) − Fn+(t)ξ(t)dG(t) ≤∫θ+,F0,t0d(P0 − Pn) + op(n
−1/2). (7.11)
Combining (7.11) and (7.8) yields
√n
∫
[0,t0]
F0+(t) − Fn+(t)ξ(t)dG(t) =√n
∫θ+,F0,t0d(P0 − Pn) + op(1)
→d N(0, ‖θ+,F0,t0‖2L2(P0)),
where the convergence follows from the central limit theorem. Since F+ = 1−F+ and
‖θ+,F0,t0‖2 = ‖θK+1,F0,t0‖2, this is equivalent to
√n
∫
[0,t0]
F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) →d N(0, ‖θK+1,F0,t0‖2L2(P0)).
197
Finally, (7.10) and (7.8) imply that
√n
∫
[0,t0]
F0k(t) − Fnk(t)ξ(t)dG(t) =√n
∫θk,F0,t0d(P0 − Pn) + op(1), (7.12)
for k = 1, . . . , K. The convergence result then again follows from the central limit
theorem. 2
Remark 7.6 Theorem 7.5 requires that the underlying F01, . . . , F0K are absolutely
continuous with respect to Lebesgue measure, with densities bounded away from zero.
This assumption was not used when computing the MLE, and in fact the MLE does
not satisfy this assumption. Namely, the MLE can be taken to be piecewise contin-
uous, but it always contains horizontal pieces, where its density equals zero. Thus,
under the assumptions of Theorem 7.5, we expect that one can construct better esti-
mators than the MLE. However, these estimators are not better (asymptotically) for
the estimation of the smooth functionals we consider, since the MLE is asymptotically
efficient for these smooth functionals.
In order to prove the key Lemmas 7.14 and 7.15 that are needed in the proof of
Theorem 7.5, we need to establish a few results. We start with a basic but important
fact.
Lemma 7.7 For any F = (F1, . . . , FK) ∈ FK, we have
∫
[0,t0]
(F0k(t) − Fk(t))ξ(t)dG(t) =
∫θk,F,t0dP0, k = 1, . . . , K + 1.
Proof:
∫θk,F,t0dP0 =
∫δk − Fk(t)ξ(t)1[0,t0](t)dP0
=
∫
[0,t0]
F0k(t) − Fk(t)ξ(t)dt.
2
198
Next, we introduce an adapted version of θj,F,t0. The main goal of this adaptation
θj,F,t0 is that the corresponding functions dj,F,t0(x, k) are constant on the same in-
tervals as Fk, so that we can use the (in)equalities given by the characterization in
Proposition 2.34.
Definition 7.8 Let F = (F1, . . . , FK) ∈ FK be piecewise constant, and let k ∈1, . . . , K. Let 0 = τk0 < τk1 < · · · < τk,pk
< τk,pk+1 = ∞ be the ordered jump
points of Fk. For i ∈ 1, . . . , pk + 1, let Jki = [τk,i−1, τki) and distinguish the follow-
ing three cases:
(i) Fk(ski) = F0k(ski) for some ski ∈ Jki;
(ii) Fk(u) < F0k(τki) for all u ∈ Jki;
(iii) Fk(u) > F0k(τk,i+1−) for all u ∈ Jki.
In case (ii) we define ski = τki, and in case (iii) we define ski = τk,i+1−. Furthermore,
for j = 1, . . . , K + 1, we choose a point ukji ∈ Jki such that
|Fj(ukji) − F0j(ski)| = minx∈Jki
|Fj(x) − F0j(ski)| .
Then, for t ∈ Jki, we define
F(k)
j (t) = Fj(ukji),
ξ(k)
(t) = ξ(ski).
Finally, we define
dj,F,t0(t, k) = F(k)
j (t)(1j = k − Fk(t))ξ(k)
(t)1[0,t0](t),
θj,F,t0(t, δ) =
K∑
k=1
δk
Fk(t)− 1 − δ+
1 − F+(t)
dj,F,t0(t, k).
199
This adapted version of θj,Fn,t0is useful, because we have information on the sign of
∫θj,Fn,t0
dPn, j = 1, . . . , K + 1. In Lemma 7.9 we show that∫θK+1,Fn,t0
dPn ≤ 0. In
Lemma 7.10 we show that∫θj,Fn,τnj−dPn ≤ 0 for j = 1, . . . , K and τnj a jump point
of Fnj.
Lemma 7.9 We have
∫θK+1,Fn,t0
dPn ≤ 0.
Proof: Note that
∫θK+1,Fn,t0
dPn =
K∑
k=1
∫ δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
dK+1,Fn,t0
(t, k)dPn(t, δ),
and that dK+1,Fn,t0(t, k) is constant on the same intervals as Fnk(t), except for the one
containing t0. Using the characterization given in Proposition 2.34, it follows that for
each k = 1, . . . , K
∫
t∈[0,t0]
δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
dK+1,Fn
(t, k)dPn(t, δ)
=
∫
t∈[τnk1,t0]
δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
dK+1,Fn
(t, k)dPn(t, δ)
= −∫
t∈[τnk1,t0]
δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
F
(k)
n,K+1(t)Fnk(t)ξ(k)
(t)dPn(t, δ) ≤ 0.
Here τnk1 is the first jump point of Fnk, and the first equality follows from Fnk(t) = 0
for t < τnk1. The last inequality follows from equality in (2.41) for t = τnk1, inequality
in expression (2.41) for t = t0+, and the fact that F(k)
n,K+1(t)Fnk(t)ξ(k)
(t) is constant
on the same intervals as Fnk. 2
We can say something similar about∫θj,Fn,t0
dPn, j = 1, . . . , K, but only for jump
points of Fnj.
200
Lemma 7.10 Let τnj be a jump point of Fnj. Then we have for j = 1, . . . , K,
∫θj,Fn,τnj−dPn ≤ 0.
Proof: Let τnk1 be the first jump point of Fnk, k = 1, . . . , K. Note that
∫θj,Fn,τnj−dPn
=K∑
k=1
∫ δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
dj,Fn,τnj−(t, k)dPn(t, δ)
=
∫
t∈[0,τnj)
δj
Fnj(t)− 1 − δ+
1 − Fn+(t)
Fnj(t)(1 − Fnj(t))ξ
(j)(t)dPn(t, δ)
−∑
k 6=j
∫
t∈[0,τnj)
δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
F
(k)
nj (t)Fnk(t)ξ(k)
(t)dPn(t, δ)
=
∫
t∈[τnj1,τnj)
δj
Fnj(t)− 1 − δ+
1 − Fn+(t)
Fnj(t)(1 − Fnj(t))ξ
(j)(t)dPn(t, δ)
−∑
k 6=j
∫
t∈[τnk1,τnj)
δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
F
(k)
nj (t)Fnk(t)ξ(k)
(t)dPn(t, δ)
= I − II ≤ 0.
This inequality follows from the characterization in Proposition 2.34 which implies
I = 0 and II ≥ 0. Here I = 0 follows from the fact that we have equality at
t = τnj1 and t = τnj in (2.41) for the jth component. Similarly, II ≥ 0 follows from
expression (2.41) for the kth component (k 6= j), where we have equality at t = τnk1
and inequality at t = τnj (because τnj is typically not a jump point of Fnk). 2
The last two ingredients for the proofs of Lemmas 7.14 and 7.15 are
∣∣∣∣∫θj,Fn,t0
− θj,Fn,t0dP0
∣∣∣∣ = Op(n−2/3), (7.13)
∣∣∣∣∫
θk,Fn,t0− θk,F0,t0d(Pn − P0)
∣∣∣∣ = Op(n−2/3). (7.14)
201
In order to prove this, we first bound the differences between F(k)
nj (t) − Fnj(t) and
ξ(k)
(t) − ξ(t).
Lemma 7.11 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1
and k = 1, . . . , K,
∣∣∣Fnj(x) − F(k)
nj (x)∣∣∣ ≤ 2
∣∣∣Fnj(x) − F0j(x)∣∣∣ + (2M/ǫ)
∣∣∣Fnk(x) − F0k(x)∣∣∣, (7.15)
∣∣∣ξ(x) − ξ(k)
(x)∣∣∣ ≤ (C/ǫ)
∣∣∣F0k(x) − Fnk(x)∣∣∣, (7.16)
where ǫ and M are defined in Theorem 7.5 and C > 0 is a constant.
Proof: Let j ∈ 1, . . . , K + 1, k ∈ 1, . . . , K, i ∈ 1, . . . , pk + 1, and x ∈ Jki.
Then the triangle inequality and the definition of F(k)
nj yield:
∣∣∣Fnj(x) − F(k)
nj (x)∣∣∣ ≤
∣∣∣Fnj(x) − F0j(x)∣∣∣+∣∣∣F0j(x) − F0j(ski)
∣∣∣+∣∣∣F0j(ski) − F
(k)
nj (x)∣∣∣
=∣∣∣Fnj(x) − F0j(x)
∣∣∣ +∣∣∣F0j(x) − F0j(ski)
∣∣∣ +∣∣∣F0j(ski) − Fnj(ukji)
∣∣∣
≤∣∣∣Fnj(x) − F0j(x)
∣∣∣+∣∣∣F0j(x) − F0j(ski)
∣∣∣+∣∣∣F0j(ski) − Fnj(x)
∣∣∣.
Furthermore, the triangle inequality implies
∣∣∣F0j(ski) − Fnj(x)∣∣∣ ≤
∣∣∣F0j(ski) − F0j(x)∣∣∣ +∣∣∣F0j(x) − Fnj(x)
∣∣∣.
Hence,
∣∣∣Fnj(x) − F(k)
nj (x)∣∣∣ ≤ 2
∣∣∣Fnj(x) − F0j(x)∣∣∣ + 2
∣∣∣F0j(x) − F0j(ski)∣∣∣. (7.17)
Using ǫ < f0j(t) < M for t ∈ [0, t0], we get
∣∣∣F0j(x) − F0j(ski)∣∣∣ ≤ M
∣∣∣x− ski
∣∣∣ ≤ (M/ǫ)∣∣∣F0k(x) − F0k(ski)
∣∣∣. (7.18)
202
For the interval Jki, we now consider the three possible cases in Definition 7.8. In
case (i) we have
∣∣∣F0k(x) − F0k(ski)∣∣∣ =
∣∣∣F0k(x) − Fnk(ski)∣∣∣ =
∣∣∣F0k(x) − Fnk(x)∣∣∣.
In case (ii), we have Fnk(x) < F0k(τki) for all x ∈ Jki and ski = τki. This yields
∣∣∣F0k(x) − F0k(ski)∣∣∣ =
∣∣∣F0k(x) − F0k(τki)∣∣∣ <
∣∣∣F0k(x) − Fnk(τki)∣∣∣ =
∣∣∣F0k(x) − Fnk(x)∣∣∣.
Similarly, in case (iii) we have Fnk(x) > F0k(τk,i+1−) for all x ∈ Jki and ski = τk,i+1−.
This yields
∣∣∣F0k(x) − F0k(ski)∣∣∣ =
∣∣∣F0k(x) − F0k(τk,i+1−)∣∣∣ <
∣∣∣F0k(x) − Fnk(τk,i+1−)∣∣∣
=∣∣∣F0k(x) − Fnk(x)
∣∣∣.
Hence, in all three cases we have∣∣∣F0k(x)−F0k(ski)
∣∣∣ ≤∣∣∣F0k(x)− Fnk(x)
∣∣∣. Combining
this with (7.17) and (7.18) gives (7.15).
To prove (7.16), we again consider the three cases in Definition 7.8. In case (i) we
have
∣∣∣ξ(x) − ξ(k)
(x)∣∣∣ =
∣∣∣ξ(x) − ξ(ski)∣∣∣ ≤ C
∣∣∣x− ski
∣∣∣ ≤ (C/ǫ)∣∣∣F0k(x) − F0k(ski)
∣∣∣
= (C/ǫ)∣∣∣F0k(x) − Fnk(ski)
∣∣∣ = (C/ǫ)∣∣∣F0k(x) − Fnk(x)
∣∣∣.
The expressions for case (ii) and (iii) follow analogously. 2
We can now prove (7.13) and (7.14).
Lemma 7.12 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1,
∣∣∣∣∫
θj,Fn,t0− θj,Fn,t0
dP0
∣∣∣∣ = Op(n−2/3).
203
Proof: First, note that for j = 1, . . . , K, we have
∫θj,Fn,t0
− θj,Fn,t0dP0
=
K∑
k=1
∫ δk
Fnk(t)− 1 − δ+
1 − Fn+(t)
dj,Fn,t0
(t, k) − dj,Fn,t0(t, k)dP0(t, δ)
=K∑
k=1
∫ F0k(t)
Fnk(t)− 1 − F0+(t)
1 − Fn+(t)
dj,Fn,t0
(t, k) − dj,Fn,t0(t, k)dG(t)
=
∫
[0,t0]
F0j(t)
Fnj(t)− 1 − F0+(t)
1 − Fn+(t)
Fnj(t)(1 − Fnj(t))(ξ(t) − ξ
(j)(t))dG(t)
−∑
k 6=j
∫
[0,t0]
F0k(t)
Fnk(t)− 1 − F0+(t)
1 − Fn+(t)
Fnk(t)Fnj(t)ξ(t) − F
(k)
nj (t)ξ(k)
(t)dG(t).
Similarly, for j = K + 1, we write
∫θK+1,Fn,t0
− θK+1,Fn,t0dP0
= −K∑
k=1
∫
[0,t0]
[F0k(t)
Fnk(t)− 1 − F0+(t)
1 − Fn+(t)
· Fnk(t)Fn,K+1(t)ξ(t) − F(k)
n,K+1(t)ξ(k)
(t)]dG(t).
Note that all terms in these expressions contain the common factor
F0k(t)
Fnk(t)− 1 − F0+(t)
1 − Fn+(t)
Fnk(t).
For k = 1, . . . , K, we rewrite the absolute value of this expression as
∣∣∣∣∣F0k(t)(1 − Fn+(t)) − Fnk(t)(1 − F0+(t))
1 − Fn+(t)
∣∣∣∣∣
=
∣∣∣∣∣F0k(t)(F0+(t) − Fn+(t)) + (1 − F0+(t))(F0k(t) − Fnk(t))
1 − Fn+(t)
∣∣∣∣∣ . (7.19)
204
Due to the assumption F0+(t0) < 1 and consistency of Fn+ in a neighborhood of t0
(Proposition 4.15), we can assume at the cost of a small probability that 1−Fn+(t0) >
(1 − F0+(t0))/2 > 0 for n sufficiently large. Hence, with large probability, (7.19) is
bounded by
C1
∣∣∣F0+(t) − Fn+(t)∣∣∣+∣∣∣F0k(t) − Fnk(t)
∣∣∣,
for some constant C1 > 0.
We now consider the remaining parts of∫θj,Fn,t0
− θj,Fn,t0dP0. First, by Lemma
7.11 we have∣∣∣ξ(t) − ξ
(j)(t)∣∣∣ ≤ C2
∣∣∣Fnj(t) − F0j(t)∣∣∣. Furthermore, using the same
lemma we obtain
∣∣∣∣Fnj(t)ξ(t) − F(k)
nj (t)ξ(k)
(t)
∣∣∣∣ =
∣∣∣∣(Fnj(t) − F(k)
nj (t))ξ(t) + F(k)
nj (t)(ξ(t) − ξ(k)
(t))
∣∣∣∣
≤ C3
∣∣∣Fnk(t) − F0k(t)∣∣∣ +∣∣∣Fnj(t) − F0j(t)
∣∣∣,
for j = 1, . . . , K + 1 and some constant C3 > 0. The result now follows by combining
these expressions, and using the Cauchy-Schwarz inequality and the L2(G) rate of
convergence given in (5.7). 2
Lemma 7.13 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1,
∣∣∣∣∫
θj,Fn,t0− θj,F0,t0d(Pn − P0)
∣∣∣∣ = Op(n−2/3).
Proof: We use the modulus of continuity result of Van de Geer (2000, Lemma 5.13,
page 79). For j ∈ 1, . . . , K + 1 we define
hjF = θj,F,t0 − θj,F0,t0 , F ∈ F
HjF = hjF : F ∈ F , F+(t0) < 1 − a/2 ,
205
where a = 1 − F0+(t0) > 0. Note that
θj,F,t0(t, δ) =K∑
k=1
δk
Fk(t)− 1 − δ+
1 − F+(t)
dj,F,t0(t, k)
=
δj(1 − Fj(t)) −
(1 − δ+)Fj(t)(1 − Fj(t))
1 − F+(t)
ξ
(j)(t)1[0,t0](t)
−∑
k 6=j
δkF
(k)
j (t) − (1 − δ+)F(k)
j (t)Fk(t)
1 − F+(t)
ξ
(k)(t)1[0,t0](t).
The class Hj is uniformly bounded, since F+(t0) < 1 − a/2, F0+(t0) = 1 − a, and
ξ is continuous and hence bounded on [0, t0]. This implies that we can rescale the
functions in Hj so that Van de Geer’s condition (5.39) is satisfied.
In order to satisfy Van de Geer’s condition (5.40), we must show that the γ-entropy
with bracketing of Hj is bounded by Aγ−1 for some constant A > 0. To see this, note
that the function ξ is fixed, but that the adapted versions ξ(k)
depend on Fk. Since
ξ is Lipschitz continuous, its restriction to [0, t0] is of bounded variation. Since the
adaptations ξ(k)
are ‘more constant’ versions of ξ, they are also of bounded variation.
Furthermore, the functions Fj , 1 − Fj , F(k)
j , (1 − F+)−1, for j = 1, . . . , K + 1 and
k = 1, . . . , K, are bounded and monotone, and hence of bounded variation. Here we
again use that F+(t0) < 1−a/2. Hence, our class of functions consists of the sums and
products of functions of bounded variation. It then follows from Propositions 5.23
and 5.24 that the γ-bracketing entropy of Hj is bounded by A′γ−1 for some A′ > 0.
Next, we define Hj(s) = hjF ∈ Hj : ‖hjF‖2 ≤ s. After some algebra, we obtain
θj,Fn,t0− θj,F0,t0 = δk
[(1 − Fnj(t))ξ
(j)(t) − (1 − F0j(t))ξ(t)
]
+ (1 − δ+)
[F0j(t)(1 − F0j(t))ξ(t)
1 − F0+(t)− Fnj(t)(1 − Fnj(t))ξ
(j)(t)
1 − Fn+(t)
]
+∑
k 6=jδk[F0j(t)ξ(t) − F
(k)
nj (t)ξ(k)
(t)]
+∑
k 6=j(1 − δ+)
[F
(k)
nj (t)Fnk(t)ξ(k)
(t)
1 − Fn+(t)− F0j(t)F0k(t)ξ(t)
1 − F0+(t)
].
206
Using the L2(G) rate of convergence of the MLE and Lemma 7.11, we find that the
four terms on the right side have L2(P0)-norms of order Op(n−1/3). This implies that
we can find some C > 0 such that hjFn∈ Hj(Cn
−1/3) with large probability. We now
apply Van de Geer (2000, Lemma 5.13, page 79, equation (5.42)) with α = 1 and
β = 0. This yields
suphjF∈Hj(Cn−1/3)
∣∣∣∣∫hjFd(P − Pn)
∣∣∣∣ = Op(n−2/3)
and completes the proof. 2
We are now ready to prove Lemmas 7.14 and 7.15.
Lemma 7.14 Under the conditions of Theorem 7.5, we have
∫
[0,t0]
F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) ≤∫θK+1,F0,t0d(P0 − Pn) +Op(n
−2/3).
Proof: By Lemma 7.7, we have
∫
[0,t0]
F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t)
=
∫θK+1,Fn,t0
dP0
=
∫θK+1,Fn,t0
dP0 +
∫θK+1,Fn,t0
− θK+1,Fn,t0dP0.
In Lemma 7.9 we showed that∫θK+1,Fn,t0
dPn ≤ 0. Hence,
∫
[0,t0]
F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t)
≤∫θK+1,Fn,t0
d(P0 − Pn) +
∫θK+1,Fn,t0
− θK+1,Fn,t0dP0.
207
The right side of this expression can be written as
∫θK+1,F0,t0d(P0 − Pn) +
∫θK+1,Fn,t0
− θK+1,F0,t0d(P0 − Pn)
+
∫θK+1,Fn,t0
− θK+1,Fn,t0dP0
=
∫θK+1,F0,t0d(P0 − Pn) +Op(n
−2/3),
where the last equality follows from Lemmas 7.12 and 7.13. 2
Lemma 7.15 Under the conditions of Theorem 7.5, we have for k = 1, . . . , K,
∫
[0,t0]
F0k(t) − Fnk(t)ξ(t)dG(t) ≤ maxτ∈σnk ,τnk
∫θk,F0,τ−d(P0 − Pn) +Op(n
−2/3),
where σnk is the last jump point of Fnk at or before t0, and τnk is the first jump point
of Fnk strictly after t0.
Proof: Let k ∈ 1, . . . , K and τ ∈ σnk, τnk. We use Lemmas 7.10, 7.12 and 7.13
to show that
∫
[0,τ)
F0k(t) − Fnk(t)ξ(t)dG(t) ≤∫θk,F0,τ−d(P0 − Pn) +Op(n
−2/3),
analogously to the proof of Lemma 7.14. Next, we relax the requirement that the
upper endpoint of the integration interval is a jump point of Fnk. We define
ψnk(t) =
∫
[0,t)
F0k(x) − Fnk(x)ξ(x)dG(x).
Suppose there is a point s ∈ [σnk, τnk) so that F0k(s) = Fnk(s). Then ψnk(t) is
decreasing on the interval [σnk, s) and increasing on the interval [s, τnk). Hence,
ψnk(t) ≤ maxψnk(σnk), ψnk(τnk). If there is no such s, then either Fnk(u) < F0k(σnk)
for all u ∈ [σnk, τnk), or Fnk(u) > F0k(τnk−) for all u ∈ [σnk, τnk). In the former case,
208
ψnk(t) is increasing for all t ∈ [σnk, τnk), so that ψnk(t) ≤ ψnk(τnk). In the latter case,
ψnk(t) is decreasing for all t ∈ [σnk, τnk), so that ψnk(t) ≤ ψnk(σnk). 2
Remark 7.16 We now briefly discuss what happens if we do not know G. Note that
our results imply that
√n
∫
[0,t0]
Fnk(t) − F0k(t)dGn(t) →d N
(0,
∫
[0,t0]
F0k(t)(1 − F0k(t))dG(t)
),
since
√n
∫
[0,t0]
Fnk(t) − F0k(t)dGn(t)
=√n
∫
[0,t0]
Fnk(t) − F0k(t)dG(t) +√n
∫
[0,t0]
Fnk(t) − F0k(t)d(Gn −G)(t),
and the last term on the right hand side is of order Op(n−1/6) by a modulus of
continuity result (Van de Geer (2000, Lemma 5.13, page 79). Hence, in this sense we
do not lose anything by not knowing the distribution G.
Furthermore, note that
√n
(∫
[0,t0]
Fnk(t)dGn(t) −∫
[0,t0]
F0k(t)dGn(t)
)
=√n
∫
[0,t0]
Fnk(t) − F0k(t)dGn(t) +√n
∫
[0,t0]
F0k(t)d(Gn −G)(t).
The first term on the right side converges to a normal distribution with variance∫[0,t0]
F0k(t)(1 − F0k(t))dG(t), by the argument given above. However, the second
term on the right side also gives a contribution. Thus, considering∫[0,t0]
Fnk(t)dGn(t)
as an estimator for∫[0,t0]
F0k(t)dG(t), not knowing G does result in a bigger asymptotic
variance.
Remark 7.17 It may be of interest to consider joint convergence of smooth func-
tionals of several components. For example, we can use (7.12) to show that the limit
209
of the vector
(√n
∫
[0,t0]
Fn1(t) − F01(t)dG(t), . . . ,√n
∫
[0,t0]
FnK(t) − F0K(t)dG(t)
)
is a multivariate normal distribution NK(0,Σ), where
Σjk =
∫
[0,t0]
F0j(t)(1j = k − F0k(t))dG(t), j, k ∈ 1, . . . , K.
In turn, this result can be used to study smooth functionals that consist of a linear
combination of several components.
210
Chapter 8
EXAMPLES
In this chapter we apply our methods to real and simulated data. First, in Section
8.1, we reanalyze a data set on the menopausal status of women, and verify that our
results agree with those of Jewell, Van der Laan and Henneman (2003) and Jewell and
Kalbfleisch (2004). Next, in Section 8.2, we compare the MLE and several variants of
the naive estimator in a simulation study. We consider both pointwise estimation and
estimation of smooth functionals. For pointwise estimation, we find that the MLE is
superior to the naive estimator in terms of mean squared error, both for small and
large samples. For the estimation of smooth functionals, we find that the MLE and
the naive estimator behave similarly, and in agreement with the theoretical results in
Chapter 7.
8.1 Menopause data
MacMahon and Worcestor (1966) and Krailo and Pike (1983) studied the menopausal
status of women participating in Cycle I of the Health Examination Survey of the Na-
tional Center for Health Statistics. This study consisted of a nationwide probability
sample of persons between age 18 and 79 from the United States civilian, noninstitu-
tional population. The participants were asked to complete a self-administered ques-
tionnaire. The sample contained 4211 females, of whom 3581 completed the question-
naire. The question regarding menopausal status is given in Figure 8.1. MacMahon
and Worcestor (1966) found that there was marked terminal digit clustering in the re-
sponse to part c of this question, especially for women who had a natural menopause.
Therefore, Krailo and Pike (1983) decided to only analyze the responses to parts b and
211
Question 74. WOMEN ONLY
a. Age when periods started ______
b. Have periods stopped? (not counting pregnancy) Yes / No
IF YES
c. Age when periods stopped _____
d. Was this due to an operation? Yes / No
IF NO
e. Have they begun to stop? Yes / No
f. Date of last period _____
Figure 8.1: Question 74 of the Health Examination Survey (taken from MacMahonand Worcestor (1966)).
d. These parts provide current status data with competing risks, where X is the age
at menopause, Y is the cause of menopause, and T is the age at the time of the survey.
Krailo and Pike (1983) performed a parametric analysis. Nonparametric analyses of
the same data have been performed by Jewell, Van der Laan and Henneman (2003)
and Jewell and Kalbfleisch (2004). In these three analyses, attention was restricted
to the age range 25-59 years. Furthermore, seven women who were less than 35 years
of age and reported having had a natural menopause were excluded as being an error
or abnormal. The remaining data set contained information on 2423 women.
In order to verify our methods, we reanalyzed these data and computed the MLE
and the naive estimator. The results are given in Figure 8.2. Note that the MLE
and the naive estimators are very similar, and that they are indistinguishable for the
sub-distribution function for operative menopause. Furthermore, note that operative
menopause seems to occur at a constant rate between age 30 and 55, while the rate
of natural menopause peaks around age 50-55. Our results agree with those of Jewell,
Van der Laan and Henneman (2003) and Jewell and Kalbfleisch (2004).
212
20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Menopause data
age (year)
MLE operativeNE operativeMLE naturalNE natural
Figure 8.2: The MLE and the naive estimator (NE) for the sub-distribution functionsfor the menopause data analyzed by Krailo and Pike (1983). The MLE and naiveestimator for operative menopause are indistinguishable.
213
8.2 Simulations
In order to compare the MLE and the naive estimator, we simulated data from the
following model with K = 5 competing risks:
P (T ≤ t) = 1 − exp(−5t/2),
P (Y = k) = k/15, k = 1, . . . , 5,
P (X ≤ t|Y = k) = 1 − exp(−kt), k = 1, . . . , 5,
(8.1)
with T independent of (X, Y ). The true sub-distribution functions in this model are
F0k(t) =k
15(1 − exp(−kt)), k = 1, . . . , 5,
and are shown in Figure 8.3 on page 218.
We simulated 1000 data sets of sizes 100, 1000 and 10000. For each data set
we computed the MLEs Fn1, . . . , Fn5 and the naive estimators Fn1, . . . , Fn5. The
MLE was computed using sequential quadratic programming (SQP) and the support
reduction algorithm, as described in Section 3.1. As convergence criterion we used
the conditions in (2.37) of Proposition 2.25, with a tolerance of 10−10. The naive
estimators were computed with a convex minorant algorithm.
8.2.1 Pointwise estimation
We now compare the behavior of the estimators for pointwise estimation. We do this
by computing the bias, variance and mean squared error of the estimators on the
following grid:
0.0, 0.01, 0.02, . . . , 3.0. (8.2)
214
Recall that the estimators are not uniquely defined for all t ∈ R+, due to representa-
tional non-uniqueness (see Section 2.2). Thus, in order to compare the estimators on
a grid, we use the convention that the naive estimators Fnk are right-continuous and
piecewise constant, with jumps only at points in T1, . . . , Tn. Similarly, we assume
that the MLEs Fnk are right-continuous and piecewise constant with jumps only at
points in Tk (see Definition 2.22). These conventions are equivalent to assigning all
probability mass to the right endpoints of the maximal intersections.
Jewell, Van der Laan and Henneman (2003) stated that the performance of the
naive estimators can be improved by suitably modifying them when their sum exceeds
one. To investigate this claim, we computed two variants of the naive estimator: a
scaled naive estimator F(s)nk , and a truncated naive estimator F
(t)nk . The scaled naive
estimator is defined as follows:
F(s)nk (t) =
Fnk(t) if Fn+(T(n)) ≤ 1,
Fnk(t)/Fn+(T(n)) if Fn+(T(n)) > 1,k = 1, . . . , 5.
To define the truncated naive estimator, we let
t∗ = mint ∈ (0, T(n)] : Fn+(t) > 1 ∪ T(n) + 1.
If t∗ = T(n)+1, then the naive estimator does not violate the constraint Fn+(T(n)) ≤ 1,
and hence we let the truncated naive estimator be equal to the naive estimator. If
t∗ ≤ T(n), then the constraint Fn+(T(n)) ≤ 1 is violated, and we define
F(t)nk (t) =
Fnk(t) if t < t∗,
Fnk(t−) + αnk if t ≥ t∗,k = 1, . . . , 5,
where
αnk =Fnk(t
∗) − Fnk(t∗−)
Fn+(t∗) − Fn+(t∗−)(1 − Fn+(t∗−)), k = 1, . . . , 5.
215
In order to limit the number of plots, we only show the results for the estimation
of F01, F03 and F05. In the legends of the plot we use the following abbreviations:
naive estimator (NE), scaled naive estimator (SNE), and truncated naive estimator
(TNE).
Figure 8.4 shows the estimators in one simulation for each sample size. Note that
the MLE is close to the true underlying distribution. Furthermore, note that the MLE
and the naive estimator tend to be similar for smaller values of t, while they tend to
diverge for larger values of t, where the naive estimator becomes too large and violates
the constraint Fn+(t) ≤ 1. The truncated naive estimator repairs such a violation by
only changing the estimator at points for which the constraint is violated, while the
scaled naive estimator changes the values at all points. As a result, the scaled naive
estimator tends to yield a significant underestimate for smaller values of t.
Figure 8.5 shows the sample bias of the estimators, scaled by a factor n1/3. We
see that the bias of the MLE is smallest in absolute value. Furthermore, note that
the bias of the scaled naive estimator is largely negative, for the reason discussed in
the previous paragraph.
Figure 8.6 shows the sample variance of the estimators, scaled by a factor n2/3.
We see that the scaled naive estimator has the smallest variance for small values of
t. This can be explained by the fact that the estimator is scaled down. Among the
remaining estimators, the MLE tends to have the smallest variance.
Figure 8.7 shows the sample mean squared error of the estimators, scaled by a
factor n2/3. We see that the mean squared error of the MLE is in general smaller
than that of the naive estimators. Considering the three naive estimators, we see
that the truncated naive estimator performs best and is significantly better than the
naive estimator. On the other hand, we see that the mean squared error of the scaled
naive estimator tends to be worse than that of the naive estimator. The latter can
be explained by the large negative bias of the scaled naive estimator.
Figure 8.8 shows the relative efficiency of the estimators, in the form of the mean
216
squared error of the MLE divided by the mean squared error of each estimator. We
clearly see that the MLE is most efficient. The only exception is the upper left plot
for k = 1 and n = 100, where the scaled naive estimator is more efficient for smaller
values of t. This can be viewed as an anomaly due to the small values of F01. In
all other cases the relative efficiency of the scaled naive estimator quickly drops to
almost zero. The relative efficiency of the naive estimator also decreases to a number
close to zero, but its decrease is more gradual in t. The truncated naive estimator
seems to stabilize at a relative efficiency of about 75%.
Considering Figures 8.5 to 8.8, we see that the curves of the naive estimator and
the truncated naive estimator coincide until a certain point, and then start to diverge.
This point is the smallest time s for which Fn+(s) > 1 in one of the 1000 simulated
data sets. The value of this point increases as the sample size increases, due to
consistency of the naive estimator.
To investigate the behavior of the estimators at larger values of t, we computed
the sample bias, variance and mean squared error at the point t = 10. The results
are given in Table 8.1. We see that the bias, variance and mean squared error of the
naive estimator do not decrease with n. The bias can be as large as 0.3, even for
sample size 10000. On the other hand, the MLE still behaves well.
8.2.2 Smooth functionals
We now consider estimation of the following smooth functional:
∫
[0,t0]
F0k(t)dG(t),
for t0 = 2 and t0 = 10. In Chapter 7 we proved that
√n
∫
[0,t0]
Fnk(t) − F0k(t)dG(t) →d N
(0,
∫
[0,t0]
F0k(t)1 − F0k(t)dG(t)
), (8.3)
217
where Fnk can be either the naive estimator or the MLE.
We computed the left hand side of (8.3) for the MLE and the naive estima-
tor, for the 1000 simulated data sets for each sample size. Note that the integral∫[0,t0]
Fnk(t)dG(t) can be computed easily by using partial integration:
∫
[0,t0]
Fnk(t)dG(t) =
∫
[0,t0]
G(t0) −G(t)dFnk(t).
The results for t0 = 2 are given in Figures 8.9 and 8.10. The results for t0 = 10 are
given in Figures 8.11 and 8.12. Note that the MLE and the naive estimator behave
similarly. Furthermore, their behavior agrees with the theoretical limit (8.3), depicted
by a black line in the figures.
Given the fact that the naive estimator performs very badly at pointwise estima-
tion at t = 10, it may come as a surprise that the naive estimator behaves well for
estimation of the smooth functional when t0 = 10. However, the smooth functionals
are integrated with respect to G, and the density g(t) = 52exp(−5t/2) is very small
for large t, so that any effects in the tail of the distributions are suppressed.
218
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.1
0.2
0.3
0.4
True sub−distribution functions
t
k=5
k=4
k=3
k=2
k=1
Figure 8.3: The true underlying sub-distribution functions in model (8.1).
219
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=1, n=100
t
MLENESNETNE
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=3, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=5, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=1, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=3, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=5, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=1, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=3, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
k=5, n=10000
t
Figure 8.4: The estimators for F0k, k = 1, 3, 5, for one simulation for each samplesize. The solid black lines denote the true underlying sub-distribution functions.
220
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=1, n=100
t
MLENESNETNE
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=3, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=5, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=1, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=3, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=5, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=1, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=3, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
k=5, n=10000
t
Figure 8.5: Sample bias of the estimators for F0k, k = 1, 3, 5, scaled by a factor n1/3.The results are computed over 1000 simulations for each sample size n, on the griddefined in (8.2).
221
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=100
t
MLENESNETNE
0.0 0.5 1.0 1.5 2.0 2.5 3.00.
000.
050.
100.
150.
200.
25
k=3, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=3, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=3, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=10000
t
Figure 8.6: Sample variance of the estimators for F0k, k = 1, 3, 5, scaled by a factorn2/3. The results are computed over 1000 simulations for each sample size n, on thegrid defined in (8.2).
222
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=100
t
MLENESNETNE
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=3, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=3, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=1, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=3, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
k=5, n=10000
t
Figure 8.7: Sample mean squared error of the estimators for F0k, k = 1, 3, 5, scaledby a factor n2/3. The results are computed over 1000 simulations for each sample sizen, on the grid defined in (8.2).
223
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=1, n=100
t
MLENESNETNE
0.0 0.5 1.0 1.5 2.0 2.5 3.00.
00.
51.
01.
52.
0
k=3, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=5, n=100
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=1, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=3, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=5, n=1000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=1, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=3, n=10000
t
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
k=5, n=10000
t
Figure 8.8: Sample relative efficiency of the estimators for F0k, k = 1, 3, 5, withrespect to the MLE. The sample relative efficiency for each estimator is computedusing the formula (MSE MLE)/(MSE estimator), where the sample mean squarederrors (MSEs) were computed as in Figure 8.7.
224
Table 8.1: Sample bias, variance and mean squared error for estimating F0k(10) for k = 1, 3, 5. The results arecomputed over 1000 simulation for each sample size n.
Bias k = 1 Bias k = 3 Bias k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 -1.5e-2 1.1e-1 -2.2e-2 1.0e-2 -2.3e-3 2.8e-1 -1.1e-3 1.9e-2 1.2e-2 2.9e-1 1.9e-2 -4.2e-2
1000 -9.0e-3 1.6e-1 -1.7e-2 2.8e-2 1.0e-3 2.7e-1 2.7e-3 7.4e-3 8.1e-3 3.0e-1 1.4e-2 -5.1e-210000 -7.6e-3 1.6e-1 -1.2e-2 2.7e-2 4.7e-4 2.7e-1 2.0e-3 6.1e-3 5.6e-3 3.2e-1 8.0e-3 -4.6e-2
Var k = 1 Var k = 3 Var k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 1.3e-3 6.6e-2 1.3e-3 1.1e-2 3.0e-3 9.8e-2 4.1e-3 1.9e-2 4.5e-3 7.1e-2 5.8e-3 1.9e-2
1000 3.2e-4 6.6e-2 2.9e-4 1.0e-2 6.0e-4 8.1e-2 7.9e-4 1.5e-2 8.0e-4 6.8e-2 9.3e-4 1.6e-210000 7.5e-5 6.0e-2 6.5e-5 9.3e-3 1.2e-4 7.9e-2 1.4e-4 1.5e-2 1.5e-4 7.0e-2 1.7e-4 1.6e-2
MSE k = 1 MSE k = 3 MSE k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 1.5e-3 7.8e-2 1.8e-3 1.1e-2 3.0e-3 1.8e-1 4.1e-3 2.0e-2 4.7e-3 1.6e-1 6.2e-3 2.0e-2
1000 4.0e-4 9.1e-2 5.6e-4 1.1e-2 6.0e-4 1.6e-1 8.0e-4 1.5e-2 8.7e-4 1.6e-1 1.1e-3 1.9e-210000 1.3e-4 8.4e-2 2.2e-4 1.0e-2 1.2e-4 1.5e-1 1.4e-4 1.5e-2 1.8e-4 1.7e-1 2.3e-4 1.8e-2
225
k=1, n=100D
ensi
ty
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=100
Den
sity
−1.0 −0.5 0.0 0.5 1.00.
00.
51.
01.
5
k=5, n=100
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=1000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=1000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=1000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=10000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=10000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=10000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure 8.9: Smooth functionals of the MLE for t0 = 2. The histograms and den-sity estimates (green) are based on 1000 simulations for each sample size from√n∫
[0,2]Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limit-
ing distribution is given in black.
226
k=1, n=100
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=100
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=100
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=1000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=1000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=1000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=10000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=10000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=10000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure 8.10: Smooth functionals of the naive estimator for t0 = 2. The histogramsand density estimates (red) are based on 1000 simulations for each sample size from√n∫[0,2]
Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting
distribution is given in black.
227
k=1, n=100D
ensi
ty
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=100
Den
sity
−1.0 −0.5 0.0 0.5 1.00.
00.
51.
01.
5
k=5, n=100
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=1000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=1000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=1000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=10000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=10000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=10000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure 8.11: Smooth functionals of the MLE for t0 = 10. The histograms anddensity estimates (green) are based on 1000 simulations for each sample size from√n∫
[0,10]Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting
distribution is given in black.
228
k=1, n=100
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=100
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=100
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=1000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=1000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=1000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
k=1, n=10000
Den
sity
−0.4 −0.2 0.0 0.2 0.4
01
23
4
k=3, n=10000
Den
sity
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
k=5, n=10000
Den
sity
−1.5 −0.5 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure 8.12: Smooth functionals of the naive estimator for t0 = 10. The histogramsand density estimates (red) are based on 1000 simulations for each sample size from√n∫[0,10]
Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting
distribution is given in black.
229
Chapter 9
AN EXTENSION:
INTERVAL CENSORED CONTINUOUS MARK DATA
In the preceding chapters we considered the situation in which X ∈ R+ is subject
to current status censoring (or interval censoring case 1, i.e., one observation time
per subject), and Y ∈ 1, . . . , K is a discrete variable. In this chapter we study an
extension of this model, in the following two directions. First, we let the survival
time X ∈ R+ be subject to case k interval censoring, meaning that there are exactly
k observation times for each subject. Second, we let Y ∈ R be a continuous random
variable. The variable Y is also called a mark variable, so that this model is sometimes
referred to as the interval censored continuous mark model.
Interval censored continuous mark data arise in various situations. For example,
X can be the time of onset of a disease and Y its incubation period. Alternatively, X
can be the time of death and Y a measure of utility or cost, such as quality adjusted
lifetime or lifetime medical costs (Huang and Louis (1998)). A third example is the
HIV vaccine trial data analyzed in Hudgens, Maathuis and Gilbert (2006), where X
is the time of HIV infection and Y is the viral distance between the infecting HIV
virus and the virus in the vaccine.
The work in this chapter is largely taken from Maathuis and Wellner (2006),
and was motivated by Hudgens, Maathuis and Gilbert (2006). Our main focus is
on asymptotic properties of the MLE for interval censored continuous mark data,
and in particular on consistency. In Section 9.1 we use the analogy with univariate
right censored data to obtain an explicit formula for the MLE. In Section 9.2 we use
this formula and the mark specific cumulative hazard function of Huang and Louis
230
(1998) to derive the almost sure limit of the MLE. We conclude that the MLE is
inconsistent in general. In Section 9.3 we show that the inconsistency can be repaired
by discretizing the marks. In Section 9.4 we illustrate the behavior of the inconsistent
and repaired MLE in four examples.
9.1 The model and an explicit formula for the MLE
9.1.1 Intermezzo: univariate right censored data
Hudgens, Maathuis and Gilbert (2006) noted a close connection between the MLE for
univariate right censored data and the MLE for interval censored continuous mark
data. We will use this connection in Section 9.1.2 to derive a new explicit formula for
the MLE in the interval censored continuous mark model. However, we first briefly
review univariate right censored data in a way that shows the similarity between the
two models.
Let X > 0 be a survival time subject to right censoring. Let T > 0 be the
censoring variable, with T independent of X. Let U = X ∧ T = min(X, T ) and
Γ = 1X ≤ T. We are interested in the MLE Fn(x) of F0(x) = P (X ≤ x) based on
n independent and identically distributed copies of (U1,Γ1), . . . , (Un,Γn) of (U,Γ).
Using the censored data perspective of Section 2.2, we find that the observed sets
for these data can have two forms: R = U if Γ = 1 and R = (U,∞) if Γ = 0.
Let U(1), . . . , U(n) be the order statistics of U1, . . . , Un, and let Γ(i) and R(i) be the
corresponding values of Γ and R. We assume that all Ri with Γi = 1 are distinct,
since this will be the case for the continuous mark data. However, we allow ties in
the T ’s and U ’s provided this assumption is not violated. We break such ties in U
arbitrarily after ensuring that observations with Γ = 1 are ordered before those with
Γ = 0.
Assuming that F has a density f with respect to some dominating measure
µ, the likelihood (conditional on G) is Ln(F ) =∏n
i=1 q(Ui,Γi), where q(u, γ) =
231
f(u)γ 1 − F (u)1−γ . The first term of q is a density-type term, and hence Ln(F )
can be made arbitrarily large by letting f peak at some value Ui with Γi = 1. This
problem is usually solved by maximizing Ln(F ) over the class of distribution functions
that have a density with respect to counting measure on the observed failure times.
We can then write Ln(F ) =∏n
i=1 PF (Ri), where PF (R) is the probability of R under
F .
We now consider the maximal intersections of R1, . . . , Rn. Using the idea of the
height map of Maathuis (2005), we find that each R(i) with i ∈ I = i ∈ 1, . . . , n :
Γ(i) = 1 is a maximal intersection. We denote these maximal intersections by A(i).
This notation may seem redundant since A(i) = R(i), but it will be useful in the
next section. Furthermore, there is an extra maximal intersection A(n+1) = R(n) =
(U(n),∞) if and only if Γ(n) = 0. Let I be the collection of indices of all maximal
intersections. Thus, I = I if Γ(n) = 1 and I = I ∪ n+ 1 if Γ(n) = 0. Let αi be the
probability mass of maximal intersection A(i), i ∈ I. We can then write the likelihood
in terms of the αi’s, analogously to the expression of the log likelihood in (2.15):
n∏
i=1
P (Ri) =n∏
i=1
∑
j∈I
αj1A(j) ⊆ R(i)
=
n∏
i=1
αΓ(i)
i
∑
j≥i+1,j∈I
αj
1−Γ(i)
. (9.1)
The MLE α maximizes this expression under the constraints
∑
i∈I
αi = 1 and αi ≥ 0 for all i ∈ I. (9.2)
It is well-known that α is the Kaplan-Meier or product-limit estimator, given by
αi =
i−1∏
j=1
(1 − Γ(j)
n− j + 1
)Γ(i)
n− i+ 1, i ∈ I,
and αn+1 = 1 −∑i∈I αi if Γ(n) = 0 (see for example Shorack and Wellner (1986),
232
Chapter 7, pages 332-333). Equivalently, we can write
∑
j≥i,j∈I
αj =∏
j≤i−1
(1 − Γ(j)
n− j + 1
), i ∈ I.
The vector α is uniquely determined. We obtain Fn(x) by summing the probability
mass in the interval (0, x]. Note that the maximal intersections A(i) : i ∈ I are
points. The extra maximal intersection A(n+1), that exists if and only if Γ(n) = 0,
takes the form of a half line. Hence, representational non-uniqueness occurs if and
only if Γ(n) = 0, and if it occurs then it affects Fn(x) for x > T (n).
9.1.2 Continuous mark data
We now formally introduce the interval censored continuous mark model. Let X ∈R+ = (0,∞) be a survival time, let Y ∈ R be a continuous mark variable, and let
F0(x, y) = P (X ≤ x, Y ≤ y) be their joint distribution. Let F0X(x) = F0(x,∞) and
F0Y (y) = F0(∞, y) be the marginal distributions of X and Y . Let X be subject to
interval censoring case k, using the terminology of Groeneboom and Wellner (1992).
Let T = (T1, . . . , Tk) be the k observation times and let G be their distribution.1 We
assume that T is independent of (X, Y ) and G(0 < T1 < · · · < Tk) = 1. We use
subscripts to denote the marginal distributions of G. For example, G1 is the distribu-
tion of T1 and G23 is the distribution of (T2, T3). Let Γ = (Γ1, . . . ,Γk+1) be a vector
of indicator functions, where Γj = 1Tj−1 < X ≤ Tj for j = 1, . . . , k + 1, T0 = 0
and Tk+1 = ∞. We say that X is right censored if Γk+1 = 1, and we assume that Y
is observed if and only if X is not right censored. Thus, we observe Z = (T,Γ,W ),
where W = Γ+Y and Γ+ =∑k
j=1 Γj = 1 − Γk+1. We study the nonparametric maxi-
mum likelihood estimator Fn(x, y) for F0(x, y) based on n independent and identically
distributed copies Z1, . . . , Zn of Z, where Zi = (Ti,Γi,Wi), Ti = (T1i, . . . , Tki) and
1In case of current status censoring (k = 1), we denote the observation time simply by T .
233
Γi = (Γ1i, . . . ,Γk+1,i). We allow ties between components of the vectors Ti and Tj
for i 6= j.
The observed sets for this model are defined as follows:
R =
(Tj−1, Tj ] × W if Γj = 1, j = 1, . . . , k
(Tk,∞) × R if Γk+1 = 1.
Note that R is a line segment if Γ+ = 1 and R is a half plane if Γk+1 = 1. Assuming
that F has a density f with respect to some dominating measure µX × µY , the
likelihood (conditional on G) is given by Ln(F ) =∏n
i=1 q(Zi), where
q(z) = q(t, γ, w) =
k∏
j=1
∫
(tj−1,tj ]
f(s, w)µX(ds)
γj
(1 − FX(tk))γk+1 , (9.3)
where FX(x) = F (x,∞) is the marginal distribution ofX under F . The first term of q
is a density-type term. Hence, Ln(F ) can be made arbitrarily large by letting f(s, w)
peak at w = Wi for some observation with Γ+i = 1. We therefore define the MLE
Fn(x, y) to be the maximizer of Ln(F ) over the class F of all bivariate distribution
functions that have a marginal density fY with respect to counting measure on the
observed marks. We can then write Ln(F ) =∏n
i=1 PF (Ri).
As in Maathuis (2005), we call the projection of R on the x- and y-axis its x-
interval and y-interval. We denote the left and right endpoint of the x-interval of R
by TL and TR:
TL =
k+1∑
j=1
ΓjTj−1, TR =
k+1∑
j=1
ΓjTj. (9.4)
Furthermore, we define a new variable U :
U = Γ+TR + Γk+1TL. (9.5)
234
Note that U equals T if X is subject to current status censoring. The variable U
plays an important role, because it will determine the order of the observations. Let
U(1), . . . , U(n) be the order statistics of U1, . . . , Un and let Γ(i) = (Γ1(i), . . . ,Γk+1,(i)),
W(i), R(i), TL(i) and TR(i) be the corresponding values of Γ, W , R, TL and TR. We
break ties in U arbitrarily after ensuring that observations with Γ+ = 1 are ordered
before those with Γ+ = 0. Let I = i ∈ 1, . . . , n : Γ+(i) = 1. Recall that
the maximal intersections are the local maxima of the height map of the canonical
observed sets. Since Y is continuous, the observed sets R(i), i ∈ I, are completely
distinct with probability one. Hence, each such R(i) contains exactly one maximal
intersection A(i):
A(i) = (maxTL(j) : j /∈ I, j < i ∪ TL(i), TR(i)] × W(i), i ∈ I. (9.6)
To understand this expression, let S(i) be the set of right censored observed sets
R(j) with TL(i) < TL(j) < TR(i). Then (9.6) implies that A(i) = R(i) if S(i) = ∅ and
A(i) ⊆ R(i) otherwise. Furthermore, in the latter case the left endpoint of A(i) is
determined by the largest TL(j) with R(j) ∈ S(i). The right endpoints of A(i) and R(i)
are always identical. Equation (9.6) also implies that the maximal intersections can
be computed in O(n logn) time, which is faster than the height map algorithm of
Maathuis (2005) due to the special structure in the data. We again have an extra
maximal intersection A(n+1) = R(n) = (U(n),∞) × R if and only if Γ+(n) = 0. Let Ibe the collection of indices of all maximal intersections. Thus, I = I if Γ+(n) = 1 and
I = I ∪ n+ 1 if Γ+(n) = 0. Let αi be the probability mass of maximal intersection
A(i), i ∈ I. We can then write the likelihood as
n∏
i=1
P (Ri) =
n∏
i=1
∑
j∈I
αj1A(j) ⊆ R(i)
=
n∏
i=1
αΓ+(i)
i
∑
j≥i+1,j∈I
αj
1−Γ+(i)
. (9.7)
The MLE α maximizes this expression under the constraints (9.2). From the analogy
235
with likelihood (9.1) it follows immediately that
αi =
i−1∏
j=1
(1 − Γ+(j)
n− j + 1
)Γ+(i)
n− i+ 1, i ∈ I,
and αn+1 = 1 −∑i∈I αi if Γ+(n) = 0. Equivalently, we can write
∑
j≥i,j∈I
αj =∏
j≤i−1
(1 − Γ+(j)
n− j + 1
), i ∈ I. (9.8)
These formulas are different from (but equivalent to) the ones given in Section 3.1 of
Hudgens, Maathuis and Gilbert (2006). The form given here has several advantages.
First, the tail probabilities (9.8) can be computed in time complexity O(n logn), since
the computationally most intensive step consists of sorting the U ’s. Furthermore, the
current form provides additional insights in the behavior of the MLE. In particular,
it shows that the MLE can be viewed as a right endpoint imputation estimator (see
Remark 9.1), and it allows for an easy derivation of the almost sure limit of the MLE
(see Section 9.2).
The vector α is again uniquely determined. This was noted by Hudgens, Maathuis
and Gilbert (2006) and also follows from our derivation here. We obtain Fn(x, y) by
summing all mass in the region (0, x] × (−∞, y]. We define a marginal MLE for X
by letting FXn(x) = Fn(x,∞). The estimators Fn and FXn can suffer considerably
from representational non-uniqueness, since the maximal intersections A(i) : i ∈ Iare line segments and A(n+1) extends to infinity in two dimensions. We denote the
estimator that assigns all mass to the upper right corners of the maximal intersections
by F ln, since it is a lower bound for the MLE. Similarly, we denote the estimator that
assigns all mass to the lower left corners of the maximal intersections by F un , since it
236
is an upper bound for the MLE. The formulas for F ln simplify considerably:
1 − F lXn(x) =
∏
U(i)≤x
(1 − Γ+(i)
n− i+ 1
), (9.9)
F ln(x, y) =
n∑
i=1
αi1U(i) ≤ x,W(i) ≤ y
=∑
U(i)≤x
∏
U(j)<U(i)
(1 − Γ+(j)
n− j + 1
)Γ+(i)1W(i) ≤ y
n− i+ 1, (9.10)
where U was defined in (9.5).
Remark 9.1 The MLE F ln can be viewed as a right endpoint imputation estimator.
Namely, replace the observed sets R(i) with Γ+(i) = 1 by their right endpoint:
R′(i) =
U(i) × W(i) if i ∈ I,R(i) if i /∈ I.
Then the intersection structures of R(i)ni=1 and R′(i)ni=1 are identical, meaning that
R(i) ∩ R(j) = ∅ if and only if R′(i) ∩ R′
(j) = ∅, for all i, j ∈ 1, . . . , n. Furthermore,
the maximal intersections of R′(i)ni=1 are A(i) = R′
(i) : i ∈ I. Hence, writing the
likelihood for the imputed data in terms of α yields exactly the same likelihood as
(9.7). As a result the values αi, i ∈ I, are identical to the ones for the original
data. Furthermore, since F ln assigns mass to the upper right corners of the maximal
intersections, F ln is completely equivalent to the MLE for the imputed data. Since the
observed sets R′(i) impute an x-value that is always at least as large as the unobserved
value X, F lXn tends to have a negative bias.
9.2 Inconsistency of the MLE
We now derive the almost sure limits F lX∞ and F l
∞ of the MLEs F lXn and F l
n. In some
cases representational non-uniqueness disappears in the limit, so that FX∞ = F lX∞
237
and F∞ = F l∞. This occurs for all (x, y) ∈ R+ × R if and only if the maximal
intersections A(i), i ∈ I, converge to points and∑
i∈I αi → 1 as n→ ∞; see Examples
1 and 2 in Section 9.4. If these conditions fail, then the upper bounds F uX∞ and F u
∞
can be obtained from their lower bounds by reassigning mass from the upper right
corners to the lower left corners of the maximal intersections. We illustrate this in
Examples 3 and 4 in Section 9.4.
However, we first derive the lower bounds F lX∞ and F l
∞. Let
Hn(x) = Pn1U ≤ x, x ≥ 0,
Vn(x, y) = PnΓ+1U ≤ x,W ≤ y, x ≥ 0, y ∈ R,
and V1n(x) ≡ Vn(x,∞) = PnΓ+1U ≤ x. Here U is defined in (9.5) and Pnf(X) =
n−1∑n
i=1 f(Xi). Furthermore, let
Λn(x, y) =
∫
[0,x]
Vn(ds, y)
1 − Hn(s−),
Λ1n(x) ≡ Λn(x,∞) =
∫
[0,x]
V1n(ds)
1 − Hn(s−).
Since
Λn(dx, y) =PnΓ+1U = x,W ≤ y
Pn1U ≥ x and Λ1n(dx) =PnΓ+1U = xPn1U ≥ x
we can write equations (9.9) and (9.10) in terms of Λ1n and Λn:
1 − F lXn(x) =
∏
s≤x1 − Λ1n(ds), (9.11)
F ln(x, y) =
∫
s≤x
∏
u<s
1 − Λ1n(du)Λn(ds, y). (9.12)
Note that (9.11) is analogous to the Kaplan-Meier estimator for right censored data,
and that (9.12) is analogous to equation (3.3) of Huang and Louis (1998). However,
238
our functions Λ1n and Λn are defined differently. As we will see in the following
lemma and theorems, this difference lies at the root of the inconsistency problems of
the MLE.
Lemma 9.2 For I ⊆ Rd, d ≥ 1, let D(I) be the space of cadlag functions on I
(cadlag = right-continuous with left limits). Let ‖ · ‖∞ be the supremum norm on
(D(R+),D(R+),D(R+ × R)). Then
‖(Hn −H,V1n − V1,Vn − V )‖∞ →a.s. 0, (9.13)
where
V (x, y) =
k∑
j=1
∫
[0,x]
F0(t, y)dGj(t) −k∑
j=2
∫
0≤s≤t≤xF0(s, y)dGj−1,j(s, t), (9.14)
V1(x) =
k∑
j=1
∫
[0,x]
F0X(t)dGj(t) −k∑
j=2
∫
0≤s≤t≤xF0X(s)dGj−1,j(s, t), (9.15)
H(x) = V1(x) +
∫
[0,x]
1 − F0X(s)dGk(s). (9.16)
Proof: Equation (9.13) follows immediately from the Glivenko-Cantelli theorem,
with H(x) = E(1U ≤ x), V (x, y) = E(Γ+1U ≤ x,W ≤ y) and V1(x) =
V (x,∞) = E(Γ+1U ≤ x). We now express H , V and V1 in terms of F0 and
G. Note that the events [Γj = 1], j = 1, . . . , k + 1, are disjoint. Furthermore, U = Tj
and W = Y on [Γj = 1], j = 1, . . . , k, and U = Tk on [Γk+1 = 1]. Hence,
V (x, y) = E(Γ+1U ≤ x,W ≤ y) =
k∑
j=1
P (Γj = 1, Y ≤ y, Tj ≤ x)
=k∑
j=1
P (X ∈ (Tj−1, Tj], Y ≤ y, Tj ≤ x)
=k∑
j=1
∫
0≤s≤t≤xF0(t, y) − F0(s, y)dGj−1,j(s, t),
239
and, using T0 = 0, X > 0 and G(0 < T1 < · · · < Tk) = 1, this can be written as
k∑
j=1
∫
[0,x]
F0(t, y)dGj(t) −k∑
j=2
∫
0≤s≤t≤xF0(s, y)dGj−1,j(s, t).
Taking y = ∞ yields the expression for V1(x). The expression for H follows similarly,
using
H(x) = E1U ≤ x =k∑
j=1
P (Γj = 1, Tj ≤ x) + P (Γk+1 = 1, Tk ≤ x).
2
The differentials of V and V1 with respect to x are
V (dx, y) =
k∑
j=1
F0(x, y)dGj(x) −k∑
j=2
∫
[0,x]
F0(s, y)dGj−1,j(s, x), (9.17)
V1(dx) =
k∑
j=1
F0X(x)dGj(x) −k∑
j=2
∫
[0,x]
F0X(s)dGj−1,j(s, x). (9.18)
Let τ be such that H(τ) < 1. In the next theorem we derive the limits of Λ1n and Λn
for x ∈ [0, τ ] and y ∈ R.
Theorem 9.3 Let ‖ · ‖∞ be the supremum norm on (D[0, τ ],D([0, τ ] × R)). Then
‖(Λ1n − Λ1∞, Λn − Λ∞)‖∞ →a.s. 0,
where
Λ∞(x, y) =
∫
[0,x]
V (ds, y)
1 −H(s−), x ∈ [0, τ ], y ∈ R, (9.19)
Λ1∞(x) = Λ∞(x,∞) =
∫
[0,x]
V1(ds)
1 −H(s−), x ∈ [0, τ ]. (9.20)
240
Proof: The proof is similar to the discussion on page 1536 of Gill and Johansen
(1990). For all x ≥ 0, let H−n (x) ≡ Hn(x−). Consider the mappings (H−
n ,V1n,Vn) →((1 − H−
n )−1,V1n,Vn) → (Λ1n, Λn) on the spaces
(D−[0, τ ],D[0, τ ],D([0, τ ] × R)) → (D−[0, τ ],D[0, τ ],D([0, τ ] × R))
→ (D[0, τ ],D([0, τ ] × R)),
where D−(0, τ ] is the space of ‘caglad’ (left-continuous with right limits) functions on
(0, τ ]. The first mapping is continuous with respect to the supremum norm when we
restrict the domain of its first argument to elements of D−[0, τ ] that are bounded by
say 1 +H(τ)/2 < 1. Strong consistency of H−n ensures that it satisfies this bound
with probability one for n large enough. The second mapping is continuous with
respect to the supremum norm by the Helly-Bray lemma. Combining the continuity
of these mappings with Lemma 9.2 yields the result of the theorem. 2
Next, we derive the limits of F lXn and F l
n.
Theorem 9.4 Let ‖ · ‖∞ be the supremum norm on (D[0, τ ],D([0, τ ] × R)). Then
‖(F lXn − F l
X∞, Fln − F l
∞)‖∞ → 0 almost surely,
where
F lX∞(x) = 1 −
∏
s≤x1 − Λ1∞(ds) , (9.21)
F l∞(x, y) =
∫
u≤x
∏
s<u
1 − Λ1∞(ds)Λ∞(du, y). (9.22)
Proof: To derive the almost sure limit of FXn consider the mapping
Λ1n →∏
s≤x1 − Λ1n(ds) = 1 − F l
Xn(x) (9.23)
241
on the space D[0, τ ] to itself. This mapping is continuous with respect to the supre-
mum norm when its domain is restricted to functions of uniformly bounded varia-
tion (Gill and Johansen (1990), Theorem 7). Note that Λ1n ≤ 1/1 − Hn(τ) <
2/1−H(τ) with probability one for n large enough. Together with the monotonic-
ity of Λ1n this implies that with probability one Λ1n is of uniformly bounded variation
for n large enough. The almost sure limit of F lXn now follows by combining Theorem
9.3 and the continuity of (9.23).
To derive the almost sure limit of F ln consider the mapping
(Λ1n, Λn) →∫
u≤x
∏
s<u
1 − Λ1n(ds)Λn(du, y) = F ln(x, y)
on the space (D[0, τ ],D([0, τ ]×R)) to D([0, τ ]×R). This mapping is continuous with
respect to the supremum norm when its domain is restricted to functions of uniformly
bounded variation (Huang and Louis (1998), Theorem 1). Note that Λn(x, y) ≤Λ1n(x), so that with probability one the pair (Λn, Λ1n) is uniformly bounded for n
large enough. The result then follows as in the first part of the proof. 2
In Corollaries 9.5 - 9.7, we rewrite F l∞ in various ways.
Corollary 9.5 For x ∈ [0, τ ], y ∈ R, we can write
F l∞(x, y) =
∫
[0,x]
Λ∞(ds, y)
Λ1∞(ds)dF l
X∞(s) =
∫
[0,x]
V (ds, y)
V1(ds)dF l
X∞(s) . (9.24)
Proof: Combining equations (9.21) and (9.22) yields
F l∞(x, y) =
∫
[0,x]
1 − F lX∞(s−)Λ∞(ds, y) . (9.25)
Taking y = ∞ gives F lX∞(x) = F l
∞(x,∞) =∫[0,x]
1 − F lX∞(s−)Λ1∞(ds). Hence,
dF lX∞(s) = 1 − F l
X∞(s−)Λ1∞(ds). Combining this with equation (9.25) yields the
242
first equality of (9.24). The second equality follows from the identities
Λ∞(ds, y) = V (ds, y)/1−H(s−),
Λ1∞(ds) = V1(ds, y)/1−H(s−).
2
Corollary 9.6 Let X and Y be independent. Then
F l∞(x, y) = F l
X∞(x)F0Y (y), x ∈ [0, τ ], y ∈ R. (9.26)
Proof: If X and Y are independent, equations (9.17) and (9.18) yield V (ds, y) =
F0Y (y)V1(ds). Substituting this into equation (9.24) gives the result. 2
Corollary 9.7 Let X be subject to current status censoring (k = 1). Then
F l∞(x, y) =
∫
[0,x]
P (Y ≤ y|X ≤ s)dF lX∞(s), x ∈ [0, τ ], y ∈ R. (9.27)
Proof: For k = 1 equations (9.17) and (9.18) reduce to V (ds, y) = F0(s, y)dG(s) and
V1(ds) = F0X(s)dG(s). Hence, V (ds, y)/V1(ds) = F0(s, y)/F0X(s) = P (Y ≤ y|X ≤s). 2
We now consider necessary and sufficient conditions for consistency of F lXn and F l
n.
From the one-to-one correspondence between a univariate distribution function and
its cumulative hazard function it follows that F lXn is consistent for F0X if and only
if Λ1∞ equals the cumulative hazard function ΛX of F0X . Similarly, it follows that
F ln(x, y) is consistent for F0(x, y) if and only if Λ∞ equals the mark specific cumulative
hazard function Λ of F0. This is made precise in the following corollary.
243
Corollary 9.8 We introduce the following conditions:
Λ1∞(x) =
∫
[0,x]
V1(ds)
1 −H(s−)=
∫
[0,x]
F0X(ds)
1 − F0X(s−)= ΛX(x) (9.28)
Λ∞(x) =
∫
[0,x]
V (ds, y)
1 −H(s−)=
∫
[0,x]
F0(ds, y)
1 − F0X(s−)= Λ(x, y), (9.29)
Then F lXn is consistent for F0X on (0, τ ] if and only if (9.28) holds for all x ∈ (0, τ ].
Furthermore, F ln is consistent for F0 on (0, τ ] × R if and only if (9.29) holds for all
x ∈ (0, τ ], y ∈ R. Finally, let x0 ∈ (0, τ ] with FX∞(x0) > 0. Then F ln(x0, y)/F
lXn(x0)
is consistent for F0Y (y) if X and Y are independent.
The last claim of the corollary follows from (9.26). Conditions (9.28) and (9.29) are
hard to interpret in general, since F0X and F0 enter on both sides of the equations
when we plug in expressions (9.16), (9.17) and (9.18) for H(s−), V (ds, y) and V1(ds).
However, it is clear that the conditions force a relation between F0 and G, and such
a relation will typically not hold and cannot be assumed, since F0 is unknown. The
following corollary further strengthens this result when X is subject to current status
censoring.
Corollary 9.9 Let X be subject to current status censoring, and let F0X and G be
continuous. Then the MLE F lXn is inconsistent for any choice of F0X and G.
Proof: Let γ = infx : F0X(x) > 0 < τ . For continuous distribution functions G
and F0X condition (9.28) can be rewritten as
∫
(γ,x]
dG(s)
1 −G(s)=
∫
(γ,x]
dF0X(s)
F0X(s)1 − F0X(s) , x ∈ (γ, τ ].
For continuous G and F0X this integral equation is solved by
− log1 −G(x) + C = log
F0X(x)
1 − F0X(x)
, x ∈ (γ, τ ].
244
This yields F0X(x) = [1 + exp(−C)1−G(x)]−1 for x ∈ (γ, τ ]. But there is no finite
C such that F0X(γ) = 0 holds, and hence condition (9.28) fails for all continuous
distributions G and F0X . 2
The following corollary shows that the asymptotic bias of the MLE goes to zero as
the number k of observation times per subject increases, for at least one particular
distribution of T = (T1, . . . , Tk), namely if T1, . . . , Tk are distributed as the order
statistics of a uniform sample on [0, θ].
Corollary 9.10 Let X be subject to interval censoring case k. Assume that the
elements T1, . . . , Tk of T are the order statistics of k independent and identically
distributed uniform random variables on [0, θ]. We denote the resulting limits by
V k(x, y), V k1 (x), Hk(x), Λk(x, y) and Λk
1(x), using the superscript k to denote the
dependence on k. Then
Λk1∞(x) =
∫
[0,x]
dV k1 (s)
1 −Hk(s−)→∫
[0,x]
dF0X(s)
1 − F0X(s−)= ΛX(x), k → ∞,
Λk∞(x, y) =
∫
[0,x]
dV k(s, y)
1 −Hk(s−)→∫
[0,x]
F0(ds, y)
1 − F0X(s−)= Λ(x, y), k → ∞,
for all continuity points x < θ of ΛX(x) and Λ(x, y) and for all y ∈ R.
Proof: Since the observation times are order statistics of k independent and iden-
tically distributed uniform random variables, the marginal densities gj, j = 1, . . . , k
and the joint densities gj−1,j, j = 2, . . . , k are known (see, e.g., Shorack and Wellner
(1986), page 97). Summing them over j yields:
k∑
j=1
gj(t) =k
θ1[0,θ](t)
k−1∑
j−1=0
(k − 1
j − 1
)(t
θ
)j−1(1 − t
θ
)k−1−(j−1)
=k
θ1[0,θ](t),
k∑
j=2
gj−1,j(s, t) =k(k − 1)
θ21[0≤s≤t≤θ]
(1 − t− s
θ
)k−2
.
245
Let x < θ. We compute, using Fubini’s theorem to rewrite the second term,
V k(x, y) =
k∑
j=1
∫
[0,x]
F0(t, y)dGj(t) −k∑
j=2
∫
0≤s≤t≤xF0(s, y)dGj−1,j(s, t)
=k
θ
∫
[0,x]
F0(t, y)dt−∫ ∫
0≤s≤t≤xF0(s, y)
k(k − 1)
θ2
(1 − t− s
θ
)k−2
dsdt
=k
θ
∫
[0,x]
F0(s, y)
(1 − x− s
θ
)k−1
ds =
∫
[0,x]
F0(s, y)dQkx(s),
where, for s ≤ x,
Qkx(s) =
∫ s
0
k
θ
(1 − x− r
θ
)k−1
dr =
(1 − x− s
θ
)k−(1 − x
θ
)k.
Thus, as k → ∞, Qkx(s) converges weakly to the distribution function corresponding
to the measure with mass 1 at x. Plugging in y = ∞ in V k(x, y) yields V k1 (x) =
∫[0,x]
F0X(s)dQkx(s). Furthermore, plugging in the expressions for V k
1 and Gk in (9.16)
gives
Hk(x) =
∫
[0,x]
F0X(s)dQkx(s) +
∫
[0,x]
(1 − F0X(s))k
θ(s/θ)k−1ds.
Hence, for x < θ we have V k(x, y) → F0(x, y), Vk1 (x) → F0X(x) and 1 − Hk(x) →
1 − F0X(x) as k → ∞ for continuity points of the limits. The corollary then follows
from the extended Helly-Bray theorem. 2
Remark 9.11 The MLE for the distribution function of bivariate censored data has
been found to be inconsistent before, namely when X and Y are both right censored
Van der Laan (1996), and when X is current status censored and Y is uncensored
(Maathuis (2003), Section 6.2). In the latter model the inconsistency could be ex-
plained by representational non-uniqueness of the MLE. However, this is not the case
for interval censored continuous mark data, where the MLE is typically inconsistent
even if representational non-uniqueness plays no role in the limit. Rather, the in-
246
consistency in this model is related to the fact that the functions Λ1n and Λn that
define the MLE in (9.9) and (9.10) do not converge to the true underlying cumulative
hazard functions.
However, there is a similarity between these three bivariate censored data models
with inconsistent MLEs. Namely, in each model the observed sets can take the form of
line segments, and the likelihood contains corresponding partial density-type terms.
Thus, observed line segments can be viewed as a warning sign for consistency prob-
lems, and whenever they occur consistency of the MLE should be carefully studied.
These warning signs arise in the model for HIV vaccine data in Hudgens, Maathuis
and Gilbert (2006). This model is slightly different from ours, since it allows the mark
variable to be missing for observations that are not right censored. As a result, there
is no explicit formula for the MLE and hence it is more difficult to derive its almost
sure limit. Consistency of the MLE in this model is currently still an open problem,
but simulation results clearly point to inconsistency (Hudgens, Maathuis and Gilbert
(2006)).
9.3 Repaired MLE via discretization of marks
We now define a simple repaired estimator Fn(x, y) which is consistent for F0(x, y) for
y on a grid. The idea behind the estimator is that one can define discrete competing
risks based on a continuous random variable. Doing so transforms interval censored
continuous mark data into interval censored competing risks data, for which the MLE
is consistent.
To describe the method, we let K > 0 and define a grid y1 < · · · < yK . We let
y0 = −∞ and yK+1 = ∞, and introduce a new random variable C ∈ 1, . . . , K + 1:
C =K+1∑
j=1
j1yj−1 < Y ≤ yj.
We can determine the value of C for all observations with an observed mark. Hence,
247
we can transform the observations (T,Γ,W ) into (T,Γ,W ∗), where W ∗ = Γ+C. This
gives interval censored competing risks data with K + 1 competing risks. Hence, this
repaired MLE can be computed with one of the algorithms described in Chapter 4.
Since the observed sets for interval censored competing risks data form a partition
of the space R+ × 1, . . . , K + 1, global consistency of the MLE follows from Theo-
rems 9 and 10 of Van der Vaart and Wellner (2000). We can derive local consistency
from the global consistency as done in Section 4.2. This means that we can consis-
tently estimate the sub-distribution functions F0j(x) = P (X ≤ x, C = j) = P (X ≤x, yj−1 < Y ≤ yj). Hence, we can consistently estimate F0(x, yj) =
∑jl=1 F0l(x) for
x ∈ R+ and yj on the grid.
It may be tempting to choose K large, such that F0(x, y) can be estimated for
y on a fine grid. However, this may result in a poor estimator. To obtain a good
estimator one should choose the grid such that there are ample observations for each
value of C. In practice, one can start with a course grid, and then refine the grid as
long as the estimator stays close to the one computed on the course grid.
We close this section with some general remarks about this method. First, note
that the repaired MLE corresponds to an existing consistent MLE in the following
two cases: (a) estimation of F0(x, y) for right censored continuous mark data, and (b)
estimation of F0X(x) for interval censored continuous mark data. In the first case the
discretization does not change the intersection structure of the data if the distribution
of the observation times is continuous. Hence, the repaired MLE equals the consistent
MLE as defined by Huang and Louis (1998) for y on the grid. In the second case
we can take K = 0, thereby ignoring any information on Y . This means that we
compute the MLE for univariate interval censored data (T,Γ) which is known to be
consistent (Schick and Yu (2000), Van der Vaart and Wellner (2000)). In a simulation
study we found that moderate values of K tend to give better estimates for F0X , and
in Section 9.4 we present results for n = 10, 000 and K = 20. Finally, note that the
grouping of the data that occurs in the discretization tends to yield smaller maximal
248
intersections in the x-direction and hence diminishes problems with representational
non-uniqueness. This is visible in Examples 3 and 4 in Section 9.4.
9.4 Examples
We illustrate the asymptotic behavior of the inconsistent and repaired MLE in four
examples. The examples are chosen to cover a range of scenarios, summarized in Table
9.1. In each example we compute the MLEs F ln and F u
n and the repaired estimators F ln
and F un for sample size n = 10,000. For the repaired estimator we use an equidistant
grid with K = 20 points as shown in Figure 9.3. We compare these estimators to the
true underlying distribution F0 and the derived limits F l∞ and F u
∞.
Figure 9.1 shows the contour lines of the MLE F ln, its limit F l
∞ and the true
underlying distribution F0. Note that F ln and F l
∞ are almost indistinguishable, while
there is a clear difference between F l∞ and F0. The results for the upper limits F u
n
and F u∞ are similar and not shown. Figure 9.2 contains the results for F0X and shows
that the MLE tends to underestimate F0X , which can be understood through Remark
9.1. However, the repaired MLE Fn closely follows F0X . Figure 9.3 shows the results
for F0(x0, y) for fixed x0. This function is often estimated as an alternative for F0Y ,
since F0Y cannot be consistently estimated if the support of T1, . . . , Tk is contained in
the support of X, a situation that typically occurs in practice. The values of x0 were
chosen to show a range of possible scenarios for the behavior of the MLE, and we
see that Fn can suffer from significant positive or negative bias and non-uniqueness.
However, the repaired MLE is again close to the underlying distribution. We now
discuss each example in detail.
Example 9.12 Let X and Y be independent, with X ∼ Unif(0, 1) and Y ∼ Exp(1).
Let X be subject to current status censoring with observation time T ∼ Unif(0, 0.5)
independent of (X, Y ). Thus, F0X(x) = x, F0Y (y) = 1 − exp(−y) and F0(x, y) =
x(1 − exp(−y)) for x ∈ [0, 1] and y ≥ 0.
249
Table 9.1: Summary of the examples for interval censored continuous mark data.
Example 1 Example 2 Example 3 Example 4(In)dependence of (X, Y ) independent dependent dependent dependentCensoring mechanism for X case 1 case 1 case 2 case 2Distribution of T continuous continuous continuous discrete
We derive the limits for (x, y) ∈ [0, τ ] × R+ for τ < 0.5. Using equations (9.18),
(9.20), (9.21) and the fact that∏
s≤x1 − Λ1∞(s) = exp−Λ1∞(s) when Λ1∞ is
continuous, we obtain
Λ1∞(x) =
∫ x
0
F0X
1 −GdG =
∫ x
0
2s
1 − 2sds = −x− log
√2 − 4x+ log
√2,
1 − F lX∞(x) = exp−Λ1∞(x) =
√1 − 2x exp(x) 6= 1 − F0X(x) = 1 − x.
Since all maximal intersections A(i), i ∈ I, converge to points and F lX∞(0.5) = 1, the
limit FX∞ does not suffer from representational non-uniqueness. Hence, FX∞ = F lX∞.
Figure 9.2 shows that FX∞(x) < F0X(x) for small values of x, but FX∞(x) > F0X(x)
for large values of x. In particular, FX∞(0.5) = 1 > F0X(0.5) = 0.5. The fact that
FX∞ equals one at the upper support point of T is true in some generality and can be
explained as follows. Let η = G−1(1), let X be subject to current status censoring, let
F0X(η) > 0, and let F0X and G be continuous at η. Then Λ1∞(x) =∫ x0F0X/(1−G)dG
can be viewed as a scaled down version of the cumulative hazard function of G, and
hence it converges to infinity for x ↑ η. This implies that FX∞(x) converges to one
for x ↑ η. This observation is relevant in practice since it often happens in medical
studies that the support of G is strictly contained in the support of X. Figure 9.2
also shows that the repaired estimator FXn(x) closely follows F0X(x) for x < 0.5.
Neither estimator behaves well for x > 0.5, but this was to be expected since we
cannot estimate outside of the support of G.
Since X and Y are independent, the bivariate limit F∞ follows from equation
250
(9.26): F∞(x, y) = FX∞(x)F0Y (y) = 1−√
1 − 2x exp(x)1−exp(−y). This implies
that F0(x0, y) for x0 = 0.49 is overestimated by a factor FX∞(0.49)/F0X(0.49) ≈ 1.57,
as shown in Figure 9.3. The repaired estimator Fn(0.49, y) behaves quite well, but is
slightly off for larger values of x.
Example 9.13 Let X ∼ Unif(0, 1), and let Y |X be exponentially distributed with
mean 1/(X + a), where a = 0.5. Let X be subject to current status censoring
with observation time T ∼ Unif(0, 1) independent of (X, Y ). Thus, F0X(x) = x,
F0Y (y) = 1−exp(−ay)1−exp(−y)/y and F0(x, y) = x−exp(−ay)1−exp(−xy)/yfor x ∈ [0, 1] and y ≥ 0.
Let x ∈ [0, τ ] × R+ for τ < 1. Equations (9.18), (9.20) and (9.21) yield
Λ1∞(x) =
∫ x
0
F0X
1 −GdG =
∫ x
0
s
1 − sds = x− log(1 − x),
1 − F lX∞(x) = exp−Λ1∞(x) = (1 − x) exp(x) ≥ 1 − F0X(x) = 1 − x,
where the inequality in the last line is strict for all x ∈ (0, 1]. As in Example 1
FX∞ = F lX∞ is unique. Note P (Y ≤ y|X ≤ x) = 1 − exp(−ay)1 − exp(−xy)/(xy)
and fX∞(x) = x exp(x). Hence, equation (9.27) yields
F∞(x, y)= x exp(x) +exp(−ay)y(1 − y)
exp(x− xy) − 1 −
1 +exp(−ay)
y
exp(x) − 1.
Figures 9.2 and 9.3 show that FXn(x) and Fn(0, 5, y) underestimate F0X(x) and
F0(0.5, y), while the repaired MLE behaves very well.
Example 9.14 Let X ∼ Unif(0, 2), and let Y ≡ X. Let X be subject to interval
censoring case 2 with observation times T = (T1, T2), independent of (X, Y ) and
uniformly distributed over (t1, t2) : 0 ≤ t1 ≤ 1, 1 ≤ t2 ≤ 2. Thus, F0X(x) = 12x,
F0Y (y) = 12y and F0(x, y) = 1
2(x ∧ y) for (x, y) ∈ [0, 2]2.
We derive the limits for (x, y) ∈ [0, τ ] × [0, 2] for τ < 2. Using equations (9.16),
251
(9.18), (9.20) and (9.21), we get
Λ1∞(x) = − log
1 − 1
4(1 ∧ x)2
+
2
3− 2
3x− log(2 − x)
1x > 1,
F lX∞(x) =
1
4x21x ≤ 1 +
1 − 3
4(2 − x) exp
(2
3x− 2
3
)1x > 1.
In this example the limit FX∞ is non-unique and hence we also derive the upper bound
F uX∞. To do so, we look at the x-intervals of the observed sets which take the form
(0, t1], (t1, t2] and (t2,∞), with t1 ∈ (0, 1] and t2 ∈ (1, 2]. Since there are no right
censored observations with TL < 1, equation (9.6) implies that observed sets with
x-interval (0, t1] are maximal intersections, and these maximal intersections do not
converge to points when n goes to infinity. On the other hand, maximal intersections
corresponding to observed sets with x-interval (t1, t2] do converge to points. Hence,
we obtain the upper bound F uX∞ by reassigning all mass at points t1 ≤ 1 to x = 0+,
where 0+ denotes a point slightly bigger than zero to account for the fact that the
x-intervals are left-open. This yields
F uX∞(x) =
1
410 < x ≤ 1 +
1 − 3
4(2 − x) exp
(2
3x− 2
3
)1x > 1.
Note that F uX∞ is left-continuous at zero. We obtain F l
∞ by first computing V (dx, y)
using (9.17), and then integrating V (dx, y)/V1(dx) against F lX∞(x) using (9.24):
F l∞(x, y) =
F lX∞(x) x ≤ y,
F lX∞(y) + 1
2y(x− y) y ≤ x ≤ 1,
F lX∞(y) + 3
8(2y − 1)exp(2
3x− 2
3) − exp(2
3y − 2
3) 1 ≤ y ≤ x,
F lX∞(y) + 1
2y(1− y) + 3
8y2exp(2
3x− 2
3) − 1 y ≤ 1 ≤ x.
We find F u∞ by reassigning mass from the upper right to the lower left corners of the
maximal intersections, as outlined for FX∞. Figure 9.1 shows that F l∞ is smoother
than F0 and clearly different. Figure 9.2 shows that F lX∞(x) < F0X(x) for all x ∈ (0, τ ]
252
and F lX∞(x) = F u
X∞(x) for x ≥ 1, and Figure 9.3 shows that both F l∞(0.75, y) and
F u∞(0.75, y) are smaller than F0(0.75, y). However, the repaired estimators FXn and
Fn(0.75, y) are unique and behave very well.
Example 9.15 Let (X, Y ) be uniformly distributed over (x, y) : 0 ≤ x ≤ y ≤ 1.Let X be subject to interval censoring case 2 with observation times T = (T1, T2)
independent of (X, Y ). Let the distribution of T be discrete: G(0.25, 0.5) = 0.3,
G(0.25, 0.75) = 0.3 and G(0.5, 0.75) = 0.4. Thus, F0X(x) = 2x−x2, F0Y (y) = y2
and F0(x, y) = (2xy − x2)1x ≤ y + y21x > y for (x, y) ∈ [0, 1]2.
Since we can only expect to get sensible estimates for F0(x, y) for values of x
in the support of the observation time distribution, we derive the limits for x ∈0.25, 0.5, 0.75 and y ∈ [0, 1]. Equations (9.16), (9.18), (9.20) and (9.21) yield
F lX∞(x) ≈ 0.26, F l
X∞(0.5) ≈ 0.66 and F lX∞(0.75) ≈ 0.94. Since G is discrete,
we do not use the exponential function in (9.21), but compute the product. As
in Example 9.14, FX∞ is non-unique. We obtain F uX∞ from F l
X∞ by moving the
probability mass from the right endpoints to the left endpoints of the maximal inter-
sections. The possible x-intervals of the maximal intersections are (0, 0.25], (0, 0.5],
(0.25, 0.5], (0.5, 0.75] and (0.75,∞). Consider the interval (0, 0.25] and note that
moving mass from x = 0.25 to x = 0+ does not change the value of FX∞(x) for
x ∈ 0, 0.25, 0.5, 0.75. This also holds if we move mass in the other intervals, except
for the interval (0, 0.5], where moving the mass from x = 0.5 to x = 0+ increases the
value of FX∞(x) at x = 0.25. Note that the mass F lX∞(0.5) comes from maximal
intersections with x-intervals (0, 0.5] and (0.25, 0.5]. The proportion of mass coming
from the latter is
α = P (TL = 0.25, TR = 0.5|TR = 0.5)
=G(0.25, 0.5)F0X(0.5) − F0X(0.25)
G(0.25, 0.5)F0X(0.5) − F0X(0.25) +G(0.5, 0.75)F0X(0.5)≈ 0.238.
Hence, we get F uX∞(0.25) = F l
X∞(0.25) + (1 − α)F lX∞(0.5) ≈ 0.56 and F u
X∞(x) =
253
F lX∞(x) for x ∈ 0, 0.5, 0.75. To derive the bivariate limit F l
∞, we first find V (dx, y)
using equation (9.17) and then integrate V (dx, y)/V1(dx) against F lX∞(x) using equa-
tion (9.24). This yields F l∞(0.25, y) = 0.6F0(0.25, y), F l
∞(0.5, y) = 0.3F0(0.25, y) +
0.7F0(0.5, y) and F l∞(0.75, y) ≈ 0.90F0(0.75, y)+0.19F0(0.5, y)−0.084F0(0.25, y). The
upper bound F u∞(x, y) can be found by reassigning mass to the lower left corners of
the maximal intersections. To do so, we compute
α(y) = P (TL = 0.25, TR = 0.5|TR = 0.5, Y ≤ y)
=G(0.25, 0.5)F0(0.5, y)− F0(0.25, y)
G(0.25, 0.5)F0(0.5, y)− F0(0.25, y)+G(0.5, 0.75)F0(0.5, y).
We then get F u∞(0.25, y) = F l
∞(0.25, y) + 1 − α(y)F l∞(0.5, y) − F l
∞(0.25, y), and
the value of F∞(x, y) is unchanged for x ∈ 0, 0.5, 0.75. The discrete nature of the
limit F l∞ is visible in Figure 9.1. Figure 9.2 shows significant non-uniqueness in all
estimators for x-values outside the support of G. However, FXn(x) is unique for x ∈0.25, 0.5, 0.75 and very close to F0X(x). Finally, Figure 9.3 shows that F∞(0.25, y)
is non-unique, while the repaired MLE is unique and closely follows F0(0.25, y).
254
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
Fn, Example 1
0.0 0.2 0.4 0.6 0.8 1.00
12
34
F∞, Example 1
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
F, Example 1
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
Fn, Example 2
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
F∞, Example 2
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
F, Example 2
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
Fn, Example 3
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
F∞, Example 3
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
F, Example 3
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fn, Example 4
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
F∞, Example 4
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
F, Example 4
Figure 9.1: Contour lines for the bivariate functions F ln, F
l∞ and F0. All functions
were computed on an equidistant grid with mesh size 0.02, and n = 10,000.
255
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
FX, Example 1
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
x
FX, Example 2
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x
FX, Example 3
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
FX, Example 4
Figure 9.2: Dotted: F0X . Dashed: F lX∞ and F u
X∞. Solid black: F lXn and F u
Xn using
the equidistant grid with K = 20 shown in Figure 9.3. Solid grey: F lXn and F u
Xn. Inall cases n = 10,000.
256
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
y
F(0.49, y), Example 1
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
y
F(0.5, y), Example 2
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
y
F(0.75, y), Example 3
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y
F(0.25, y), Example 4
Figure 9.3: Dotted: F0(x0, y). Dashed: F l∞(x0, y) and F u
∞(x0, y). Circles: F ln(x0, y) =
F un (x0, y) using an equidistant grid with K = 20. Solid grey: F l
n(x0, y) and F un (x0, y).
In all cases n = 10,000.
257
BIBLIOGRAPHY
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E.
(1955). An empirical distribution function for sampling with incomplete informa-tion. Annals of Mathematical Statistics 26 641–647.
Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D.
(1972). Statistical Inference Under Order Restrictions. The Theory and Applicationof Isotonic Regression. John Wiley & Sons, London - New York - Sydney.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Effi-cient and adaptive estimation for semiparametric models. Johns Hopkins UniversityPress.
Birman, M. and Solomjak, M. (1967). Piece-wise polynomial approximations offunctions of the classes wαp . Mathematics of the USSR Sbornik 73 295–317.
Dudley, R. M. (1968). Distances of probability measures and random variables.Ann. Math. Statist. 39 1563–1572.
Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab.6 899–929. (Correction: (1979) Ann. Probab. 7 909-911).
Gentleman, R. and Geyer, C. J. (1994). Maximum likelihood for interval cen-sored data: Consistency and computation. Biometrika 81 618–623.
Gentleman, R. and Vandal, A. C. (2001). Computational algorithms forcensored-data problems using intersection graphs. J. Comput. Graph. Statist. 10
403–421.
Gentleman, R. and Vandal, A. C. (2002). Nonparametric estimation of thebivariate CDF for arbitrarily censored data. Can. J. Statist. 30 557–571.
Geskus, R. B. and Groeneboom, P. (1996). Asymptotically optimal estimation ofsmooth functionals for interval censoring, part 1. Statistica Neerlandica 50 69–88.
Geskus, R. B. and Groeneboom, P. (1997). Asymptotically optimal estimation ofsmooth functionals for interval censoring, part 2. Statistica Neerlandica 51 201–219.
Geskus, R. B. and Groeneboom, P. (1999). Asymptotically optimal estimationof smooth functionals for interval censoring, case 2. Ann. Statist. 27 627–674.
258
Gill, R. D. and Johansen, S. (1990). A survey of product-integration with a viewtoward application in survival analysis. Ann. Statist. 18 1501–1555.
Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. AcademicPress, New York.
Gordon, R. D. (1941). Values of Mills’ ratio of area to bounding ordinate and of thenormal probability integral for large values of the argument. Ann. Math. Statistics12 364–366.
Groeneboom, P. (1989). Brownian motion with a parabolic drift and Airy func-tions. Probability Theory and Related Fields 81 79–109.
Groeneboom, P. (1996). Lectures on inverse problems. In Lectures on ProbabilityTheory and Statistics. Ecole d’Ete de Probabilites de Saint Flour XXIV, 1994,Springer, Berlin.
Groeneboom, P., Jongbloed, G. and Wellner, J. (2002). The support reduc-tion algorithm for computing nonparametric function estimates in mixture models.Technical Report 2002-13, Vrije Universiteit Amsterdam, The Netherlands. Avail-able at arXiv:math/ST/0405511.
Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001a). A canonicalprocess for estimation of convex functions: The “invelope” of integrated Brownianmotion + t4. Ann. Statist. 29.
Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001b). Estimation ofa convex function: Characterizations and asymptotic theory. Ann. Statist. 29.
Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonpara-metric Maximum Likelihood Estimation. Birkhauser Verlag, Basel.
Hajos, G. (1957). Uber eine Art von Graphen. Internationale MathematischeNachrichten 11. Problem 65.
Huang, J. and Wellner, J. A. (1995). Asymptotic normality of the NPMLEof linear functionals for interval censored data, case 1. Statistica Neerlandica 49
153–163.
Huang, Y. and Louis, T. A. (1998). Nonparametric estimation of the joint distri-bution of survival time and mark variables. Biometrika 85 785–798.
Hudgens, M. G., Maathuis, M. H. and Gilbert, P. B. (2006). Nonparametricestimation of the joint distribution of a survival time subject to interval censoringand a continuous mark variable. Submitted.
259
Hudgens, M. G., Satten, G. A. and Longini, I. M. (2001). Nonparametricmaximum likelihood estimation for competing risks survival data subject to intervalcensoring and truncation. Biometrics 57 74–80.
Jewell, N. P. and Kalbfleisch, J. D. (2004). Maximum likelihood estimationof ordered multinomial parameters. Biostatistics 5 291 – 306.
Jewell, N. P., Van der Laan, M. J. and Henneman, T. (2003). Nonparametricestimation from current status data with competing risks. Biometrika 90 183–197.
Jongbloed, G. (1995). Three Statistical Inverse Problems. Ph.D. thesis, DelftUniversity of Technology, The Netherlands.
Jongbloed, G. (1998). The iterative convex minorant algorithm for nonparametricestimation. J. Comput. Graph. Statist. 7 310–321.
Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Statist. 18 191–219.
Krailo, M. D. and Pike, M. C. (1983). Estimation of the distribution of age atnatural menopause from prevalence data. American Journal of Epidemiology 117
356–361.
Maathuis, M. H. (2003). Nonparametric Maximum Likelihood Estimation For Bi-variate Censored Data. Master’s thesis, Delft University of Technology, The Nether-lands.
Maathuis, M. H. (2005). Reduction algorithm for the MLE for the distributionfunction of bivariate interval censored data. J. Comput. Graph. Statist. 14 352–362.
Maathuis, M. H. and Wellner, J. A. (2006). Inconsistency of the MLE forthe joint distribution of interval censored survival times and continuous marks.Submitted.
MacMahon, B. and Worcestor, J. (1966). Age at menopause, United States1960 - 1962. National Center for Health Statistics. Vital and Health Statistics 11.
Pfanzagl, J. (1988). Consistency of maximum likelihood estimators for certainnonparametric families, in particular: mixtures. J. Statist. Plann. Inference 19
137–158.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, NewYork. Available at http://ameliabedelia.library.yale.edu/dbases/pollard1984.pdf.
Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order RestrictedStatistical Inference. John Wiley & Sons, Chichester.
260
Rudin, W. (1976). Principles of Mathematical Analysis. 3rd ed. McGraw-Hill, NewYork.
Schick, A. and Yu, Q. (2000). Consistency of the GMLE with mixed case interval-censored data. Scand. J. Statist. 27 45–55.
Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applicationsto Statistics. John Wiley & Sons, New York.
Silverman, B. W. (1982). On the estimation of a probability density function bythe maximum penalized likelihood method. Ann. Statist. 10 795–810.
Turnbull, B. W. (1976). The empirical distribution function with arbitrarilygrouped, censored, and truncated data. J. R. Statist. Soc. B 38 290–295.
Van de Geer, S. A. (1991). The entropy bound for monotone functions. Tech.Rep. 91-10, University of Leiden, The Netherlands.
Van de Geer, S. A. (1993). Hellinger-consistency of certain nonparametric maxi-mum likelihood estimators. Ann. Statist. 21 14–44.
Van de Geer, S. A. (1996). Rates of convergence of the maximum likelihoodestimator in mixture models. J. Nonparametr. Stat. 6 293–310.
Van de Geer, S. A. (2000). Applications of Empirical Process Theory. CambridgeUniversity Press, Cambridge.
Van der Laan, M. J. (1996). Efficient estimation in the bivariate censoring modeland repairing NPMLE. Ann. Statist. 24 596–627.
Van der Vaart, A. W. (1991). On differentiable functionals. Ann. Statist. 19
178–204.
Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence andEmpirical Processes: With Applications to Statistics. Springer-Verlag, New York.
Van der Vaart, A. W. and Wellner, J. A. (2000). Preservation theoremsfor Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In High DimensionalProbability II. Birkhauser, Boston, 115–133.
Vandal, A. C., Gentleman, R. and Liu, X. (2006). Mixture nonuniqueness ofthe CDF NPMLE with censored data. Submitted.
Wellner, J. A. (2003). Empirical processes: Theory and applications. Lecture notesfor summer school on statistics and probablity, Bocconi University, Milan. Availableat http://www.stat.washington.edu/jaw/RESEARCH/TALKS/talks.html.
261
Wong, G. Y. and Yu, Q. (1999). Generalized MLE of a joint distribution functionwith multivariate interval-censored data. Journal of Multivariate Analysis 69 155–166.
Yoshihara, K.-i. (1979). The Borel-Cantelli lemma for strong mixing sequences ofevents and their applications to LIL. Kodai Math. J. 2 148–157.
Zeidler, E. (1985). Nonlinear Functional Analysis and its Applications III: Varia-tional Methods and Optimization. Springer-Verlag, New York.
262
VITA
Marloes Henriette Maathuis was born to Harry and Ina Maathuis on May 28 1978
in Groningen, The Netherlands. After graduating from the Praedinius Gymnasium in
Groningen in 1996, she started studies in Applied Mathematics at the Delft University
of Technology. As part of this program, she did an internship at the Ethiopian
Netherlands AIDS Research Project in Addis Ababa, Ethiopia. In 2001 she came to
the University of Washington to write her Master’s thesis, resulting in a Master of
Science degree in Applied Mathematics from the Delft University of Technology in
2003. She simultaneously started graduate studies at the University of Washington,
and graduated with a Doctor of Philosophy in Statistics in 2006. She will remain
associated with the University of Washington in the following year, as an Acting
Assistant Professor in Statistics.