thesis - ETH Zurichstat.ethz.ch/~maathuis/papers/thesis.pdf · 2007. 9. 6. · Title: thesis.dvi...

Nonparametric estimation for current status data with

competing risks

Marloes Henriette Maathuis

A dissertation submitted in partial fulfillmentof the requirements for the degree of

Doctor of Philosophy

University of Washington

2006

Program Authorized to Offer Degree: Statistics

University of WashingtonGraduate School

This is to certify that I have examined this copy of a doctoral dissertation by


and have found that it is complete and satisfactory in all respects,and that any and all revisions required by the final

examining committee have been made.

Co-Chairs of the Supervisory Committee:

Piet Groeneboom

Jon A. Wellner

Reading Committee:

Piet Groeneboom

Michael G. Hudgens

Jon A. Wellner

Date:

In presenting this dissertation in partial fulfillment of the requirements for the doctoraldegree at the University of Washington, I agree that the Library shall make itscopies freely available for inspection. I further agree that extensive copying of thisdissertation is allowable only for scholarly purposes, consistent with “fair use” asprescribed in the U.S. Copyright Law. Requests for copying or reproduction of thisdissertation may be referred to Proquest Information and Learning, 300 North ZeebRoad, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted“the right to reproduce and sell (a) copies of the manuscript in microform and/or (b)printed copies of the manuscript made from microform.”

Signature

Date

University of Washington

Abstract

Nonparametric estimation for current status data with competing risks


Co-Chairs of the Supervisory Committee:

Professor Piet GroeneboomStatistics

Professor Jon A. WellnerStatistics

We study current status data with competing risks. Such data arise naturally in

cross-sectional survival studies with several failure causes. Moreover, generalizations

of these data arise in HIV vaccine clinical trials.

The general framework is as follows. We analyze a system that can fail from K

competing risks, where K ∈ N is fixed. The random variables of interest are (X, Y ),

where X ∈ R+ = (0,∞) is the failure time of the system, and Y ∈ 1, . . . , K is

the corresponding failure cause. However, we cannot observe (X, Y ) directly. Rather,

we observe the ‘current status’ of the system at a single random observation time

T ∈ R+, where T is independent of (X, Y ). This means that at time T , we observe

whether or not failure occurred, and if and only if failure occurred, we also observe

the failure cause Y .

We study nonparametric estimation of the sub-distribution functions F0k(t) =

P (X ≤ t, Y = k), k = 1, . . . , K, t ∈ R+. We focus on two estimators: the nonpara-

metric maximum likelihood estimator (MLE) and the ‘naive estimator’ introduced

by Jewell, Van der Laan and Henneman (2003). Our main interest is in asymptotic

properties of the MLE, and the naive estimator is considered for comparison.

Until now, the asymptotic properties of the MLE have been largely unknown. We

resolve this issue by proving its consistency, n1/3-rate of convergence, and limiting

distribution. The limiting distribution involves a new self-induced limiting process,

consisting of the convex minorants of K correlated two-sided Brownian motion pro-

cesses plus parabolic drifts, plus an additional term involving the difference between

the sum of the K drifting Brownian motions and their convex minorants.

Various other aspects that we consider include characterizations of the estimators,

uniqueness, graph theory, and computational algorithms. Furthermore, we show that

both the MLE and the naive estimator are asymptotically efficient for a family of

smooth functionals, with√n-rate convergence to a normal limit. Finally, we study an

extension of the model, where X is subject to interval censoring and Y is a continuous

random variable. We show that the MLE is typically inconsistent in this model, and

propose a simple method to repair this inconsistency.

TABLE OF CONTENTS

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation and problem description . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of previous work . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Overview of new results and outline of this thesis . . . . . . . . . . . 4

Chapter 2: The estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Definition of the estimators . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Censored data perspective . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Graph theory and uniqueness . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Chapter 3: Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.1 Reduction and optimization . . . . . . . . . . . . . . . . . . . . . . . 63

3.2 Iterative convex minorant algorithms . . . . . . . . . . . . . . . . . . 66

Chapter 4: Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 Hellinger consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Local and uniform consistency . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 5: Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1 Hellinger rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Asymptotic local minimax lower bound . . . . . . . . . . . . . . . . . 90

5.3 Local rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4 Technical lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . 118

i

Chapter 6: Limiting distribution . . . . . . . . . . . . . . . . . . . . . . . . 132

6.1 The limiting distribution of the naive estimator . . . . . . . . . . . . 133

6.2 The limiting distribution of the MLE . . . . . . . . . . . . . . . . . . 146

6.3 Technical lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . 177

Chapter 7: A family of smooth functionals . . . . . . . . . . . . . . . . . . 186

7.1 Information bound calculations . . . . . . . . . . . . . . . . . . . . . 187

7.2 Asymptotic normality of functionals of the MLE . . . . . . . . . . . . 194

Chapter 8: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

8.1 Menopause data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

8.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Chapter 9: An extension: interval censored continuous mark data . . . . . 229

9.1 The model and an explicit formula for the MLE . . . . . . . . . . . . 230

9.2 Inconsistency of the MLE . . . . . . . . . . . . . . . . . . . . . . . . 236

9.3 Repaired MLE via discretization of marks . . . . . . . . . . . . . . . 246

9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

ii

LIST OF FIGURES

Figure Number Page

2.1 The estimators: Graphical representation of the observed data. . . . . 9

2.2 Graph theory: Intersection graph for the MLE. . . . . . . . . . . . . 30

2.3 Convex minorant characterizations: Plots for the data in Table 2.5. . 59

5.1 Asymptotic local minimax lower bound: The perturbation Fnk. . . . . 91

5.2 Local rate: Plot of vn(t) for various values of β. . . . . . . . . . . . . 100

5.3 Local rate: Example clarifying the proof of Lemma 5.16. . . . . . . . 128

6.1 Limiting distribution: Processes for the naive estimator at t0 = 1 . . . 136

6.2 Limiting distribution: Processes for the naive estimator at t0 = 2 . . . 137

6.3 Limiting distribution: Processes for the MLE at t0 = 1 . . . . . . . . 153

6.4 Limiting distribution: Processes for the MLE at t0 = 2 . . . . . . . . 154

6.5 Limiting distribution: Comparison of limiting processes at t0 = 1. . . 155

6.6 Limiting distribution: Comparison of limiting processes at t0 = 2. . . 156

8.1 Menopause data: Question of the Health Examination Study. . . . . . 211

8.2 Menopause data: The MLE and the naive estimator. . . . . . . . . . 212

8.3 Simulations: The true underlying sub-distribution functions. . . . . . 218

8.4 Simulations: The estimators in a single simulation. . . . . . . . . . . 219

8.5 Simulations: Pointwise bias. . . . . . . . . . . . . . . . . . . . . . . . 220

8.6 Simulations: Pointwise variance. . . . . . . . . . . . . . . . . . . . . . 221

8.7 Simulations: Pointwise mean squared error. . . . . . . . . . . . . . . . 222

8.8 Simulations: Pointwise relative efficiency. . . . . . . . . . . . . . . . . 223

8.9 Simulations: Smooth functionals of the MLE for t0 = 2. . . . . . . . . 225

8.10 Simulations: Smooth functionals of the naive estimator for t0 = 2. . . 226

8.11 Simulations: Smooth functionals of the MLE for t0 = 10. . . . . . . . 227

8.12 Simulations: Smooth functionals of the naive estimator for t0 = 10. . 228

9.1 Continuous mark data: Contour lines for estimates of F0(x, y). . . . . 254

iii

9.2 Continuous mark data: Estimates of F0X(x). . . . . . . . . . . . . . . 255

9.3 Continuous mark data: Estimates of F0(x0, y). . . . . . . . . . . . . . 256

iv

LIST OF TABLES

Table Number Page

2.1 Censored data perspective: Example data. . . . . . . . . . . . . . . . 22

2.2 Censored data perspective: Estimators for the data in Table 2.1. . . . 23

2.3 Graph theory: Example data. . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Graph theory: Clique matrix for the data in Table 2.3. . . . . . . . . 35

2.5 Convex minorant characterizations: Example data . . . . . . . . . . . 58

8.1 Simulations: Pointwise bias, variance and MSE at t = 10. . . . . . . . 224

9.1 Continuous mark data: Summary of the examples. . . . . . . . . . . . 249

v

ACKNOWLEDGMENTS

I sincerely thank my advisors, Piet Groeneboom and Jon Wellner, for their

mentorship over the past years. Their knowledge, guidance, inspiration and

encouragement have been very important to me.

I thank Peter Gilbert, Tilmann Gneiting, Peter Hoff and Michael Hudgens

for serving on my committee, with special thanks to Michael for suggesting

this research problem. I thank Bernard Deconinck for serving as the graduate

school representative.

I am grateful to the faculty, staff and students in our department for provid-

ing a stimulating and supportive research environment. In particular, I thank

Fadoua Balabdaoui, Moulinath Banerjee and Hanna Jankowski for helpful dis-

cussions. Finally, I want to express my deep gratitude to Steven, my parents,

my family and my friends, for their continuous support.

vi

1

Chapter 1

INTRODUCTION

1.1 Motivation and problem description

The work in this thesis is motivated by recent clinical trials of candidate vaccines

against HIV/AIDS. The main purpose of such trials is to determine the overall effi-

cacy of a candidate vaccine. Like many viruses, HIV exhibits significant genotypic

and phenotypic variation, so that it can be distinguished into several subtypes. There-

fore, it is also of interest to determine the efficacy of a vaccine against each subtype

of the virus. Establishing vaccine efficacy for certain subtypes can warrant vaccina-

tion of populations in which the given subtypes are highly prevalent. Furthermore,

establishing that the vaccine is efficacious for some subtypes, but not for others, gives

important information for possible improvements of the vaccine.

Thus, the variables of interest are the time of infection and the subtype of the

infecting virus. These variables cannot be observed directly, because participants of a

trial are only tested for the virus at several follow-up times. Since each test indicates

whether or not infection happened before the time of the test, the time of infection

is interval censored, i.e., only known to lie within a time interval determined by the

follow-up times. Since simultaneous infections with several subtypes of a virus are

rare, the subtypes are often analyzed as competing risks (see, e.g., Hudgens, Satten

and Longini (2001)). Hence, these trials yield interval censored survival data with

competing risks.

In this thesis, we analyze current status data with competing risks. Current sta-

tus censoring is the simplest form of interval censoring, where there is exactly one

2

observation time for each subject. We study these data for two reasons. First, such

data arise naturally in cross-sectional studies with several failure causes. Second,

understanding current status data with competing risks is a first step towards under-

standing the more complicated interval censored data with competing risks that arise

in vaccine clinical trials.

We consider the following general framework. We analyze a system that can fail

from K competing risks, where K ∈ N is fixed. The random variables of interest

are (X, Y ), where X ∈ R+ = (0,∞) is the failure time of the system, and Y ∈1, . . . , K is the corresponding failure cause. Due to censoring, we cannot observe

(X, Y ) directly. Rather, we observe the ‘current status’ of the system at a single

random observation time T ∈ R+, where T is independent of (X, Y ). Thus, at time

T we observe whether or not failure occurred, and if and only if failure occurred, we

also observe the failure cause Y .

Examples that fit into this framework can be found in reliability and survival

analysis. For an example, see the menopause data analyzed by Krailo and Pike

(1983), where X is the age at menopause, Y is the cause of menopause (natural or

operative), and T is the age at the time of the survey. In cross-sectional HIV studies

we think of X as the time of HIV infection, Y as the subtype of the infecting HIV

virus, and T as the time of the HIV test. Note that one is free to define the origin

of the time scale as. Common choices include the date of birth and the beginning of

the study.

Given current status data with competing risks, we consider nonparametric estima-

tion of the sub-distribution functions F0k(t) = P (X ≤ t, Y = k), k = 1, . . . , K. This

problem, or close variants thereof, has been studied by Hudgens, Satten and Longini

(2001), Jewell, Van der Laan and Henneman (2003), and Jewell and Kalbfleisch

(2004). However, there are still many open problems. In particular, until now, the

asymptotic properties of the nonparametric maximum likelihood estimator (MLE)

have been largely unknown. In this thesis, we resolve this problem. We prove con-

3

sistency, the rate of convergence and the limiting distribution of the MLE. These

asymptotic results form an important step towards making inference about the sub-

distribution functions.

The outline of the remainder of this chapter is as follows. In Section 1.2 we give

an overview of previous work in this area. In Section 1.3 we give an outline of this

thesis, together with a discussion of our main results.

1.2 Overview of previous work

Hudgens, Satten and Longini (2001) study competing risks data subject to interval

censoring and truncation. They derive the nonparametric maximum likelihood esti-

mator (MLE) and provide an EM algorithm for its computation. They also introduce

an alternative pseudo-likelihood estimator. They apply their methods to data from

a cohort of injecting drug users in Thailand, where the event of interest is infection

with HIV-1, and the competing risks are HIV-1 subtypes B and E.

Jewell, Van der Laan and Henneman (2003) study current status data with com-

peting risks. They consider some simple parametric models, some ad-hoc nonparamet-

ric estimators, and the MLE. They compare these estimators in a simulation study.

Furthermore, they apply their methods to data analyzed by Krailo and Pike (1983),

where the event of interest is menopause and the competing risks are natural and

operative menopause. Finally, the authors discuss results suggesting that the simple

ad-hoc estimators might yield fully efficient estimators for smooth functionals of the

sub-distribution functions.

Jewell and Kalbfleisch (2004) study maximum likelihood estimation of a series of

ordered multinomial parameters. Current status data with competing risks can be

viewed as a special case of this setting. The authors focus on the computation of the

MLE, and introduce an iterative version of the Pool Adjacent Violators Algorithm.

4

1.3 Overview of new results and outline of this thesis

We focus on the following two nonparametric estimators for the sub-distribution func-

tions: the MLE Fn = (Fn1, . . . , FnK), and the ‘naive estimator’ Fn = (Fn1, . . . , FnK)

introduced by Jewell, Van der Laan and Henneman (2003).1 Our main interest is in

asymptotic properties of the MLE, and the naive estimator is considered for compar-

ison.

In Chapter 2 we define the estimators, and discuss the relationship between them.

We show that both the MLE and the naive estimator can be viewed as maximum like-

lihood estimators for censored data. This observation is useful, because it allows us

to use readily available theory and computational algorithms. In particular, the naive

estimator can be viewed as the maximum likelihood estimator for reduced univariate

current status data. Hence, many properties of the naive estimator follow straight-

forwardly from known results on current status data. The censored data perspective

also allows us to use graph theory to study uniqueness properties of the estimators.

Finally, we characterize the estimators in terms of necessary and sufficient condi-

tions, in the form of Fenchel characterizations and (self-induced) convex minorant

characterizations. These characterizations play a key role in the development of the

asymptotic theory, and also lead to computational algorithms.

Computational aspects of the MLE are discussed in Chapter 3. Since there are

no explicit formulas available for the MLE, we compute the MLE with an iterative

algorithm. We discuss two classes of algorithms and the connections between them.

The first class is based on sequential quadratic programming, where each quadratic

programming problem is solved using a support reduction algorithm. The second class

consists of iterative convex minorant algorithms. We prove convergence of algorithms

in both classes. Furthermore, we show that one particular iterative convex minorant

algorithm can be viewed as a sequential quadratic programming method that only

1The subscript n denotes the sample size.

5

uses the diagonal elements of the Hessian matrix.

In Chapter 4 we discuss consistency of the estimators. We prove that both esti-

mators are Hellinger consistent, and we use this to derive various forms of local and

uniform consistency.

The rate of convergence is discussed in Chapter 5. The Hellinger rate of conver-

gence and the local rate of convergence of the naive estimator are n1/3. This follows

from known results on current status data without competing risks. For the MLE, we

prove that the Hellinger rate of convergence is n1/3. Next, we derive a local asymp-

totic minimax lower bound of n1/3, meaning that no estimator can have a better local

rate of convergence than n1/3, in a minimax sense. We proceed by proving that the

local rate of convergence of the MLE is n1/3. This result comes as no surprise given

the local asymptotic minimax lower bound and the local rate of convergence of the

naive estimator. However, the proof of this result turned out to be rather involved,

and required new methods. The key idea is to first establish a rate result for∑K

k=1 Fnk

that holds uniformly on a fixed neighborhood around a point t0, instead of on the

usual shrinking neighborhood of order O(n−1/3).

In Chapter 6 we discuss the limiting distribution of the estimators. The limiting

distribution of the naive estimator is given by the slopes of the convex minorants of

K correlated two-sided Brownian motion processes plus parabolic drifts. The limiting

distribution of the MLE involves a new self-induced limiting process, consisting of the

convex minorants of K correlated two-sided Brownian motion processes plus parabolic

drifts, plus an additional term involving the difference between the sum of the K

drifting Brownian motion processes and their convex minorants.

In Chapter 7 we consider estimation of smooth functionals. Jewell, Van der Laan

and Henneman (2003) suggested that the naive estimator yields asymptotically effi-

cient smooth functionals. We show that this is indeed the case, and that the same

holds for the MLE.

In Chapter 8 we apply our methods to real and simulated data. We compare

6

the MLE and the naive estimator in a simulation study, considering both pointwise

estimation and the estimation of smooth functionals. For pointwise estimation, we

show that the MLE is superior to the naive estimator in terms of mean squared error,

both for small and large sample sizes. For the estimation of smooth functionals,

we show that the behavior of the MLE and the naive estimator is similar, and in

agreement with the results in Chapter 7.

Finally, in Chapter 9 we consider an extension of the model, where X is subject

to interval censoring case k, and Y is a continuous random variable. This model is

referred to as the interval censored continuous mark model. It is applicable to HIV

vaccine clinical trials by letting X be the time of HIV infection, and Y be the ‘viral

distance’ between the infecting HIV virus and the virus present in the vaccine. We

derive the limit of the MLE in this model, and show that the MLE is inconsistent in

general. We also suggest a simple method for repairing the MLE by discretizing Y ,

an operation that transforms the data to interval censored data with competing risks.

We illustrate the behavior of the MLE and the repaired MLE in four examples.

7

Chapter 2

THE ESTIMATORS

In this chapter we study finite sample properties of the MLE and the naive esti-

mator. In Section 2.1 we formally define the model and the estimators. Since both

estimators can be viewed as maximum likelihood estimators for censored data, Sec-

tion 2.2 provides a general discussion on the MLE for censored data. In Section 2.3

we use a graph theoretic perspective to derive properties of the estimators. Finally, in

Section 2.4, we characterize the estimators in terms of necessary and sufficient Fenchel

and convex minorant conditions.

2.1 Definition of the estimators

Before we define the MLE and the naive estimator, we introduce some assumptions

and notation. Recall that K ∈ N denotes the number of competing risks. The

variables of interest are (X, Y ), where X ∈ R+ is the failure time of a system, and

Y ∈ 1, . . . , K is the corresponding failure cause. We do not observe (X, Y ) directly.

Rather, we observe the system at a random observation time T ∈ R+. At this time,

we observe whether or not failure occurred, and if and only if failure occurred, we

also observe the failure cause Y . Our goal is nonparametric estimation of the bivari-

ate distribution function of (X, Y ), or equivalently, of the vector of sub-distribution

functions F0 = (F01, . . . , F0K), where

F0k(t) = P (X ≤ t, Y = k), k = 1, . . . , K.

We make the following assumptions:

8

(a) T is independent of (X, Y );

(b) The system cannot fail from two or more causes at the same time.

Assumption (a) is essential for the development of the theory, and is used in the

definition of the estimators in Sections 2.1.2 and 2.1.3. Assumption (b) ensures that

the failure cause is well defined. This assumption is always satisfied by defining

simultaneous failure from several causes as a new failure cause. We do not make any

other assumptions. In particular, we do not require that all observation times are

distinct.

2.1.1 Notation

We denote the observed data by Z = (T,∆), where ∆ = (∆1, . . . ,∆K+1) and

∆k = 1X ≤ T, Y = k, k = 1, . . . , K, (2.1)

∆K+1 = 1X > T. (2.2)

Thus, for k = 1, . . . , K, ∆k = 1 if and only if failure happened by time T and was due

to cause k. Furthermore, ∆K+1 = 1 if and only if failure did not happen by time T .

Note that∑K+1

k=1 ∆k = 1, and hence ∆K+1 = 1−∑Kk=1 ∆k. A graphical representation

of the observed data is given in Figure 2.1.

Let Z1, . . . , Zn be n i.i.d. observations of Z, where Zi = (Ti,∆i) and ∆i =

(∆i1, . . . ,∆i,K+1). We call an observation Zi right censored if ∆i,K+1 = 1, and left

censored otherwise. Let T(1), . . . , T(n) be the order statistics of T1, . . . , Tn, where ties

are broken arbitrarily after ensuring that left censored observation are ordered before

right censored observations. We denote the corresponding ∆-vectors by ∆(1), . . . ,∆(n),

where ∆(i) = (∆(i)1, . . . ,∆(i),K+1).

9

T T

T T

1

2

3

1

2

3

1

2

3

1

2

3

∆ = (0, 0, 1, 0) ∆ = (0, 0, 0, 1)

∆ = (1, 0, 0, 0) ∆ = (0, 1, 0, 0)

Figure 2.1: Graphical representation of the observed data (T,∆) in an example withK = 3 competing risks. The grey sets indicate the values of (X, Y ) that are consistentwith (T,∆), for each of the four possible values of ∆.

Let ek, k = 1, . . . , K + 1, be the kth unit vector in RK+1, and let

Z = (t, ek) : t ∈ R+, k = 1, . . . , K + 1. (2.3)

Let G be the distribution of T , and let Gn be the empirical distribution of T1, . . . , Tn.

Furthermore, let Pn be the empirical distribution of Z1, . . . , Zn, i.e., for any function

h : Z 7→ R we have Pnh(Z) =∫h(z)dPn(z) = 1

n

∑ni=1 h(Zi). For vectors x =

(x1, . . . , xK) ∈ RK , we define x+ =

∑Kk=1 xk and xK+1 = 1 − x+. For example,

we write ∆+ =∑K

k=1 ∆k, F0+(t) =∑K

k=1 F0k(t) and F0,K+1(t) = 1 − F0+(t). The

only exception to the notation xK+1 = 1 − x+ is that we do not use it for the naive

estimator. The reason for this will become clear in Section 2.1.3.

10

2.1.2 The MLE

We now define the MLE Fn = (Fn1, . . . , FnK) for F0 = (F01, . . . , F0K). Note that

∆|T ∼ MultinomialK+1(1, (F01(T ), . . . , F0,K+1(T ))). (2.4)

Hence, under F = (F1, . . . , FK), the density for a single observation z = (t, δ) is

pF (z) =

K+1∏

k=1

Fk(t)δk , (2.5)

with respect to the dominating measure µ = G×#, where # is counting measure on

ek : k = 1, . . . , K + 1. The corresponding log likelihood (divided by n)1 is

ln(F ) =

∫log pF (u, δ)dPn(u, δ) =

K+1∑

k=1

∫δk logFk(u)dPn(u, δ), (2.6)

and the MLE (if it exists)2 is defined by

ln(Fn) = maxF∈FK

ln(F ), (2.7)

where FK is the set of all K-tuples of sub-distribution functions on R+ with pointwise

sum bounded by one. Note that we can absorbG in the dominating measure µ because

of the assumed independence between T and (X, Y ).

2.1.3 The naive estimator

We now define the naive estimator Fn = (Fn1, . . . , Fn,K+1). The naive estimator Fnk

can be viewed as the MLE for the reduced current status data Zk = (T,∆k). To see

1In order to efficiently use the empirical process notation, we use the convention of dividing alllog likelihoods by n.

2Existence of the estimators will follow from Theorem 2.1 ahead.

11

this, let pk,Fk(u, δ) be the marginal density of the reduced current status data Zk:

pk,Fk(u, δ) = Fk(u)

δk1 − Fk(u)1−δk.

Then the naive estimator Fnk maximizes the marginal log likelihood

lnk(Fk) =

∫log pk,Fk

(u, δ)dPn(u, δ)

=

∫δk logFk(u) + (1 − δk) log(1 − Fk(u)) dPn(u, δ), (2.8)

for k = 1, . . . , K + 1. Thus, the naive estimators (if they exist) are defined by

lnk(Fnk) = maxFk∈F

lnk(Fk), k = 1, . . . , K, (2.9)

ln,K+1(Fn,K+1) = maxS∈S

ln,K+1(S). (2.10)

where F is the collection of all sub-distribution functions on R+, and S is the collection

of all sub-survival functions on R+. Note that we can omit G in the marginal log

likelihood, since T and (X, Y ) are independent.

The naive estimator provides two different estimators for the overall failure time

distribution F0+, namely Fn+ =∑K

k=1 Fnk and 1− Fn,K+1. Since the naive estimator

does not require the sum of the sub-distribution functions to be bounded by one,

Fn+ may exceed one. In contrast, 1 − Fn,K+1 is always bounded between zero and

one. This estimator is simply the MLE for the overall failure time distribution when

information on the failure causes is ignored. In general, Fn,K+1 6= 1 − Fn+, and we

therefore do not use the shorthand notation xK+1 = 1 − x+ for the naive estimator.

2.1.4 Comparison of the two estimators

In order to point out the similarities and differences between the MLE and the naive

estimator, we give the following alternative but equivalent definition of the naive

12

estimator. For F = (F1, . . . , FK), we define

ln(F ) =

∫ K∑

k=1

[δk logFk(u) + (1 − δk) log(1 − Fk(u))

]dPn(u, δ). (2.11)

Then the naive estimator Fn = (Fn1, . . . , FnK) (if it exists) is defined by

ln(Fn) = maxF∈FK

ln(F ), (2.12)

where FK is the space of all K-tuples of sub-distribution functions on R+. Comparing

this optimization problem with the optimization problem (2.7) for the MLE, we see

the following two differences:

(a) The log likelihood (2.6) for the MLE contains a term involving FK+1(u) =

1−F+(u), while the log likelihood (2.11) for the naive estimator does not include

such a term;

(b) The space FK for the MLE includes the constraint that the sum of the sub-

distribution functions is bounded by one, while the space FK for the naive

estimator does not include such a constraint.

Thus, the MLE takes into account the K-dimensional system of sub-distribution

functions, while the naive estimator ignores this aspect of the problem. In fact,

since the sub-distribution functions in optimization problem (2.12) are not related to

each other, the optimization problem can be split into the K optimization problems

defined in (2.9). Since these optimization problems correspond to the MLE for uni-

variate current status data, both computational results and asymptotic theory follow

straightforwardly from known results for current status data (see Groeneboom and

Wellner (1992, Part II, Sections 1.1, 4.1 and 5.1)).

The fact that the MLE takes into account the system of sub-distribution functions

leads to more complicated computation and asymptotic theory. However, these com-

13

plications result in a better pointwise behavior of the MLE, as shown in the simulation

study in Section 8.2.

2.2 Censored data perspective

From the definitions of the MLE and the naive estimator, we see that both estima-

tors can be viewed as nonparametric maximum likelihood estimators for censored

data. Viewing the estimators from this perspective allows us to use readily available

computational algorithms and theory for the MLE for censored data.

We consider the following general framework. Let W be a random variable taking

values in W. Suppose that W has distribution F0. Our goal is to estimate this

distribution. However, we do not observe W directly. Rather, we observe a vector

of random sets D = (D1, . . . , Dp) that form a partition of W, i.e., ∪pj=1Dj = Wand Dj ∩ Dk = ∅ for j 6= k ∈ 1, . . . , p. We assume that D is independent of

W . In principle, we can allow the number of random sets to be random, but for

our purposes that is not needed. Furthermore, we observe an indicator vector ∆ =

(∆1, . . . ,∆p), where ∆j = 1W ∈ Dj, j = 1, . . . , p. Thus, we observe a vector D

containing a random partition of W, and an indicator vector ∆ indicating which set

R ∈ D1, . . . , Dp contains the unobservable W . We call the set R an observed set.

Using the convention 0 ·Dj = ∅, we can write

R = ∪pj=1∆jDj.

Let Z1, . . . , Zn be n i.i.d. copies of Z = (D,∆). These data define n i.i.d. observed

sets R1, . . . , Rn. Writing the log likelihood in terms of these sets gives

ln(F ) =1

n

n∑

i=1

logPF (Ri),

where PF (Ri) denotes the probability mass in Ri under distribution F . The maximum

14

likelihood estimator (if it exists) is defined by

ln(Fn) = maxF∈F

ln(F ), (2.13)

where F is the space of all distribution functions on W. Since ln(F ) is optimized

over the function space F , the optimization problem (2.13) is infinite dimensional.

However, the number of parameters can be reduced by generalizing the reasoning

of Turnbull (1976) for univariate censored data. It follows that the estimators can

only assign mass to a finite collection of disjoint sets A1, . . . , Am, called maximal

intersections by Wong and Yu (1999). In the literature, there are several equivalent

definitions of maximal intersections. Wong and Yu (1999) define Aj to be a maximal

intersection if and only if it is a finite intersection of the Ri’s such that for each i

Aj ∩ Ri = ∅ or Aj ∩ Ri = Aj . Gentleman and Vandal (2002) use a graph theoretic

perspective. They show that the maximal intersections correspond to maximal cliques

of the intersection graph of the observed sets. We discuss this perspective in detail

in the next section. For observed sets that take the form of rectangles in Rp, p ∈ N,

Maathuis (2005) introduces yet another way to view the maximal intersections, using

a height map of the observed sets. This height map is a function h : Rp → 0, 1, . . . , ,where h(x) is defined as the number of observed sets that overlap at the point x ∈ Rp.

Maathuis (2005) shows that the maximal intersections are exactly the local maxima of

the height map of a canonical version of the observed sets. We say that R′1, . . . , R

′n are

a canonical version of R1, . . . , Rn if the following three properties hold: (i) R1, . . . , Rn

and R′1, . . . , R

′n have the same intersection structure, i.e., Ri ∩ Rj = ∅ if and only if

R′i ∩R′

j = ∅, for all i, j ∈ 1, . . . , n; (ii) The x-coordinates of R′1, . . . , R

′n are distinct

and take values in 1, . . . , 2n; (ii) The y-coordinates of R′1, . . . , R

′n are distinct and

take values in 1, . . . , 2n. Thus, any ties that may have been present in R1, . . . , Rn

are resolved in R′1, . . . , R

′n, but in a way that does not affect the intersection structure.

For details on the transformation to canonical sets, see Maathuis (2005, Section 2.1).

15

By generalizing the reasoning of Turnbull (1976), it follows that the MLE is in-

different to the distribution of mass within the maximal intersections. As a result,

the MLE is typically not uniquely defined on the maximal intersections. This type of

non-uniqueness is called representational non-uniqueness by Gentleman and Vandal

(2002). Thus, we can at best hope to determine the probability masses αj = PF (Aj),

j = 1, . . . , m. We let α = (α1, . . . , αm) and write the probability mass in an observed

set Ri in terms of α:

Pα(Ri) =m∑

j=1

αj1Aj ⊆ Ri. (2.14)

Then we can write the log likelihood as

ln(α) =1

n

n∑

i=1

logPα(Ri) =1

n

n∑

i=1

log

(m∑

j=1

αj1Aj ⊆ Ri). (2.15)

Thus, we can think of the computation of the estimators as a two step process. First,

in the reduction step, we compute the maximal intersections A1, . . . , Am. Next, in the

optimization step, we solve the optimization problem

ln(α) = maxA

ln(α), (2.16)

where

A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m, 1Tα = 1

and 1 is the all-one vector in Rm. This optimization problem is an m-dimensional

convex constrained optimization problem. Existence of the MLE follows directly from

standard methods in optimization theory.

Theorem 2.1 The MLE α defined by (2.16) exists.

16

Proof: Letting log(0) = −∞, ln(α) is a continuous extended real valued function on

the nonempty compact set A. Hence, the maximum exists by, e.g., Zeidler (1985,

Corollary 38.10). 2

The optimization problem (2.16) may have several solutions. This forms a second

source of non-uniqueness for the MLE, called mixture non-uniqueness by Gentleman

and Vandal (2002). We will show in Section 2.3 that for current status data with

competing risks, both the MLE and the naive estimator are mixture unique. However,

we first show how both estimators fit into the censored data framework.

2.2.1 Censored data perspective of the MLE

For the MLE, the variable of interest is W = (X, Y ), taking values in the space

W = R+ × 1, . . . , K. The observation time T defines a partition of p = K + 1

random sets in W:

Dk = (0, T ] × k, k = 1, . . . , K, (2.17)

DK+1 = (T,∞) × 1, . . . , K. (2.18)

Since there is a one-to-one correspondence between D = (D1, . . . , DK+1) and T , the

assumption that T is independent of (X, Y ) is equivalent to the assumption that D is

independent of (X, Y ). Furthermore, note that ∆k = 1X ≤ T, Y = k = 1(X, Y ) ∈Dk for k = 1, . . . , K, and ∆K+1 = 1X > T = 1(X, Y ) ∈ DK+1. Hence, the ∆

vector indicates which set contains the unobservable (X, Y ), and the observed data

(T,∆) give exactly the same information as (D,∆).

The corresponding observed sets are R = ∪K+1k=1 ∆kDk, so that

R =

(0, T ] × k if ∆k = 1, k = 1, . . . , K,

(T,∞) × 1, . . . , K if ∆K+1 = 1.(2.19)

17

It follows that we can write the log likelihood (2.6) as ln(F ) = 1n

∑ni=1 logPF (Ri).

The MLE maximizes this expression over all bivariate sub-distribution functions F

on R+ × 1, . . . , K, or equivalently, over all K-tuples of sub-distribution functions

F = (F1, . . . , FK) with pointwise sum bounded by one.

We now consider the maximal intersections of the observed sets R1, . . . , Rn. Note

that the observed sets can take the form (t,∞)×1, . . . , K for some t ∈ R+. Such sets

are not rectangles in R2, and hence we cannot directly use the concept of the height

map of Maathuis (2005). However, by transforming such sets into (t,∞)× [1, K], we

do have rectangles in R2. We can then compute the maximal intersections using the

concept of the height map. Afterwards we transform sets of the form (t,∞) × [1, K]

back to (t,∞) × 1, . . . , K.Once we have computed α, we obtain Fnk(t) by summing the mass in (0, t]×k,

for k = 1, . . . , K and t ∈ R+. For each k ∈ 1, . . . , K + 1, we call A a maximal

intersection for Fnk, if A is involved in the computation of Fnk. A precise definition

is given below.

Definition 2.2 Let k ∈ 1, . . . , K, and let R = R1, . . . , Rn be the observed sets

as defined in (2.19). We call A a maximal intersection for Fnk if it is a maximal

intersection of R and A ∩ (R × k) 6= ∅. We call A a maximal intersection for Fn+

(or equivalently, for Fn,K+1) if A is a maximal intersection for some Fnk, k = 1, . . . , K.

Note that maximal intersections for Fn+ are sets in R+ × 1, . . . , K, although Fn+

is a function on R+. Recall from Section 2.1.1 that we order the observations such

that their observation times are nondecreasing, where ties are broken arbitrarily after

ensuring that left censored observations are ordered before right censored observa-

tions. Hence, if there is an observation Zi such that Ti = T(n) and ∆i,K+1 = 1, then

∆(n),K+1 = 1 holds, even if there are other observations with Ti = T(n) and ∆ik = 1

for some k ∈ 1, . . . , K. This is used in the following lemma, which provides infor-

mation on the form of the maximal intersections for Fnk. The lemma follows directly

18

from the idea of the height map.

Lemma 2.3 Let k ∈ 1, . . . , K. Each maximal intersection for Fnk satisfies one of

the following two conditions:

(i) A = (T(i), T(j)]×k, with i < j, ∆(i),K+1 = 1, ∆(j)k = 1, and ∆(l),K+1 = ∆(l)k =

0 for all l such that T(i) < T(l) < T(j);

(ii) A = (T(n),∞) × 1, . . . , K, with ∆(n),K+1 = 1.

Moreover, if a set A satisfies one of these conditions, then A is a maximal intersection

for Fnk.

2.2.2 Censored data perspective of the naive estimator

For the naive estimator Fnk, we consider the reduced current status data Zk = (T,∆k).

Define the variables

Wk = X1Y = k + ∞ · 1Y 6= k, k = 1, . . . , K,

WK+1 = X,

taking values in W = R+ ∪ ∞. Note that F0k(t) = P (Wk ≤ t) for k = 1, . . . , K,

and F0,K+1(t) = P (WK+1 > t). Hence we can take W1, . . . ,WK+1 to be our variables

of interest.

The observation time T defines a partition of p = 2 random sets in W:

D1 = (0, T ] and D2 = (T,∞]. (2.20)

Since there is a one-to-one correspondence between D = (D1, D2) and T , the as-

sumption that T is independent of (X, Y ) is equivalent to the assumption that D is

independent of W1, . . . ,WK+1.

19

For k = 1, . . . , K, note that ∆k = 1X ≤ T, Y = k = 1Wk ≤ T = 1Wk ∈D1. Hence, the vector (∆k, 1 − ∆k) indicates whether D1 or D2 contains the unob-

servable Wk, and the reduced current status data (T,∆k) give exactly the same infor-

mation as (D,∆k). The corresponding observed sets are R(k) = ∆kD1 ∪ (1 − ∆k)D2,

so that

R(k) =

(0, T ] if ∆k = 1,

(T,∞) if ∆k = 0.(2.21)

We can write the log likelihood (2.8) as lnk(Fk) = 1n

∑ni=1 logPF (R

(k)i ). The naive

estimator maximizes this expression over all sub-distribution functions Fk on R+.

For k = K + 1, note that ∆K+1 = 1X > t = 1WK+1 ∈ D2. Hence, the

vector (1 − ∆K+1,∆K+1) indicates whether D1 or D2 contains the unobservable X,

and the reduced current status data (T,∆K+1) give exactly the same information as

(D,∆K+1). The corresponding observed sets are R(K+1) = (1−∆K+1)D1 ∪∆K+1D2,

so that

R(K+1) =

(0, T ] if ∆K+1 = 0,

(T,∞) if ∆K+1 = 1.(2.22)

We can write the log likelihood (2.8) as ln,K+1(S) = 1n

∑ni=1 logPS(R

(K+1)i ). The naive

estimator Fn,K+1 maximizes this expression over all sub-survival functions S on R+.

Definition 2.4 For k = 1, . . . , K + 1, we call A a maximal intersection for Fnk if it

is a maximal intersection of the observed sets R(k)1 , . . . , R

(k)n as defined in (2.21) and

(2.22).

The maximal intersections for the naive estimator are described in Lemmas 2.5 and

2.6. Both lemmas follow directly from the idea of the height map.

Lemma 2.5 Let k ∈ 1, . . . , K. Each maximal intersections A for Fnk satisfies one

of the following two conditions:

20

(i) A = (T(i), T(j)], with (T(i), T(j)) ∩ T1, . . . , Tn = ∅, ∆(i)k = 0, and ∆(j)k = 1.

(ii) A = (T(n),∞), with ∆(n)k = 0.

Moreover, if an interval A satisfies one of these conditions, then it is a maximal

intersection for Fnk.

Lemma 2.6 Each maximal intersection for Fn,K+1 satisfies one of the following two

conditions:

(i) A = (T(i), T(j)], with (T(i), T(j))∩T1, . . . , Tn = ∅, ∆(i),K+1 = 1, and ∆(j),K+1 =

0.

(ii) A = (T(n),∞), with ∆(n),K+1 = 1.

Moreover, if an interval A satisfies one of these conditions, then A is a maximal

intersection for Fn,K+1.

2.2.3 Comparing the maximal intersections for both estimators

Definition 2.7 For any set A ∈ R2, we define the x-interval and y-interval of A to

be the projections of A on the x-axis and y-axis. Furthermore, we define the lower

and upper endpoint of A to be the lower and upper endpoint of its x-interval.

We now compare the maximal intersections for Fnk and Fnk, for k ∈ 1, . . . , K.

Lemma 2.8 For each k = 1, . . . , K, the number of maximal intersections for Fnk

is at least as large as the number of maximal intersections for Fnk. Moreover, each

upper endpoint of a maximal intersection for Fnk is an upper endpoint of a maximal

intersection for Fnk.

Proof: Let A be a maximal intersection for Fnk. We show that there is a maximal

intersection for Fnk with the same upper endpoint. Note that A must satisfy one of

21

the two conditions of Lemma 2.3. First, suppose that the A = (T(n),∞)×1, . . . , Kwith ∆(n),K+1 = 1. Then ∆(n)k = 0, and A = (T(n),∞) is a maximal intersection

for Fnk by Lemma 2.5. Next, suppose that A = (T(i), T(j)] × k, with ∆(i),K+1 = 1,

∆(j)k = 1 and ∆(l)k = ∆(l),K+1 = 0 for all l such that T(i) < T(l) < T(j). Then

∆(j−1)k = 0, and hence A = (T(j−1), T(j)] is a maximal intersection for Fnk by Lemma

2.5. 2

Lemma 2.9 The number of maximal intersections for Fn,K+1 is at most as large

as the number of maximal intersections for Fn,K+1. Moreover, the collection of lower

endpoints of the maximal intersections for Fn,K+1 is identical to the collection of lower

endpoints of the maximal intersections for Fn,K+1. As a result, the number of regions

on the x-axis where Fn,K+1 can put mass is identical to the number of regions on the

x-axis where Fn,K+1 can put mass. Finally, the union of the maximal intersections

for Fn,K+1 is contained in the union of the x-intervals of the maximal intersections

for Fn,K+1.

Proof: Let A be a maximal intersection for Fn,K+1. We show that there is a maximal

intersection for Fn,K+1 with the same lower endpoint. Note that A must satisfy

one of the two conditions of Lemma 2.6. First, suppose that A = (T(i), T(j)] with

(T(i), T(j))∩T1, . . . , Tn = ∅, ∆(i),K+1 = 1 and ∆(j),K+1 = 0. Since ∆(j),K+1 = 0, there

must be a k ∈ 1, . . . , K such that ∆(j)k = 1. But this implies that (T(i), T(j)]×k is

a maximal intersection for Fnk, by Lemma 2.3. Next, suppose that A = (T(n),∞) with

∆(n),K+1 = 1. Then (T(n),∞)×1, . . . , K is a maximal intersection for Fn1, . . . , FnK

by Lemma 2.3, and hence it is a maximal intersection for Fn,K+1 by definition.

Next, let A be a maximal intersection for Fn,K+1. We show that there is a maximal

intersection for Fn,K+1 with the same lower endpoint. By definition, it follows that

there is a k ∈ 1, . . . , K so that A is a maximal intersection for Fnk. Hence, A

must satisfy one of the two conditions of Lemma 2.3. First, suppose that A =

(T(i), T(j)] × k, with ∆(i),K+1 = 1, ∆(j)k = 1 and ∆(l)k = ∆(l)K+1 = 0 for all l

22

Table 2.1: Example data with K = 2 competing risks, illustrating that the number ofpositive maximal intersections for Fn,K+1 can be larger than the number of positive

maximal intersections for Fn,K+1.

i t(i) δ(i)1 δ(i)2 δ(i)31 1 1 0 02 2 0 0 13 3 0 0 14 4 1 0 05 5 0 0 1

i t(i) δ(i)1 δ(i)2 δ(i)36 6 0 1 07 7 0 1 08 8 1 0 09 9 0 1 0

10 10 0 1 0

such that T(i) < T(l) < T(j). If S = (T(i), T(j)) ∩ T1, . . . , Tn = ∅, then (T(i), T(j)]

is a maximal intersection for Fn,K+1 by Lemma 2.6. Otherwise, (T(i),minS] is a

maximal intersection for Fn,K+1. Next, suppose that A = (T(n),∞)×1, . . . , K with

∆(n),K+1 = 1. Then (T(n),∞) is a maximal intersection for Fn,K+1 by Lemma 2.6.

The last statement follows by combining the fact that the collection of lower

endpoints of the maximal intersections for Fn,K+1 and Fn,K+1 are identical, with the

fact that maximal intersections for Fn,K+1 cannot contain observation times in their

interior (Lemma 2.6). 2

Remark 2.10 The last statement of Lemma 2.9 has implications for representational

non-uniqueness of the estimators. It shows that it is possible that the area in which

the MLE Fn,K+1 suffers from representational non-uniqueness is larger than the area

in which Fn,K+1 suffers from representational non-uniqueness. This was also noted

by Hudgens, Satten and Longini (2001), and partly motivated their pseudo-likelihood

estimator. However, note that it can also happen that Fn,K+1 is non-unique over a

larger area, if many of the maximal intersections for Fn,K+1 get zero mass. For an

example, see Tables 2.1 and 2.2.

Motivated by Remark 2.10, we now consider maximal intersections that get positive

mass. We introduce the following terminology:

23

Table 2.2: The estimators for the data in Table 2.1, in terms of their maximal inter-sections (MIs) and the corresponding probability masses.

Fn,K+1

MIs mass(0, 1] × 1 3/10(3, 4] × 1 0(5, 8] × 1 0(5, 6] × 2 7/10

Fn,K+1

MIs mass(0, 1] 1/3(3, 4] 1/6(5, 6] 1/2

Definition 2.11 Let k ∈ 1, . . . , K + 1. We say that A is a positive maximal

intersection for Fnk if A is a maximal intersection for Fnk and the MLE assigns

positive mass to A. Similarly, we say that Fnk is a positive maximal intersection for

Fnk if A is a maximal intersection for Fnk and Fnk assigns positive mass to A.

After reading Lemma 2.9, one may wonder whether the number of positive maxi-

mal intersections for Fn,K+1 is at most as large as the number of positive maximal

intersections for Fn,K+1. This is indeed often the case in simulations, but not al-

ways. A counter example can be found in Table 2.1. In this example, Fn,K+1 has

four maximal intersections, given in Table 2.2. The naive estimator Fn,K+1 has three

maximal intersections, with corresponding masses given in Table 2.2. Note that the

maximal intersections satisfy the statement in Lemma 2.9. However, there are only

two positive maximal intersections for Fn,K+1, while there are three positive maximal

intersections for Fn,K+1.

2.3 Graph theory and uniqueness

Gentleman and Vandal (2001), Gentleman and Vandal (2002), Maathuis (2003), and

Vandal, Gentleman and Liu (2006) use a graph theoretic perspective to study prop-

erties of the maximum likelihood estimator for censored data. Before we apply these

methods to our problem, we give an introduction to graph theory. This introduction

24

is mostly based on Golumbic (1980), and also partly given in Maathuis (2003, Section

3.3).

2.3.1 Introduction to graph theory for censored data

Let G = (V,E) be an undirected graph, where V is a set of vertices, and E is a set

of edges. An edge is a collection of two vertices. Two vertices v and w are said to

be adjacent in G if there is an edge between v and w, i.e., vw ∈ E. We say that

two sets of vertices S1 and S2 are adjacent if there is at least one pair of vertices

(v, w) such that v ∈ S1, w ∈ S2 and vw ∈ E. A subgraph of G = (V,E) is defined

to be any graph G′ = (V ′, E ′) such that V ′ ⊆ V and E ′ ⊆ E. Given a subset

A ⊆ V of vertices, we define the subgraph induced by A to be GA = (A,EA), where

EA = xy ∈ E : x ∈ A, y ∈ A.We call a subset M ⊆ V of vertices a clique if every pair of distinct vertices in M

is adjacent. We call M ⊆ V a maximal clique if there is no clique in G that properly

contains M as a subset3. Every finite graph has a finite number of maximal cliques

that we denote by C = C1, . . . , Cm.Let R = R1, . . . , Rn be a family of sets. The intersection graph of R is obtained

by representing each set in R by a vertex, and connecting two vertices by an edge if

and only if their corresponding sets intersect. An intersection graph of a collection

of intervals on a linearly ordered set is called an interval graph. Alternatively, an

undirected graph G is called an interval graph if it can be thought of as an intersection

graph of a set of intervals on the real line. Every maximal clique Cj in an intersection

graph has a real representation Aj =⋂R∈Cj

R, given by the intersection of the sets

that form the maximal clique.

A sequence of vertices (v0, v1, . . . , vl) is called a cycle of length l + 1 if vi−1vi ∈ E

for all i = 1, . . . , l and vlv0 ∈ E. A cycle (v0, . . . , vl) is called a simple cycle if vi 6= vj

3Instead of the terms ‘clique’ and ‘maximal clique’, some authors use the terms ‘complete sub-graph’ and ‘clique’.

25

for i 6= j. A simple cycle (v0, v1, . . . , vl) is called chordless if for all i = 0, . . . , l,

vivj ∈ E only for j = (i± 1) mod (l+ 1). A graph is called triangulated if it does not

contain chordless cycles of length strictly greater than three. Hajos (1957) showed

that every interval graph is triangulated.

A clique graph of R is an intersection graph of the maximal cliques C. Thus,

in this graph each vertex represents a maximal clique, and two vertices Cj and Ck

are adjacent if and only if Cj ∩ Ck 6= ∅, i.e., if there is at least one set in R that is

an element of both Cj and Ck. We define the clique matrix to be a vertices versus

maximal cliques incidence matrix. For n observed sets with m maximal cliques, this

is an n×m matrix H with elements Hij = 1Aj ⊆ Ri.4

We now return to the maximum likelihood estimator for censored data. Let R =

R1, . . . , Rn be the observed sets. Gentleman and Vandal (2001) showed that the

maximal intersections A1, . . . , Am of R, defined in Section 2.2, are exactly the real

representations of the maximal cliques of the intersection graph of R. Hence, we

can study the intersection graph to deduce properties of the MLE. In particular,

Gentleman and Vandal (2002, Lemma 4) showed that α is unique if the intersection

graph is triangulated. An alternative proof can be found in Maathuis (2003, Lemma

3.13). Finally, we can use the clique matrix H to rewrite the optimization problem

(2.16). Namely, Pα(Ri) = (Hα)i, so that (2.16) becomes

ln(α) = maxA

n∑

i=1

log ((Hα)i) .

2.3.2 Graph theoretic aspects and uniqueness of the naive estimator

For k = 1, . . . , K + 1, let R(k) = R(k)1 , . . . , R

(k)n be the observed sets for the naive

estimator Fnk, as defined in (2.21) and (2.22). The following proposition uses the

structure of the intersection graph and the form of the maximal intersections to

4Note that our H is the transpose of the incidence matrix defined in Gentleman and Vandal(2002, page 559).

26

prove uniqueness of the naive estimators at the observation times. Alternatively,

uniqueness can be proved from strict concavity of the marginal log likelihoods lnk(Fk),

k = 1, . . . , K + 1, defined in (2.8).

Proposition 2.12 The naive estimators Fnk(t), k = 1, . . . , K + 1, are unique at all

observation times T1, . . . , Tn.

Proof: Let k ∈ 1, . . . , K + 1. Note that the observed sets R(k) are intervals in R.

Hence, the intersection graph of R(k) is an interval graph, and it follows from Hajos

(1957) that the graph is triangulated. Thus, by Gentleman and Vandal (2002, Lemma

4), the naive estimator is mixture unique.

We obtain Fnk(t) by summing all mass in the interval (0, t]. Thus, Fnk(t) is

unique if t is not in the interior of a maximal intersection. Lemma 2.5 implies that

the observation times T1, . . . , Tn can never be contained in the interior of a maximal

intersection, so that Fnk is unique at all observation times. 2

In fact, it follows from the proof of Proposition 2.12 that we can make a stronger

statement: For each k = 1, . . . , K, Fnk(t) is unique if and only if t is not in the

interior of a positive maximal intersection for Fnk (see Definitions 2.4 and 2.11 for

this terminology).

2.3.3 Graph theoretic aspects of the MLE

In this section we study the intersection graph and clique graph for the MLE. We also

derive a bound on the number of maximal intersections and describe the structure of

the clique matrix.

Recall that the observed sets R = R1, . . . , Rn defined in (2.19) are sets in

R+ × 1, . . . , K. We define the following partition of R:

27

Definition 2.13 Let R = R1, . . . , Rn be the observed sets defined in (2.19). We

define

Rk = Ri ∈ R : ∆ik = 1, k = 1, . . . , K + 1. (2.23)

Furthermore, let nk denote the number of observed sets in Rk, k = 1, . . . , K + 1.

Theorem 2.14 describes the structure of the intersection graph of R.

Theorem 2.14 The intersection graph G = (V,E) of R has the following properties:

(a) Each Rk, k = 1, . . . , K + 1 is a clique in G;

(b) For k 6= k′ ∈ 1, . . . , K, Rk and Rk′ are not adjacent in G.

(c) For i 6= j ∈ 1, . . . , n and k ∈ 1, . . . , K, Ri ∈ Rk and Rj ∈ RK+1 are

adjacent in G if and only if Ti > Tj;

(d) G is triangulated.

Proof: Properties (a)-(c) follow from the definition of the observed sets in (2.19).

To prove (a), let k ∈ 1, . . . , K and let Ri and Rj be two different observed sets in

Rk. Then Ri ∩ Rj = (0, Ti ∧ Tj] × k 6= ∅. Hence, the corresponding vertices are

adjacent in G. Similarly, for two different observed sets Ri and Rj in RK+1, we have

Ri ∩ Rj = (Ti ∨ Tj ,∞) × 1, . . . , K. Hence, for each k = 1, . . . , K + 1, every pair

of distinct vertices in Rk is adjacent in G. By definition, this means that each Rk,

k = 1, . . . , K + 1 is a clique.

To prove (b), let k 6= k′ ∈ 1, . . . , K and let Ri ∈ Rk and Rj ∈ Rk′. Then

Ri ⊆ R×k and Rj ⊆ R×k′. Hence, Ri∩Rj = ∅, and Ri and Rj are not adjacent

in G.

28

To prove (c), let k ∈ 1, . . . , K, Ri ∈ Rk, and Rj ∈ RK+1. Then

Ri ∩ Rj =

(Tj, Ti] × k if Ti > Tj,

∅ if Ti ≤ Tj .

Hence, Ri and Rj are adjacent in G if and only if Ti > Tj .

We now prove that G is triangulated. We define

Vk = Rk ∪RK+1, k = 1, . . . , K.

Let GVk= (Vk, Ek) be the subgraph of G that is induced by Vk. Note that GVk

can

be viewed as the intersection graph of the following intervals in R:

(0, Ti] : Ri ∈ Rk ∪ (Ti,∞) : Ri ∈ RK+1.

This implies that GVkis an interval graph, and hence that it is triangulated (Hajos

(1957)). Now consider the original intersection graph G = (V,E) of R. Note that

V =⋃Kk=1 Vk. Furthermore, since for all k 6= k′ ∈ 1, . . . , K, Rk is not adjacent to

Rk′ and Vk ∩ Vk′ = RK+1, it follows that E =⋃Kk=1EVk

. Let c = v0, . . . , vl be a

chordless cycle in G. We show by contradiction that c is completely contained in Vk

for some k ∈ 1, . . . , K. Thus, suppose that c contains vertices from both Rk and

Rk′ for some k 6= k′ ∈ 1, . . . , K. Then, since Rk and Rk′ are not adjacent, c must

contain at least two vertices vi and vj in RK+1, with j 6= (i±1) mod (l+1). However,

since RK+1 is a clique, vi and vj are adjacent in G. This contradicts the assumption

that c is chordless. It follows that c must be completely contained in Vk for some

k ∈ 1, . . . , K. Hence, c is a chordless cycle in GVk, and since each subgraph GVk

is triangulated, the length of c is at most three. This proves that G is triangulated.

2

Figure 2.2 shows the intersection graph for the MLE, for the example data with

29

Table 2.3: Example data with K = 2 competing risks. The data are used to illustratethe intersection graph of the MLE in Figure 2.2, and the corresponding clique matrixin Table 2.4.

i t(i) δ(i)1 δ(i)2 δ(i)31 1 1 0 02 2 1 0 03 4 0 0 14 5 0 1 05 7 1 0 06 8 0 1 07 9 0 0 1

i t(i) δ(i)1 δ(i)2 δ(i)38 11 1 0 09 12 0 0 1

10 15 0 1 011 16 1 0 012 17 0 0 113 18 0 1 014 20 0 0 1

K = 2 competing risks in Table 2.3. We can think of RK+1 = R3 as the backbone of

the graph, to which R1 and R2 are connected as defined by property (c) of Theorem

2.14. Furthermore, note that the sets R1 and R2 are not adjacent to each other, and

that all sets R1,R2,R3 are cliques. Finally, the graph is triangulated.

We now consider the maximal cliques C = C1, . . . , Cm of the intersection graph

G of R. Since R1, . . . ,RK are not adjacent in G (Theorem 2.14 (c)), it follows that a

maximal clique can only contain sets from one of the collections R1, . . . ,RK . Hence,

each maximal clique is contained in Rk ∪RK+1 for some k ∈ 1, . . . , K. We define

the x-interval of a maximal clique C to be the x-interval of the real representation

of C. This allows us to define the following partition of the collection of maximal

cliques:

Definition 2.15 Let C be the collection of maximal cliques of the intersection graph

G of the observed sets R defined in (2.19). We define

Ck = C ∈ C : C ∩Rk 6= ∅, k = 1, . . . , K, (2.24)

CK+1 = C ∈ C : C ⊆ RK+1. (2.25)

30

R2

R3

R1

0 2 4 6 8 10 12 14 16 18 20

1 2

3

4

5

6

7

8

9

10

11

12

13

14

Figure 2.2: Intersection graph for the MLE for the data in Table 2.3. Vertices insidea dashed line form a clique, and edges between such vertices are omitted for clarityof the picture.

Furthermore, let mk denote the number of maximal cliques in Ck, k = 1, . . . , K + 1.

Since all maximal intersections are disjoint, we can order the maximal cliques in Cksuch that the upper endpoints of their x-intervals are increasing. We denote the

ordered maximal cliques in Ck by Ck(1), . . . , Ck(mk), k = 1, . . . , K + 1.

We now prove some properties of the maximal cliques. These properties rely on the

assumed ordering of the observations, defined in Section 2.1.1. Here we assumed that

in case of ties left censored observations are ordered before right censored observations.

Let R(1), . . . , R(n) be the observed sets corresponding to Z(1), . . . , Z(n).

Lemma 2.16

CK+1 =

RK+1 if ∆(n),K+1 = 1,

∅ otherwise.

Proof: By the definition of CK+1 in (2.25) and the fact that RK+1 is a clique (The-

orem 2.14 (a)), it follows that RK+1 is the only possible element of CK+1 is RK+1.

31

First suppose that ∆(n),K+1 = 1, or equivalently, R(n) ∈ RK+1. Then there cannot

exist a Tj > T(n) with ∆j+ = 1. By Theorem 2.14 (c), this implies that R(n) ∈ RK+1

is not adjacent to any Rj /∈ RK+1. Hence, RK+1 is a maximal clique.

Now suppose that ∆(n),K+1 = 0. Then there must be a k ∈ 1, . . . , K such that

∆(n)k = 1, and hence R(n) ∈ Rk. By Theorem 2.14 (c), R(n) is adjacent to all vertices

in RK+1, since T(n) > T for all observation times T corresponding to R ∈ RK+1.

Hence, RK+1 is not a maximal clique. 2

Lemma 2.17 Let k ∈ 1, . . . , K and j ∈ 1, . . . , mk − 1. Then

(a) Ck(j+1) ∩Rk is a strict subset of Ck(j) ∩Rk;

(b) Ck(j) ∩RK+1 is a strict subset of Ck(j+1) ∩RK+1.

Proof: All sets of Rk that are in Ck(j+1) must also be in Ck(j). This implies that

Ck(j+1) ∩Rk ⊆ Ck(j) ∩Rk. (2.26)

Similarly, all sets of RK+1 that are in Ck(j) must also be in Ck(j+1). This implies that

Ck(j) ∩RK+1 ⊆ Ck(j+1) ∩RK+1. (2.27)

Now suppose that (2.26) or (2.27) holds with equality. Then, since

Ck(j) = Ck(j) ∩Rk ∪ Ck(j) ∩RK+1,

it follows that Ck(j) ⊆ Ck(j+1) or Ck(j+1) ⊆ Ck(j). This contradicts the assumption

that Ck(j) and Ck(j+1) are two different maximal cliques. Hence, the relations (2.26)

and (2.27) must hold strictly. 2

The following theorem gives a bound on the number of maximal cliques of the in-

tersection graph of R. This bound is important for computational purposes, since the

32

dimensionality of the optimization problem (2.16) is equal to the number of maximal

cliques.

Theorem 2.18 Let m be the number of maximal intersections of R = R1, . . . , Rndefined in (2.19). Then

m ≤

K

K + 1· (n+ 1) + 1

∧ n. (2.28)

Proof: Recall the notation nk and mk from Definitions 2.13 and 2.15, and note

that n =∑K+1

k=1 nk and m =∑K+1

k=1 mk. Let k ∈ 1, . . . , K and consider the or-

dered maximal cliques Ck(1), . . . , Ck(mk) of Definition 2.15. Lemma 2.17 implies that

#Ck(j) ∩ Rk > #Ck,(j+1) ∩ Rk, for j = 1, . . . , mk − 1, where #C denotes the

cardinality of the set C. In other words, each maximal clique Ck(j) contains at least

one more set of Rk than Ck(j+1), for j = 1, . . . , mk − 1. Since by definition Ck(mk)

contains at least one set of Rk, and since Rk contains nk sets, it then follows that

mk ≤ nk. Lemma 2.17 also implies that #Ck(j+1) ∩RK+1 > #Ck(j) ∩RK+1, for

j = 1, . . . , mk − 1. Together with the fact that Ck(1) may contain zero sets of RK+1,

it follows by similar reasoning that mk ≤ nK+1 + 1. Hence, mk ≤ nk ∧ (nK+1 + 1),

k = 1, . . . , K, and

m =

K+1∑

k=1

mk ≤K∑

k=1

mk + 1 ≤K∑

k=1

nk ∧ (nK+1 + 1) + 1, (2.29)

The right side of (2.29) is largest if nk = nK+1 + 1 for all k = 1, . . . , K. In that

case

n =

K+1∑

k=1

nk = K(nK+1 + 1) + nK+1 = (K + 1)nK+1 +K,

so that nK+1 = (n −K)/(K + 1) and nk = nK+1 + 1 = (n + 1)/(K + 1). Plugging

this into the right side of (2.29), we obtain an upper bound of K(n+ 1)/(K + 1) + 1.

33

This yields the first part of the inequality.

To show that m ≤ n, note that Lemma 2.16 implies that mK+1 ≤ nK+1. Together

with mk ≤ nk for k = 1, . . . , K, this implies that m =∑K+1

k=1 mk ≤ ∑K+1k=1 nk = n.

2

We now consider the n×m clique matrix H , with Hij = 1Aj ∈ Ri. Each column

of H corresponds to a maximal clique Cj , or equivalently, to a maximal intersection

Aj. Each row of H corresponds to an observed set Ri. Thus, the jth column of the

matrix indicates which observed sets form maximal clique Cj . Analogously, the ith

row of the matrix indicates which maximal intersections are contained in Ri. We

order the columns of H so that they correspond to the ordered maximal intersections

C1(1), . . . , C1(m1), . . . , CK(1), . . . , CK(mK), CK+1,(1). We order the rows of H so that they

correspond to the observed sets R1, . . . ,RK+1, where the sets in Rk, k = 1, . . . , K+1,

are ordered such that their observation times are nondecreasing. If for some k ∈1, . . . , K+1 the sets Rk or Ck are empty, then the corresponding rows and columns

are omitted from H . Finally, we say that a column is a column of Ck if it corresponds

to a maximal clique in Ck. We say that a row is a row of Rk if it corresponds to an

observed set in Rk.

Lemma 2.19 The n×m clique matrix H of R has the following properties:

(a) m ≤

KK+1

(n+ 1) + 1∧ n.

(b) For each k = 1, . . . , K + 1, the columns of Ck can only have nonzero elements

in rows of Rk or RK+1.

(c) For each k = 1, . . . , K, the block matrix formed by the columns of Ck and the

rows of Rk, is an nk×mk matrix with mk ≤ nk∧(nK+1+1). The first column of

this block is completely filled with ones. The other columns consist of a sequence

of zeroes, followed by a sequence of ones, where both sequences must be of positive

34

length. The length of the sequence of ones in the ith column is strictly smaller

than the length of the sequence of ones in the (i−1)th column, for i = 2, . . . , mk.

(d) For each k = 1, . . . , K+1, the block matrix formed by the columns of Ck and the

rows of RK+1 is an nK+1×mk matrix with mk ≤ nk∧(nK+1+1). All columns of

this block matrix consist of a sequence of ones, followed by a sequence of zeroes.

Either sequence may be of zero length. The length of the sequence of ones in

the ith column is strictly larger than the length of the sequence of ones in the

(i− 1)th column, for i = 2, . . . , mk.

(e) H can be stored in O(Kn) space.

(f) Hα can be computed in O(Kn) time.

(g) Null(H) = ∅.

Proof: Property (a) follows immediately from Theorem 2.18. Property (b) follows

from Definition 2.15. Properties (c) and (d) follow from Lemma 2.17 and the proof

of Theorem 2.18. To prove (e), let k ∈ 1, . . . , K. By property (c), the first row

of Rk contains exactly one nonzero element. Furthermore, property (c) implies that

two successive rows of Rk are either identical, or differ in one element. Hence, we

can store the difference between these rows using at most one element, and we need

at most nk elements to store the information in all rows of Rk. Next, we consider

RK+1. The last row of RK+1 contains at most K nonzero elements. Furthermore,

property (d) implies that two successive rows of RK+1 differ in at most K elements,

and we need at most KnK+1 elements to store the information in the rows of RK+1.

Hence, we can store the clique matrix H using∑K

k=1 nk +KnK+1 = O(Kn) elements.

Property (f) can be proved by similar reasoning.

Finally, note that all unit vectors ej ∈ Rm, j = 1, . . . , m can be generated by

taking the difference of rows of H . This implies that the row space of H is of full

35

Table 2.4: Clique matrix for the data in Table 2.3. The rows are divided into threegroups corresponding to R1,R2 and R3, and the columns are divided into three groupscorresponding to C1, C2 and C3. In each group Rk, k = 1, . . . , 3, the observationsare ordered such that the observation times are nondecreasing. In each group Ck,k = 1, . . . , 3, the maximal cliques are ordered according to their x-intervals, as definedin Definition 2.15.

H =

111 11 1 11 1 1 1

111 11 1 1

1 1 1 1 1 11 1 1 1 1

1 1 1 11 1

1

rank, and this proves (g). 2

The clique matrix for the example data in Table 2.3 is given in Table 2.4. Note that

the special block structure of the matrix is clearly visible.

2.3.4 Uniqueness of the MLE

Recall from Theorem 2.14 (d) that the intersection graph G of R is triangulated.

Hence, the results of Gentleman and Vandal (2002, Lemma 4) or Maathuis (2003,

Lemma 3.13) immediately imply that α is unique. We also give an alternative proof

of this result, using the clique matrix H .

Theorem 2.20 The MLE α is unique.

36

Proof: Let qi = Pα(Ri) and qi = Pα(Ri), i = 1, . . . , n. The vector q is uniquely

determined because the log likelihood is strictly concave in q. Hence, the set of

maximum likelihood estimators is

α ∈ A : Hα = q.

By Lemma 2.19 (f), Null(H) = ∅, and hence the system Hα = q has a unique solution

α. 2

Remark 2.21 We now discuss generalizations of Theorem 2.20 to the mixed case

interval censored competing risks model, studied by Hudgens, Satten and Longini

(2001). We distinguish the following two cases: (i) the failure cause Y is observed if

and only if an observation is not right censored; (ii) the failure cause Y is observed

only if an observation is not right censored, but not for all such observations.

In case (i), Theorem 2.20 can be generalized, and implies that the MLE is always

mixture unique. The main difference between the intersection graphs for interval

censored data and current status data lies in the sets R1, . . . ,RK . For current status

data these sets are cliques, while this is typically not the case for interval censored

data. However, for both types of data, the set RK+1 is a clique. Furthermore, for

both types of data, the subgraphs GVkinduced by Vk = Rk∪RK+1 are interval graphs

and therefore triangulated. Hence, by similar reasoning as in the proof of Theorem

2.14, it follows that the intersection graph for interval censored data with competing

risks is triangulated.

In case (ii), the intersection graph is generally not triangulated, and the MLE

may be mixture non-unique. Hence, in this case our Theorem 2.20 cannot be gener-

alized, and one should use other methods to assess mixture uniqueness. For example,

Hudgens, Satten and Longini (2001, page 76)) suggest to check the Kuhn-Tucker

conditions given in Gentleman and Geyer (1994).

37

Finally, note that Hudgens, Satten and Longini (2001) use a slightly different

parametrization of the MLE, usingK parameters to represent the mass in the maximal

intersection (T(n),∞)× 1, . . . , K which arises when ∆(n),K+1 = 1. Only the sum of

these K parameters can be uniquely defined.

We now translate the uniqueness of α into a statement about uniqueness of Fnk(t).

Recall from Proposition 2.12 that the naive estimators Fnk, k = 1, . . . , K + 1, are

unique at all observation times. For the MLE we do not obtain uniqueness at all

observation times. Rather, we get uniqueness at the following sets:

Definition 2.22 For k = 1, . . . , K + 1 we define

Tk = Ti, i = 1, . . . , n : ∆ik + ∆i,K+1 > 0 ∪ T(n). (2.30)

Proposition 2.23 Fnk(t), k = 1, . . . , K + 1 is unique for all t ∈ Tk.

Proof: Let k ∈ 1, . . . , K and t ∈ R+. Recall that we obtain Fnk(t) by summing

all probability mass in (0, t] × k. Hence, Fnk(t) is unique if t is not contained in

the interior of a maximal intersection for Fnk. Observation times Ti ∈ Tk are never

contained in the interior of a maximal intersection for Fnk by Lemma 2.3. 2

Similar to the comment after Proposition 2.12, we actually obtain a stronger statement

in the proof of Proposition 2.23: For each k = 1, . . . , K + 1, Fnk(t) is unique if and

only if t is not in the interior of the x-interval of a positive maximal intersection for

Fnk (see Definitions 2.2 and 2.11 for this terminology).

2.4 Characterizations

We now characterize the estimators in terms of necessary and sufficient conditions.

In Section 2.4.1 we consider the maximum likelihood estimator for the general cen-

sored data setting discussed in Section 2.2. Recall that for current status data with

38

competing risks, both the MLE and the naive estimator can be viewed as maximum

likelihood estimators for censored data. Hence, the characterizations in this section

can be used for both estimators.

In Section 2.4.2 we give Fenchel and convex minorant characterizations for the

naive estimator. In Section 2.4.3 we give analogous characterizations for the MLE.

The main difference between the characterizations for the naive estimator and the

MLE is that the characterizations for the MLE are self-induced, similar to the char-

acterization for univariate interval censored data given in Groeneboom and Wellner

(1992, Proposition 1.4, page 29).

The characterizations in Section 2.4.1 are used to check convergence of the support

reduction algorithm for the MLE, described in Section 3.1. The characterizations in

2.4.3 are used to check convergence of the iterative convex minorant algorithms for the

MLE, discussed in Section 3.2. Furthermore, we use the characterizations in Section

2.4.3 to derive asymptotic properties of the MLE.

2.4.1 Characterizations of the maximum likelihood estimator for censored data

Recall the general censored data setting discussed in Section 2.2. Given n i.i.d. ob-

served sets R1, . . . , Rn, the MLE α is defined by

ln(α) = maxA

ln(α), (2.31)

where α = (α1, . . . , αm) denotes the mass in the maximal intersections A1, . . . , Am,

ln(α) =1

n

n∑

i=1

logPα(Ri) =1

n

n∑

i=1

log

(m∑

j=1

αj1Aj ⊆ Ri),

A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m, 1Tα = 1,

and 1 is the all-one vector in Rm.

39

In order to give characterizations for α, we first translate optimization problem

(2.31) into an optimization problem over a cone. To this end, we adjust the object

function, analogously to Silverman (1982, Theorem 3.1), Jongbloed (1995, Corollary

2.2) and Maathuis (2003, Lemma 3.7).

Lemma 2.24 The vector α ∈ A maximizes ln(α) over A if and only if it maximizes

ln(α) over A, where

A = α ∈ Rm : αj ≥ 0, j = 1, . . . , m , (2.32)

ln(α) = ln(α) − 1Tα. (2.33)

Proof: Suppose α maximizes ln(α) over A. We will show that ln(α) ≥ ln(α) for all

α ∈ A. Note that this inequality holds trivially when α = 0. Hence, let α ∈ Awith 1Tα = cα > 0. Then α/cα ∈ A and therefore ln(α) ≥ ln(α/cα). Together with

1T α = 1, this yields

ln(α) = ln(α) − 1 ≥ ln(α/cα) − 1 =1

n

n∑

i=1

log

(1

cα

m∑

j=1

αj1Aj ⊆ Ri)

− 1

=1

n

n∑

i=1

log

(m∑

j=1

αj1Aj ⊆ Ri)

− log(cα) − 1

= ln(α) + cα − log(cα) − 1 ≥ ln(α),

since x− log(x) − 1 ≥ 0 for x > 0. Hence, α maximizes ln(α) over A.

Now suppose α maximizes ln(α) over A, and suppose 1T α = c > 0. Then α/c ∈ Aand ln(α) ≥ ln(α/c). By the same reasoning as above this gives

ln(α) ≥ ln(α/c) = ln(α/c) − 1 = ln(α) + c− log(c) − 1.

Since x− log(x) − 1 ≤ 0 if and only if x = 1, this yields c = 1. Hence, α ∈ A, and α

maximizes ln(α) over A ⊆ A. 2

40

Using the Fenchel conditions given in Robertson, Wright and Dykstra (1988, Sec-

tion 6.2), it follows that α is the MLE if and only if

〈α,∇ln(α)〉 ≤ 0 for all α ∈ A and 〈α,∇ln(α)〉 = 0, (2.34)

where ∇ln(α) is the vector of partial derivatives:

∇ln(α) =

(∂ln(α)

∂α1, . . . ,

∂ln(α)

∂αm

).

Since A consists of vectors with nonnegative elements, (2.34) is equivalent to

∂ln(α)

∂αj

≤ 0 for all j = 1, . . . , m,

= 0 if αj > 0.(2.35)

In Proposition 2.25, we use (2.34) and (2.35) to derive characterizations for the max-

imum likelihood estimator for censored data.

Proposition 2.25 (Compare Lemma 3.8 of Maathuis (2003)) The vector α ∈ Asatisfies (2.31) if and only if

1

n

n∑

i=1

Pα(Ri)

Pα(Ri)≤ 1Tα for all α ∈ A and 1T α = 1; (2.36)

or, equivalently, if and only if

1

n

n∑

i=1

1Aj ⊆ RiPα(Ri)

≤ 1 for all j = 1, . . . , m,

= 1 if αj > 0.(2.37)

Proof: Note that

∂ln(α)

∂αj=

1

n

n∑

i=1

1Aj ⊆ RiPα(Ri)

− 1.

41

Condition (2.37) now follows directly by plugging this expression into (2.35). To prove

(2.36), let α ∈ A. We have

〈α,∇ln(α)〉 =m∑

j=1

αj

(1

n

n∑

i=1

1Aj ⊆ RiPα(Ri)

− 1

)

=1

n

n∑

i=1

∑mj=1 αj1Aj ⊆ Ri

Pα(Ri)− 1Tα =

1

n

n∑

i=1

Pα(Ri)

Pα(Ri)− 1Tα.

Similarly, we find

〈α,∇ln(α)〉 =1

n

n∑

i=1

Pα(Ri)

Pα(Ri)− 1T α = 1 − 1T α.

Condition (2.36) then follows by combining the last two displays with (2.34). 2

2.4.2 Fenchel and convex minorant characterizations for the naive estimator

We now give Fenchel and convex minorant characterizations for the naive estimator.

Recall that Gn is the empirical distribution of the observation times T1, . . . , Tn. Fur-

thermore, recall that the naive estimators Fnk, k = 1, . . . , K + 1 can be viewed as

maximum likelihood estimators for the reduced current status data Zk = (T,∆k).

Hence, their characterizations follow directly from the characterizations for current

status data given in Groeneboom and Wellner (1992, Proposition 1.1, page 39 and

Proposition 1.2, page 41). We restate these characterizations in a slightly different

way, so that they can be easily compared to our characterizations of the MLE in

Section 2.4.3. We first introduce some definitions.

Definition 2.26 Given a set of points S ∈ R2, the convex minorant is the largest

convex function lying below S. The concave majorant is the smallest concave function

lying above S. 5

5Some authors use the terms greatest convex minorant and smallest concave majorant instead.

42

Definition 2.27 Let B be a collection of points in R. We say that a nondecreasing

right-continuous function F : R+ → [0, 1] with F (0) = 0 has a jump at t w.r.t. B

if either t = minB and F (t) > 0, or minB < t ∈ B and F (t) > F (t′) for all

preceding t′ ∈ B. Similarly, we say that a nonincreasing right-continuous function

S : R+ → [0, 1] with S(0) = 1 has a jump at t w.r.t. B if either t = minB and

S(t) < 1, or minB < t ∈ B and S(t) < S(t′) for all preceding t′ ∈ B.

Proposition 2.28 Let k ∈ 1, . . . , K. Then Fnk is a naive estimator if and only if

∫

(0,t)

Fnk(u)dGn(u) ≤∫

u∈(0,t)

δkdPn(u, δ), for all t ∈ R+,

and equality holds if (i) Fnk has a jump at t w.r.t. T1, . . . , Tn, and (ii) t = ∞.

Proposition 2.29 Fn,K+1 is a naive estimator if and only if:

∫

(0,t)

Fn,K+1(u)dGn(u) ≥∫

u∈(0,t)

δK+1dPn(u, δ), for all t ∈ R+,

and equality holds if (i) Fn,K+1 has a jump at t w.r.t. T1, . . . , Tn, and (ii) t = ∞.

These characterization can also be written as convex minorant characterizations:

Proposition 2.30 Let the cumulative sum diagrams φnk be defined by

φnk(t) =

(Gn(t),

∫

(0,t]

δkdPn(u, δ)

), t ∈ R+, k = 1, . . . , K.

Then Fnk(s), s ∈ T1, . . . , Tn, k = 1, . . . , K, are the uniquely defined values of the

naive estimator if and only if for all such s, Fnk(s) is the left slope of the convex

minorant of φnk(s) at Gn(s).

Proposition 2.31 Let the cumulative sum diagram φn,K+1 be defined by

φn,K+1(t) =

(Gn(t),

∫

(0,t]

δK+1dPn(u, δ)

), t ∈ R+.

43

Then Fn,K+1(s), s ∈ T1, . . . , Tn are the uniquely defined values of the naive estima-

tor if and only if for all such s, Fn,K+1(s) is the left slope of the concave majorant of

φn,K+1(s) at Gn(s).

These geometric characterizations provide one-step convex minorant algorithms for

the computation of the naive estimators, as discussed in Groeneboom and Wellner

(1992, pages 40-41). An alterative but equivalent computational method is the Pool

Adjacent Violators Algorithm (PAVA), see, e.g., Ayer, Brunk, Ewing, Reid and Sil-

verman (1955) and Barlow, Bartholomew, Bremner and Brunk (1972, pages 13-15).

2.4.3 Fenchel and convex minorant characterizations for the MLE

Groeneboom and Wellner (1992, Proposition 1.4, page 49) show that a characteriza-

tion of the type of Proposition 2.30 also holds in a more complicated situation, where

the cumulative sum diagram is “self-induced”, in the sense that it is constructed from

the solution itself. We now derive such self-induced characterizations for the MLE

for current status data with competing risks. First, in Lemma 2.32 we translate op-

timization problem (2.7) into an optimization problem over a cone, by removing the

constraint F+(u) ≤ 1. In Propositions 2.34 and 2.36 we give Fenchel characterizations

of the MLE. Proposition 2.40 gives a self-induced convex minorant characterization.

We illustrate this characterization in Example 2.41.

Lemma 2.32 Fn ∈ FK maximizes ln(F ) over FK if and only if it maximizes ln(F )

over FK, where

FK = the collection of all K-tuples of bounded nonnegative

nondecreasing right-continuous functions, (2.38)

ln(F ) =

∫ K∑

k=1

δk logFk(u) + (1 − δ+) log(cF − F+(u))

dPn − cF , (2.39)

and cF = F+(∞).

44

Proof: The proof is analogous to the proof of Lemma 2.24. Suppose that Fn maxi-

mizes ln(F ) over FK . We will show that ln(Fn) ≥ ln(F ) for all F ∈ FK . Note that this

inequality holds trivially when F+(∞) = 0. Hence, let F ∈ FK with F+(∞) = cF > 0.

Then F/cF ∈ FK , and therefore ln(Fn) ≥ ln(F/cF ). Together with Fn+(∞) = 1 this

yields

ln

(Fn

)= ln(Fn) − 1 ≥ ln(F/cF ) − 1

=

∫ K∑

k=1

δk logFk(t) + (1 − δ+) log cF − F+(t)dPn − log cF − 1

= ln(F ) + cF − log cF − 1 ≥ ln(F ),

since x− log x− 1 ≥ 0 for x > 0. Hence Fn maximizes ln(F ) over FK .

Now suppose Fn maximizes ln(F ) over FK , and suppose Fn+(∞) = c > 0. Then

ln(Fn) ≥ ln(Fn/c), and by the same reasoning as above this gives

ln(Fn) ≥ ln(Fn/c) = ln(Fn/c) − 1 = ln(Fn) + c− log c− 1.

Since x − log x − 1 ≤ 0 if and only if x = 1, this yields c = 1. Hence, Fn ∈ FK , and

Fn maximizes ln(F ) over FK ⊆ FK . 2

We now give several Fenchel characterizations of the MLE. The proof of Proposition

2.34 is based on the finite dimensional nature of the optimization problem, which

allows us to use Fenchel conditions that are analogous to (2.34). We give two differ-

ent proofs for Proposition 2.36. The first proof follows from Proposition 2.34. The

second proof can be used in cases where the optimization problem is truly infinite di-

mensional. Proposition 2.40 is derived from Proposition 2.36 and gives a self-induced

convex minorant characterization.

45

Definition 2.33 For F ∈ FK , we define

βnF = 1 −∫

1 − δ+F+(∞) − F+(u)

dPn(u, δ), (2.40)

Furthermore, recall the definition of Tk in Definition 2.22, and the definition of jump

points in Definition 2.27. We use the conventions that 0/0 = 0 and 0 · ∞ = 0.

Proposition 2.34 Let Fn ∈ FK. Then Fnk(t), t ∈ Tk, k = 1, . . . , K, are the uniquely

defined values of the MLE (2.7) if and only if

(a) βnFn≥ 0 and equality holds if Fn+(T(n)) < Fn+(∞).

(b) For all k = 1, . . . , K and t ∈ R+,

∫

u∈[t,∞)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)

dPn(u, δ) ≤ βnFn

, (2.41)

and equality holds if Fnk has a jump at t w.r.t. Tk.

Proof: This proof uses the fact that the optimization problem is finite dimensional.

We introduce the following notation. Let T ′(i), i = 1, . . . , nd, nd ≤ n be the or-

der statistics of the distinct observation times in T1, . . . , Tn and let T ′(0) = 0 and

T ′(nd+1) = ∞. Furthermore, for i = 1, . . . , nd and k = 1, . . . , K + 1, let

Nik =n∑

j=1

∆jk1Tj = T ′(i) and Ni =

n∑

j=1

1Tj = T ′(i).

Thus, Ni represents the number of observations with observation time T ′(i), and Nik is

the number of those observations with ∆k = 1. For any set of functions (F1, . . . , FK)

and i ∈ 0, . . . , nd + 1, we define

Fik = Fk(T′(i)), Fik = Fnk(T

′(i)), Fi+ =

K∑

k=1

Fik, and Fi+ =

K∑

k=1

Fik.

46

Let F = (F1, . . . , FK), where Fk = (F1k, . . . , Fnd+1,k). Since the extended log like-

lihood function ln(F ) defined in Lemma 2.32 only depends on values Fk(T′(i)), i =

1, . . . , nd + 1, k = 1, . . . , K, we can write it as:

ln(F ) =1

n

nd∑

i=1

K∑

k=1

Nik logFik +Ni,K+1 log(Fnd+1,+ − Fi+)

− Fnd+1,+.

We need to maximize this function over the space

FK = F ∈ R(nd+1)K : 0 ≤ F1k ≤ · · · ≤ Fnd+1,k for all k = 1, . . . , K.

Thus, we maximize the concave function ln(F ) over a convex cone, and hence we have

Fenchel optimality conditions analogous to (2.34):

〈F,∇ln(F )〉 ≤ 0 for all F ∈ FK and 〈F ,∇ln(F )〉 = 0, (2.42)

where

∇ln(F ) =

((∂ln(F )

∂F1k, . . . ,

∂ln(F )

∂Fnd+1,k

), k = 1, . . . , K

).

Rewriting the first expression of (2.42) yields

0 ≥ 〈F,∇ln(F )〉 =K∑

k=1

nd+1∑

j=1

Fjk∂ln(F )

∂Fjk=

K∑

k=1

nd+1∑

j=1

j∑

i=1

(Fik − Fi−1,k)∂ln(F )

∂Fjk

=

K∑

k=1

nd+1∑

i=1

(Fik − Fi−1,k)

nd+1∑

j=i

∂ln(F )

∂Fjk.

Now fix an l ∈ 1, . . . , nd + 1 and k ∈ 1, . . . , K. Since the above inequality must

hold for all F ∈ FK, it must hold for F with

Flk − Fl−1,k > 0, and Fik′ − Fi−1,k′ = 0 otherwise.

47

This implies that∑nd+1

j=l∂ln(F )∂Fjk

≤ 0. Since this holds for all l ∈ 1, . . . , nd + 1 and

k ∈ 1, . . . , K, we obtain

nd+1∑

j=i

∂ln(F )

∂Fjk≤ 0, i = 1, . . . , nd + 1, k = 1, . . . , K. (2.43)

Furthermore, by rewriting the expression for 〈F ,∇ln(F )〉 in a similar way, it follows

that we must have equality in (2.43) if Fik > Fi−1,k. Considering condition (2.43) for

i = nd + 1 and plugging in

∂ln(F )

Fnd+1,k=

1

n

nd∑

j=1

Nj,K+1

Fnd+1,+ − Fj+− 1 = −βnF , k = 1, . . . , K,

yields

βnF = 1 − 1

n

nd∑

j=1

Nj,K+1

Fnd+1,+ − Fj+= 1 −

∫1 − δ+

Fn+(∞) − Fn+(u)dPn(u, δ) ≥ 0. (2.44)

Furthermore, equality must hold in (2.44) if Fnd+1,k − Fnd,k > 0 for some k ∈1, . . . , K, or equivalently, if Fnd+1,+ − Fnd,+ > 0. This gives condition (a) of the

lemma. Similarly, condition (2.43) together with

∂ln(F )

∂Fjk=

1

n

(Njk

Fjk− Nj,K+1

Fnd+1,+ − Fj+

), j = 1, . . . , nd, k = 1, . . . , K,

yields that

1

n

nd∑

j=i

(Njk

Fjk− Nj,K+1

Fnd+1,+ − Fj+

)≤ βnF , (2.45)

where equality must hold if Fik > Fi−1,k. We can rewrite (2.45) as

48

∫

[t,∞)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)

dPn(u, δ) ≤ βnF , (2.46)

for t = T ′(i). Now let k ∈ 1, . . . , K and let s1 < s2 be two successive points in Tk.

Note that the left side of (2.46) has the same value for all t ∈ (s1, s2]. Furthermore,

note that the following two statements are equivalent:

(a) Fik > Fi−1,k for some i such that T ′(i) ∈ (s1, s2],

(b) Fnk(s2) > Fnk(s1),

and that the left side of (2.46) vanishes for t = T ′(nd+1) = ∞. This implies that

satisfying (2.46) for i = 1, . . . , nd + 1, is equivalent to satisfying (2.46) for t ∈ Tk,where equality must hold if Fnk has a jump at t w.r.t. Tk. This is condition (b) of the

lemma. 2

We now give a second Fenchel characterization for the MLE:

Definition 2.35 For F ∈ FK , we define

GnF (t) =

∫

(0,t]

1 − δ+F+(∞) − F+(u)

dPn(u, δ) + βnF1[T(n),∞)(t), t ∈ R+, (2.47)

where βnF is defined in (2.40).




(b) For all k = 1, . . . , K and t ∈ R+,

∫

u∈(0,t)

Fnk(u)dGnFn(u) ≤

∫

u∈(0,t)

δkdPn(u, δ), (2.48)

and equality holds if (i) Fnk has a jump at t w.r.t. Tk, and (ii) t = ∞.

49

We give two proofs of Proposition 2.36. The first proof follows from Proposition 2.34.

The second proof does not require the result in Proposition 2.34, and can be used for

truly infinite dimensional optimization problems.

Proof 1 of Proposition 2.36: We show that condition (b) of Proposition 2.34 is

equivalent to condition (b) of Proposition 2.36. Let k ∈ 1, . . . , K. Then condition

(b) of Proposition 2.34 is equivalent to the following three conditions:

(i) For the last jump point τ of Fnk w.r.t. Tk, we have

∫

[τ,s)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)

dPn ≥ βnFn

1s > T(n), for all s > τ,

and equality must hold if s > T(n).

(ii) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have

∫

[σ,s)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)

dPn ≥ 0, for all s ∈ (σ, τ ],

and equality holds if s = τ .

(iii) For the first jump point σ of Fnk w.r.t. Tk, we have

∫

[s,σ)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)

dPn ≤ 0, for all s ∈ [0, σ).

To see this, assume that condition (b) of Proposition 2.34 holds. Let τ be the last

jump point of Fnk w.r.t. Tk. Then equality holds in (2.41) for t = τ . Furthermore,

inequality holds in (2.41) for t = s. Subtracting these two conditions, we obtain the

inequality part of condition (i). If s > T(n), then the left side of (2.41) is zero, so that

equality holds in (i) for s > T(n). Conditions (ii) and (iii) can be proved analogously.

50

Furthermore, it can be easily verified that conditions (i)-(iii) above imply condition

(b) of Proposition 2.34.

In conditions (i) and (ii), we can multiply both sides of the equations by Fnk(u),

because this is a constant and positive quantity on the intervals of integration. This

means that (i) and (ii) are equivalent to:

(i’) For the last jump point τ of Fnk w.r.t. Tk, we have for all s > τ :

∫

[τ,s)

δkdPn ≥∫

[τ,s)

Fnk(u)1 − δ+

Fn+(∞) − Fn+(u)dPn + βnFn

1s > T(n)Fnk(T(n)),

=

∫

[τ,s)

Fnk(u)dGnFn(u),

where equality must hold if s > T(n).

(ii’) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have for s ∈(σ, τ ]:

∫

[σ,s)

δkdPn ≥∫

[σ,s)

Fnk(u)1 − δ+

Fn+(∞) − Fn+(u)dPn =

∫

[σ,s)

Fnk(u)dGnFn(u),

where equality must hold if s = τ .

In (iii) we cannot multiply through by Fnk(u), because Fnk(u) = 0 on the interval of

integration. However, note that (iii) is equivalent to the condition that δk = 0 before

the first jump point of Fnk(u). An alternative way of writing this condition is

(iii’) For the first jump point σ of Fnk w.r.t. Tk, we have for s ∈ (0, σ]:

∫

(0,s)

δkdPn =

∫

(0,s)

Fnk(u)dGnFn(u).

This completes the proof, since (i’)-(iii’) are equivalent to condition (b) of Proposition

2.36. 2

51

Proof 2 of Proposition 2.36: Suppose that Fn ∈ FK is an MLE. For t ∈ R+

and |h| < min1/Fn1(∞), . . . , 1/FnK(∞) we define the perturbation F(h)nk (u) =

Fnk(u)1 + hFnk(u). Let F(h,k)n ∈ FK be the corresponding vector of components,

where only the kth component is changed to F(h)nk . By Lemma 2.32, Fn maximizes

the function ln(F ) over F ∈ FK . Hence, we get

0 = limh↓0

h−1ln(F (h,k)n

)− ln(Fn)

=

∫Fnk(u)

δk −

Fnk(u)(1 − δ+)

Fn+(∞) − Fn+(u)

dPn(u, δ) − F 2

nk(∞)βnFn

=

∫ ∫

[t,∞)

δk −Fnk(u)(1 − δ+)

Fn+(∞) − Fn+(u)dPn(u, δ) − Fnk(∞)βnFn

dFnk(t). (2.49)

Here we obtain the last line by writing Fnk(u) =∫(0,u]

dFnk(t) and using Fubini’s

theorem. Furthermore, for h ≥ 0 we consider the perturbation F(h,t)nk (u) = Fnk(u)1+

h1[t,∞)(u). Let F(h,t,k)n ∈ FK be the corresponding vector of components, where only

the kth component is changed to F(h,t)nk . Then we get

0 ≥ limh↓0

h−1ln(F (h,t,k)n

)− ln(Fn)

=

∫

[t,∞)

δk −

Fnk(u)(1 − δ+)

Fn+(∞) − Fn+(u)

dPn(u, δ) − Fnk(∞)βnFn

. (2.50)

Note that (2.50) is left-continuous in t, and constant between successive points t in Tk.Hence, by combining (2.49) and (2.50), it follows that equality in (2.50) must hold for

points of jump t ∈ Tk of Fnk w.r.t. Tk. Taking t > T(n) in (2.50) yields βnFn≥ 0, which

is the first part of condition (a). Furthermore, (2.49) yields βnFnFnk(∞)Fnk(∞) −

Fnk(T(n)) = 0. This implies that

βnFnFnk(∞) − Fnk(T(n)) = 0, (2.51)

52

so that we can write (2.50) as

0 ≥∫

[t,∞)

δk −

Fnk(u)(1 − δ+)

Fn+(∞) − Fn+(u)

dPn(u, δ) − Fnk(T(n))βnFn

=

∫

[t,∞)

δkdPn(u, δ) −∫

[t,∞)

Fnk(u)dGnFn(u). (2.52)

Equality must hold at points of jump t of Fnk w.r.t. Tk. Note that the MLE Fnk(u)

must be positive at the first point with δk = 1. Otherwise, the log-likelihood would be

−∞. Hence, for the first jump point σ of Fnk with respect to Tk, we have∫(0,σ)

δkdPn =∫(0,σ)

FnkdGnFn= 0. Together with the equality in (2.52) at t = σ, this implies

∫δkdPn =

∫FnkdGnFn

. Combining this with (2.52) gives condition (a). Furthermore,

if Fn+(T(n)) < Fn+(∞), then there must be a k ∈ 1, . . . , K such that Fnk(T(n)) <

Fnk(∞). Together with (2.51) this yields βnFn= 0. This completes condition (b).

Now assume that Fn ∈ FK satisfies conditions (a) and (b). Let c = Fn+(∞). We

show that Fn maximizes ln(F ) over F ∈ FK . As in the proof of Proposition 2.36, we

can multiply through in condition (b) by a function that has the same jump points

as Fnk w.r.t. Tk. Multiplying by log Fnk(u) under the convention 0 · ∞ = 0 gives

∫δk log Fnk(u)dPn(u, δ) =

∫Fnk(u) log Fnk(u)dGnFn

(u), k = 1, . . . , K.

Furthermore, by definition of GnFn, we have

∫(1 − δ+) log(c− Fn+(u))dPn(u, δ) =

∫(c− Fn+(u)) log(c− Fn+(u))dGnFn

(u).

Note that this also holds if βnFn> 0 because of the equality condition in (a). Namely,

βnFn> 0 implies that Fn+(T(n)) = Fn+(∞). By combining the last two displays we

can write ln(F ) as:

53

ln(Fn) =

∫ [ K∑

k=1

δk log Fnk(u) + (1 − δ+) log(c− Fn+(u))

]dPn(u, δ)

=

∫ [ K∑

k=1

Fnk(u) log Fnk(u) + (c− Fn+(u)) log(c− Fn+(u))

]dGnFn

(u) − c.

Furthermore, using the stochastic ordering property of condition (b), it follows that

for monotone nondecreasing functions F ∈ FK with F+(∞) = cF , we have

∫δk logFk(u)dPn(u, δ) ≤

∫Fnk(u) logFk(u)dGnFn

(u), k = 1, . . . , K. (2.53)

We can prove this by writing logFk(u) =∫(0,u]

d logFk(s) and using Fubini:

∫Fnk(u) logFk(u)dGnFn

(u) −∫δk logFk(u)dPn(u, δ)

=

∫ [∫

[s,∞)

Fnk(u)dGnFn(u) −

∫

[s,∞)

δkdPn

]d logFk(s) ≥ 0,

since the integrand is nonnegative by condition (b). Furthermore,

∫(1 − δ+) log(cF − F+(u))dPn(u, δ) =

∫(c− Fn+(u)) log(cF − F+(u))dGnFn

(u),

by the definition of GnFn. Combining the above display with (2.53) gives

ln(F ) =

∫ [ K∑

k=1

δk logFk(u) + (1 − δ+) log(cF − F+(u))

]dPn(u, δ) − cF

≤∫ [ K∑

k=1

Fnk(u) logFk(u) + (c− Fn+(u)) log(cF − F+(u))

]dGnFn

(u) − cF .

It follows that

ln(Fn) − ln(F ) ≥∫ [ K∑

k=1

Fnk(u) log

(Fnk(u)

Fk(u)

)

+ (c− Fn+(u)) log

(c− Fn+(u)

cF − F+(u)

)+ cF − c

]dGnFn

(u). (2.54)

54

Here we use that∫dGnFn

= 1, so that we can pull cF − c inside the integral. We

now show that this expression is nonnegative, by showing that its integrand is non-

negative. To do so, consider two vectors q and p in RK+1 with∑K+1

k=1 qk = cq and∑K+1

k=1 pk = cp. We use the nonnegativity of the Kullback-Leibler number for a multi-

nomial distribution to show that

0 ≤K∑

k=1

(qk/cq) log

(pk/cqpk/cp

)+ (1 − q+/cq) log

(1 − q+/cq1 − p+/cp

)

=

K∑

k=1

(qk/cq) log

(pkpk

)+ (1 − q+/cq) log

(cq − q+cp − p+

)+ log(cp/cq)

≤K∑

k=1

(qk/cq) log

(pkpk

)+ (1 − q+/cq) log

(cq − q+cp − p+

)− 1 + c/cq

=1

cq

K∑

k=1

qk log

(pkpk

)+ (cq − q+) log

(cq − q+cp − p+

)+ cp − cq

,

using again the inequality log(x) ≤ x − 1 for x > 0. Letting qk = Fnk(u) and pk =

Fk(u), it follows that the integrand of (2.54) is nonnegative. Hence, ln(Fn)−ln(F ) ≥ 0,

so that Fn maximizes ln(F ) over F ∈ FK . 2

Alternatively, we can formulate Proposition 2.36 in terms of self-induced convex mi-

norant characterization:

Definition 2.37 For k = 1, . . . , K, let the cusum diagram φnkF be defined by

φnkF (t) =

(GnF (t),

∫

(0,t]

δkdPn(u, δ)

), t ∈ R+, k = 1, . . . , K, (2.55)




55

(b) For all k = 1, . . . , K and t ∈ Tk, Fnk(t) is the slope of the convex minorant of

φnkFnat GnFn

(t), where we take the left-continuous slope if GnFnhas a jump at

t, and the right-continuous slope otherwise.

Before we prove Proposition 2.38, we give an equivalent formulation in terms of a

fixed point of a self-induced convex minorant mapping:

Definition 2.39 For each F ∈ FK such that βnF ≥ 0, and each k ∈ 1, . . . , K, we

define the mapping Snk : F 7→ SnkF by

[SnkF ](t) is the slope of the convex minorant of φnkF at GnF (t), t ∈ Tk, (2.56)

where we take the left-continuous slope if GnF has a jump at t, and the right-

continuous slope otherwise. We define Sn(F ) by

Sn(F ) = (Sn1(F ), . . . , SnK(F )) . (2.57)

We can now reformulate Proposition 2.38 in terms of a fixed point of the mapping

Sn(F ).




(b) Fn is a fixed point of the mapping Sn in the sense that

[SnkFn](t) = Fnk(t) for all t ∈ Tk, k = 1, . . . , K.

Proof of Propositions 2.38 and 2.40: We show that condition (b) of Proposition

2.36 is equivalent to condition (b) of Propositions 2.38 and 2.40. Let k ∈ 1, . . . , K,

56

and recall that condition (b) of Proposition 2.36 is equivalent to conditions (i’), (ii’)

and (iii’) given in its Proof 1. Furthermore, note that Fnk(t) is constant on the

intervals of integration in each of these statements. Hence, we can take Fnk(t) outside

the integral, yielding:

(i”) For the last jump point τ of Fnk w.r.t. Tk, we have for all s > τ :

Fnk(s) = Fnk(τ) ≤∫[τ,s)

δkdPn∫[τ,s)

dGnFn(u)

,

where equality must hold if s > T(n).

In terms of the cusum diagram, this means that Fnk(τ) is the slope of the line segment

connecting the points φnkFn(τ−) and φnkFn

(T(n)). Furthermore, no points φnkFn(s)

for s ≥ τ may lie below this line segment. Similarly, condition (ii’) of Proof 1 of

Proposition 2.36 is equivalent to:

(ii”) For any two successive jump points σ and τ of Fnk w.r.t. Tk, we have for s ∈(σ, τ ]:

Fnk(s) = Fnk(σ) ≤∫[σ,s)

δkdPn∫[σ,s)

dGnFn(u)

,

where equality must hold if s = τ .

In terms of the cusum diagram, this means that Fnk(σ) is the slope of the line segment

connecting the points φnkFn(σ−) and φnkFn

(τ−). Furthermore, no points φnkFn(s) for

s ∈ [σ, τ) may lie below this line segment.

Next, we consider condition (iii’) of Proof 1 of Proposition 2.36. Let σ be the first

jump point of Fnk w.r.t. Tk. Then by definition Fnk(t) = 0 for t < σ. Condition (iii’)

says that∫(0,σ)

δkdPn = 0, and this is equivalent to Fnk(t), t ∈ (0, σ), being the slope

of the line segment connecting φnkFn(0) and φnkFn

(σ−).

57

Thus, it follows that condition (b) of Proposition 2.36 for the kth component is

equivalent to Fnk being the slope of a piecewise linear function Hnk + c, where Hnk is

below the cusum diagram φnkFn, and touches the cusum diagram whenever it has a

change of slope. Without loss of generality we take c = 0. Taking the monotonicity

constraint on Fn into account as well, it follows that Hnk must be convex, and hence

Hnk must be the convex minorant of the cusum diagram.

Finally, note that Hnk can only touch the cusum diagram at points at which GnFn

jumps. By the above reasoning, we must take the left derivative at these points.

Since there can be vertical stretches of points in the cusum diagram, we take the

right derivative at all other points. 2

Example 2.41 To illustrate Proposition 2.40, we consider the example data in Table

2.5, and the corresponding plots in Figure 2.3. Note that T1 = 2, 3, 8, 10, T2 = 2, 3and T3 = 3. Note that the cumulative sum diagrams φnkFn

, k = 1, 2, only depend

on Fn through the value of Fn+(Ti) for which ∆i+ = 0. Thus, we can construct the

cumulative sum diagrams for the MLE using Fn+(3) = 3/5. We can then compute

Fnk(s), s ∈ Tk, by taking the slope of the convex minorant of φnkFnat GnFn

(s), where

we take the left slope if Fnk has a jump at s w.r.t. Tk, and the right slope otherwise.

Note that GnFn(s) jumps at s = 3 and s = 10. Hence, we get Fn1(2) = Fn1(3) = 2/5

and Fn1(8) = Fn1(10) = 4/5. Note that at GnFn(3) = GnFn

(8), the left slope is 2/5,

while the right slope is 4/5. For s = 3 we need the left slope, and for s = 8 we

need the right slope. This minor difficulty with left and right slopes is caused by the

vertical pieces in the cumulative sum diagram φnkFn.

Similarly, for k = 2, we obtain Fn2(2) = Fn2(3) = 1/5 and Fn2(T(n)) = Fn2(10) =

1/5. By monotonicity this implies that also Fn2(8) = 1/5. Note that these values

correspond exactly to the ones given in Table 2.5. Thus, one can recover the values

of Fnk(t), t ∈ Tk, from the values of Zi, i = 1, . . . , n, and Fn+(t), t ∈ TK+1, using a

simple one-step convex minorant algorithm.

58

Table 2.5: Example data with K = 2 competing risks, to illustrate Propositions 2.30and 2.40. The convex minorants are given in Figure 2.3.

i t(i) δ(i)1 δ(i)2 δ(i)3 Fn1(t(i)) Fn2(t(i)) Fn1(t(i)) Fn2(t(i)) Fn+(t(i))1 2 1 0 0 1/3 1/5 2/5 1/5 3/52 2 0 1 0 1/3 1/5 2/5 1/5 3/53 3 0 0 1 1/3 1/5 2/5 1/5 3/54 8 1 0 0 1 1/5 4/5 1/5 15 10 1 0 0 1 1/5 4/5 1/5 1

We now compare the cusum diagrams φnk and φnkFnfor the naive estimator and

the MLE. Note that the y-coordinates of the points in both cusum diagrams are

given by∫(0,t]

δkdPn, so that the points in the left and right panels of Figure 2.3 align

horizontally. Furthermore, the x-coordinates Gn(t) of φnk(t) and GnFn(t) of φnkFn

do

not depend on k. Hence, the points in the upper panels of Figure 2.3 align vertically

with those in the lower panels. Furthermore, we always have Gn(T(n)) = GnFn(T(n)) =

1, and, for s < T(n) such that Fn+(s) < 1,

GnFn(s) = Gn(s) +

∫

(0,s]

Fn+(t) − δ+

1 − Fn+(t)dPn(t, δ) + βnFn

1s ≥ T(n). (2.58)

This expression shows that the x-coordinates of φnk and φnkFndiffer by two terms

that we refer to as the Fn+-term and the βnFn-term. The Fn+-term comes from

the difference in the log likelihoods (2.6) and (2.8). The βnFn-term comes from the

constraint F+ ≤ 1 on the space FK .

Remark 2.42 Note that the Kullback-Leibler number used in Proof 2 of Proposition

2.36 equals zero if and only if Fnk(u) = Fk(u) for all k = 1, . . . , K at points of jump

of GnFn(u). Hence, Fn+(u) is unique at points of jump of GnFn

(u). Since these

values uniquely determine Fnk(u), u ∈ Tk, via the convex minorant characterization

in Proposition 2.40, this gives another proof of uniqueness.

59

Naive MLE

k = 1

k = 2

0 .5 1 0 .5 1

0 .5 1 0 .5 1

0

.2

.4

.6

0

.2

.4

.6

0

.2

.4

.6

0

.2

.4

.6

2 3

8

10

2 3 8 10

2 3

8

10

2 3, 8 10

Figure 2.3: Convex minorant plots for the data in Table 2.5. The left column corre-sponds to the naive estimator, and shows the cusum diagrams and its convex mino-rants (see Proposition 2.30). The right column corresponds to the MLE, and shows

the self-induced cusum diagrams (based on the values of Fn+ given in Table 2.5) andits convex minorants (see Proposition 2.40). All points are labeled by their observa-tion times.

In Remark 2.41 we saw that the x-coordinates of the cusum diagrams of the MLE

and the naive estimator are different, while their y-coordinates are identical. We can

also derive convex minorant characterizations where the x-coordinates of the cusum

diagrams of both estimators are identical, and their y-coordinates are different. To

illustrate the freedom in the convex minorant characterizations, we now discuss a

particular family of convex minorant characterizations. These new characterizations

look more complicated, but will prove useful for computational purposes in Section

3.2.

We define

sk = minT1, . . . , Tn : ∆ik = 1 for k = 1, . . . , K. (2.59)

60

Note that it follows from the form of the log likelihood (2.6) that Fnk(t) = 0 for

t < sk. We now define a family of cusum diagrams:

Definition 2.43 Let ck : Z → R+, k = 1, . . . , K, where Z is defined in (2.3). For

F ∈ FK such that βnF ≥ 0, let

G∗nk(s) =

∫

(0,s]

ck(u, δ)dPn(u, δ),

V ∗nkF (s) =

∫

(0,s]

δk

Fk(u)− 1 − δ+F+(∞) − F+(u)

+ ck(u, δ)Fk(u)

dPn(u, δ)

− βnF1s ≥ T(n),

φ∗nkF (t) = (G∗

nk(t), V∗nkF (t)),

Proposition 2.44 Let Fnk(u) = 0 for u < sk, k = 1, . . . , K. Then Fnk(t), t ∈Tk ∩ [sk,∞), are the uniquely defined values of the MLE if and only if


(b) For all k = 1, . . . , K and t ∈ Tk ∩ [sk,∞), Fnk(t) is the slope of the convex

minorant of φ∗nkFn

at G∗nk(t), where we take the left-continuous slope if G∗

nk(t)

has a jump at t, and the right-continuous slope otherwise.

Note that the definition of sk is used to avoid the part of the convex minorant where

we may get negative slopes.

Proof: The proof is analogous to the proof of Proposition 2.40. For example, consider

condition (i) in Proof 1 of Proposition 2.36. Let τ be the last jump point of Fnk w.r.t.

Tk. Adding∫[τ,s)

ck(u, δ)Fnk(u)dPn(u, δ) to both sides of the equation yields that we

61

have for s > τ :

∫

[τ,s)

δk

Fnk(u)− 1 − δ+

Fn+(∞) − Fn+(u)+ ck(u, δ)Fnk(u)

dPn(u, δ) − βnFn

1s > T(n)

≥∫

[τ,s)

ck(u, δ)Fnk(u)dPn(u, δ),

and equality must hold if s > T(n). Pulling Fnk(u) = Fnk(τ) out of the integral on the

right side yields, for s > τ that Fnk(s) = Fnk(τ) where

Fnk(τ)≤∫[τ,s)

δk

Fnk(u)− 1−δ+

Fn+(∞)−Fn+(u)+ ck(u, δ)Fnk(u)

dPn(u, δ) − βnFn

1s > T(n)∫[τ,s)

ck(u, δ)dPn(u, δ).

In terms of the cusum diagram φ∗nkFn

, this means that Fnk(τ) is the slope of the

line segment connecting φ∗nkFn

(τ−) and φ∗nkFn

(T(n)). Furthermore, no points φ∗nkFn

(s),

s > τ may lie below this line segment. Condition (ii) in Proof 1 of Proposition 2.44

can be treated analogously, and condition (iii) can be omitted since Fnk(t) = 0 for

t < sk, k = 1, . . . , K. 2

62

Chapter 3

COMPUTATION

We already noted in Section 2.4.2 that the naive estimator can be computed with

existing algorithms for current status data, such as a one-step convex minorant algo-

rithm or the Pool Adjacent Violators Algorithm (PAVA). In this chapter we therefore

focus on the computation of the MLE.

There are several algorithms available for the computation of the MLE. Hudgens,

Satten and Longini (2001) and Jewell, Van der Laan and Henneman (2003) use the

EM algorithm, and Jewell and Kalbfleisch (2004) propose an iterative Pool Adjacent

Violators Algorithm. The EM algorithm is known for its slow convergence, and indeed

also in this problem it needs an extremely large number of iterations and a very long

computing time. The algorithm of Jewell and Kalbfleisch (2004) seems to improve on

EM when ∆(n),K+1 = 1. However, when ∆(n),K+1 = 0 the algorithm does not converge

to the MLE directly, and in this case one needs to do a K − 1 dimensional search to

find the MLE. Such a search is very costly in computing time.

We propose to compute the MLE using sequential quadratic programming (SQP)

methods. The basic idea of SQP is as follows. Suppose we want to minimize a function

θ(F ) over F ∈ H. Let F (0) ∈ H be fixed and set l = 0. For each l = 0, 1, . . . , let F new

be the minimizer of θ(l)(F ) over H, where θ(l)(F ) is a quadratic approximation of θ(F )

around F (l). We then obtain the next iterate by taking F (l+1) = F (l) +α(F new−F (l))

for a suitable α > 0. We continue this process until the necessary and sufficient

conditions for the optimum are satisfied within a specified tolerance.

Thus, in this procedure we need to minimize the quadratic functions θ(l)(F ) over

H for l = 0, 1, . . . . In order to solve these quadratic optimization problems, we can

63

use either the complete Hessian matrix of θ(F ), or only its diagonal elements. We

developed algorithms for both approaches. Section 3.1 describes a method that uses

the complete Hessian and employs the support reduction algorithm of Groeneboom,

Jongbloed and Wellner (2002). The main advantage of this method is that it can

be used for any censored data problem. Section 3.2 describes a method that uses

only the diagonal elements. Using only the diagonal elements reduces the quadratic

optimization problems to isotonic regression problems which can be solved by a one-

step convex minorant algorithm. Hence, this approach results in an iterative convex

minorant algorithm, and we show that it corresponds to the convex minorant charac-

terization in Proposition 2.44 for a specific choice of the functions ck, k = 1, . . . , K.

If the Hessian matrix is sparse off the diagonal, then this algorithm is expected to

be faster, because the speed gained in solving the quadratic optimization problems

outweighs the fact that we do not solve the exact quadratic optimization problems.

3.1 Reduction and optimization

As noted in Section 2.2, a general approach for the computation of the MLE for

censored data consists of a reduction step followed by an optimization step. In the

reduction step we compute the maximal intersections A1, . . . , Am of the observed sets

R1, . . . , Rn. In the optimization step we solve the optimization problem defined in

(2.15) and (2.16).

The main advantage of this approach is its versatility. The form of the log likeli-

hood (2.15) is the same for all censored data problems, so that the same optimization

algorithm can be used for all problems, and only the reduction step may require ad-

justments. Another advantage of this approach, compared to the one of Jewell and

Kalbfleisch (2004), is that we estimate a significantly smaller number of parameters.

Hudgens, Satten and Longini (2001) also employ a reduction and an optimization

step for the computation of the MLE. However, they use an EM algorithm for the

optimization step, while we use the support reduction algorithm of Groeneboom,

64

Jongbloed and Wellner (2002). We now describe our implementation of both the

reduction step and the optimization step in more detail.

3.1.1 Reduction step

We first note that we can use the height map algorithm of Maathuis (2005) for the

reduction step. The idea behind this algorithm is as follows. Given n observed sets in

Rp, p ∈ N, taking the form of p-dimensional rectangles1, the height map is a function

h : Rp → 0, 1, . . . , , where h(x) is defined as the number of observed sets that

overlap at the point x ∈ Rp. Maathuis (2005) shows that the maximal intersections

correspond exactly to the local maxima of the height map of a canonical version of

the observed sets. However, this algorithm only works when the observed sets are

rectangles in Rp for some p ∈ N. As discussed in Section 2.2.1, the observed sets for

the MLE can take the form (t,∞) × 1, . . . , K for t ∈ R+, and such sets are not

rectangles in R2. We resolve this problem by transforming sets (t,∞)×1, . . . , K into

(t,∞)× [1, K]. After this transformation we compute the maximal intersections, and

if we find any maximal intersections of the form (t,∞) × [1, K], then we transform

these back to (t,∞) × 1, . . . , K. This reduction algorithm has time complexity

O(n2) (Maathuis (2005)).

We can exploit the special structure of current status data with competing risks to

create a reduction algorithm with lower time complexity. Essentially, the data can be

thought of as K one-dimensional data sets. This is also apparent in the intersection

graph of the observed sets (see Figure 2.2 on page 30), in which the sets R1, . . . ,RK

are not adjacent to each other, but are all adjacent to RK+1 as described in Theorem

2.14 (c). As a result of this one-dimensional structure, we can use the idea of the

height map algorithm for each k = 1, . . . , K separately. This is done in Algorithm 1,

given in pseudo code. This algorithm is of time complexity O(n logn), since the most

1Thus, an observed set can be written as [x11, x12]× [x21, x22]×· · ·× [xp1, xp2], where it is allowedthat xj1 = xj2, j = 1, . . . , p, and where the boundaries of intervals can also be open.

65

time consuming step consists of sorting the observations.

Algorithm 1: Reduction algorithm(R1, . . . ,RK+1):Input: The sets R1, . . . ,RK+1 as defined in (2.23).Output: The maximal intersections of R.

1: for k = 1 to K do

2: Sort the observations in Rk on their observation times.3: Find the maximal intersections of Rk, using the conditions given in Lemma 2.3.4: Output the union of all maximal intersections

3.1.2 Optimization step

After finding the maximal intersections, we need to solve the m-dimensional convex

constrained optimization problem (2.16). This problem can be approached in various

ways. Hudgens, Satten and Longini (2001) use an EM algorithm. Since the EM

algorithm is known for its slow convergence properties, we instead use the support

reduction algorithm of Groeneboom, Jongbloed and Wellner (2002). Convergence of

this algorithm follows from their Theorem 3.1.

The versatility of this approach is illustrated by the fact that we could re-use

programs that were developed for bivariate interval censored data in Maathuis (2003).

Details of the implementation can be found there.

Remark 3.1 To obtain fast convergence, it is important to use a good starting value

for the iterations. In simulation studies, a suitable starting value can be generated

from the true underlying distribution. If the underlying distribution is unknown, one

can use a starting value based on the naive estimator. We found that fast convergence

is obtained by starting in a value that is close to the ‘truncated naive estimator’ that

we will discuss in Section 8.2.

66

3.2 Iterative convex minorant algorithms

We now discuss iterative convex minorant algorithms for the computation of the

MLE. First, note that any convex minorant characterization can be turned into an

iterative convex minorant algorithm. To do so, let F (0) ∈ FK (see equation (2.38))

be some starting value. Furthermore, let P(l)k , k = 1, . . . , K, denote the points of

the cumulative sum diagrams in the lth iteration step. Thus, using Propositions 2.40

or 2.44, we have P(l)k = φnkF (l)(t), t ≥ 0 or P(l)

k = φ∗nkF (l)(t), t ≥ 0. Then, for

l = 0, 1, . . . , let F new be the slope of the convex minorant of P(l)k , and take as the

next iterate F (l+1) = F (l) + α(F new − F (l)) for a suitable α > 0.

If such an algorithm converges, it converges to the MLE. However, the convergence

properties of the algorithm will depend on the choice of the convex minorant char-

acterization. To illustrate the iterative convex minorant algorithms, we discuss one

algorithm in detail and prove its convergence. The algorithm we discuss corresponds

to a specific choice of the functions ck, k = 1, . . . , K, in Proposition 2.44. Further-

more, the algorithm corresponds to a sequential quadratic programming approach

that only uses the diagonal elements of the Hessian matrix, showing the connection

with the approach discussed in Section 3.1.

To describe the algorithm, we repeat the notation that was used in the proof of

Proposition 2.34. Let T ′(i), i = 1, . . . , nd, nd ≤ n, be the order statistics of the distinct

observation times in T1, . . . , Tn and let T ′(0) = 0 and T ′

(nd+1) = ∞. Furthermore, for

i = 1, . . . , nd and k = 1, . . . , K + 1, let

Nik =

n∑

j=1

∆jk1Tj = T ′(i) and Ni =

n∑

j=1

1Tj = T ′(i).

Furthermore, for any set of functions (F1, . . . , FK) and i ∈ 0, . . . , nd + 1, we define

Fik = Fk(T′(i)) and Fi+ =

K∑

k=1

Fik.

67

Let F = (F1, . . . , FK), where Fk = (F1k, . . . , Fnd+1,k). We can then write the extended

log likelihood function ln(F ) as

ln(F ) =1

n

nd∑

i=1

K∑

k=1

Nik logFik +Ni,K+1 log(Fnd+1,+ − Fi+)

− Fnd+1,+. (3.1)

We need to maximize this function over the space

FK = F ∈ R(nd+1)K : 0 ≤ F1k ≤ · · · ≤ Fnd+1,k for all k = 1, . . . , K.

If Nnd,K+1 = 0, we can take Fnd+1,+ = Fnd,+, since in this case the MLE will never

put any mass to the right of T ′(nd). Making this substitution in (3.1), and multiplying

by -1 to turn to the problem into a minimization problem, yields the new criterion

function

ϕn(F ) = −1

n

nd∑

i=1

K∑

k=1

Nik logFik +Ni,K+1 log(Fnd,+ − Fi+)

+ Fnd,+.

On the other hand, if Nnd,K+1 > 0, the constraint Fnd,+ ≤ 1 is automatically satisfied

and we do not need the Lagrange term Fnd+1,+ in (3.1). Thus, in this case we work

with the criterion function

ψn(F ) = −1

n

nd∑

i=1

K∑

k=1

Nik logFik +Ni,K+1 log(1 − Fi+)

.

Recall the definitions of sk in (2.59), and of Tk in Definition 2.22. Let Ik denote the

indices of the order statistics corresponding to Tk ∩ [sk,∞):

Ik = i = 1, . . . , nd : T ′(i) ∈ Tk ∩ [sk,∞), k = 1, . . . , K.

Furthermore, let mk = |Ik| and m =∑K

k=1mk. We set Fik = 0 for i < sk. Then

ϕn(F ) and ψn(F ) only depend on Fik for i ∈ Ik, i = 1, . . . , K. Restricting F to only

68

contain elements Fik for i ∈ Ik, k = 1, . . . , K, the computation of the MLE reduces

to finding the minimizer of θ(F ) over H, where

H = F ∈ Rm : 0 ≤ Fik ≤ Fjk for all i < j ∈ Ik, k = 1, . . . , K,

and where θ(F ) = ϕn(F ) if Nnd,K+1 = 0, and θ(F ) = ψn(F ) if Nnd,K+1 > 0. This

means that in step l of the algorithm we need to solve the quadratic optimization

problem

F new = argminF∈Hθ(F(l)) + (F − F (l))T∇l +

1

2(F − F (l))THl(F − F (l)),

where ∇l ∈ Rm is the vector of first derivatives of θ(F ) at F (l), and Hl is the m×m

diagonal matrix containing the second derivatives of θ(F ) at F (l). Note that this

optimization problem is equivalent to

F new = argminF∈H1

2

F −

(F (l) −Hl

−1∇l

)THl

F −

(F (l) −Hl

−1∇l

).

Since F ∈ H is required to be monotone in each of the components F1, . . . , FK , and

since there are no constraints between the components, this minimization problem

can be further reduced to K isotonic least squares problems

F newk = argminFk∈Gk

1

2

∑

i∈Ik

Fik −

(F

(l)ik −

(∂2θ(F (l))

∂F 2ik

)−1∂θ(F (l))

∂Fik

)2∂2θ(F (l))

∂F 2ik

,

(3.2)

where Gk = Fk ∈ Rmk : 0 ≤ Fik ≤ Fjk for all i < j ∈ Ik for k = 1, . . . , K. It

is well known (see, e.g., Robertson, Wright and Dykstra (1988)) that the solution of

the isotonic least squares problem

min1

2

n∑

i=1

(xi − yi)2hi

69

for a fixed y ∈ Rn and positive values h1, . . . , hn can be found as the left derivative

of the convex minorant of the points P = Pi = (G(i),V(i)), i = 0, . . . , n where

P0 = (0, 0) and

G(i) =i∑

j=1

hj , V(i) =i∑

j=1

hjyj.

Hence, for k ∈ 1, . . . , K, the solution F newk of the isotonic least squares problems

(3.2) is given by the left derivative of the convex minorant of the points (0, 0) and

(∑

j∈Ik,j≤i

∂2θ(F (l))

∂F 2ik

,∑

j∈Ik,j≤i

∂2θ(F (l))

∂F 2ik

F(l)ik − ∂θ(F (l))

∂Fik

), i ∈ Ik. (3.3)

If Nnd,K+1 = 0, we replace θ(F ) by ϕn(F ) in (3.3). Note that

∂ϕn(F(l))

∂Fik= −1

n

(Nik

F(l)ik

− Ni,K+1

F(l)nd,+

− F(l)i+

)+ 1i = nd

(1 − 1

n

nd∑

j=1

Nj,K+1

F(l)nd,+

− F(l)j+

),

and

n∂2ϕn(F

(l))

∂F 2ik

=Nik(F

(l)ik

)2 +Ni,K+1(

F(l)nd,+ − F

(l)i+

)2 + 1i = ndnd∑

j=1

Nj,K+1(F

(l)nd,+ − F

(l)j+

)2 .

Hence, this corresponds exactly to Proposition 2.44 with Fnd+1,+ = Fnd,+,

c(l)k (T ′

(j)) = c(l)jk = n

∂2ϕn(F(l))

∂F 2jk

and βnF (l) = 1 − 1

n

nd∑

j=1

Nj,K+1

F(l)nd,+ − F

(l)j+

.

If Nnd,K+1 > 0, we replace θ(F ) by ψn(F ) in (3.3). In this case, we have

∂ψn(F(l))

∂Fik= −1

n

(Nik

F(l)ik

− Ni,K+1

1 − F(l)i+

),

70

and

n∂2ψn(F

(l))

∂F 2ik

=Nik(F

(l)ik

)2 +Ni,K+1(

1 − F(l)i+

)2 .

Hence, we again obtain an algorithm that corresponds to Proposition 2.44, but now

with Fnd+1,+ = 1,

c(l)k (T ′

(j)) = c(l)jk = n

∂2ψn(F(l))

∂F 2jk

and βnF (l) = 0.

Convergence of the iterative convex minorant algorithm is proved in Jongbloed (1998).

Since both criterion functions ϕn and ψn satisfy the conditions of his theorem, it

follows that our iterative convex minorant algorithm yields a direction of descent of

the criterion function. Hence, it converges to the MLE when augmented by a line

search procedure.

71

Chapter 4

CONSISTENCY

In this chapter we prove global and local consistency of the MLE and the naive es-

timator. In Section 4.1 we prove Hellinger consistency, using empirical process theory

and Glivenko-Cantelli preservation theorems. This also leads to Lr(G) consistency

for r > 0. In Section 4.2, we prove several types of local and uniform consistency,

following the methods of Schick and Yu (2000).

4.1 Hellinger consistency

We first define the Hellinger and total variation distance for both estimators. Recall

the definitions of the MLE and the naive estimator in Sections 2.1.2 and 2.1.3. The

MLE is based on the observed data Z = (T,∆). The density for one observation

z = (t, δ) with respect to µ = # × G is pF (z) =∏K

k=1 Fk(t)δk(1 − F+(t))1−δ+ . Here

G is the distribution function of the observation time T and # is counting measure

on ek, k = 1, . . . , K + 1. Recall that Fn+ =∑K

k=1 Fnk, Fn,K+1 = 1 − Fn+ and

F0,K+1 = 1 − F0+. Furthermore, recall that FK is the class the class of all K-

tuples of sub-distribution functions on R with pointwise sum bounded by one. The

Hellinger and total variation distance between two vectors F = (F1, . . . , FK) and

F ′ = (F ′1, . . . , F

′K) in FK are given by

h2(pF , pF ′) =1

2

∫(√pF −√

pF ′)2 dµ =1

2

K+1∑

k=1

∫ (√Fk −

√F ′k

)2

dG, (4.1)

dTV (pF , pF ′) =

K+1∑

k=1

∫ ∣∣∣Fnk − F0k

∣∣∣ dG. (4.2)

72

The naive estimator Fnk, k = 1, . . . , K + 1, is based on the marginal data Zk =

(T,∆k). The density for one observation zk = (t, δk) with respect to µ = # × G

is pk,Fk(zk) = Fk(t)

δk(1 − Fk(t))1−δk . Here # is counting measure on (1, 0), (0, 1).

Recall that F is the class of all sub-distribution functions on R, and that S is the

class of all sub-survival functions on R. The Hellinger and total variation distance

between two components Fk and F ′k, k ∈ 1, . . . , K in F , or FK+1 and F ′

K+1 in S,

are given by

h2(pk,Fk, pk,F ′

k) =

1

2

∫ (√pk,Fk

−√pk,F ′k

)2

dµ

=1

2

∫ (√Fk −

√F ′k

)2

+(√

1 − Fk −√

1 − F ′k

)2dG (4.3)

and

dTV (pk,Fk, pk,F ′

k) = 2

∫|Fk − F ′

k| dG, (4.4)

for k = 1, . . . , K + 1. We now prove Hellinger consistency for the naive estimators

and the MLE. For the naive estimator, Hellinger consistency follows immediately from

Theorem 7 of Van der Vaart and Wellner (2000), which gives Hellinger consistency

for the MLE for mixed case interval censored data.

Theorem 4.1

h(pk,Fnk, pk,F0k

) →a.s. 0, k = 1, . . . , K + 1. (4.5)

Proof: Let k ∈ 1, . . . , K+1. The naive estimator Fnk is the MLE for the marginal

current status data Zk = (T,∆k). Univariate current status data is a special case

of univariate mixed case interval censored data. Hence, Hellinger consistency for the

naive estimator follows immediately from known results for the MLE for univari-

ate mixed case interval censored data (see, e.g., Van der Vaart and Wellner (2000,

73

Theorem 7)). 2

For the MLE, Hellinger consistency follows from Theorem 9 of Van der Vaart and

Wellner (2000). Theorem 9 is a more general version of their Theorem 7. It uses

the concept of VC-class, which is defined as (see, e.g., Dudley (1978), Pollard (1984),

Van der Vaart and Wellner (2000, page 85)):

Definition 4.2 A collection C of subsets of a sample space W is said to pick out a

certain subset of the finite set x1, . . . , xn ⊆ W if it can be written as x1, . . . , xn∩Cfor some C ∈ C. The collection C is said to shatter x1, . . . , xn if C picks out each

of the 2n subsets. The VC-index V (C) of C is the smallest n for which no set of size

n is shattered by C. A collection C of measurable sets is a called a VC-class if its

VC-index is finite.

Theorem 9 of Van der Vaart and Wellner (2000) holds in the censored data setting

that we discussed in Section 2.2. We briefly recall the set-up. Let W be a random

variable taking values in W. Suppose that W has distribution Q0. We do not observe

W directly. Rather, we observe a vector of random sets D = (D1, . . . , Dp) that form

a partition of W: ∪pk=1Dk = W and Dk ∩ Dj = ∅ for j 6= k. Here the number of

random sets p is allowed to be random. However, that is not needed in our case, in

which p = K + 1 and K is the number of competing risks. Furthermore, we observe

an indicator vector ∆ = (∆1, . . . ,∆K+1), where ∆k = 1W ∈ Dk, k = 1, . . . , K + 1.

We assume W and D are independent. The observed random variable is Z = (D,∆),

and Z1, . . . , Zn are n i.i.d. copies of Z. Finally, Qn is the nonparametric maximum

likelihood estimator of Q0 based on Z1, . . . , Zn. Then Theorem 9 of Van der Vaart and

Wellner (2000) states that h(pQn, pQ0) →a.s. 0 if all Dk ∈ D and D is a VC collection

of subsets of W. Recall from Section 2.2.1 that the MLE for current status data

with competing risks fits this framework, with W = (X, Y ), W = R × 1, . . . , K,Dk(T ) = (−∞, T ]×k for k = 1, . . . , K, and DK+1(T ) = (T,∞)×1, . . . , K. Note

that D and (X, Y ) are indeed independent. This follows from the independence of

74

(X, Y ) and T (assumption (a) in Section 2.1), and the fact that D only depends on

T . Furthermore, the class D is a VC-class with VC-index 3. Hence, it follows that the

MLE for the bivariate distribution function of (X, Y ) is Hellinger consistent. Since

estimating the bivariate distribution function of (X, Y ) is equivalent to estimating the

sub-distribution functions, it follows that Fn = (Fn1, . . . , FnK) is Hellinger consistent.

For completeness we also give a direct proof of this result. We first recall the

definitions of outer integrals and measurable majorants, as given in Van der Vaart

and Wellner (1996, Section 1.2, page 6).

Definition 4.3 Let (Ω,A,P) be an arbitrary probability space, and let T : Ω 7→[−∞,∞] be an arbitrary map. The outer integral of T with respect to P is

E∗T = infEU : U ≥ T, U : Ω 7→ [−∞,∞] is measurable and EU exists.

Here, EU is understood to exist if at least one of EU+ or EU− is finite. The functions

U are allowed to take the value ∞, so that the infimum is never empty. The outer

probability of an arbitrary subset B of Ω is

P ∗(B) = infP (A) : A ⊃ B,A ∈ A.

The infima in the above definitions are always achieved, and are denoted by T ∗ and

B∗ respectively.

Next, we recall the definitions of (universal) Glivenko-Cantelli classes and envelope

functions, as given in Van der Vaart and Wellner (1996, page 81 and 84).

Definition 4.4 Let (X ,B, P0) be a probability space. Let F be a class of measurable

functions f : X → R. Let Pn be the empirical measure of n i.i.d. copies X1, . . . , Xn

75

of X ∼ P0. Then F is a P0-Glivenko-Cantelli class if

‖Pn − P0‖∗F = supf∈F

|(Pn − P0)f |∗ →a.s. 0.

If the statement above holds for all probability measures P on (X ,B), then F is called

a universal Glivenko-Cantelli class.

Definition 4.5 Let F be a class of measurable functions f : X → R. An envelope

function of the class F is any function F : X → R such that |f(x)| ≤ F (x) for every

x ∈ X and f ∈ F .

We now give a direct proof of Hellinger consistency of the MLE.

Theorem 4.6

h(pFn, pF0) →a.s. 0. (4.6)

Proof: We first give an outline of the proof. Let

P = pF : F ∈ FK. (4.7)

Since FK is convex, it follows that P is convex. Hence, we can use the following

inequality for convex classes P:

h2(pFn, pF0) ≤ (Pn − P0)φ(pFn

/pF0), (4.8)

where φ(t) = (t−1)/(t+1) (Van der Vaart and Wellner (2000, Proposition 3); see also

Pfanzagl (1988) and Van de Geer (1993, 1996)). This inequality shows that Hellinger

consistency of the MLE follows if

P1 = φ(pF/pF0) : F ∈ FK (4.9)

76

is a P0-Glivenko Cantelli class.

In the remainder, we prove that P1 is indeed a P0-Glivenko-Cantelli class. We start

by showing that P is a P0-Glivenko-Cantelli class. Note that P is a class of functions

on the space X = (t, ek) : t ∈ R, k = 1, . . . , K+1. The spaces Xk = (t, ek) : t ∈ R,k = 1, . . . , K + 1, form a partition of X . Define Pk = pF1Xk : F ∈ FK, for

k = 1, . . . , K + 1. Then we can use Theorem 4 of Van der Vaart and Wellner (2000),

which states that P is P0-Glivenko-Cantelli if Pk, k = 1, . . . , K + 1 are P0-Glivenko-

Cantelli and P has a P0-integrable envelope function. Note that Pk = Fk : F ∈ FKfor k = 1, . . . , K, and PK+1 = 1−F+ : F ∈ FK. Thus, for each k = 1, . . . , K+1, Pk

consists of monotone functions bounded by one. Hence, they are universal Glivenko

Cantelli classes (Van der Vaart and Wellner (1996, Theorem 2.4.1, page 122 and

Theorem 2.7.5, page 159); see also Birman and Solomjak (1967) and Van de Geer

(1991)). Furthermore, the class P has an integrable envelope function f(t, ek) = 1 for

all k = 1, . . . , K + 1. Hence, it follows that P is P0-Glivenko-Cantelli.

Next, we use Theorem 3 of Van der Vaart and Wellner (2000), which states that

H = ψ(P1, . . . ,Pj) is P0-Glivenko-Cantelli if the following conditions hold: P1, . . . ,Pjare P0-Glivenko-Cantelli, ψ : Rj → R is a continuous function, and H has a P0-

integrable envelope. First, we apply this theorem to

P2 = ψ(P, p−1F0), (4.10)

where ψ(t, s) = ts. Note that p−1F0 is a P0-Glivenko-Cantelli class, because it consists

of a single integrable function: P0p−1F0

=∫

1dµ with µ = # × G. Furthermore, note

that p−1F0

is an envelope for P2. Hence, P2 has an integrable envelope. Since ψ(s, t) is

a continuous function, it follows that P2 is a P0-Glivenko-Cantelli class.

Finally, note that P1 = φ(P2) with φ(t) = (1− t)/(1 + t). Since φ is a continuous

function which is bounded by one, it follows that P1 = φ(P2) is a P0-Glivenko-Cantelli

class. 2

77

We now give several corollaries from Theorems 4.1 and 4.6, which yield consistency

of the estimators in total variation distance and Lr(G) for r > 0.

Corollary 4.7

dTV (pk,Fnk, pk,F0k

) = 2

∫ ∣∣∣Fnk − F0k

∣∣∣ dG→a.s. 0, k = 1, . . . , K + 1, (4.11)

dTV (pFn, pF0) =

K+1∑

k=1

∫ ∣∣∣Fnk − F0k

∣∣∣ dG→a.s. 0. (4.12)

Proof: This follows directly from the second part of the well-known inequality relat-

ing Hellinger distance and total variation distance:

h2(pF1, pF2) ≤ dTV (pF1, pF2) ≤√

2h(pF1, pF2).

2

Corollary 4.8 For any r ≥ 1, we have

K+1∑

k=1

∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣r

dG(t) →a.s. 0, (4.13)

K∑

k=1

∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣r

dG(t) →a.s. 0. (4.14)

Proof: This follows directly from Corollary 4.7 and the inequality |a− b|r ≤ |a− b|for a, b ∈ [0, 1] and r ≥ 1. 2

4.2 Local and uniform consistency

It is clear from the Lr(G) consistency that the observation time distribution G plays

a key role in obtaining local consistency of the estimators. For example, it follows

immediately that one cannot expect consistency at intervals at which G has no mass.

We make this observation more precise, and give several different conditions under

78

which we obtain local or uniform consistency, using techniques from Section 3 of

Schick and Yu (2000). We only give proofs for the MLE, since the proofs for the

naive estimator are analogous and follow almost directly from Schick and Yu (2000).

We start with a simple corollary that is analogous to Corollary 2.3 of Schick and Yu

(2000).

Corollary 4.9 For each point a with G(a) > 0 we have:

Fnk(a) →a.s. F0k(a), k = 1, . . . , K + 1,

Fnk(a) →a.s. F0k(a), k = 1, . . . , K,

Proof: Note that

G(a)K∑

k=1

∣∣∣Fnk(a) − F0k(a)∣∣∣ ≤

K∑

k=1

∫ ∣∣∣Fnk(t) − F0k(t)∣∣∣ dG(t) →a.s. 0

by (4.14) with r = 1. Hence, if G(a) > 0, it follows that

K∑

k=1

∣∣∣Fnk(a) − F0k(a)∣∣∣→a.s. 0,

or equivalently,∣∣∣Fnk(a) − F0k(a)

∣∣∣→a.s. 0 for all k = 1, . . . , K. 2

We now introduce some terminology used by Schick and Yu (2000).

Definition 4.10 Let a ∈ R. We say that a is a support point of G if G((a−ǫ, a+ǫ)) >0 for every ǫ > 0. We say that a is regular if G((a−ǫ, a]) > 0 and G([a, a+ ǫ)) > 0 for

every ǫ > 0. We say that a is strongly regular if G((a−ǫ, a)) > 0 and G([a, a+ ǫ)) > 0

for every ǫ > 0. We say that a is a point of increase of a distribution function F if

F (a+ ǫ) − F (a− ǫ) > 0 for every ǫ > 0.

Lemmas 4.11 and 4.12 state some properties of the continuity points and the points

of increase of F01, . . . , F0K and F0+. These properties follow easily from monotonicity

79

of F01, . . . , F0K .

Lemma 4.11 F0+ is continuous at a point t0 if and only if all sub-distribution func-

tions F01, . . . , F0K are continuous at t0.

Lemma 4.12 The point t0 is a point of increase of F0+ if and only if t0 is a point of

increase of at least one of the sub-distribution functions F01, . . . , F0K .

We now define the following set:

ΩG =

ω ∈ Ω :

K∑

k=1

∫ ∣∣∣Fnk(·;ω) − F0k

∣∣∣ dG+K+1∑

k=1

∫ ∣∣∣Fnk(·;ω)− F0k

∣∣∣ dG→ 0

.

By Corollary 4.8 with r = 1, we have P (ΩG) = 1. We can now prove several propo-

sitions concerning local consistency. The propositions and proofs are analogous to

Schick and Yu (2000), with the difference that we have a collection of sub-distribution

functions.

Fix an ω ∈ ΩG. Let Fk be a pointwise limit of Fnk(·;ω), meaning that Fn′k(t, ω) →Fk(t) for all t ∈ R and some subsequence n′. The existence of such a pointwise

limit is guaranteed by Helly’s selection theorem (see, e.g., Rudin (1976, page 167)).

Similarly, let Fk be a pointwise limit of Fnk(·;ω). We assume without loss of generality

that limn→∞ Fnk(t) = Fk(t) and limn→∞ Fnk(t;ω) = Fk(t) for all t ∈ R. Let

Bk =t ∈ R : Fk(t) 6= F0k(t) or Fk(t) 6= F0k(t)

, k = 1, . . . , K, (4.15)

BK+1 =t ∈ R : FK+1(t) 6= F0,K+1(t)

, (4.16)

B =K+1⋃

k=1

Bk. (4.17)

By Corollary 4.8, we have G(Bk) = 0 for k = 1, . . . , K + 1, and G(B) = 0. We now

give a proposition that is analogous to Proposition 1 of Schick and Yu (2000).

80

Proposition 4.13 For each ω ∈ ΩG and each regular continuity point a of F0+,

Fnk(a;ω) → F0k(a), k = 1, . . . , K + 1,

Fnk(a;ω) → F0k(a), k = 1, . . . , K.

Proof: We only prove the result for the MLE. Let ω ∈ ΩG. We need to show that

B does not contain regular continuity points of F0+. Let t0 be a continuity point of

F0+. Then t0 is a continuity point of F01, . . . , F0K by Lemma 4.11. If t0 ∈ B, then

there is a k ∈ 1, . . . , K such that Fk(t0) 6= F0k(t0). Continuity of F0k at t0, and

monotonicity of Fk and F0k imply there exists an ǫ > 0 such that either (t0−ǫ, t0] ⊆ B

or [t0, t0 + ǫ) ⊆ B. Since G(B) = 0, this implies that either G((t0 − ǫ, t0]) = 0 or

G([t0, t0 + ǫ)) = 0. Hence, t0 is not regular. 2

We obtain the following corollary for a fixed k ∈ 1, . . . , K, by replacing B by Bk in

the proof of Proposition 4.13.

Corollary 4.14 Let k ∈ 1, . . . , K. Then

Fnk(a;ω) → F0k(a) and Fnk(a;ω) → F0k(a),

for all regular continuity points a of F0k.

Such corollaries can be derived for many of the results that follow. However, we will

not point this out each time, and focus on the joint consistency results instead. We

now give a proposition that is analogous to Proposition 2 of Schick and Yu (2000).

Proposition 4.15 Suppose every point in an open interval (a, b) is a support point

of G. Then

Fnk(t;ω) → F0k(t), k = 1, . . . , K + 1

Fnk(t;ω) → F0k(t), k = 1, . . . , K,

81

for every continuity point t of F0+ in (a, b) and every ω ∈ ΩG. If also F0+(a) = 0 and

F0+(b−) = 1, then

Fnk(t;ω) → F0k(t), k = 1, . . . , K,

for all continuity points t of F0+ and all ω ∈ ΩG.

Proof: We only prove the result for the MLE. Let t0 ∈ (a, b) be a continuity point of

F0+. Then t0 is a continuity point of F01, . . . , F0K by Lemma 4.11. Suppose t0 ∈ B.

This implies that there exists a k ∈ 1, . . . , K such that Fk(t) 6= F0k(t). Since F0k is

continuous at t0, there exists an ǫ > 0 such that either (t0 − ǫ] ⊆ B or [t0, t0 + ǫ) ⊆ B.

Furthermore, since t0 ∈ (a, b) and all points in (a, b) are support points of G, there

exist support points t1 and t2 of G and an η > 0 such that (t1 − η, t1 + η) ⊆ (t0 − ǫ, t0]

and (t2 − η, t2 + η) ⊆ [t0, t0 + ǫ). This leads to the contradiction G(B) > 0. Hence, B

does not contain continuity points t ∈ (a, b) of F0+. This proves the first part of the

proposition.

If F0+(a) = 0 and F0+(b−) = 1, then F0k(a) = 0 and F0k(b−) = limt→∞ F0k(t) for

all k = 1, . . . , K. In this case we obtain Fk(t) = F0k(t) for all continuity points t of

F0+ and for all k = 1, . . . , K. This follows from the monotonicity of Fk and F0k, and

the fact that F+ and F0+ are bounded by zero and one. This second part does not

follow automatically for the naive estimator, since Fn+ is not bounded by one. 2

Next, we give propositions that are analogous to Propositions 3 and 4 of Schick and

Yu (2000).

Proposition 4.16 If every point of increase of F0+ is strongly regular, then

Fnk(t;ω) → F0k(t), k = 1, . . . , K + 1,

Fnk(x;ω) → F0k(x), k = 1, . . . , K,

82

for all continuity points of F0+ and all ω ∈ ΩG.

Proof: We only prove the result for the MLE. Suppose every point of increase of F0+

is strongly regular. We show that B does not contain continuity points of F0+. First,

let t0 be a continuity point of F0+. If t0 is a point of increase of F0+, then it must

be strongly regular, and hence regular. Proposition 4.13 then implies that t0 cannot

belong to B.

Now let t0 be a continuity point, but not a point of increase of F0+. Then t0

is not a point of increase of F01, . . . , F0K by Lemma 4.12, and it is a continuity

point of F01, . . . , F0k by Lemma 4.11. We now show by contradiction that t0 does

not belong to B. Thus, suppose t0 ∈ B. Then there is a k ∈ 1, . . . , K such

that Fk(t0) 6= F0k(t0). This means that Fk(t0) > F0k(t0) or Fk(t0) < F0k(t0). In

either case we can derive the contradiction G(B) > 0. Suppose first that Fk(t0) >

F0k(t0). Then b = supt : F0k(t) = F0k(t0) is a point of increase of F0k, b > t0 and

Fk(b−) ≥ Fk(t0) > F0k(t0) = F0k(b−). Hence, (t0, b) ⊆ B and, since b is strongly

regular by assumption, G(B) ≥ G((t0, b)) > 0. Suppose now that Fk(t0) < F0k(t0).

Then a = inft ∈ R : F0k(t) = F0k(t0) is a point of increase of F0k, a < t0 and

Fk(a) ≤ Fk(t0) < F0k(t0) = F0k(a). Hence (a, t0) ⊆ B and, since a is strongly regular

by assumption, G(B) ≥ G((a, t0)) > 0. This shows that B does not contain regular

continuity points of F0+. 2

Proposition 4.17 Suppose F0+ is continuous and that, for all a < b, 0 < F0+(a) <

F0+(b) < 1 implies that G((a, b)) > 0. Then the naive estimator and the MLE are

uniformly strongly consistent, i.e.,

supt∈R

∣∣∣Fnk(t) − F0k(t)∣∣∣→a.s. 0, k = 1, . . . , K + 1,

supt∈R

∣∣∣Fnk(t) − F0k(t)∣∣∣→a.s. 0, k = 1, . . . , K.

Proof: We only prove the result for the MLE. Make the assumptions of the propo-

83

sition. Suppose B contains a point t0. Then there is a k ∈ 1, . . . , K such that

Fk(t0) 6= F0k(t0). Since F0+ is continuous, all sub-distribution functions F01, . . . , F0K

are continuous (Lemma 4.11). Hence, we can construct an open interval (a, b) ⊆ B

that contains a point of increase of F0k. Every point of increase of F0k is a point

of increase of F0+, by Lemma 4.12. Hence, F0+(b) > F0+(a), so that by assumption

G((a, b)) > 0. This leads to the contradiction G(B) ≥ G((a, b)) > 0. Hence, B is

empty. This implies that Fk → F0k pointwise, and this convergence is uniform since

F0k is continuous. 2

Finally, we prove a proposition that is analogous to Proposition 5 of Schick and Yu

(2000).

Proposition 4.18 Suppose the following four conditions hold for real numbers τ1 <

τ2.

(a) F0+ is continuous at every point in the interval (τ1, τ2];

(b) either G(τ1) > 0 or F0+(τ1) = 0;

(c) either G(τ2) > 0 or F0+(τ2−) = 1;

(d) for all a and b in (τ1, τ2), 0 < F0+(a) < F0+(b) < 1 implies G((a, b)) > 0.

Then the MLE is strongly uniformly strongly consistent on [τ1, τ2]:

supx∈[τ1,τ2]

∣∣∣Fnk(x) − F0k(x)∣∣∣→a.s. 0, k = 1, . . . , K.

Proof: We only prove the result for the MLE, and for the case that G(t1) > 0 and

F0+(τ2−) = 1. Note that F0+(τ2−) = 1 implies that F0k(τ2−) = limt→∞ F0k(t) for all

k = 1, . . . , K. We show that B′ = B ∩ [τ1, τ2] = ∅. This implies that Fnk(t) → F0k(t)

for all t ∈ [τ1, τ2]. Since F0+ is continuous, all sub-distribution functions F01, . . . , F0K

are continuous (Lemma 4.11), and hence this convergence is uniform on [τ1, τ2].

84

It follows from Corollary 4.9 that Fk(τ1) = F0k(τ1) for all k = 1, . . . , K. This gives

the desired result if F0+(τ1) = 1, using the monotonicity of F0k and Fk and the fact

that F0+ and F+ are bounded by one. (Note that this does not follow automatically for

the naive estimator, since Fn+ is not bounded by one.) Therefore, assume F0+(τ1) < 1.

We need to show that B′ is empty. Suppose B′ contains a point t0. This implies there

is a k ∈ 1, . . . , K such that Fk(t0) 6= F0k(t0). We can then use the continuity of F0k,

the monotonicity of Fk and F0k, and Fk(τ1) = F0k(τ1) < F0k(τ2−) = limt→∞ F0k(t) to

show that B′ contains an interval (a, b), strictly contained in (τ1, τ2), such that 0 <

F0k(a) < F0k(b) < limt→∞ F0k(t). This implies 0 < F0+(a) < F0+(b) < 1, and hence

by assumption G((a, b)) > 0. This gives the contradiction G(B′) ≥ G((a, b)) > 0. We

can conclude that B′ is empty. 2

Remark 4.19 The consistency results of this section show that the observation time

distribution G plays a key role in the local consistency of the estimators. This obser-

vation is important for the design of clinical trials. For example, if we let G have a

positive density on an interval (a, b), then the estimators Fnk and Fnk are consistent

at all continuity points of F0k in (a, b) by Proposition 4.15. On the other hand, if

we choose G to have zero mass on an interval (a, b), then we cannot expect that the

estimators Fnk and Fnk are consistent on this interval.

85

Chapter 5

RATE OF CONVERGENCE

The Hellinger rate of convergence of the naive estimator is n1/3. This follows

from Van de Geer (1996) or Van der Vaart and Wellner (1996, Theorem 3.4.4, page

327). Under certain regularity conditions, the local rate of convergence of the naive

estimator is also n1/3. This follows from Groeneboom and Wellner (1992, Lemma 5.4,

page 95). Furthermore, this local rate result implies that the distance between two

successive jump points of Fnk around a point t0 is of order Op(n−1/3).

In this chapter we discuss similar results for the MLE. In Section 5.1 we show

that the global rate of convergence is n1/3. Subsequently, we prove in Section 5.2 that

n1/3 is an asymptotic local minimax lower bound for the rate of convergence, meaning

that no estimator can converge locally at a rate faster than n1/3, in a minimax sense.

Hence, the naive estimator converges locally at the optimal rate. Since the MLE

is expected to be at least as good as the naive estimator, one may expect that the

MLE also converges locally at the optimal rate of n1/3. This is indeed the case, and

this is proved in Section 5.3. Our main tool for proving this result is Theorem 5.10,

which gives a uniform rate of convergence of Fn+ on a fixed neighborhood of a point,

rather than on the usual shrinking neighborhood of order n−1/3. Technical lemmas

and proofs are collected in Section 5.4.

5.1 Hellinger rate of convergence

We prove the global rate of convergence of the MLE using the rate theorem for M-

estimators of Van der Vaart and Wellner (1996, Theorem 3.4.1, page 322). A slightly

simplified version of this theorem can be found in Wellner (2003):

86

Theorem 5.1 Let Mn, n ≥ 1 be stochastic processes indexed by a set Θ, and let

M : Θ 7→ R be a deterministic function. Furthermore, let

θn = argmaxθ∈Θ Mn(θ),

θ0 = argmaxθ∈Θ M(θ),

and assume that M satisfies1

M(θ) − M(θ0) . −d2(θ, θ0) (5.1)

for every θ in a neighborhood of θ0. Suppose there exists a γ0 > 0 such that for all

n ≥ 1 and γ < γ0, the centered process Mn − M satisfies2

E∗ supd(θ,θ0)<γ

|(Mn − M)(θ) − (Mn − M)(θ0)| .φn(γ)√

n, (5.2)

where φn are functions such that γ 7→ φn(γ)/γα is decreasing for some α < 2 not

depending on n. Let rn . γ−1n satisfy

r2nφn

(1

rn

)≤

√n, for every n.

If θn satisfies Mn(θn) ≥ Mn(θ0) − Op(r−2n ) and converges in (outer) probability to

θ0, then rn(θn, θ0) = Op(1). If the given conditions hold for every θ and γ, then the

hypothesis that θn is consistent is unnecessary.

In order to verify condition (5.2), we will use bracketing numbers. We recall the

following definitions, adapted from Van der Vaart and Wellner (1996, pages 83 and

324).

1The notation . means “is bounded above up to a universal constant”.

2The star indicates an outer integral, see Definition 4.3.

87

Definition 5.2 Given two functions l and u, the bracket [l, u] is the set of all functions

f with l ≤ f ≤ u. An ǫ-bracket w.r.t. a norm ‖ · ‖ is a bracket [l, u] with ‖u− l‖ < ǫ.

The bracketing number N[ ](ǫ,F , ‖ · ‖) is the minimum number of ǫ-brackets w.r.t.

‖ · ‖ needed to cover F . In this definition, the upper and lower bounds u and l of the

brackets do not need to belong to F themselves but are assumed to have finite norms.

The entropy with bracketing or bracketing entropy is the logarithm of the bracketing

number. Finally, the bracketing integral is defined by

J[ ](γ,F , ‖ · ‖) =

∫ γ

0

√1 + logN[ ](ǫ,F , ‖ · ‖)dǫ. (5.3)

Recall the definitions of pF and h(pF , pF0) in (2.5) and (4.1). Theorem 5.3 gives the

Hellinger rate of convergence of the MLE.

Theorem 5.3 n1/3h(pFn, pF0) = Op(1).

Proof: We use Theorem 5.1 with Θ = FK , θ = F = (F1, . . . , FK), d(F, F0) =

h(pF , pF0), Mn(F ) = PnmpF, M(F ) = P0mpF

, Gn(F ) =√n(Mn − M)(F ), and

mpF(t, δ) = log

(pF (t, δ) + pF0(t, δ)

2pF0(t, δ)

).

We use Theorem 3.4.4 of Van der Vaart and Wellner (1996, page 327) to verify the

conditions of Theorem 5.1. The former theorem directly implies that condition (5.1)

of Theorem 5.1 is satisfied. Furthermore, it implies that

E∗‖Gn‖Mγ ≤ J[ ](γ,P, h)

1 +J[ ](γ,P, h)γ2√n

, (5.4)

where P = pF : F ∈ FK, Mγ = F ∈ FK : h(pF , pF0) < γ, and E∗‖Gn‖Mγ =

E∗ supMγGn(F ). Since mpF0

≡ 0, the key condition (5.2) of Theorem 5.1 can be

written as E∗‖Gn‖Mγ . φn(γ). Thus, a bound on the right side of (5.4) is a candidate

for the function φn(γ) in (5.2).

88

In view of (5.3) and (5.4), we need to bound the bracketing entropy logN[ ](ǫ,P, h).Let F = (F1, . . . , FK) ∈ FK , and recall that FK+1 = 1−F+. For each k = 1, . . . , K+1,

let [lk, uk] be a bracket containing Fk, with size ǫ/√K + 1 w.r.t. the L2(G) norm:

∥∥∥√uk −√lk

∥∥∥2

L2(G)=

∫(√uk −

√lk)

2dG ≤ ǫ2

K + 1. (5.5)

Then

[pl(t, δ), pu(t, δ)] =

[K+1∏

k=1

lk(t)δk ,

K+1∏

k=1

uk(t)δk

]

is a bracket containing pF , and by assumption (5.5) its size w.r.t. the Hellinger

distance is bounded by:

h2(pl, pu) =1

2

K+1∑

k=1

∫(√uk −

√lk)

2dG ≤ ǫ2.

Note that pl and pu are typically not in the class P, since we do not require that

lK+1 = 1 − l+ and uK+1 = 1 − u+. However, the upper and lower bounds of the

brackets in Definition 5.2 are not required to be in the class P, so this does not pose

a problem.

We now count how many brackets [pl, pu] we need to cover the class P. First, note

that [√lk,

√uk] contains

√Fk for all k = 1, . . . , K + 1. Furthermore, note that all

√Fk, k = 1, . . . , K + 1 are contained in the class

F = F : R 7→ [0, 1] is monotone.

It is well-known that

logN[ ] (δ,F , L2(Q)) . 1/δ, (5.6)

89

uniformly in Q (see, e.g., Van de Geer (2000, page 18, equation (2.5)) or Van der Vaart

and Wellner (1996, Theorem 2.7.5, page 159)). Hence, considering all possible combi-

nations of (K + 1)-tuples of brackets [√lk,

√uk] with ‖√uk −

√lk‖L2(G) ≤ ǫ/

√K + 1,

it follows that

logN[ ](ǫ,P, h) ≤ log(N[ ](ǫ/

√K + 1,F , L2(G))K+1

)

= (K + 1) logN[ ]

(ǫ/√K + 1,F , L2(G)

)

.(K + 1)3/2

ǫ.

Dropping the dependence onK (since K is fixed), this implies that J[ ](γ,P, h) . γ1/2,

and together with (5.4) we obtain E‖Gn‖Mγ ≤ √γ + (γ

√n)−1. Note that (

√γ +

(γ√n)−1)/γ is decreasing in γ. Hence, we can take φn(γ) =

√γ+(γ

√n)−1 in Theorem

5.1. We then obtain that rnh(pFn, pF0) = Op(1) provided that h(pFn

, pF0) → 0 in outer

probability, and r2nφn(r

−1n ) ≤ √

n for all n. The first condition is fulfilled by the almost

sure Hellinger consistency of the MLE (Theorem 4.6). The second condition holds for

rn = cn1/3 and c = ((√

5 − 1)/2)2/3. 2

Analogously to the comments directly following Theorem 4.6, Theorem 5.3 implies

n1/3dTV (pFn, pF0) = Op(1) and n1/3‖Fn − F0‖1 = Op(1). Furthermore, we have

n1/3‖Fn − F0‖2 = n1/3

(K∑

k=1

∫Fnk − F0k2dG

)1/2

= Op(1), (5.7)

since

‖F − F0‖22 =

K∑

k=1

∫ √Fk −

√F0k

2 √Fk +

√F0k

2

dG

≤ 4

K∑

k=1

∫ √Fk −

√F0k

2

dG ≤ 8h2(pF , pF0).

90

5.2 Asymptotic local minimax lower bound

In this section we prove that n1/3 is an asymptotic local minimax lower bound for

the rate of convergence of Fnk, k = 1, . . . , K. We use the set-up of Groeneboom

(1996, Section 4.1, page 132). Let P be a set of probability densities on a measurable

space (Ω,A) with respect to a σ-finite dominating measure. We estimate a parameter

θ = Up ∈ R, where U is a real-valued functional and p ∈ P. Let Un, n ≥ 1, be a

sequence of estimators based on a sample of size n, i.e., Un = tn(Z1, . . . , Zn), where

Z1, . . . , Zn is a sample from the density p, and tn : Ωn → R is a Borel measurable

function. Let l : [0,∞) → [0,∞) be an increasing convex loss function with l(0) = 0.

The risk of the estimator Un in estimating Up is defined by En,pl(|Un − Up|), where

En,p denotes the expectation with respect to the product measure P⊗n corresponding

to the sample Z1, . . . , Zn. We now recall Lemma 4.1 of Groeneboom (1996, page 132).

Lemma 5.4 For any p1, p2 ∈ P such that the Hellinger distance h(p1, p2) < 1:

infUn

max En,p1l(|Un − Up1|), En,p2l(|Un − Up2|)

≥ l

(1

4|Up1 − Up2|

(1 − h2(p1, p2)

)2n).

Let k ∈ 1, . . . , K. We apply Lemma 5.4 to the estimation of F0k(t0). Let

Unk, n ≥ 1, be a sequence of estimators of F0k(t0). Furthermore, let c > 0 and

let F kn = (Fn1, . . . , FnK) be a perturbation of F0 where only the kth component is

changed in the following way (see Figure 5.1):

Fnk(x) =

F0k(t0 − cn−1/3) if x ∈ [t0 − cn−1/3, t0),

F0k(t0 + cn−1/3) if x ∈ [t0, t0 + cn−1/3),

F0k(x) otherwise,

(5.8)

and Fnj(x) = F0j(x) for j 6= k. Note that F kn is a valid vector of sub-distribution

functions with corresponding survival function Fn,K+1 = 1 − Fn+. Proposition 5.5

91

t0cn−1/3 cn−1/3

F0k

Fnk

Figure 5.1: Perturbation used to derive the asymptotic local minimax lower bound.

gives a minimax lower bound for the rate of convergence for estimating F0k(t0).

Proposition 5.5 Fix k ∈ 1, . . . , K. Let 0 < F0k(t0) < F0k(∞), and let F0k and G

be continuously differentiable at t0 with strictly positive derivatives f0k(t0) and g(t0).

Then for r ≥ 1 we have:

lim infn→∞

nr/3 infUn

maxEn,pF0

|Unk − F0k(t0)|r , En,pF k

n|Unk − Fnk(t0)|r

≥ dr[g(t0)

f0k(t0)

1

F0k(t0)+

1

1 − F0+(t0)

]−r/3, (5.9)

where d = 2−5/3e−1/3.

Proof: Let r ≥ 1. We apply Lemma 5.4 with l(x) = xr, p1 = pF0 and p2 = pF kn,

where pF is defined in (2.5). This yields:

nr/3 infUn

maxEn,pF0

|Unk − F0k(t0)|r , En,pFn|Unk − Fnk(t0)|r

≥ nr/3(

1

4|Fnk(t0) − F0k(t0)|

(1 − h2(pFn, pF0)

)2n)r

. (5.10)

We now compute the quantities in this expression. First, continuous differentiability

92

of F0k in a neighborhood around t0 yields

n1/3 |Fnk(t0) − F0k(t0)| = n1/3∣∣F0k(t0 + cn−1/3) − F0(t0)

∣∣ = cf0k(t0) + o(1). (5.11)

Next, we compute the Hellinger distance h2(pF0, pF kn), defined in (4.1). Since Fnj = F0j

for j 6= k, we only need to compute∫ (√

F0j −√Fnj)2dG for j = k and j = K + 1.

We first consider j = k:

∫ (√F0k −

√Fnk

)2

dG

=

∫ t0

t0−cn−1/3

(√F0k −

√Fnk

)2

dG+

∫ t0+cn−1/3

t0

(√F0k −

√Fnk

)2

dG. (5.12)

Using the definition of Fnk in (5.8), the condition F0k(t0) > 0, and the continuous

differentiability of G and F0k in a neighborhood around t0, we can write the first term

of (5.12) as

∫ t0

t0−cn−1/3

(√F0k(t) −

√F0k(t0 − cn−1/3)

)2

dG(t)

=

∫ t0

t0−cn−1/3

g(t0)(t− t0 + cn−1/3

)2 (f0k(t0))2

4F0k(t0)+ o(n−2/3)

=1

n

(f0k(t0))2g(t0)c

3

12F0k(t0)+ o(n−1).

Using an analogous derivation for the second term of (5.12), we obtain

∫ (√F0k −

√Fnk

)2

dG =1

n

(f0k(t0))2g(t0)c

3

6F0k(t0)+ o(n−1).

Similarly, for j = K + 1, we get

∫ (√F0,K+1 −

√Fn,K+1

)2

dG =1

n

(f0k(t0))2 g(t0)c

3

6F0,K+1(t0)+ o(n−1),

93

so that

h2(pF0 , pFn) =1

2n

1

6g(t0)c

3(f0k(t0))2

1

F0k(t0)+

1

F0,K+1(t0)

+ o(n−1). (5.13)

Plugging the expressions (5.11) and (5.13) into the lower bound (5.10), and using that

limn→∞(1 + x/n)n = exp(x), gives the asymptotic lower bound

[1

4cf0k(t0) exp

(−1

6g(t0)c

3(f0k(t0))2

1

F0k(t0)+

1

1 − F0+(t0)

)]r. (5.14)

The maximum of (5.14) over c is attained at

c =

(1

2g(t0)(f0k(t0))

2

1

F0k(t0)+

1

1 − F0+(t0)

)−1/3

and its value is given in (5.9). 2

Remark 5.6 Note that the lower bound (5.10) consists of a part depending on the

underlying distribution, and a universal constant d. It is not clear whether the con-

stant depending on the underlying distribution is sharp, because it has not been

proved that any estimator achieves this constant. However, we do know that the

naive estimator Fnk does generally not achieve this constant. To see this, recall that

Fnk is the MLE for the reduced data (Ti,∆ki), i = 1, . . . , n. Hence, its asymptotic risk

is bounded below by the asymptotic local minimax lower bound for current status

data:

dr[g(t0)

f0k(t0)

1

F0k(t0)+

1

1 − F0k(t0)

]−r/3(5.15)

(see Groeneboom (1996, page 135, equation (4.2)), or take K = 1 in Proposition 5.5).

Since 1 − F0k(t0) > 1 − F0+(t0) if F0j(t0) > 0 for some j ∈ 1, . . . , K, j 6= k, this

bound is larger than the lower bound of Proposition 5.5.

94

We can also apply a generalized version of Lemma 5.4 to the vector of components

(Fn1, . . . , FnK). To do this, we use the following set-up. Let P be a set of probability

densities on a measurable space (Ω,A) with respect to a σ-finite dominating measure.

We estimate a parameter θ = Up ∈ RK , where U is a vector-valued functional and

p ∈ P. Let Un, n ≥ 1, be a sequence of estimators based on sample size n. Let

l : [0,∞) → [0,∞) be an increasing convex loss function with l(0) = 0. The risk of

the estimator Un in estimating Up is defined by En,pl(‖Un − Up‖), where ‖ · ‖ is a

norm on RK . We now state a generalized version of Lemma 5.4, which can be derived

by replacing the absolute value signs |·| in the proof of Lemma 5.4 by norms ‖ · ‖.

Lemma 5.7 For any p1, p2 ∈ P such that the Hellinger distance h(p1, p2) < 1:

infUn

max En,p1l(‖Un − Up1‖), En,p2l(‖Un − Up2‖)

≥ l

(1

4‖Up1 − Up2‖

(1 − h2(p1, p2)

)2n).

We apply this lemma to the estimation of (F01(t0), . . . , F0K(t0)). Let Un, n ≥ 1 be

a sequence of estimators for F0(t0) = (F01(t0), . . . , F0K(t0)). For c > 0, let Fn =

(Fn1, . . . , FnK) be a perturbation of F0, where each component is changed in the

following way (see Figure 5.1):

Fnk(x) =

F0k

(t0 − cn−1/3

)if x ∈

[t0 − cn−1/3, t0

),

F0k

(t0 + cn−1/3

)if x ∈

[t0, t0 + cn−1/3

),

F0k(x) otherwise.

Note that (Fn1(x), . . . , FnK(x)) is a valid set of sub-distribution functions with corre-

sponding survival function Fn,K+1(x) = 1 − Fn+(x).

Proposition 5.8 For each k ∈ 1, . . . , K, let 0 < F0k(t0) < F0k(∞), and let F0k

and G be continuously differentiable at t0 with positive derivatives. Then, for r ≥ 1

95

any norm ‖ · ‖ on RK, we have:

lim infn→∞

nr/3 infUn

maxEn,pF0

‖Un − F0(t0)‖r, En,pFn‖Un − Fn(t0)‖r

≥ dr

‖f0(t0)‖

(g(t0)

K+1∑

k=1

(f0k(t0))2

F0k(t0)

)−1/3r

. (5.16)

where d = 2−5/3e−1/3.

Proof: Let r ≥ 1. We apply Lemma 5.7 with l(x) = xr, p1 = pF0 and p2 = pFn,

where pF is defined in (2.5). This yields:

nr/3 infUn

maxEn,pF0

‖Un − F0(t0)‖r, En,pFn‖Un − Fn(t0)‖r

≥ nr/3(

1

4‖Fn(t0) − F0(t0)‖

(1 − h2(pFn, pF0)

)2n)r

. (5.17)

We now compute the quantities in the expression on the right side. Analogously to

the proof of Proposition 5.5, we get that

∫ (√F0k −

√Fnk

)2

dG =1

n

(f0k(t0))2g(t0)c

3

6F0k(t0)+ o(n−1)

for k = 1, . . . , K. Furthermore, using F0+(t0) < 1 we get

∫ (√F0,K+1 −

√Fn,K+1

)2

dG =1

n

(∑Kk=1 f0k(t0)

)2

g(t0)c3

6F0,K+1(t0)+ o(n−1)

=1

n

(f0,K+1(t0))2 g(t0)c

3

6F0,K+1(t0)+ o(n−1).

Hence,

h2(pF0 , pFn) =1

2

K+1∑

k=1

∫ (√F0k −

√F ′k

)2

dG =1

2n

1

6g(t0)c

3

K+1∑

k=1

(f0k(t0))2

F0k(t0)+ o(n−1).

96

Furthermore, the continuous differentiability of F0 in a neighborhood around t0 yields

that

n1/3‖Fn(t0) − F0(t0)‖ = n1/3‖F0(t0 + cn−1/3) − F0(t0)‖ = c‖f0(t0)‖ + o(1).

Analogously to the proof of Proposition 5.5, plugging these expressions into the lower

bound (5.17) and using that limn→∞(1+x/n)n = exp(x), yields the following asymp-

totic lower bound:[

1

4c‖f0(t0)‖ exp

(−1

6g(t0)c

3K+1∑

k=1

(f0k(t0))2

F0k(t0)

)]r. (5.18)

The maximum of (5.18) over c is attained at

c =

(1

2g(t0)

K+1∑

k=1

(f0k(t0))2

F0k(t0)

)−1/3

and its value is given in (5.16). 2

5.3 Local rate of convergence

As mentioned in the introduction of this chapter, the n1/3 local rate of convergence of

the naive estimator and the n1/3 local minimax lower bound for the rate of convergence

suggest that the MLE converges locally at rate n1/3. This is indeed the case, and we

now give the proof of this result. However, although this result is intuitively clear, the

proof is rather involved and required new methods. The main difficulties are that the

MLE has no closed form, and that we have to handle the system of sub-distribution

functions.

There are currently no general methods available to prove the local rate of conver-

gence of the maximum likelihood estimator in similar estimation problems. This is in

contrast to the global rate of convergence, for which there are fairly standard methods

from empirical process theory. Thus, the local rate of convergence is still proved on

97

a case by case basis. The common theme in existing proofs is to rely heavily on the

characterization of the MLE in terms of Fenchel conditions (see, e.g., Groeneboom

and Wellner (1992) for case 2 interval censored data, and Groeneboom, Jongbloed

and Wellner (2001b) for convex density estimation). This is done because the MLE

has no closed form in these problems, so that the characterization is all one has to

work with. We will use this approach as well.

The outline of this section is as follows. In Section 5.3.1 we revisit the Fenchel

conditions. These conditions will show that the term Fn+ plays an important role.

Therefore, in Section 5.3.2 we first prove a rate result for Fn+ (Theorem 5.10). This

rate result is stronger than the usual local rate result, because it holds uniformly on

a fixed neighborhood of a point t0, instead of on the usual shrinking neighborhood of

order n−1/3. In Remark 5.11, we discuss the meaning of Theorem 5.10, by comparing

it to several existing results for current status data without competing risks. Subse-

quently, we give the proof of Theorem 5.10. Finally, in Section 5.3.3 we use Theorem

5.10 to prove the local rate of convergence for the components Fn1, . . . , FnK in Theo-

rem 5.20. Technical lemmas and proofs are deferred to Section 5.4. Throughout, we

assume that for each k ∈ 1, . . . , K, Fnk is piecewise constant and right-continuous,

with jumps only at points in Tk (see Definition 2.22).

5.3.1 Revisiting the Fenchel conditions

Assume without loss of generality that Fn+(∞) = 1 and recall the definition of GnFn

in (2.47). Let τnk be a jump point of Fnk, and let τnk < s. Then Proposition 2.36

implies that

∫

[τnk,s)


[τnk,s)

Fnk(u)dGnFn(u) ≥ 0. (5.19)

To see this, note that equality must hold in (2.48) at t = τnk and that inequality must

hold at t = s. Subtracting these two equations yields (5.19).

98

For s < T(n), we can rewrite (5.19) as follows:

0 ≤∫

[τnk,s)


[τnk,s)

Fnk(u)dGnFn(u)

=

∫

[τnk,s)

δk − Fnk(u) + Fnk(u)

(1 − 1 − δ+

1 − Fn+(u)

)dPn(u, δ)

=

∫

[τnk,s)

δk − Fnk(u) + Fnk(u)

δ+ − Fn+(u)

1 − Fn+(u)

dPn(u, δ)

=

∫

[τnk,s)

δk − Fnk(u) + (δ+ − Fn+(u))

F0k(s)

1 − F0+(s)+RksFn

(u, δ)

dPn(u, δ), (5.20)

where

RksFn(u, δ) = (δ+ − Fn+(u))

(Fnk(u)

1 − Fn+(u)− F0k(s)

1 − F0+(s)

)

= (δ+ − Fn+(u))Fnk(u)(1 − F0+(s)) − F0k(s)(1 − Fn+(u))

(1 − Fn+(u))(1 − F0+(s)). (5.21)

The term RksFnarises since we replace Fnk(u)/(1 − Fn+(u)) by the constant and

deterministic factor F0k(s)/(1 − F0+(s)). Lemma 5.9 provides a bound on

∣∣∣∣∫

[w,s)

RksFn(u, δ)dPn(u, δ)

∣∣∣∣

for w < s in a neighborhood of t0. Note that the given bound grows with the length of

the integration interval. However, this growth is dominated by terms with quadratic

growth that we will encounter later. Hence, RksFncan be viewed as a remainder term.

The proof of Lemma 5.9 is given in Section 5.4.

Lemma 5.9 Let the conditions of Theorem 5.10 be satisfied. Then there exists an

r > 0 such that, uniformly in t0 − 2r < w < s < t0 + 2r, and for k = 1, . . . , K,

∣∣∣∣∫

[w,s)

RksFn(u, δ)dPn

∣∣∣∣ = Op

(n−2/3 + n−1/3(s− w)3/2

).

99

Given that RksFncan be viewed as a remainder term, and that F0k(s)/(1 − F0+(s))

is a constant factor, the Fenchel conditions (5.20) contain two important parts:

∫

[τnk,s)

δk − Fnk(u)dPn(u, δ) and

∫

[τnk ,s)

δ+ − Fn+(u)dPn(u, δ). (5.22)

The first term is equivalent to the Fenchel conditions for the naive estimator, and can

be handled without much difficulty. In order to control the second term, we need the

rate result for Fn+ that is given in the next section.

5.3.2 Uniform rate of convergence for Fn+ on a fixed neighborhood of t0

The important rate result for Fn+ is given in Theorem 5.10. The main virtue of this

theorem is that it holds uniformly on a fixed neighborhood [t0 − r, t0 + r] of t0, rather

than on a shrinking neighborhood of the form [t0−Mn−1/3, t0+Mn−1/3]. Such a fixed

neighborhood is needed, because we will use Theorem 5.10 to derive a bound on the

second term in (5.22). The usual result on a shrinking neighborhood is not enough

for this purpose, because in the proof of the local rate of the components (Theorem

5.20), we cannot assume that the length of the interval [τnk, s) is of order Op(n−1/3).

Theorem 5.10 For all k = 1, . . . , K, let 0 < F0k(t0) < F0k(∞), and let F0k and G

be continuously differentiable at t0 with strictly positive derivatives f0k(t0) and g(t0).

For β ∈ (0, 1) we define

vn(t) =

n−1/3 if |t| ≤ n−1/3,

n− 1−β3 |t|β if |t| > n−1/3.

(5.23)

Then there exists a constant r > 0 so that

supt∈[t0−r,t0+r]

∣∣∣Fn+(t) − F0+(t)∣∣∣

vn(t− t0)= Op(1). (5.24)

100

n−1/3

β = 0.95

β = 0.75

β = 0.40

β = 0.05

vn(t)

t

Figure 5.2: Plot of vn(t) for various values of β. The dotted lines are y = x andy = n−1/3. Note that β close to zero gives the sharpest bound.

Before giving the proof of this theorem, we discuss its meaning by comparing it to

several known results for current status data without competing risks.

Remark 5.11 By taking K = 1 in Theorem 5.10, it follows that the theorem holds

for the MLE Fn for current status data without competing risks. Thus, to clarify

the meaning of Theorem 5.10, we can compare it to known results for Fn. First,

we consider the local rate of convergence given in Groeneboom and Wellner (1992,

Lemma 5.4, page 95). For M > 0, they prove that

supt∈[−M,M ]

∣∣∣Fn(t0 + n−1/3t) − F0(t0)∣∣∣ = Op(n

−1/3). (5.25)

Applying Theorem 5.10 to t ∈ [t0 −Mn−1/3, t0 +Mn−1/3] yields

supt∈[t0−Mn−1/3,t0+Mn−1/3]

∣∣∣Fn+(t) − F0+(t)∣∣∣

vn(t− t0)= Op(1).

Combining this with the continuous differentiability of F0+ at t0, and with the fact

101

that

vn(t− t0) ≤ vn(Mn−1/3) = Mβn−1/3, for M ≥ 1,

yields the bound in (5.25). Hence, Theorem 5.10 is stronger than (5.25) for M ≥ 1.

Next, we consider the global bound of Groeneboom and Wellner (1992, Lemma

5.9):

supt∈R

∣∣∣Fn(t) − F0(t)∣∣∣ = Op(n

−1/3 log n). (5.26)

The result in Theorem 5.10 is fundamentally different from (5.26), because it is

stronger in some ranges, but weaker in others. For example, for |t− t0| = n−1/3 log n,

Theorem 5.10 is stronger, since

vn(t− t0) = n− 1−β3 |t− t0|β = n−1/3(logn)β < n−1/3 logn, for n ≥ 3,

for all β ∈ (0, 1). Similarly, for |t− t0| = n−1/3 log log n we have vn(t − t0) <

n−1/3 log log n. On the other hand, for |t − t0| = n−1/3+γ for some γ > 0, Theo-

rem 5.10 is weaker, because vn(t− t0) = n−1/3+γβ and n−1/3 log n = o(n−1/3+γβ

), for

any β ∈ (0, 1) and γ > 0.

Remark 5.12 Note that Theorem 5.10 gives a family of bounds in β. Choosing β

close to zero gives the tightest bound, as illustrated in Figure 5.2. For the proof of

the local rate of convergence of Fnk, k = 1, . . . , K, it is sufficient that Theorem 5.10

holds for one arbitrary value of β ∈ (0, 1). Stating the theorem for one fixed β leads

to a somewhat simpler proof. However, for completeness we present the result for all

β ∈ (0, 1).

We now provide several lemmas that are needed in the proof of Theorem 5.10.

First, Lemma 5.13 shows that we can replace∫[t,s)

F (s)−F (u)dGn(u) by∫[t,s)

F (s)−

102

F (u)dG(u), at the cost of a term Op(n−1/2(s− t)).

Lemma 5.13 Let F : R 7→ R be continuously differentiable at t0 with strictly positive

derivative f(t0). Then there exists an r > 0 such that uniformly in t0 − 2r ≤ t ≤ s ≤t0 + 2r:

∣∣∣∣∫

[t,s)

F (s) − F (u)d(Gn −G)(u)

∣∣∣∣ = Op(n−1/2(s− t)).

Proof: Integration by parts yields

n1/2

∫

[t,s)

F (s) − F (u)d(Gn −G)(u)

= −n1/2F (s) − F (t)Gn(t) −G(t) + n1/2

∫

[t,s)

Gn(u) −G(u)dF (u).

Note that n1/2 supu∈R|Gn(u) − G(u)| is tight, since it converges in distribution to

supu∈R|B(G(u))| ≤ supx∈[0,1] |B(x)|, where B is a standard Brownian bridge on [0, 1].

Hence, both terms on the right side of the display are of order Op(1)F (s)−F (t) =

Op(1)(s− t). 2

Next, Lemma 5.14 shows that∫[w,s)

F0k(s) − F0k(u)dGn(u) has a quadratic drift.

This result follows by replacing Gn by G using Lemma 5.13, and then using the

continuous differentiability of F0k. This quadratic drift plays an important role in the

proof of the local rate result, because it dominates all other terms.


r > 0 such that for all k = 1, . . . , K,

P(∫

[w,s)

F0k(s) − F0k(u)dGn(u) ≥ g(t0)f0k(t0)(s− w)2/8

for all w, s ∈ [t0 − 2r, t0 + 2r] such that s− w > n−1/3)→ 1, n→ ∞.

103

Proof: Let k ∈ 1, . . . , K. Note that

∫

[w,s)

F0k(s) − F0k(u)dGn(u)

=

∫

[w,s)

F0k(s) − F0k(u)d(Gn −G)(u) +

∫

[w,s)

F0k(s) − F0k(u)dG(u)

≥∫

[w,s)

F0k(s) − F0k(u)dG(u)−∣∣∣∣∫

[w,s)

F0k(s) − F0k(u)d(Gn −G)(u)

∣∣∣∣ . (5.27)

We write (5.27) as I−II. Note that I ≥ f0k(t0)g(t0)(s−w)2/4 for r small enough, by

the assumption that F0k and G are continuously differentiable with positive deriva-

tives. Furthermore, II is of order n−1/2(s− w)Op(1) by Lemma 5.13. Since s− w >

n−1/3, this is in turn bounded above by f0k(t0)g(t0)(s− w)2/8 with probability arbi-

trarily close to one for n sufficiently large. Plugging these results into (5.27) completes

the proof. 2

We are now ready to give the proof of Theorem 5.10.

Proof of Theorem 5.10: Let β ∈ (0, 1) and ǫ > 0. It is sufficient to show that we

can choose n1, M and r such that for all n > n1

P∃t ∈ [t0 − r, t0 + r] : Fn+(t) /∈ (F0+(t−Mvn(t− t0)), F0+(t+Mvn(t− t0)))

< ǫ,

(5.28)

since, for r small enough, the continuous differentiability of F0+ gives

F0+(t+Mvn(t− t0)) ≤ F0+(t) + 2Mvn(t− t0)f0+(t0), t ∈ [t0 − r, t0 + r],

F0+(t−Mvn(t− t0)) ≥ F0+(t) − 2Mvn(t− t0)f0+(t0), t ∈ [t0 − r, t0 + r],

and combining this with (5.28) proves (5.24):

P∃t ∈ [t0 − r, t0 + r] : |Fn+(t) − F0+(t)| ≥ 2Mvn(t− t0)f0+(t0)

< ǫ, n > n1.

104

Thus, in the remainder we prove (5.28). In fact, we only prove that there exist

n1, M and r such that for all n > n1

P∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0))

<ǫ

4, (5.29)

since the proofs for Fn+(t) ≤ F0+ (t−Mvn(t− t0)) and the interval [t0 − r, t0] are

analogous. To prove this, we make use of the fact that we can choose n1, r and

C > 0, such that for all n > n1 the following event holds with high probability:

EnrC = E(1)nr ∩ E(2)

nr ∩ E(3)nrC , (5.30)

where

E(1)nr = ∩Kk=1

Fnk has a jump in [t0 − 2r, t0 − r]

,

E(2)nr = ∩Kk=1

∫

[w,s)

(F0k(s) − F0k(u))dGn(u) ≥ g(t0)f0k(t0)(s− w)2/8

for all w, s ∈ [t0 − 2r, t0 + 2r], w − s > n−1/3

,

E(3)nrC = ∩Kk=1

∣∣∣∣∫

[w,s)

RksFn(u, δ)dPn(u, δ)

∣∣∣∣ ≤(n−2/3 + n−1/3t3/2

)C

for all w, s ∈ [t0 − 2r, t0 + 2r]

.

To see that event E(1)nr holds with high probability, let k ∈ 1, . . . , K, and note

that by Proposition 4.15 and the continuity of F0k in a neighborhood of t0, it follows

that Fnk is almost surely uniformly consistent on [t0 − 2r, t0 − r] for r small enough.

Together with the fact that F0k is strictly increasing in a neighborhood of t0, this

implies that for n large Fnk must have a jump on [t0 − 2r, t0 − r]. Lemmas 5.9 and

5.14 imply that events E(2)nr and E

(3)nrC hold with high probability.

Hence, we can choose n1, r and C such that P(EcnrC

)< ǫ/8 for all n > n1. By

105

writing

P∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0))

≤ P (EcnrC)+P

(∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+ (t+Mvn(t− t0)) ∩EnrC

). (5.31)

it follows that we can complete the proof by showing that we can choose n1, M and

r such that the second term of (5.31) is bounded by ǫ/8 for all n > n1.

In order to prove this, we put a grid on the interval [t0, t0 +r], analogously to Kim

and Pollard (1990, Lemma 4.1). The grid points tnj and grid cells Inj are denoted by

tnj = t0 + jn−1/3 and Inj = [tnj, tn,j+1) , (5.32)

where j = 0, . . . , Jn = ⌈rn1/3⌉. Then it is sufficient to show that we can choose n1,

M and r such that for all n > n1 and j = 0, . . . , Jn,

P(

∃t ∈ Inj : Fn+(t) ≥ F0+(t+Mvn(t− t0))∩ EnrC

)≤ pjM , (5.33)

where pjM is defined by

pjM =

d1 exp(−d2M3/2) if j = 0,

d1 exp(−d2(Mjβ)3/2) if j = 1, . . . , Jn,(5.34)

for some positive constants d1 and d2. To see that this is sufficient, note that (5.33)

implies that

P(∃t ∈ [t0, t0 + r] : Fn+(t) ≥ F0+(t+Mvn(t− t0)) ∩ EnrC

)

≤∞∑

j=0

pjM = d1 exp(−d2M3/2) +

∞∑

j=1

d1 exp(−d2(Mjβ)3/2),

and for any β ∈ (0, 1) this is a convergent sum that can be made arbitrarily small by

106

choosing M large.

Thus, we are left with proving (5.33). Using the monotonicity of Fn+, it is in turn

sufficient to prove that for all n > n1 and j = 0, . . . , Jn,

P(Fn+(tn,j+1) ≥ F0+(snjM)

∩EnrC

)= P (AnjM ∩EnrC) ≤ pjM , (5.35)

where

snjM = tnj +Mvn(tnj − t0), (5.36)

AnjM =Fn+(tn,j+1) ≥ F0+(snjM)

. (5.37)

Let τnk be the last jump point of Fnk before tn,j+1, for k = 1, . . . , K. On the event

EnrC , these jump points exist and are in [t0 − 2r, tn,j+1). Without loss of generality

we assume that the sub-distribution functions are labeled such that τn1 ≤ · · · ≤ τnK .

On the event AnjM there must be a k ∈ 1, . . . , K for which Fnk(tn,j+1) ≥ F0k(snjM).

Hence, on the event AnjM , we can define l ∈ 1, . . . , K such that

Fnk(tn,j+1) < F0k(snjM), k = l + 1, . . . , K, (5.38)

Fnl(tn,j+1) ≥ F0l(snjM). (5.39)

Next, we note that the Fenchel conditions imply that

∫

[τnl,snjM )

δldPn(u, δ) −∫

[τnl,snjM )

Fnl(u)dGnFn(u) ≥ 0

must hold. Hence,

P (AnjM ∩EnrC)

= P

(∫

[τnl,snjM )

δldPn(u, δ) −∫

[τnl,snjM )

Fnl(u)dGnFn(u) ≥ 0

∩ AnjM ∩ EnrC

).

107

Using (5.20) this probability is bounded above by

P

(∫

[τnl,snjM )

δl − Fnl(u) +RlsnjM Fn

(u, δ)dPn ≥ 0

∩ AnjM ∩ EnrC

)(5.40)

+ P

(∫

[τnl,snjM )

δ+ − Fn+(u)dPn ≥ 0

∩AnjM ∩EnrC

). (5.41)

Note that we can discard the factor F0k(snjM)/1 − F0+(snjM) because it is a finite

and positive constant and therefore plays no role in the sign of the integral in (5.41).

Using (5.39), the definition of τnl, and the fact that Fnl is piecewise constant and

monotone nondecreasing, it follows that on the event AnjM we have for u > τnl

Fnl(u) ≥ Fnl(τnl) = Fnl(tn,j+1) ≥ F0l(snjM).

Hence, we can bound (5.40) by

P

(∫

[τnl,snjM )

δl − F0l(snjM) +RlsnjM Fn

(u, δ)dPn(u, δ) ≥ 0

∩EnrC

)

≤ P

(sup

k ∈ 1, . . . , K

w ∈ [t0 − 2r, tn,j+1]

∫

[w,snjM )

δk − F0k(snjM)dPn(u, δ)

+(n−2/3 + n−1/3(snjM − w)3/2

)C ≥ 0

∩ EnrC

)

≤ P

(sup

k ∈ 1, . . . , K

w ∈ [t0 − 2r, tn,j+1]

∫

[w,snjM )

δk − F0k(snjM)dPn(u, δ)

+(n−2/3 + n−1/6(snjM − w)3/2

)C ≥ 0

∩ EnrC

),

using the definition of EnrC in (5.30). We can bound this probability by pjM/2 for M

sufficiently large, using Lemma 5.15 below. Expression (5.41) is also bounded above

by pjM/2 for M large, using Lemma 5.16 below. This proves (5.35) and completes

the proof. 2

108

Lemmas 5.15 and 5.16 are crucial lemmas in the proof of Theorem 5.10. The main

idea of Lemma 5.15 is that we can write

∫

[w,snjM )

δk − F0k(snjM)dPn

=

∫

[w,snjM )

δk − F0k(u)dPn +

∫

[w,snjM)

F0k(u) − F0k(snjM)dGn.

The first term on the right side is a martingale, and the second term on the right

side has a negative quadratic drift on the event EnrC . This quadratic drift dominates

both the martingale part and the term(n−2/3 + n−1/6(snjM − w)3/2

)C for sufficiently

large M . We obtain the uniformity in w by using a second grid with grid size n−1/3,

and we get the exponential bound pjM by using Orlicz norms. The proof is given in

Section 5.4.

Lemma 5.15 Let the conditions of Theorem 5.10 be satisfied, and let C > 0. Then

there exist r > 0, n1 > 0 and M > 5 such that for all n > n1 and j ∈ 0, . . . , Jn, we

have

P

(sup

k ∈ 1, . . . , K

w ∈ (t0 − 2r, tn,j+1)

∫

[w,snjM)

δk − F0k(snjM)dPn

+ (n−2/3 + n−1/6(snjM − w)3/2)C ≥ 0

∩ EnrC

)≤ pjM

2,

where snjM = tnj + Mvn(tnj − t0), and vn(·), EnrC and pjM are defined in (5.23),

(5.30) and (5.34), respectively.

Lemma 5.16 gives a similar bound, but then for the sum of the components. In this

lemma the key idea is to exploit the system of sub-distribution functions. On the

event AnjM , we play out the sub-distribution functions against each other until the

problem is reduced to a situation to which Lemma 5.15 can be applied. The proof of

this lemma is given in Section 5.4.

109

Lemma 5.16 Let the conditions of Theorem 5.10 be satisfied, and let C > 0. Then

there are M > 0, n1 > 0 and r > 0 such that for all n > n1 and j ∈ 0, . . . , Jn:

P

(∫

[τnl,snjM )

δ+ − Fn+(u)dPn(u, δ) ≥ 0

∩AnjM ∩EnrC

)≤ pjM

2,

where l is defined in (5.38), τnl is the last jump point of Fnl before tn,j+1, snjM =

tnj +Mvn(tnj− t0), and EnrC, pjM and AnjM are defined in (5.30), (5.34) and (5.37).

Remark 5.17 The conditions of Theorem 5.10 also hold when t0 is replaced by s,

for s in a neighborhood of t0. Hence, the results in this section continue to hold when

t0 is replaced by s ∈ [t0 − r/2, t0 + r/2], for r > 0 sufficiently small. To be precise,

there exists an r > 0 such that for every ǫ > 0 there exist M > 0 and n1 > 0 such

that for all s ∈ [t0 − r/2, t0 + r/2] and n > n1:

P

(sup

t∈[−r,r]

Fn+(t) − F0k(t)

v(t− s)> M

)< ǫ.

5.3.3 Local rate of convergence of Fn1, . . . , FnK

We now prove the local rate of convergence for the components Fn1, . . . , FnK . Recall

from the introduction of this chapter that our proof relies on the Fenchel conditions

(5.20). These Fenchel conditions consist of three parts: the integral of δk − Fnk(u),

the integral of δ+ − Fn+(u), and the integral of RksFn(u, δ). We can bound the part

involving RksFnusing Lemma 5.9. For the term involving δ+ − Fn+ we write

∫

[w,s)

δ+ − Fn+(u)dPn

=

∫

[w,s)

δ+ − F0+(u)dPn +

∫

[w,s)

F0+(u) − Fn+(u)dGn. (5.42)

The first term of (5.42) is bounded in Lemma 5.18. This lemma is very similar

to Lemma 4.1 of Kim and Pollard (1990), with the only difference that our class

110

of functions depends on n. In Corollary 5.19 we bound the second term of (5.42),

using Theorem 5.10 with β = 1/2. It then follows that the term involving δk − Fnk

drives the local rate of convergence for Fnk, just as for current status data without

competing risks. The local rate of convergence for the components Fn1, . . . , FnK is

given in Theorem 5.20.


r > 0 such that for all M > 0 and every γ > 0 there exist random variables An of

order Op(1) such that

∣∣∣∣∫

[t,snM )

δ+ − F0+(u)dPn(u, δ)∣∣∣∣ ≤ γ(snM − t)2 + n−2/3A2

n, (5.43)

for all t ∈ [t0 − r, snM), where snM = t0 + 2Mn−1/3.

Proof: We use a slightly generalized version of Lemma 4.1 of Kim and Pollard (1990).

We introduce the following notation:

qnt(u, δ) = (δ+ − F0+(u))1[t,snM)(u), t ≤ snM ,

Qnr = qnt : t ∈ (t0 − r, snM), r > 0,

Qnr = |δ+ − F0+(u)| 1[t0−r,snM ](u).

Here Qnr is the class of functions of interest and Qnr is its envelope. Note that Qnr is

uniformly manageable in the sense of Kim and Pollard (1990), since the functions qnt

are the product of a fixed bounded function and an indicator of a VC class of sets.

Furthermore,

PQ2nr ≤ P1[t0−r,snM ](u) ≤ 2g(t0)(r +Mn−1/3),

for r small and n large, since G is continuously differentiable at t0 with a positive

derivative. Hence, we can choose r1 > 0 such that PQ2nr ≤ 2g(t0)(r+Mn−1/3) for all

111

r ≤ r1. We can use this bound in the proof of Lemma 4.1 of Kim and Pollard (1990)

without making any other modifications, and we obtain that for every γ > 0 there

exist random variables An of order Op(1) such that

∣∣∣∣∫

[t,snM )

δ+ − F0+(u)dPn(u, δ)∣∣∣∣ = |(Pn − P )qnt| ≤ γ(snM − t)2 + n−2/3A2

n,

for all t ∈ (t0 − r, snM). 2

Corollary 5.19 provides a bound on the second term of (5.42). The proof uses the

modulus of continuity result of Van de Geer (2000) and Theorem 5.10 with β = 1/2.

Corollary 5.19 Let the conditions of Theorem 5.10 be satisfied. Then there exists

an r > 0 such that for all M > 1 we have

∣∣∣∣∫

[t,snM )

Fn+(u) − F0+(u)dGn(u)

∣∣∣∣ = Op(n−2/3 + n−1/6(snM − t)3/2), (5.44)

uniformly in t ∈ [t0 − r, t0 +Mn−1/3], where snM = t0 + 2Mn−1/3.

Proof: We write

∣∣∣∣∫

[t,snM )

Fn+(u) − F0+(u)dGn(u)

∣∣∣∣

≤∣∣∣∣∫

[t,snM )

Fn+(u) − F0+(u)d(Gn −G)(u)

∣∣∣∣+∫

[t,snM )

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u).

The first term is of order Op(n−2/3), uniformly in t ≤ snM , by the modulus of conti-

nuity result of Van de Geer (2000, Lemma 5.13, eq. (5.42)). To see this, let

Q = q(u) = qtF (u) = F (u) − F0+(u)1[t,snM )(u) : F ∈ F , t ≤ snM,

where F is the class of monotone functions F : R → [0, 1]. Taking q0 ≡ 0, it is clear

112

that

supq∈Q

‖q − q0‖∞ ≤ 1,

so that Van de Geer’s condition (5.39) is satisfied. In order to satisfy her condition

(5.40), we need to show that the bracketing entropy logN[ ](γ,Q, L2(G)) is bounded

by Aγ−1, for some constant A > 0. It is well-known that logN[ ](γ,F , L2(H)) . 1/γ,

uniformly in probability measures H on the underlying sample space (see, e.g., Van de

Geer (2000, page 18, equation (2.5)) or Van der Vaart and Wellner (1996, Theorem

2.7.5, page 159)). Furthermore, the same bound holds for the class of indicator

functions 1[t,snM ) : t ≤ snM, since they are of bounded variation (see, e.g., Van de

Geer (2000, page 18, equation (2.6))). In fact we can get a much sharper bound

on the bracketing numbers for this class, but that is not needed here. Since the

functions q ∈ Q consist of the product of functions from these two classes, it follows

by Proposition 5.23 that logN[ ](γ,Q, L2(G)) < Aγ−1 for some A > 0. Hence, Van

de Geer’s condition (5.40) is satisfied. Next, we define

Q(γ) = q ∈ Q : ‖q‖2 ≤ γ.

Using the L2(G) rate of convergence (5.7), we have

‖qtFn+‖2

2 =

∫

[t,snM )

Fn+(u) − F0+(u)2dG(u) = Op(n−2/3),

uniformly in t ≤ snM . Hence, for every ǫ > 0 we can find a C > 0 such that

P(qtFn+

∈ Q(Cn−1/3) for all t ≤ snM

)> 1 − ǫ.

Finally, applying Van de Geer (2000, Lemma 5.13, eq. (5.42)) with α = 1 and β = 0

113

to the class Q(Cn−1/3) yields

supq∈Q(Cn−1/3)

∣∣∣∣∫q d(Pn − P )

∣∣∣∣ = Op(n−2/3).

To bound the second term, note that Theorem 5.10 with β = 1/2 implies that,

uniformly in t ∈ [t0 − r, t0 + r]:

∫ t0∨t

t0∧t

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) =

∫ t0∨t

t0∧tOp(vn(t− t0))dG(u)

= Op(n−2/3 ∨ n−1/6|t− t0|3/2). (5.45)

We now distinguish the following two cases: (i) t < t0 and (ii) t ∈ [t0, t0+Mn−1/3).

In case (i) we get using (5.45) and snM = t0 + 2Mn−1/3,

∫

[t,snM )

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)

=

∫

[t,t0)

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) +

∫

[t0,snM )

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)

= Op(n−2/3 ∨ n−1/6(t0 − t)3/2) +Op(n

−1/6(2Mn−1/3)3/2)

= Op(n−1/6(snM − t)3/2),

uniformly in t ∈ [t0 − r, t0). Similarly, in case (ii) we get, using (5.45) and Mn−1/3 ≤snM − t,

∫

[t,snM )

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u) ≤

∫

[t0,snM )

∣∣∣Fn+(u) − F0+(u)∣∣∣ dG(u)

= Op(n−1/6(2Mn−1/3)3/2) = Op(n

−1/6(snM − t)3/2),

uniformly in t ∈ [t0, t0 +Mn−1/3). 2

We are now ready to prove the local rate of convergence of Fn1, . . . , FnK .

114

Theorem 5.20 Let the conditions of Theorem 5.10 be satisfied, and let M1 > 0.

Then

supt∈[−M1,M1]

∣∣∣Fnk(t0 + n−1/3t) − F0k(t0)∣∣∣ = Op(n

−1/3), k = 1, . . . , K.

Proof: Let the M1 > 0 be given, let k ∈ 1, . . . , K and let ǫ > 0. It is sufficient to

show that there exist constants M > M1 and n1 > 0 such that for all n > n1

PFnk(t0 +Mn−1/3) ≥ F0k(t0 + 2Mn−1/3) < ǫ, (5.46)

PFnk(t0 −Mn−1/3) ≤ F0k(t0 − 2Mn−1/3) < ǫ, (5.47)

since together with the monotonicity of Fnk this implies that with probability at least

1 − 2ǫ,

supt∈[−M,M ]

∣∣∣Fnk(t0 + n−1/3t) − F0k(t0)∣∣∣

≤ maxF0k(t0 + 2Mn−1/3) − F0k(t0), F0k(t0) − F0k(t0 − 2Mn−1/3),

which is bounded by 4f0k(t0)Mn−1/3 for large n. Since M > M1, the result then

follows.

We only prove (5.46), since the proof of (5.47) is analogous. Thus, we need to

show that there exist constants M > M1 and n1 > 0 such that for all n > n1

P(Fnk(t0 +Mn−1/3) ≥ F0k(snM)

)= P (BnkM) ≤ ǫ, (5.48)

where

snM = t0 + 2Mn−1/3,

BnkM = Fnk(t0 +Mn−1/3) ≥ F0k(snM).

115

Let τnk be the largest jump point of Fnk before t0 +Mn−1/3. As discussed in the

proof of Theorem 5.10, we can choose n1 and r so that for all n > n1

P (Fnk does not have a jump in [t0 − r, t0]) < ǫ/4.

Next, note that the Fenchel conditions imply that

∫

[τnk ,snM )


[τnk,snM )

Fnk(u)dGnFn(u) ≥ 0

must hold. Hence,

P (BnkM)

= P

(∫

[τnk ,snM )


[τnk ,snM )


∩BnkM

)

≤ ǫ/4 + P

(sup

w∈[t0−r,t0+Mn−1/3)

∫

[w,snM)


[w,snM)


∩ BnkM

). (5.49)

Using (5.20), we have

∫

[w,snM)


[w,snM)

Fnk(u)dGnFn(u)

=

∫

[w,snM)

δk − Fnk(u) +

F0k(snM)

1 − F0+(snM)δ+ − Fn+(u) +RksnM Fn

dPn. (5.50)

We now derive an upper bound for the last two terms in the integral (5.50). Starting

with the part that involves δ+ − Fn+(u), we write

∣∣∣∣∫

[w,snM)

δ+ − Fn+(u)dPn(u, δ)∣∣∣∣

≤∣∣∣∣∫

[w,snM)

δ+ − F0+(u)dPn(u, δ)∣∣∣∣+∣∣∣∣∫

[w,snM)

F0+(u) − Fn+(u)dGn(u)

∣∣∣∣ .

116

We bound these terms using Lemma 5.18 and Corollary 5.19. Furthermore, we use

Lemma 5.9 to bound∫[w,snM)

RksnM FndPn. It follows that we can choose r > 0 such

that for all M > 0 and γ > 0 we can find C > 0 and n1 > 0 such that for all n > n1:

P

(∃w ∈ [t0 − r, t0 +Mn−1/3) :

∫

[w,snM)

F0k(snM)

1 − F0+(snM)δ+ − Fn+(u) +RksnM Fn

dPn

> γ(snM − w)2 + (n−2/3 + n−1/6(snM − w)3/2)C

)<ǫ

4.

This implies

P (BnkM) ≤ ǫ

2+ P

(sup

w∈[t0−r,t0+Mn−1/3)

∫

[w,snM)

δk − Fnk(u)dPn + γ(snM − w)2

+(n−2/3 + n−1/6(snM − w)3/2

)C

≥ 0

∩ BnkM

).

Next, we consider the driving part∫[w,snM)

δk − Fnk(u)dPn(u, δ) of (5.50). The

definition of τnk, and the fact that Fnk is piecewise constant and monotone nonde-

creasing, imply that on the event BnkM we have, for u ≥ τnk:

Fnk(u) ≥ Fnk(τnk) = F0k(t0 +Mn−1/3) ≥ F0k(snM). (5.51)

Hence,

P (BnkM) ≤ ǫ

2+ P

(sup

w∈[t0−r,t0+Mn−1/3)

∫

[w,snM)

δk − F0k(snM)dPn + γ(snM − w)2

+(n−2/3 + n−1/6(snM − w)3/2

)C ≥ 0

)

and this can be bounded by ǫ by choosing γ and M large, by a slight adaptation

of Lemma 5.15. Note that γ should be chosen such that the negative quadratic

drift arising from∫[w,snM)

δk − F0k(snM)dPn dominates γ(snM − w)2. The choice

117

γ = g(t0)f0k(t0)/32 works. 2

Remark 5.21 Theorem 5.20 also holds when t0 is replaced by s ∈ [t0−r/2, t0 +r/2],

for r sufficiently small, for the reason discussed in Remark 5.17. To be precise, there

exists an r > 0 such that for every M1 > 0 and ǫ > 0 there exist M > 0 and n1 > 0

such that for all s ∈ [t0 − r/2, t0 + r/2] and n > n1:

P

(sup

h∈[−M1,M1]

∣∣∣Fnk(s+ n−1/3h) − F0k(s)∣∣∣ > M

)< ǫ.

Theorem 5.20 and Remark 5.21 lead to the following corollary about the distance

between the jump points of Fnk around t0:

Corollary 5.22 Let t0 ∈ R, and let the conditions of Theorem 5.20 be satisfied. Let

τ−nk(t) be the last jump point of Fnk before t, and let τ+nk(t) the first jump point of Fnk

after t, for k = 1, . . . , K. Then for every ǫ > 0 there exist C > 0 and n1 > 0 such

that for all n > n1 and s ∈ [t0 − r/2, t0 + r/2]:

P (τ+nk(s) − τ−nk(s) > Cn−1/3) < ǫ.

Proof: Let s ∈ [t0 − r/2, t0 + r/2]. Using Remark 5.21, we apply Theorem 5.20 two

times: one time with t0 replaced by s, and one time with t0 replaced by s− C2n−1/3

for C2 > 0. This yields that for every ǫ > 0 and M1 > 0, there exist n1 > 0 and

M > 0, not depending on s and C2, such that for all n > n1:

P

(sup

t∈[−M1,M1]

∣∣∣Fnk(s− C2n−1/3 + n−1/3t) − F0k(s− C2n

−1/3)∣∣∣ > Mn−1/3

)< ǫ,

P

(sup

t∈[−M1,M1]

∣∣∣Fnk(s+ n−1/3t) − F0k(s)∣∣∣ > Mn−1/3

)< ǫ.

118

Furthermore, for C2 sufficiently large we have

F0k(s− C2n−1/3) +Mn−1/3 < F0k(s) −Mn−1/3.

It follows that for each s ∈ [t0 − r/2, t0 + r/2],

P(Fnk has a jump in the interval (s− C2n

−1/3, s))> 1 − 2ǫ.

The statement now follows by using similar reasoning for τ+nk(s). 2

5.4 Technical lemmas and proofs

Propositions 5.23 and 5.24 give preservation theorems for bracketing entropy. These

propositions are used to verify conditions about bracketing entropy in Lemma 5.9,

Corollary 5.19, and also in Lemma 7.13 in Chapter 7. For completeness we give

proofs, although these may be known results.

Proposition 5.23 Let P be a probability measure on (X ,A). For h : X 7→ R and

h ∈ L2(P ), let ‖h‖2 = (∫h2dP )1/2. Let H1 and H2 be two classes of nonnegative

functions from X to R+, with

logN[ ](γ,Hi, ‖ · ‖2) ≤ Aiγ−αi, i = 1, 2,

for some constants Ai > 0 and αi > 0. Let H1 and H2 be the envelopes of H1 and

H2, and assume that ‖H1‖2, ‖H2‖2 and ‖H1H2‖2 are finite. Furthermore, define

H3 = h1h2 : h1 ∈ H1, h2 ∈ H2.

Then logN[ ](γ,H3, ‖ · ‖2) ≤ A′γ−(α1∨α2) for γ ≤ 1, for some constant A′ > 0.

Proof: Let h1 ∈ H1 and h2 ∈ H2. Let [l1, u1] be a γ-bracket containing h1 and

let [l2, u2] be a γ-bracket containing h2, where the size of both brackets is computed

119

w.r.t. ‖ · ‖2. Without loss of generality we assume that 0 ≤ l1 ≤ u1 ≤ H1 and

0 ≤ l2 ≤ u2 ≤ H2. Then we can define a new bracket [l3, u3] = [l1l2, u1u2] which

contains h1h2. The upper and lower bounds of this bracket are guaranteed to have

finite norms by the assumption that ‖H1H2‖2 is finite. Using the triangle inequality

and the Cauchy-Schwarz inequality, we obtain

‖u1u2 − l1l2‖2 = ‖u1(u2 − l2) + l2(u1 − l1)‖2

≤ ‖u1‖2 · ‖u2 − l2‖2 + ‖l2‖2 · ‖u1 − l1‖2

≤ γ(‖u1‖2 + ‖l2‖2) ≤ γ(‖H1‖2 + ‖H2‖2) ≡ γM.

Let Ni = N[ ](γ,Hi, ‖ · ‖2) for i = 1, 2. Then N[ ](γM,H3, ‖ · ‖2) ≤ N1N2. Hence, the

γM-bracketing entropy for H3 is bounded by log(N1)+ log(N2) ≤ (A1 +A2)γ−(α1∨α2),

for γ ≤ 1. This implies that logN[ ](γ,H3, ‖·‖2) ≤Mα1∨α2(A1+A2)γ−(α1∨α2) ≡ A′γ−α.

2

Proposition 5.24 Let ‖ · ‖ be an arbitrary norm. Let H1 and H2 be two classes of

functions with

logN[ ](γ,Hi, ‖ · ‖) ≤ Aiγ−αi, i = 1, 2,

for some constants Ai > 0 and αi > 0. Let H1 and H2 be the envelopes of H1 and

H2, and assume that ‖H1‖ and ‖H2‖ are finite. Furthermore, define

H3 = h1 + h2 : h1 ∈ H1, h2 ∈ H2.

Then logN[ ](γ,H3, ‖ · ‖) ≤ A′γ−(α1∨α2) for γ ≤ 1, for some constant A′ > 0.

Proof: Let h1 ∈ H1 and h2 ∈ H2. Let [l1, u1] be a γ-bracket containing h1 and let

[l2, u2] be a γ-bracket containing h2, where the size of both brackets is computed w.r.t.

‖ · ‖. Without loss of generality, we assume that li and ui are contained in [−Hi, Hi],

120

for i = 1, 2. Note that we can define a new bracket [l3, u3] = [l1 + l2, u1 + u2] which

contains h1 + h2. The upper and lower bounds of this bracket are guaranteed to

have finite norms. by the assumption that ‖H1‖ and ‖H2‖ are finite. The size of the

bracket [l3, u3] is

‖(u1 + u2) − (l1 + l2)‖ ≤ ‖u1 − l1‖ + ‖u2 − l2‖ ≤ 2γ.

Let Ni = N[ ](γ,Hi, ‖ · ‖) for i = 1, 2. Then N[ ](2γ,H3, ‖ · ‖) ≤ N1N2. Hence, the

2γ-bracketing entropy for H3 is bounded by log(N1) + log(N2) ≤ (A1 +A2)γ−(α1∨α2).

This implies that logN[ ](γ,H3, ‖ · ‖) ≤ 2α1∨α2(A1 +A2)γ−(α1∨α2) ≡ A′γ−(α1∨α2). 2

Next, we provide the proofs of Lemmas 5.9, 5.15 and 5.16.

Proof of Lemma 5.9: Fix k ∈ 1, . . . , K. Note that

RksFn(u, δ) =

δ+ − Fn+(u)

(1 − Fn+(u))(1 − F0+(s))

[Fnk(u)(1 − F0+(s)) − F0k(s)(1 − Fn+(u))

]

=δ+ − Fn+(u)

(1 − Fn+(u))(1 − F0+(s))

[Fnk(u)(Fn+(u) − F0+(s))

+(1 − Fn+(u))(Fnk(u) − F0k(s))]

= (δ+ − Fn+(u))(Fn+(u) − F0+(s))F0k(s)

(1 − F0+(s))2(1 +O(s− u) + op(1))

+ (δ+ − Fn+(u))(Fnk(u) − F0k(s))1

1 − F0+(s).

The last line follows from continuity of F0k and F0+ and consistency of Fnk and Fn+

(Proposition 4.15), so that we can replace Fnk(u)/(1 − Fn+(u)) by

F0k(s)

1 − F0+(s)+Fnk(u)(Fn+(u) − F0+(s)) + (Fnk(u) − F0k(s))(1 − Fn+(u))

(1 − Fn+(u))(1 − F0+(s))

=F0k(s)

1 − F0+(s)(1 +O(s− u) + op(1)).

121

It is sufficient to analyze the leading terms (δ+ − Fn+(u))(Fn+(u) − F0+(s)) and

(δ+ − Fn+(u))(Fnk(u) − F0k(s)) of RksFn. In fact, we only need to analyze the latter

term, since the result for the first term then follows by summing over k = 1, . . . , K.

We can discard the factors F0k(s)/1− F0+(s)2 and 1/1− F0+(s), since these are

bounded between two positive constants under the conditions of Theorem 5.10. We

now write:

δ+ − Fn+(u)Fnk(u) − F0k(s)

= δ+ − F0+(u)Fnk(u) − F0k(u) + δ+ − F0+(u)F0k(u) − F0k(s)

+ F0+(u) − Fn+(u)Fnk(u) − F0k(u) + F0+(u) − Fn+(u)F0k(u) − F0k(s)

≡ R(1)(u, δ) +R(2)(u, δ) + R(3)(u, δ) +R(4)(u, δ).

For j = 1, . . . , 4, we write

∣∣∣∣∫

[w,s)

R(j)dPn

∣∣∣∣ ≤∣∣∣∣∫

[w,s)

R(j)d(Pn − P )

∣∣∣∣+∣∣∣∣∫

[w,s)

R(j)dP

∣∣∣∣ . (5.52)

We first show that the second term on the right side of (5.52) is of order Op(n−2/3 +

n−1/3(s− w)3/2), uniformly in w < s. Namely, let w < s, and note that

∫

[w,s)

R(1)dP =

∫

[w,s)

R(2)dP = 0.

Furthermore, using Cauchy-Schwarz and the L2(G) rate of convergence (5.7), yields

∣∣∣∣∫

[w,s)

R(3)dP

∣∣∣∣ ≤√∫

F0+(u) − Fn+(u)2dG

√∫Fnk(u) − F0k(u)2dG

= Op(n−2/3).

Similarly, we obtain∣∣∣∫[w,s)

R(4)dP∣∣∣ = Op(n

−1/3(s− w)3/2).

122

We now consider the first term on the right side of (5.52), starting with j = 2:

∣∣∣∣∫

[w,s)

R(2)d(Pn − P )

∣∣∣∣ ≤∣∣∣∣∫

[w,s)

δ+F0k(u) − F0k(s)d(Pn − P )

∣∣∣∣

+

∣∣∣∣∫

[w,s)

F0+(u)F0k(u) − F0k(s)d(Gn −G)

∣∣∣∣ . (5.53)

The second term on the right side of (5.53) is of order Op(n−1/2(s − w)), uniformly

in t0 − 2r < w < s < t0 + 2r, by Lemma 5.13. Letting Gn+(u) = Pn∆+1T ≤ uand G+ = P∆+1T ≤ u, the first term on the right side of (5.53) can be written

as∫[w,s)

F0k(u) − F0k(s)d(Gn+ − G+)(u). Note that n1/2(Gn+ − G+) converges in

distribution to a mean zero Gaussian process, and satisfies supu |Gn+(u) −G+(u)| =

Op(n−1/2). Hence, we can also bound the first term on the right side of (5.53) by

Op(n−1/2(s− w)), along the lines of Lemma 5.13.

We are left with the terms∫[w,s)

R(j)d(Pn − P ), for j = 1, 3, 4. We bound these

terms using the modulus of continuity result of Van de Geer (2000, Lemma 5.13, page

79). We only consider j = 1, since j = 3 and j = 4 are analogous. Let

Q = q(u, δ) = qwsF (u, δ) = δ+ − F0+(u)F (u) − F0k(u)1[w,s)(u) : w < s, F ∈ F,

where F is the class of monotone functions F : R → [0, 1]. Taking q0 = 0, it is clear

that

supq∈Q

‖q − q0‖∞ ≤ 1,

so that Van de Geer’s condition (5.39) is satisfied. In order to satisfy her condition

(5.40), we need to show that logN[ ](γ,Q, L2(G)) is bounded by Aγ−1, for some

constant A > 0. It is well-known that logN[ ](γ,F , L2(H)) . 1/γ, uniformly in

probability measures H on the underlying sample space (see, e.g., Van de Geer (2000,

page 18, equation (2.5)) or Van der Vaart and Wellner (1996, Theorem 2.7.5, page

123

159)). Furthermore, the same bound holds for the class of indicator functions 1[w,s),

since they are of bounded variation (see, e.g., Van de Geer (2000, page 18, equation

(2.6))). Since the functions q ∈ Q consist of the sums and products of functions from

classes with bracketing entropy bounded by Aγ−α, it follows from Propositions 5.23

and 5.24 that logN[ ](γ,Q, L2(G)) is bounded by A′γ−1 for some constant A′ > 0.

Hence, Van de Geer’s condition (5.40) is satisfied.

Next, we define Q(γ) = q ∈ Q : ‖q‖2 ≤ γ. Using the L2(G) rate of convergence

(5.7), we have

‖qwsFnk‖2

2 =

∫

[w,s)

δ+ − F0+(u)2Fnk(u) − F0k(u)2dP (u)

≤∫

Fnk(u) − F0k(u)2dG(u) = Op(n−2/3),

uniformly in w < s. Hence, for every ǫ > 0 we can choose C > 0 such that

P(qwsFnk

∈ Q(Cn−1/3) for all w < s)> 1 − ǫ.

Applying Van de Geer (2000, Lemma 5.13, page 79, eq. (5.42)) with α = 1 and β = 0

to the class Q(Cn−1/3) yields:

supq∈Q(Cn−1/3)

∣∣∣∣∫q d(Pn − P )

∣∣∣∣ = Op(n−2/3).

Hence∫[w,s)

R(1)d(Pn − P ) = Op(n−2/3) uniformly in w < s. The integrals involving

R(3) and R(4) can be handled similarly.

Combining everything, we have shown that there exists an r > 0 such that:

∣∣∣∣∫

[w,s)

RksFn(u, δ)dPn

∣∣∣∣ = Op

n−2/3 + n−1/2(s− w) + n−1/3(s− w)3/2

= Op(n−2/3 + n−1/3(s− w)3/2),

124

uniformly in t0 − 2r < w < s < t0 + 2r, and for k = 1, . . . , K. 2

Proof of Lemma 5.15: Let C > 0. It is sufficient to show that there exist r > 0,

n1 > 0 and M > 0 such that the statement holds for a fixed k. Let k ∈ 1, . . . , K,n > 0 and j ∈ 0, . . . , Jn. On the event EnrC we have, for w ∈ [t0 − 2r, tn,j+1):

∫

[w,snjM)

δk − F0k(snjM)dPn =

∫

[w,snjM )

δk − F0k(u) + F0k(u) − F0k(snjM)dPn

≤∫

[w,snjM)

δk − F0k(u)dPn −g(t0)f0k(t0)(snjM − w)2

8,

since snjM −w ≥ snjM − tn,j+1 ≥ (M − 1)n−1/3 > n−1/3 for M > 2. Furthermore, for

M large we have

(n−2/3 + n−1/6(snjM − w)3/2

)C ≤ g(t0)f0k(t0)(snjM − w)2/16,

since snjM − w > (M − 1)n−1/3. Hence, it is sufficient to bound

P

[sup

w∈(t0−2r,tn,j+1)

∫

[w,snjM )

δk − F0k(u)dPn −g(t0)f0k(t0)(snjM − w)2

16≥ 0

]. (5.54)

To do so, we put a grid on the interval [t0 −2r, tn,j+1). The grid points tn,j−q and grid

cells In,j−q are given by

In,j−q = [tn,j−q, tn,j−q+1) = [t0 + (j − q)n−1/3, t0 + (j − q + 1)n−1/3),

for q = 0, . . . , Qnj = ⌈2rn1/3 + j⌉. We then bound (5.54) above by

Qnj∑

q=0

P

sup

w∈In,j−q

∫

[w,snjM )

δk − F0k(u)dPn ≥ λnkjqM

, (5.55)

where λnkjqM = f0k(t0)g(t0)(snjM − tn,j−q+1)2/16. If we bound the qth term in (5.55)

125

by

pjqM =

2 exp−d2(q +M)3/2 if j = 0, q = 0, . . . , Qn0,

2 exp−d2(q +Mjβ)3/2 if j = 1, . . . , Jn, q = 0, . . . , Qnj,

then we are done, because summing over q and using (a + b)3/2 ≥ a3/2 + b3/2 for

a, b > 0, yields

pjM ≤

d1 exp−d2M

3/2, if j = 0,

d1 exp−d2(Mjβ)3/2

if j = 1, . . . , Jn,

where d1 = 2∑∞

q=0 exp(−d2q3/2) <∞.

To prove that the qth term in (5.55) is bounded by pjqM , we use the fact that a

bounded Orlicz norm ‖X‖ψp, for p ≥ 1, gives an exponential bound on tail probabil-

ities, see, e.g., Van der Vaart and Wellner (1996, page 96 or 239):

P (|X| > t) ≤ 2 exp(−tp/‖X‖pψp). (5.56)

Here the Orlicz norm is ‖X‖ψp = infc > 0 : Eψp(|X|/c) ≤ 1 with ψp(x) = exp(xp)−1. In order to apply inequality (5.56), we define

FnkjqM = (δk − F0k(u))1[w,snjM)(u) : w ∈ In,j−q

and ‖Gn‖FnkjqM= supf∈FnkjqM

Gnf = supf∈FnkjqM

√n(Pn − P )f . Then the qth term

of (5.55) equals

P‖Gn‖FnkjqM

≥√nλnkjqM

≤ 2 exp

(−√nλnkjqM/

∥∥‖Gn‖FnkjqM

∥∥ψ1

), (5.57)

where the inequality follows by applying (5.56) with p = 1. Thus, if we can bound

the ψ1-Orlicz norm of ‖Gn‖FnkjqM, then we are done. Let FnkjqM be the envelope of

126

FnkjqM :

FnkjqM(u, δ) = |δk − F0k(u)| 1[tn,j−q ,snjM )(u) ≤ 1[tn,j−q ,snjM )(u).

Using Theorem 2.14.5 of Van der Vaart and Wellner (1996, page 244) with p = 1,

followed by their Theorem 2.14.1 on page 239, we get

∥∥∥‖Gn‖∗FnkjqM

∥∥∥ψ1

.∥∥∥‖Gn‖FnkjqM

∥∥∥1+ n−1/2(1 + log n)‖FnkjqM‖ψ1

. J(1,FnkjqM)‖FnkjqM‖2 + n−1/2 logn‖FnkjqM‖ψ1 . (5.58)

Note that

PF 2nkjqM ≤ G(1[tn,j−q ,snjM )).

The function J(1,FnkjqM) is constant in our case. Hence the first term of (5.58) is

given by√G(1[tn,j−q,snjM )).

We now compute the second term of (5.58). Since FnkjqM(u, δ) ≤ 1[tn,j−q ,snjM )(u),

we have

ψ1(FnkjqM/c) = exp(FnkjqM/c) − 1 ≤ exp(1/c) − 11[tn,j−q,snjM )(u),

and Pψ1(FnkjqM/c) ≤ exp(1/c)−1G(1[tn,j−q,snjM )). This expectation is bounded by

one if and only if c ≥ [log1 + 1/G(1[tn,j−q,snjM ))]−1. Hence,

‖FnkjqM‖ψ1 ≤[log1 + 1/G(1[tn,j−q,snjM ))

]−1.

Plugging this into (5.58) gives

∥∥∥‖Gn‖∗FnkjqM

∥∥∥ψ1

. J(1,FnkjqM)‖FnjkqM‖2 + n−1/2 log n‖FnkjqM‖ψ1

.√G(1[tn,j−q ,snjM )) + n−1/2 logn

[log1 + 1/G(1[tn,j−q,snjM ))

]−1(5.59)

127

The first term of (5.59) dominates the expression. To see this, let x = G(1[tn,j−q,snjM ))

and note that x ∈ [0, 1]. Since the length of the interval [tn,j−q, snjM) is at least

Mn−1/3, we can assume that x ≥ d0n−1/3 with d0 = g(t0)M/2. Now note that

2√d0

√x log(1 + 1/x) ≥ 2√

d0

√x log 2 ≥ 2(log 2)n−1/6 ≥ n−1/2 log n. (5.60)

Here the first inequality follows from the fact that f(x) = log(1 + 1/x) ≥ f(1) =

log 2 for x ∈ [0, 1], since it is a decreasing function. The second inequality follows

from x ≥ d0n−1/3, and the third inequality follows from n−1/3 log n ≤ 2 log 2 for all

n ≥ 1. Dividing both sides of (5.60) by log(1 + 1/x) yields that (5.59) is bounded by

(1 + 2/√d0) times its first term. Plugging this into (5.57) yields, for some constant

b > 0,

P (‖Gn‖FnkjqM>

√nλnkjqM) ≤ 2 exp

− b

√nλnkjqM√

G(1[tn,j−q ,snjM ))

.

Now recall that

snjM − tn,j−q = qn−1/3 +Mvn(tnj − t0),

λnkjqM = f0k(t0)g(t0)(q − 1)n−1/3 +Mvn(tnj − t0)

2/16,

and let

xnjqM = qn−1/3 +Mvn(tnj − t0)) =

(q +M)n−1/3 if j = 0

(q +Mjβ)n−1/3 if j = 1, . . . , Jn.

Then we have λnkjqM ≥ f0k(t0)g(t0)x2njqM/32 for M ≥ 5, and G(1[tn,j−q ,snjM )) ≤

2g(t0)xnjqM . Hence,

b√nλnjqM√

G(1[tn,j−q ,snjM ))≥ d2

√nx2

njkM√xnjkM

=

d2(q +M)3/2 if j = 0,

d2(q +Mjβ)3/2 if j = 1, . . . , Jn,

128

F02

F01

F0+

Fn2

Fn1

Fn+

snjMtn,j+1τn2τn1

Figure 5.3: Example clarifying the treatment of the Fn+ term in Lemma 5.16. Notethat Fn+(tn,j+1) > F0+(snjM), Fn1(tn,j+1) > F01(snjM), and Fn2(tn,j+1) < F02(snjM).

Thus, in this example l = 1 (see (5.38) and (5.39)). Since Fn+(τn1) < F0+(snjM) wecannot apply the method of Lemma 5.15.

where d2 = bf0k(t0)√g(t0)/(32

√2). 2

Proof of Lemma 5.16: We first note that l is only defined on the event AnjM =

Fn+(tn,j+1) ≥ F0+(snjM). Hence, this entire proof should be read on the event

AnjM . Furthermore, note that we can apply the method of proof Lemma 5.15 if

Fn+(u) ≥ F0+(snjM) for all u ≥ τnl. This situation occurs if l = K, because in that

case none of the sub-distribution functions jump on the interval (τnl, tn,j+1).

Now suppose that l < K. Then we typically do not have that Fn+(u) ≥ F0+(snjM)

for all u ≥ τnl, as illustrated in Figure 5.3. Hence, we cannot apply the method of

Lemma 5.15. Instead, we exploit the K-dimensional system of sub-distribution func-

tions by breaking∫[τnl,snjM )

δ+ − Fn+(u)dPn into pieces that we analyze separately.

First, we define l∗ ∈ l, . . . , K as follows. If

∫

[τnl,τnk)

δ+ − Fn+(u)dPn ≥ 0, for all k = l + 1, . . . , K, (5.61)

129

we let l∗ = l. Otherwise we define l∗ such that

∫

[τnl,τnk)

δ+ − Fn+(u)dPn ≥ 0, k = l∗ + 1, . . . , K, (5.62)

∫

[τnl,τnl∗ )

δ+ − Fn+(u)dPn < 0. (5.63)

Then, by (5.63) and the decomposition [τnl, snjM) = [τnl, τnl∗) ∪ [τnl∗ , snjM), we get

∫

[τnl,snjM )

δ+ − Fn+(u)dPn ≤∫

[τnl∗ ,snjM )

δ+ − Fn+(u)dPn, (5.64)

where strict inequality holds if l 6= l∗. By rearranging the sum and using the notation

τn,K+1 = snjM , we can rewrite the right side of (5.64) as

K∑

k=l∗+1

∫

[τnl∗ ,τnk)

δk − Fnk(u)dPn +K∑

k=l∗

k∑

p=1

∫

[τnk,τn,k+1)

δp − Fnp(u)dPn. (5.65)

We now derive upper bounds for both terms in (5.65), on the event AnjM ∩ EnrC .

Starting with the first term, note that

∫

[τnl∗ ,τnk)

δ+ − Fn+(u)dPn ≥ 0, k = l∗ + 1, . . . , K. (5.66)

Namely, if l = l∗ then (5.66) is the same as (5.61). On the other hand, if l < l∗

then (5.66) follows (with strict inequality) from (5.62), (5.63) and the decomposition

[τnl, τnk) = [τnl, τnl∗)∪[τnl∗ , τnk). Furthermore, the Fenchel conditions (see Proposition

2.36 and expression (5.20)) imply that

∫

[t,τnk)

δk − Fnk(u) +

F0k(τnk)

1 − F0+(τnk)δ+ − Fn+(u) +RkτnkFn

(u, δ)

dPn ≤ 0,

for k = 1, . . . , K, t ≤ τnk. Using this inequality with t = τnl∗ together with (5.66) and

130

F0k(τnk)/1 − F0+(τnk) > 0 yields that

∫

[τnl∗ ,τnk)

δk − Fnk(u) +RkτnkFn

(u, δ)dPn ≤ 0.

Hence, on the event EnrC we have

∫

[τnl∗ ,τnk)

δk − Fnk(u)dPn ≤ −∫

[τnl∗ ,τnk)

RkτnkFn(u, δ)dPn

≤(n−2/3 + n−1/6(τnk − τnl∗)

3/2)C

≤(n−2/3 + n−1/6(snjM − τnl∗)

3/2)C,

for k = l∗ + 1, . . . , K, using the definition of EnrC in (5.30). This implies that, on the

event EnrC , the first term of (5.65) is bounded by(n−2/3 + n−1/6(snjM − τnl∗)

3/2)KC.

We now derive an upper bound for the second term of (5.65). Note that the

inequalities (5.38) in the definition of l imply that on the event AnjM

K∑

p=k+1

Fnp(tn,j+1) <K∑

p=k+1

F0p(snjM), k = l, . . . , K.

Together with the definition of τn1, . . . , τnK , this yields that on the event AnjM =

Fn+(tn,j+1) ≥ F0+(snjM, we have

k∑

p=1

Fnp(τnp) =k∑

p=1

Fnp(tn,j+1) >k∑

p=1

F0p(snjM), k = l, . . . , K.

Furthermore, Fnp(τnp) ≤ Fnp(τnk) for p ≤ k by the monotonicity of Fnp and the

ordering τn1 ≤ · · · ≤ τnK . Hence, we get for k = l, . . . , K and u ≥ τnk:

k∑

p=1

Fnp(u) ≥k∑

p=1

Fnp(τnk) ≥k∑

p=1

Fn+(τnp) >

k∑

p=1

F0p(snjM).

131

This means that on the event AnjM the second term of (5.65) is bounded above by

K∑

k=l∗

k∑

p=1

∫

[τnk ,τn,k+1)

δp − F0p(snjM)dPn =K∑

k=1

∫

[τnk∨τnl∗ ,snjM )

δk − F0k(snjM)dPn.

Combining (5.64), (5.65) and the upper bound for (5.65) on the event AnjM ∩ EnrC ,

we obtain:

P

(∫

[τnl,snjM )

δ+ − Fn+dPn ≥ 0

∩ AnjM ∩ EnrC

)

≤ P

(∫

[τnl∗ ,snjM )

δ+ − Fn+dPn ≥ 0

∩AnjM ∩EnrC

)

≤ P

((n−2/3 + n−1/6(snjM − τnl∗)

3/2)KC

+K∑

k=1

∫


δk − F0k(snjM)dPn ≥ 0

∩EnrC

)

≤ P

((n−2/3 + n−1/6(snjM − τnl∗)

3/2)KC

+

∫

[τnl∗ ,snjM )

δ1 − F01(snjM)dPn ≥ 0

∩ EnrC

)

+ P

( K∑

k=2

∫



∩ EnrC

)

≤ P

(sup

w∈(t0−2r,tn,j+1)

(n−2/3 + n−1/6(snjM − w)3/2

)KC

+

∫

[w,snjM)

δ1 − F01(snjM)dPn≥ 0

∩ EnrC

)

+ P

(sup

k ∈ 1, . . . , K

w ∈ (t0 − 2r, tn,j+1)

∫

[w,snjM)


∩EnrC

).

We can bound both terms on the last two lines by pjM/4, using Lemma 5.15.

132

Chapter 6

LIMITING DISTRIBUTION

In Section 5.3 we showed that, for k = 1, . . . , K,

n1/3Fnk(t0) − F0k(t0) = Op(1) and n1/3Fnk(t0) − F0k(t0) = Op(1).

In this chapter we discuss the limiting distributions of these quantities. In Section 6.1

we show that the limiting distribution of the naive estimator (Fn1, . . . , FnK) is given

by the slopes of the convex minorants of a K-tuple of two-sided correlated Brownian

motion processes plus parabolic drifts. In Section 6.2 we discuss analogous results for

the MLE. We will see that the limiting distribution of the MLE is given by the slopes

of the convex minorants of the K-tuple of two-sided Brownian motion processes plus

parabolic drifts, plus an extra term involving the difference between the sum of the

K drifting Brownian motions and their convex minorants. This extra term causes

the system of processes to be self-induced. Hence, existence and uniqueness of these

processes are not automatic, and we formally establish these properties in Theorem

6.9. In Theorem 6.10 we prove convergence of the MLE to its limiting distribution.

Technical proofs are collected in Section 6.3.

Throughout this chapter, we use the following conventions and notation. We

assume that the naive estimators are right-continuous and piecewise constant with

jumps only at T1, . . . , Tn. Similarly, we assume that for each k ∈ 1, . . . , K, the

MLE Fnk is right-continuous and piecewise constant with jumps only at points in

Tk (see Definition 2.22). We denote the right-continuous derivative of a function

f : R 7→ R by f ′ (if it exists). Furthermore, N is the collection of nonnegative integers

133

0, 1, . . ., l∞[−m,m] denotes the set of uniformly bounded real functions on [−m,m],

C[−m,m] is the set of continuous real functions on [−m,m], and D[−m,m] is the set

of cadlag functions on [−m,m]. Finally, we use the following definition for integrals

and indicator functions:

Definition 6.1 For t < t0 we define

1[t0,t](u) = −1[t,t0](u) and 1[t0,t)(u) = −1[t,t0)(u).

Furthermore, in analogy with the definition of the signed Riemann integral, we define

for t < t0:

∫

[t0,t)

f(u)dA(u) =

∫f(u)1[t0,t)(u)dA(u)

= −∫f(u)1[t,t0)(u)dA(u) = −

∫

[t,t0)

f(u)dA(u),

if dA is a Lebesgue-Stieltjes measure, with a similar definition if both endpoints of the

interval are closed. We use the same notation for integrals with respect to Brownian

motion W (·). Thus, we define for t < t0:

∫ t

t0

f(u)dW (u) = −∫ t0

t

f(u)dW (u).

6.1 The limiting distribution of the naive estimator

The limiting distribution of the naive estimator follows by generalizing known results

on the MLE for univariate current status data (Groeneboom and Wellner (1992,

Theorem 5.1, page 89)). To describe this limiting distribution, we define the following

processes:

Definition 6.2 Let W = (W1, . . . ,WK) be a K-tuple of two-sided Brownian motion

134

processes originating from zero, with mean zero and covariances

EWj(t)Wk(s) = (|s| ∧ |t|)1st > 0Σjk, s, t ∈ R,

where Σjk = g(t0)1j = kF0k(t0) − F0j(t0)F0k(t0), for j, k ∈ 1, . . . , K. Further-

more, let

Xk(t) =Wk(t)

g(t0)+

1

2f0k(t0)t

2, k = 1, . . . , K, t ∈ R.

Definition 6.3 Let Hk be the convex minorant of Xk, i.e., Hk is convex and satisfies

the following conditions:

Hk(t) ≤ Xk(t), k = 1, . . . , K, t ∈ R

∫Hk(t) −Xk(t)dH ′

k(t) = 0, k = 1, . . . , K.

Furthermore, let H = (H1, . . . , HK), and let U(t) = (U1(t), . . . , UK(t)) be the vector

of right derivatives of H at t, i.e., Uk(t) = H ′k(t) for k = 1, . . . , K and t ∈ R.

Note that the processes H1, . . . , HK exist and are unique. The main result of this

section is given in Theorem 6.4.

Theorem 6.4 For each k = 1, . . . , K, let F0k be continuously differentiable at t0 with

strictly positive derivative f0k(t0). Furthermore, let G be continuously differentiable at

t0 with strictly positive derivative g(t0). Let U be as defined in Definition 6.3. Then

n1/3Fn(t0) − F0(t0) →d U(0), in RK .

ForK = 1, Theorem 6.4 just gives the limiting distribution of the maximum likelihood

estimator for univariate current status data. For K > 1, we obtain for each k =

1, . . . , K the limiting distribution of the maximum likelihood estimator for the reduced

135

current status data (T,∆k). The multinomial covariance structure of the Brownian

motions comes from the multinomial distribution of ∆|T .

Example 6.5 Throughout this chapter, we consider the following example. Let T

be independent of (X, Y ), and let T , Y and X|Y have the following distributions:

K = 2,

G(t) = P (T ≤ t) = 1 − exp(−t), (6.1)

P (Y = k) =k

3, k = 1, 2,

P (X ≤ t|Y = k) = 1 − exp(−kt), k = 1, 2,

so that

F0k(t) =k

31 − exp(−kt), k = 1, 2.

.

Figures 6.1 and 6.21 show the limiting processes for the naive estimator, for t0 = 1

and t0 = 2. Comparing the two figures, we see that for t0 = 2 the variance of

the Brownian motions Wk(h)/g(t0) is larger, the negative correlation between the

two component processes W1(h)/g(t0) and W2(h)/g(t0) is stronger, and the parabolic

drifts f0k(t0)h2/2 are weaker. These observations follow from the definition of the pro-

cesses (Definition 6.2) and the fact that F0k is increasing and g and f0k are decreasing.

Finally, note that the slope processes Uk(h) have much fewer jumps for t0 = 2.

We now provide a proof of Theorem 6.4 that is in the same spirit as the proof

of the limiting distribution of the MLE in Section 6.2 ahead. The idea is as follows.

First, we characterize the localized estimator in terms of localized processes X locn ,

1These figures are constructed using the localized processes defined in the proof of 6.4, for samplesize n = 100,000.

136

−15 −10 −5 0 5 10 15

−10

−5

05

1015

Brownian motion, k=1

h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015

Convex minorant, k=1

h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Slope of convex minorant, k=1

h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.1: Limiting processes for the naive estimator, for the model given in Example6.5 and t0 = 1. The top row shows Wk(h)/g(t0), k = 1, 2. The middle row shows

Xk(h) (grey) and its convex minorant Hk(h) (red), k = 1, 2. The parabolic driftsf0k(t0)h

2/2 are denoted by dashed lines. The bottom row shows the slope process

Uk(h), together with a dashed line of slope f0k(t0), k = 1, 2.

137

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.2: Limiting processes for the naive estimator, for the model given in Example6.5 for t0 = 2. Please see Figure 6.1 for further explanation.

138

H locn , U loc

n . Next, we show that these processes, restricted to [−m,m], are tight in an

appropriate space, for each m ∈ N. Via a diagonal argument, it then follows that

every subsequence (X locn′ , H loc

n′ , U locn′ ) has a convergent subsequence (X loc

n′′ , H locn′′ , U loc

n′′ ),

converging to a limit (X,H,U), with U = H ′ and with its component processes

defined on R. By the continuous mapping theorem, this limit satisfies the conditions

of Definition 6.3 on intervals [−m,m], for each m ∈ N. By letting m → ∞ we

obtain that these conditions are satisfied on R. This shows existence of a process

satisfying the conditions of Definition 6.3. Since the processes defined in Definition

6.3 are unique, all subsequences must converge to the same limit (X, H, U). Hence

(X locn , H loc

n , U locn ) →d (X, H, U).

Note that we obtain existence of the limiting processes while proving convergence

of the naive estimator to its limiting distribution. For the naive estimator, this proof

is vacuous, since existence of the convex minorants of X1, . . . , XK is well-known.

However, existence of the limiting processes for the MLE is not known, and hence

this step will be important for the MLE. Furthermore, note that uniqueness of the

limiting processes is used in the proof. For the naive estimator, uniqueness of the

convex minorants of X1, . . . , XK is known and hence we use it without proof. For

the MLE, uniqueness of the limiting processes is not known, and we establish this

separately in Section 6.2.2. Finally, note that our approach is different from the

one used by Groeneboom, Jongbloed and Wellner (2001a,b) for maximum likelihood

estimation of convex densities. They first establish existence and uniqueness of the

limiting process separately, and then prove convergence to the limiting distribution.

We now provide several results that are needed in the proof of Theorem 6.4.

Let τnk be the last jump point of Fnk before t0, k = 1, . . . , K. Lemma 6.6 shows

n2/3∫[τnk,t0)

δk − Fnk(u)dPn(u, δ) is tight.

139

Lemma 6.6 Let τnk be the last jump point of Fnk before t0. Then

∫

[τnk,t0)

Fnk(u) − δkdPn(u, δ) = Op(n−2/3), k = 1, . . . , K.

Proof: We write

∫

[τnk ,t0)

Fnk(u) − δkdPn(u, δ)

=

∫

[τnk ,t0)

Fnk(u) − F0k(u)dGn(u) +

∫

[τnk ,t0)

F0k(u) − δkdPn(u, δ)

=

∫

[τnk ,t0)

Fnk(u) − F0k(u)d(Gn −G)(u) +

∫

[τnk,t0)

Fnk(u) − F0k(u)dG(u)

+

∫

[τnk ,t0)

F0k(u) − δkd(Pn − P )(u, δ)

≡ I + II + III.

As mentioned in the introduction of Chapter 5, we know that t0 − τnk = Op(n−1/3).

Combining this with Lemma 4.1 of Kim and Pollard (1990) yields that terms I and III

are of order Op(n−2/3). Term II is of order Op(n

−2/3) by the local rate of convergence

and t0 − τnk = Op(n−1/3). 2

The next Lemma 6.7 formalizes that dGn(u) ≈ dG(u) ≈ g(t0)du for u ∈ [t0 −mn−1/3, t0 +mn−1/3].

Lemma 6.7 Let the conditions of Theorem 6.4 be satisfied, let m > 0, and k ∈1, . . . , K. Then

1

g(t0)

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)dGn(u)

=

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du+ op(n−2/3), (6.2)

uniformly in h ∈ [−m,m].

140

Proof: Let m ∈ N and k ∈ 1, . . . , K. We write

1

g(t0)

∫

[t0,t0+n−1/3h]


=1

g(t0)

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)d(Gn −G)(u)

+1

g(t0)

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)dG(u)

≡ I + II.

To see that term I is of order Op(n−1), note that by the local rate of convergence, we

can assume at the cost of probability ǫ that

suph∈[−m,m]

∣∣∣Fnk(t0 + n−1/3h) − F0k(t0)∣∣∣ ≤ Cn−1/3.

Applying Theorem 2.11.22 of Van der Vaart and Wellner (1996) to the class Qn, where

Qn =

qnFh(u) = n1/2Fn(u) − F0k(t0)1[t0,t0+n−1/3h](u) : h ∈ [−m,m], Fn ∈ Fn

,

Fn =

Fn : R 7→ [0, 1], Fn monotone, sup

h∈[−m,m]

∣∣Fn(t0 + n−1/3h) − F0k(t0)∣∣ ≤ Cn−1/3

,

yields that GnqnFh : qnFh ∈ Qn is tight in l∞[−m,m]. Since

GnqnFh =√n(Pn − P )qnFh = n

∫

[t0,t0+n−1/3h]

Fn(u) − F0k(t0)d(Gn −G)(u),

this implies that term I is of order Op(n−1).

For the second term (II) we write:

II =

∫ t0+n−1/3h

t0

Fnk(u) − F0k(t0)du

+

∫ t0+n−1/3h

t0

Fnk(u) − F0k(t0)g(u) − g(t0)

g(t0)du = IIa + IIb.

141

Note that IIa is present in (6.2). Term IIb is of order op(n−2/3), uniformly in h ∈

[−m,m], using the Cauchy-Schwarz inequality, the local rate of convergence and the

continuity of g:

|IIb| ≤1

g(t0)

(∫ t0+n−1/3h

t0

Fnk(u) − F0k(t0)2du

)1/2(∫ t0+n−1/3h

t0

g(u)− g(t0)2du

)1/2

= Op(n−1/2)o(n−1/6) = op(n

−2/3).

2

Proposition 6.8 gives convergence to the Brownian motion processes plus parabolic

drifts, as defined in 6.2.

Proposition 6.8 Let the conditions of Theorem 6.4 be satisfied. Let m > 0. Then

X locnk (h) ≡ n2/3

g(t0)

∫

[t0,t0+n−1/3h]

δk − F0k(t0)dPn(u, δ)

→d

Wk(h)

g(t0)+

1

2f0k(t0)h

2 = Xk(h),

jointly for k = 1, . . . , K in (l∞[−m,m])K .

Proposition 6.8 is quite standard. To show where the Brownian motion and the

parabolic drift come from, we write

δk − F0k(t0) = δk − F0k(u) + F0k(u) − F0k(t0).

The part δk − F0k(u) gives a martingale that converges to the Brownian motion Wk,

and the part F0k(u) − F0k(t0) gives the quadratic drift. The multinomial covariance

structure of the Brownian motions W1, . . . ,WK comes from the multinomial distribu-

tion of ∆|T , given in (2.4). For completeness, we give a proof of Proposition 6.8 in

Section 6.3.

We are now ready to prove Theorem 6.4.

142

Proof of Theorem 6.4: Let τnk be the last jump point of Fnk before t0, for k =

1, . . . , K. Recall from Proposition 2.28 that the naive estimators Fnk(t), k = 1, . . . , K,

are characterized by

∫

[τnk,t)

Fnk(u)dGn(u) ≤∫

[τnk,t)

δkdPn(u, δ), k = 1, . . . , K, t ∈ R, (6.3)

where equality must hold if t is a jump point of Fnk. In order to change the integration

interval [τnk, t) to [t0, t), we define

cnk =

∫

[τnk,t0)

Fnk(u) − δkdPn(u, δ), k = 1, . . . , K.

Then (6.3) is equivalent to

cnk +

∫

[t0,t)

Fnk(u)dGn(u) ≤∫

[t0,t)

δkdPn(u, δ), k = 1, . . . , K, t ∈ R, (6.4)

where equality must hold if t is a jump point of Fnk.

We now localize this expression, by subtracting∫[t0,t)

F0k(t0)dGn(u) on both sides,

and applying the change of variable t→ t0 + n−1/3h. This yields

cnk +

∫

[t0,t0+n−1/3h)


≤∫

[t0,t0+n−1/3h)

δk − F0k(t0)dPn(u, δ), k = 1, . . . , K, h ∈ R, (6.5)

where equality must hold if t0 + n−1/3h is a jump point of Fnk. Next, we define the

following localized processes for k = 1, . . . , K and h ∈ R:

X locnk (h) =

n2/3

g(t0)

∫

[t0,t0+n−1/3h]

δk − F0k(t0)dPn(u, δ),

H locnk (h) = n2/3

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du,

143

and

U locnk (h) = n1/3Fnk(t0 + n−1/3h) − F0k(t0).

Note that U locnk = (H loc

nk )′ at continuity points of U locnk . Furthermore, define

c locnk =n2/3

g(t0)cnk,

R locnk (h) =

n2/3

g(t0)

∫

[t0,t0+n−1/3h]


− n2/3

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du.

Then multiplying (6.5) by n2/3/g(t0) yields

c locnk + R locnk (h) + H loc

nk (h) ≤ X locnk (h), h ∈ R, k = 1, . . . , K,

and c locnk + R locnk (h−)+ H loc

nk (h−) = X locnk (h−) if U loc

nk has a jump at h. Combining these

statements, we obtain:

c locnk + R locnk (h) + H loc

nk (h) ≤ X locnk (h), h ∈ R, k = 1, . . . , K, (6.6)

∫ c locnk + R loc

nk (h−) + H locnk (h−) −X loc

nk (h−)dU loc

nk (h) = 0, k = 1, . . . , K. (6.7)

Note that these conditions also hold when the processes are restricted to [−m,m], for

each m ∈ N.

We define the following vectors:

c locn = (c locn1 , . . . , clocnK ), H loc

n = (H locn1 , . . . , H

locnK),

R locn = (R loc

n1 , . . . , RlocnK), U loc

n = (U locn1 , . . . , U

locnK),

X locn = (X loc

n1 , . . . , XlocnK),

144

and for m ∈ N, we define the space

E[−m,m] = RK × (D[−m,m])K × (D[−m,m])K × (C[−m,m])K × (D[−m,m])K

≡ RK × I × II × III × IV,

endowed with the product topology induced by the uniform topology on I×II×III,and the Skorohod topology on IV . Note that this space supports the vector

Vn|[−m,m] ≡ (c locn , R locn , X loc

n , H locn , U loc

n )|[−m,m],

where the notation |[−m,m] denotes that all processes R locnk , X loc

nk , H locnk and U loc

nk ,

k = 1, . . . , K, are restricted to [−m,m].

Analogously to Groeneboom, Jongbloed and Wellner (2001b), we now show that

Vn|[−m,m] is tight in E[−m,m] for each m ∈ N. Note that Lemma 6.6 implies tight-

ness of c locn in RK . Furthermore, Lemma 6.7 implies that R locn |[−m,m] is of order

op(1). Next, note that the subset of D[−m,m] consisting of absolutely bounded

nondecreasing functions is compact in the Skorohod topology. Hence, Theorem

5.20 and the monotonicity of U locnk , k = 1, . . . , K, yield that U loc

n |[−m,m] is tight

in (D[−m,m])K endowed with the Skorohod topology. Moreover, since the set of

absolutely bounded continuous functions with absolutely bounded derivatives is com-

pact in C[−m,m] with the uniform topology, it follows that H locn |[−m,m] is tight in

(C[−m,m])K endowed with the uniform topology. Furthermore, by Proposition 6.8

we have that

X locn (t0, h) →d

W (h)

g(t0)+

1

2f0(t0)h

2 = X(h), (6.8)

uniformly on compacta, where f0(t0) = (f01(t0), . . . , f0K(t0)), and W = (W1, . . . ,WK)

and X = (X1, . . . , XK) are defined in Definition 6.2. Hence, X locn |[−m,m] is tight in

(D[−m,m])K endowed with the uniform topology. Combining everything, we have

145

that Vn|[−m,m] is tight in E[−m,m] for each m ∈ N.

It now follows by a diagonal argument that any subsequence Vn′ of Vn has a further

subsequence Vn′′ that converges in distribution to a limit

V = (c, 0, X,H, U)

∈ RK × (C(−∞,∞))K × (C(−∞,∞))K × (C(−∞,∞))K × (D(−∞,∞))K.

Using a representation theorem (see, e.g., Dudley (1968), Pollard (1984, Representa-

tion Theorem 13, page 71) or Van der Vaart and Wellner (1996, Theorem 1.10.4, page

59)), we can assume that Vn′′ →a.s. V . Hence, U = H ′ at continuity points of U , see

Lemma 6.26 on page 177.

Conditions (6.6) and (6.7) and the continuous mapping theorem imply that the

vector (c,X,H, U) must satisfy, for all m ∈ N:

inft∈[−m,m]

Xk(t) −Hk(t) − ck ≥ 0, k = 1, . . . , K,

∫

[−m,m]

Xk(t−) −Hk(t−) − ckdUk(t) = 0, k = 1, . . . , K.

Since Xk and Hk are continuous, we can write the second condition as

∫

[−m,m]

Xk(t) −Hk(t) − ckdUk(t) = 0, k = 1, . . . , K.

Letting m→ ∞ gives

inft∈R

Xk(t) −Hk(t) − ck ≥ 0, k = 1, . . . , K,∫

Xk(t) −Hk(t) − ckdUk(t) = 0, k = 1, . . . , K.

146

Defining Hk(t) = Hk(t) + ck, k = 1, . . . , K, we have H ′k = H ′

k and

inft∈R

Xk(t) − Hk(t) ≥ 0, k = 1, . . . , K,∫Xk(t) − Hk(t)dUk(t) = 0, k = 1, . . . , K.

This proves existence of a process satisfying the conditions of Definition 6.3. Since the

processes defined in Definition 6.3 are unique, it follows that Hk = Hk and Uk = Uk,

for k = 1, . . . , K. Hence, each subsequence converges in distribution to the same

limit, so that U locn →d U in the Skorohod topology. In particular,

U locn (0) = n1/3(Fn(t0) − F0(t0)) →d U(0), in R

K .

2

6.2 The limiting distribution of the MLE

As noted in the introduction of this chapter, the limiting processes for the MLE

contain an extra term involving the difference between the sum of the drifting Brow-

nian motions and their convex minorants. We now prove existence and uniqueness of

this system of processes (Theorem 6.9), and convergence of the MLE to its limiting

distribution (Theorem 6.10). We first state the main results.

Theorem 6.9 Let

ak = 1/F0k(t0), k = 1, . . . , K + 1, (6.9)

and recall the definition of X1, . . . , XK in Definition 6.2. Then there exists an almost

surely unique K-tuple H = (H1, . . . , HK) of convex functions with right-continuous

derivatives U = (U1, . . . , UK), satisfying the following conditions:

(i) akHk(t) + aK+1H+(t) ≤ akXk(t) + aK+1X+(t), for k = 1, . . . , K, t ∈ R.

147

(ii)∫ akHk(t) + aK+1H+(t) − akXk(t) − aK+1X+(t)

dUk(t) = 0, k = 1, . . . , K.

(iii) For each M > 0 and each k = 1, . . . , K, there exist points τ1k < −M and

τ2k > M so that

akHk(t) + aK+1H+(t) = akXk(t) + aK+1X+(t) for t = τ1k and t = τ2k.

Theorem 6.10 For each k = 1, . . . , K, let F0k be continuously differentiable at t0

with strictly positive derivative f0k(t0). Furthermore, let G be continuously differen-

tiable at t0 with strictly positive derivative g(t0). Let U = (U1, . . . , UK) be defined as

in Theorem 6.9. Then

n1/3Fn(t0) − F0(t0) →d U(0), in RK .

The outline of this section is as follows. In Section 6.2.1 we discuss the processes

H1, . . . , HK , and compare them to the processes H1, . . . , HK for the naive estimator.

In Section 6.2.2 we prove that the processes H1, . . . , HK are unique. Next, in Section

6.2.3 we prove convergence of the MLE to its limiting distribution. In this proof, we

automatically obtain existence of the limiting processes H1, . . . , HK , hence completing

the proof of Theorem 6.9. In this sense our approach is different from the one followed

by Groeneboom, Jongbloed and Wellner (2001a,b), who first establish existence and

uniqueness of the limiting processes, before proving convergence. However, apart from

this difference, our approaches are very similar.

6.2.1 The process H = (H1, . . . , HK)

We now discuss the processes H1, . . . , HK . In Lemma 6.11 we study the collection of

points of touch between akHk + aK+1H+ and akXk + aK+1X+. The results in this

lemma rely on the observation that the process akHk +aK+1H+ is pointwise bounded

148

above by the convex minorant of akXk + aK+1X+. Since the convex minorant of a

Brownian motion process plus parabolic drift is well-studied (Groeneboom (1989)),

this point of view allows us to deduce properties of akHk + aK+1H+.

Lemma 6.11 Let Sk be the collection of points of touch between akHk(t)+aK+1H+(t)

and akXk(t) + aK+1X+(t). Then

(i) Sk is a subset of the points of touch of akXk(t) + aK+1X+(t) and its convex

minorant.

(ii) At points t ∈ Sk, the right and left derivatives of akHk(t) + aK+1H+(t) are

bounded above and below by the right and left derivatives of the convex minorant

of akXk(t) + aK+1X+(t).

Proof: Note that akHk(t) + aK+1H+(t) is a convex function, bounded above by

akXk(t) + aK+1X+(t). Hence, akHk(t) + aK+1H+(t) is bounded above by the con-

vex minorant of akXk(t) + aK+1X+(t). This yields (i). Property (ii) then follows

immediately from a graphical argument. 2

Property (i) of Lemma 6.11 leads to Corollary 6.12, which states that Hk is piecewise

linear, and Uk is piecewise constant, for all k = 1, . . . , K.

Corollary 6.12 Let H be defined as in Theorem 6.9. Then for each k ∈ 1, . . . , K,Hk is a piecewise linear function, and Uk is piecewise constant.

Proof: With probability one, the collection of points of touch between akXk(t) +

aK+1X+(t) and its convex minorant has no condensation points in a finite interval

(Groeneboom (1989)). By property (i) of Lemma 6.11, this implies that with proba-

bility one, Sk has no condensation points in a finite interval. Conditions (i) and (ii) of

Theorem 6.9 imply that Uk can only increase at points t ∈ Sk. Hence, Uk is piecewise

constant and Hk is piecewise linear. 2

149

In the discussion preceding Lemma 6.11, we interpreted akHk(t)+aK+1H+(t) as a

convex function below akXk(t) + aK+1X+(t). We now make this interpretation more

precise. Note that conditions (i) and (ii) of Theorem 6.4 imply that

akHk(h) + aK+1H+(h) = akXk(h) + aK+1X+(h)

at points of change of slope of Hk, k = 1, . . . , K. But akHk + aK+1H+ has a change

of slope if any Hj, j = 1, . . . , K has a change of slope. Thus, akHk(h) + aK+1H+(h)

can have changes of slope without touching akXk(h)+aK+1X+(h). This is illustrated

in Figures 6.3 and 6.4, for t0 = 1 and t0 = 2 respectively2. For example, in Figure

6.3, we see that a1H1(h)+ aK+1H+(h) has a change of slope just before zero, without

touching a1X1(h)+aK+1X+(h). This is allowed, since U1(h) does not have a jump at

this point. On the other hand, U2(h) does have a jump at this point, and we indeed

see that a2H2(h) + aK+1H+(h) touches a2X2(h) + aK+1X+(h).

In Lemma 6.13 we give two different interpretations of H1, . . . , HK that emphasize

the difference between the MLE and the naive estimator.

Lemma 6.13 Let H be defined in Theorem 6.9. Then H satisfies the following self-

induced convex minorant characterizations:

(a) For each k = 1, . . . , K, Hk(t) is the convex minorant of

Xk(t) +aK+1

ak(X+(t) − H+(t)). (6.10)

(b) For each k = 1, . . . , K, Hk(t) is the convex minorant of

Xk(t) +aK+1

ak + aK+1

(X(−k)+ (t) − H

(−k)+ (t)), (6.11)

2These figures are made using the localized processes defined in the proof of Theorem 6.10, withn = 100, 000. The convex minorant does not fit exactly, due to omission of the term R loc

nk .

150

where X(−k)+ (t) =

∑Kj=1,j 6=kXj(t) and H

(−k)+ (t) =

∑Kj=1,j 6=k Hj(t).

Proof: Characterization (a) holds since conditions (i) and (ii) of Theorem 6.9 are

equivalent to:

Hk(t) ≤ Xk(t) +aK+1

ak(X+(t) − H+(t)), t ∈ R,

∫ Hk(t) −Xk(t) −

aK+1

ak(X+(t) − H+(t))

dH ′

k(t) = 0,

for k = 1, . . . , K.

Characterization (b) holds since conditions (i) and (ii) of Theorem 6.9 are equiv-

alent to:

Hk(t) ≤ Xk(t) +aK+1

ak + aK+1(X

(−k)+ (t) − H

(−k)+ (t)), t ∈ R,

∫ Hk(t) −Xk(t) −

aK+1

ak + aK+1(X

(−k)+ (t) − H

(−k)+ (t))

dH ′

k(t) = 0,

for k = 1, . . . , K. 2

Characterization (a) of Lemma 6.13 is illustrated in Figures 6.5 and 6.6, for t0 = 1

and t0 = 2. The top row shows the extra term (aK+1/ak)X+(h) − H+(h) in the

processes for the MLE. Note that this term appears to be nonnegative. This is

indeed the case, and will be proved in Lemma 6.14. Furthermore, note that the

extra term (aK+1/ak)X+(h) − H+(h) is more prominent for t0 = 2 than for t0 = 1,

due to the larger variance of the Brownian motions at t0 = 2, and the fact that

aK+1/ak = F0k(t0)/(1 − F0+(t0)) increases with t0. The middle row of the Figures

depicts Hk and Hk, for k = 1, 2. It appears that Hk(h) ≥ Hk(h). This is indeed the

case and will be proved in Lemma 6.15.

We now discuss the origin of the extra term X+ − H+ that appears in the lim-

iting processes for the MLE. Recall the differences between the MLE and the naive

estimator, discussed in Section 2.1.4:

151

(a) The log likelihood (2.6) for the MLE contains a term involving FK+1(u) =

1−F+(u), while the log likelihood (2.11) for the naive estimator does not include

such a term;

(b) The space FK for the MLE includes the constraint that the sum of the sub-

distribution functions is bounded by one, while the space FK for the naive

estimator does not include such a constraint.

These differences were also present in the convex minorant characterization in equa-

tion (2.58), where the convex minorant characterization for the MLE contained two

extra terms: a Fn+-term and a βnFn-term. The Fn+-term came from the term 1−F+(t)

in the log likelihood (2.6), and βnFn-term came from the constraint on FK . For the

local limiting distribution at an interior point, the constraint on FK does not play a

role. Hence, we do not see the βnFn-term in the limiting process for the MLE. On the

other hand, the Fn+-term does play a role and results in the extra term X+ − H+.

In Lemma 6.14 we show that X+(t) − H+(t) ≥ 0 for all t ∈ R. This inequality is

illustrated in Figures 6.5 and 6.6.

Lemma 6.14 Let H be defined as in Theorem 6.9. Then

H+(t) ≤ X+(t), t ∈ R.

Proof: Note that condition (i) of Theorem 6.9 can be written as

Hk(t) +aK+1

akH+(t) ≤ Xk(t) +

aK+1

akX+(t), k = 1, . . . , K, t ∈ R.

Plugging in the values of a1, . . . , aK+1 as defined in (6.9) yields

Hk(t) +F0k(t0)

1 − F0+(t0)H+(t) ≤ Xk(t) +

F0k(t0)

1 − F0+(t0)X+(t), k = 1, . . . , K, t ∈ R.

152

Summing over k = 1, . . . , K gives

H+(t) +F0+(t0)

1 − F0+(t0)H+(t) ≤ X+(t) +

F0+(t0)

1 − F0+(t0)X+(t), t ∈ R.

and this is equivalent to H+(t) ≤ X+(t) for all t ∈ R. 2

We now use Lemma 6.14 to compare the MLE and the naive estimator, and find that

Hk ≤ Hk. This inequality is also illustrated in Figures 6.5 and 6.6.

Lemma 6.15 The following relation holds:

Hk(t) ≤ Hk(t), k = 1, . . . , K.

Proof: Recall that Hk(t) is the convex minorant of Xk(t). In Lemma 6.14, we saw

that the adjustment (aK+1/ak)(X+(t) − H+(t)) for the MLE is nonnegative. Hence,

Hk(t) is a convex function below Xk(t) + (aK+1/ak)(X+(t) − H+(t)). Since Hk(t) is

the convex minorant of Xk(t) + (aK+1/ak)(X+(t) − H+(t) (Lemma 6.13), it follows

that Hk(t) ≤ Hk(t), k = 1, . . . , K. 2

The following point of view is related to Lemma 6.15. By Theorem 6.4, we know

that Hk is the convex minorant of Xk(t). This implies that Hk(t) ≤ Xk(t), and

by summing over k = 1, . . . , K, we also have H+(t) ≤ X+(t). Hence, Hk(t) ≤Xk(t)+(aK+1/ak)(X+(t)− H+(t)), so that the naive estimator satisfies the inequality

conditions for the MLE. However, the naive estimator does not satisfy the equality

conditions, since typically H+(t) does not equal X+(t) when Hk(t) has a change of

slope. Hence, Hk(t) is a convex function below Xk(t) + (aK+1/ak)(X+(t) − H+(t)),

but it is typically not the convex minorant.

6.2.2 Uniqueness of the limiting process

In order to prove that the limiting process H = (H1, . . . , HK) defined in Theorem

6.9 is unique, we need that Uk(t) is tight for each t ∈ R. Such a tightness result

153

−10 −5 0 5 10

−20

020

4060

Limiting process, k=1

h

−10 −5 0 5 10−

200

2040

60


h

−10 −5 0 5 10

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Slope of component 1

h

−10 −5 0 5 10

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.3: Limiting processes for the MLE, for the model given in Example 6.5 andt0 = 1. The top row shows the drifted Brownian motion process akXk(h)+aK+1X+(h)

(black) and the convex function akHk(h)+aK+1H+(h) (green), k = 1, 2. The parabolic

drift is denoted by a black dashed line. The bottom row shows the slope process Uk(h)(green), with a black dashed line of slope f0k(t0), k = 1, 2.

154

−10 −5 0 5 10

−20

020

4060


h

−10 −5 0 5 10

−20

020

4060


h

−10 −5 0 5 10

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

−10 −5 0 5 10

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.4: Limiting processes for the MLE, for the model given in Example 6.5 andt0 = 2. Please see Figure 6.3 for further explanation.

155

−15 −10 −5 0 5 10 15

−10

−5

05

1015

Correction term MLE, k=1

h

MLENaive

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Slope convex minorant, k=1

h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.5: Comparison of the limiting processes for the MLE and the naive estimator,for the model given in Example 6.5 and t0 = 1. The top row shows (aK+1/ak)X+(h)−H+(h), k = 1, 2. The middle row shows Xk(h) (grey) with convex minorant Hk(h)

(red), and Xk + (aK+1/ak)X+(h) − H+(h) (black) with convex minorant Hk(h)

(green), k = 1, 2. The bottom row shows Uk(h) (red) and Uk(h) (green), togetherwith a dashed line of slope f0k(t0), k = 1, 2.

156

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

MLENaive

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−10

−5

05

1015


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

−15 −10 −5 0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5


h

Figure 6.6: Comparison of the limiting processes for the MLE and the naive estimator,for the model given in Example 6.5 and t0 = 2. Please see Figure 6.5 for furtherexplanation.

157

is analogous to the local rate of convergence result in Theorem 5.20. Hence, as in

Section 5.3, we first prove a stronger tightness result for U+(t) in Lemma 6.16. We

then use this to prove tightness of the components Uk(t) in Lemma 6.20. Finally, we

use this tightness result in Lemma 6.21 to prove that the limiting process is unique.

Lemma 6.16 Let

F0k(t) = f0k(t0)t, k = 1, . . . , K + 1. (6.12)

Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9. For β ∈ (0, 1) we

define

v(t) =

1, if |t| ≤ 1,

|t|β , if |t| > 1.(6.13)

Then for every ǫ > 0 there exists an M = M(ǫ) such that for every s ∈ R,

P

sup

t∈R

∣∣∣U+(t) − F0+(t)∣∣∣

v(t− s)≥M

< ǫ.

Proof: Let ǫ > 0. We first prove the result for s = 0. It is sufficient to show that we

can choose M > 0 such that

P(∃t ∈ R : U+(t) /∈ (F0+(t−Mv(t)), F0+(t+Mv(t)))

)

= P(∃t ∈ R :

∣∣∣U+(t) − F0+(t)∣∣∣ ≥ f0+(t0)Mv(t)

)< ǫ.

In fact, we only prove that there exists an M such that

P(∃t ∈ [0,∞) : U+(t) ≥ F0+(t+Mv(t))

)<ǫ

4, (6.14)

since the proofs for U+(t) ≤ F0+(t −Mv(t)) and (−∞, 0] are analogous. We put a

158

grid on [0,∞), with grid points j ∈ N = 0, 1, . . .. Then it is sufficient to show that

we can choose M such that

P(∃t ∈ [j, j + 1) : U+(t) ≥ F0+(t+Mv(t))

)≤ pjM , j ∈ N, (6.15)

where pjM is defined by

pjM = d1 exp(−d2(Mv(j))3), (6.16)

and d1 and d2 are positive constants. To see that this is sufficient, note that (6.15)

yields

P(∃t ∈ [0,∞) : U+(t) ≥ F0+(t+Mv(t))

)≤

∞∑

j=0

pjM .

For each β ∈ (0, 1), the sum∑∞

j=0 pjM can be made arbitrarily small by choosing M

large, which proves (6.14).

In the remainder we prove (6.15). Using the monotonicity of U+, it is sufficient to

show that P (Aj) ≤ pjM for all j ∈ N, where

Aj = U+(j + 1) ≥ F0+(j +Mv(j)).

Fix j ∈ N. By property (iii) of Theorem 6.9, akHk + aK+1H+ and akXk + aK+1X+

have points of touch to the left of j + 1, for each k = 1, . . . , K. For each k, define

τk to be the largest such point. Without loss of generality, we assume that the sub-

distribution functions are labeled such that τ1 ≤ · · · ≤ τK . On the event Aj, there

is a k ∈ 1, . . . , K such that Uk(j + 1) ≥ F0k(j + Mv(j)). Hence, we can define

l ∈ 1, . . . , K such that

Uk(j + 1) < F0k(j +Mv(j)), k = l + 1, . . . , K, (6.17)

Ul(j + 1) ≥ F0l(j +Mv(j)). (6.18)

159

The definition of τl and condition (i) of Theorem 6.9 imply that

alHl(τl) + aK+1H+(τl) = alXl(τl) + aK+1X+(τl),

alHl(t) + aK+1H+(t) ≤ alXl(t) + aK+1X+(t), t ∈ R.

Dividing both lines by al, and subtracting the first line from the second yields, for

t = j +Mv(j):

∫ j+Mv(j)

τl

[Ul(t)dt− dXl(t) +

aK+1

alU+(t)dt− dX+(t)

]≤ 0.

Hence,

P (Aj) = P

(∫ j+Mv(j)

τl

[Ul(t)dt− dXl(t) +

aK+1

alU+(t)dt− dX+(t)

]≤ 0

∩Aj

)

≤ P

(∫ j+Mv(j)

τl

[Ul(t)dt− dXl(t)

]≤ 0

∩ Aj

)(6.19)

+ P

(∫ j+Mv(j)

τl

[U+(t)dt− dX+(t)

]≤ 0

∩ Aj

). (6.20)

Using the definition of τl and the fact that Ul is monotone nondecreasing and piecewise

constant (Corollary 6.12), it follows that on the event Aj we have for t ≥ τl

Ul(t) ≥ Ul(τl) = Ul(j + 1) ≥ F0l(j +Mv(j)).

Hence we can bound (6.19) above by

P

(∫ j+Mv(j)

τl

[F0l(j +Mv(j))dt− dXl(t)

]≤ 0

)

≤ P

(sup

k ∈ 1, . . . , K

w ≤ j + 1

∫ j+Mv(j)

w

[dXk(t) − F0k(j +Mv(j))dt

]≥ 0

)≤ pjM/2,

160

where the last inequality follows from Lemma 6.17 below. The term (6.20) is also

bounded by pjM/2, using Lemma 6.18 below. For s 6= 0, the proof is exactly the

same, using stationarity of the increments of Brownian motion. 2

Lemmas 6.17 and 6.18 are the key lemmas in the proof of Lemma 6.16. They are

analogous to Lemmas 5.15 and 5.16. To get insight in Lemma 6.17, recall that

dXk(t) = dWk(t)/g(t0) + F0k(t). Hence,

∫ j+Mv(j)

w

[F0k(j +Mv(j))dt− dXk(t)

]

=

∫ j+Mv(j)

w

[F0k(j +Mv(j) − t)dt− dWk(t)/g(t0)

]

=1

2f0k(t0)(j +Mv(j) − w)2 −

∫ j+Mv(j)

w

dWk(t)/g(t0).

The quadratic drift dominates the term (j+Mv(j)−w)3/2C and the Brownian motion.

We obtain uniformity in w by using a second grid, and we obtain the exponential

bounds by using standard properties of Brownian motion. Finally, note that the

lemma also holds when (j+Mv(j)−w)3/2C is omitted, since this term is positive for

M > 1 and w ≤ j + 1.

Lemma 6.17 There exists an M > 1 such that for all j ∈ N:

P

(sup

k ∈ 1, . . . , K

w ≤ j + 1

∫ j+Mv(j)

w


]

+ (j +Mv(j) − w)3/2C

≥ 0

)≤ pjM/2,

where v(·) and pjM are defined in (6.13) and (6.16).

Analogously to Lemma 5.16, Lemma 6.18 relies on the system of component processes.

By playing out the different component processes against each other, we can reduce

the problem to a situation to which Lemma 6.17 can be applied.

161

Lemma 6.18 There exists an M > 0 such that for all j ∈ N,

P

(∫ j+Mv(j)

τl

[dX+(t) − U+(t)dt

]≥ 0

∩Aj

)≤ pjM/2,

where v(·) and pjM are defined in (6.13) and (6.16), τl is the last point of touch

between alHl(t) + aK+1H+(t) and alXl(t) + aK+1X+(t) before j + 1, and l is defined

in (6.17) and (6.18).

Lemma 6.16 with β = 1/2 leads to the following corollary, which we give without

proof. It is analogous to Corollary 5.19.

Corollary 6.19 Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9.

Then for every ǫ > 0 there is a C = C(ǫ) such that for all s ∈ R:

P

supu∈R+

∫ ss−u

∣∣∣U+(t) − F0+(t)∣∣∣ dt

u ∨ u3/2≥ C

< ǫ.

We can now prove tightness of Uk(t), for k = 1, . . . , K, t ∈ R. Lemma 6.20 is

analogous to Theorem 5.20.

Lemma 6.20 Let U = (U1, . . . , UK) satisfy the characterization of Theorem 6.9.

Then for every ǫ > 0 there is an M = M(ǫ) such that for all k = 1, . . . , K and t ∈ R:

P(∣∣∣Uk(t) − F0k(t)

∣∣∣ ≥M)< ǫ.

Proof: Let k ∈ 1, . . . , K, t ∈ R and ǫ > 0. It is sufficient to show that there exists

an M > 1 such that

P (Uk(t) ≥ F0k(t+M)) < ǫ, (6.21)

P (Uk(t) ≤ F0k(t−M)) < ǫ. (6.22)

162

We only prove (6.21), since the proof of (6.22) is analogous. Define

Bk = Uk(t) ≥ F0k(t+M). (6.23)

Let τk be the last point of touch between akHk +aK+1H+ and akXk +aK+1X+ before

t. Such a point exists by condition (iii) of Theorem 6.9. Together with condition (i)

of Theorem 6.9, this implies that

∫ t+M

τk

[Uk(s)ds− dXk(s) +

aK+1

ak(U+(s)ds− dX+(s))

]≤ 0.

Hence,

P (Bk) = P

(∫ t+M

τk


aK+1


]≤ 0

∩Bk

).

Note that

∣∣∣∣∫ t+M

τk

U+(s)ds− dX+(s)

∣∣∣∣

≤∫ t+M

τk

∣∣∣U+(s) − F0+(s)∣∣∣ ds+

∣∣∣∣∫ t+M

τk

F0+(s)ds− dX+(s)

∣∣∣∣ . (6.24)

We bound the first term of (6.24) by Corollary 6.19: for every ǫ > 0 there exists a

C > 0, such that for all t ∈ R:

P

supu∈R+

∫ t+Mt−u

∣∣∣U+(s) − F0+(s)∣∣∣ ds

(M + u)3/2> C

<

ǫ

2,

using that (M + u) ∨ (M + u)3/2 = (M + u)3/2 for M > 1 and u > 0. For the second

term of (6.24), note that dX+(s) = dW+(s)/g(t0) + F0+(s)ds, so that

∫ t+M

τk

F0+(s)ds− dX+(s)

=

∫ t+M

τk

dW+(s)

g(t0).

163

For every γ > 0 and ǫ > 0, we can find M > 0 such that

P

(supu∈R+

∫

[t−u,t+M)

dW+(s)

g(t0)− γ(M + u)2 ≥ 0

)< ǫ,

see (6.40).

Furthermore, using the definition of τk and the fact that Uk is monotone nonde-

creasing and piecewise constant (Corollary 6.12), it follows that on the event Bk, we

have for all t ≥ τk:

Uk(t) ≥ Uk(τk) = Uk(t) ≥ F0k(t+M).

Hence, for every γ > 0 and ǫ > 0, we can find C > 0 and M1 > 0 such that for all for

M ≥M1:

P (Bk) = P

(∫ t+M

τk


aK+1


]≤ 0

∩Bk

)

≤ ǫ

2+ P

(supw≤t

∫ t+M

w

[dXk(s) − F0k(s+M)ds

]+ (t+M − w)3/2C

+ γ(t+M − w)2

≥ 0

).

This probability can be made arbitrarily small by choosing γ small and M large, by

a slight adaptation of Lemma 6.17. The choice γ = 18f0k(t0) works. 2

Lemma 6.21 Let H and H be two K-tuples satisfying the conditions of Theorem

6.9. Then H ≡ H almost surely.

Proof: Let H = (H1, . . . , HK) and H = (H1, . . . , HK) be two processes satisfying

the conditions of Theorem 6.9, and let U = (U1, . . . , UK) and U = (U1, . . . , UK) be

164

the corresponding derivatives, i.e., Uk = H ′k and Uk = H ′

k for k = 1, . . . , K. We define

φm(U) =K∑

k=1

ak

[1

2

∫ m

−mU2k (t)dt−

∫ m

−mUk(t)dXk(t)

]

+ aK+1

[1

2

∫ m

−mU2

+(t)dt−∫ m

−mU+(t)dX+(t)

], m ∈ N.

Note that

φm(U) − φm(U)

=

K∑

k=1

ak2

∫ m

−mU2

k (t) − U2k (t)dt−

K∑

k=1

ak

∫ m

−mUk(t) − Uk(t)dXk(t) (6.25)

+aK+1

2

∫ m

−mU2

+(t) − U2+(t)dt− aK+1

∫ m

−mU+(t) − U+(t)dX+(t). (6.26)

Using U2k − U2

k = (Uk − Uk)2 + 2Uk(Uk − Uk), we rewrite the first term of (6.25) as

K∑

k=1

ak2

∫ m

−mU2

k (t) − U2k (t)dt

=K∑

k=1

ak2

∫ m

−mUk(t) − Uk(t)2dt+

K∑

k=1

ak

∫ m

−mUk(t)Uk(t) − Uk(t)dt.

Similarly, we rewrite the first term of (6.26) as

aK+1

2

∫ m

−mU2

+(t) − U2+(t)dt

=aK+1

2

∫ m

−mU+(t) − U+(t)2dt+ aK+1

∫ m

−mU+(t)U+(t) − U+(t)dt.

Defining

Ak(t) = akHk(t) −Xk(t) + aK+1H+(t) −X+(t),

Ak(t) = akHk(t) −Xk(t) + aK+1H+(t) −X+(t),

165

this yields

φm(U) − φm(U) =K∑

k=1

ak2

∫ m


aK+1

2

∫ m

−mU+(t) − U+(t)2dt

+

K∑

k=1

∫ m

−mUk(t) − Uk(t)dAk(t). (6.27)

Using integration by parts, we rewrite the third term on the right side of (6.27):

K∑

k=1

∫ m

−mUk(t) − Uk(t)dAk(t)

=K∑

k=1

Uk(t) − Uk(t)Ak(t)∣∣∣∣m

−m−

K∑

k=1

∫ m

−mAk(t)dUk(t) − Uk(t)

≥K∑

k=1


−m. (6.28)

The inequality on the last line follows from the following two facts:

(a)∫ m−m Ak(t)dUk(t) = 0, since Ak(t) = 0 at points of jump of Uk by conditions (i)

and (ii) of Theorem 6.9;

(b)∫ m−m Ak(t)dUk(t) ≤ 0, since Ak(t) ≤ 0 by condition (i) of Theorem 6.9, and Uk

is monotone nondecreasing.

By combining (6.27) and (6.28), we obtain

φm(U) − φm(U) ≥K∑

k=1

ak2

∫ m


aK+1

2

∫ m

−mU+(t) − U+(t)2dt

+

K∑

k=1


−m.

Using the same expression, but with U and U interchanged, we get

166

0 = φm(U) − φm(U) + φm(U) − φm(U)

≥K∑

k=1

ak

∫ m

−mUk(t) − Uk(t)2dt+ aK+1

∫ m

−mU+(t) − U+(t)2dt

+

K∑

k=1


−m+

K∑

k=1


−m.

We rewrite the last two terms of this display as follows:

K∑

k=1


−m+

K∑

k=1


−m

=K∑

k=1

[Uk(m) − Uk(m)Ak(m) − Ak(m)

+ Uk(−m) − Uk(−m)Ak(−m) −Ak(−m)].

It then follows that

K∑

k=1

ak

∫ m

−mUk(t) − Uk(t)2dt+ aK+1

∫ m

−mU+(t) − U+(t)2dt

≤K∑

k=1


+ Uk(−m) − Uk(−m)Ak(−m) −Ak(−m)].

This inequality holds for all m ∈ N, and hence we can take lim infm→∞. On the left

side we can replace lim infm→∞ by limm→∞, since this is a monotone sequence in m:

K∑

k=1

ak

∫Uk(t) − Uk(t)2dt+ aK+1

∫U+(t) − U+(t)2dt

≤ lim infm→∞

K∑

k=1


+ Uk(−m) − Uk(−m)Ak(−m) − Ak(−m)]. (6.29)

167

We will now show that the right side of (6.29) is almost surely equal to zero. We

prove this in two steps. First, we show that it is of order Op(1), and then we use this

to show that it is almost surely equal to zero.

To show that the right side of (6.29) is of order Op(1), let k ∈ 1, . . . , K, and

note that the tightness of Lemma 6.20 yields that Uk(m) − F0k(m) and Uk(m) −F0k(m) are of order Op(1). This implies that also Uk(m)−Uk(m) is of order Op(1).

Furthermore, Lemma 6.20 implies that the distance of m to jump points of Uk and

Uk is of order Op(1). This means that both Ak(m) and Ak(m) are of order Op(1), and

hence also Ak(m) − Ak(m) is of order Op(1). Using the same argument for −m,

this proves that the right side of (6.29) is of order Op(1).

We will now use this result to show that the right hand side of (6.29) is almost

surely equal to zero. Let k ∈ 1, . . . , K and η > 0. We will show that

P(lim infm→∞

∣∣∣Uk(m) − Uk(m)∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η)

= 0. (6.30)

Since

P(lim infm→∞

∣∣∣Uk(m) − Uk(m)∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η)

≤ lim infm→∞

P(∣∣∣Uk(m) − Uk(m)

∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η),

it is sufficient to show that

lim infm→∞

P(∣∣∣Uk(m) − Uk(m)

∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η)

= 0. (6.31)

Let τkm be the last jump point of Uk before m. Let τ−km be the last jump point

of Uk at or before τkm, and let τ+km be the first jump point of Uk after τkm. We now

168

define the following events:

E1m = E1m(ǫ) =

∫ ∞

τ−km

Uk(t) − Uk(t)2dt < ǫ

,

E2m = E2m(δ) = size of jump of Uk at τkm > δ ,

E3m = E3m(C) =∣∣∣Uk(m) − Uk(m)

∣∣∣ < C,

Em = Em(ǫ, δ, C) = E1m(ǫ) ∩ E2m(δ) ∩ E3m(C).

Let ǫ1 > 0 and ǫ2 > 0. Since the right side of (6.29) is of order Op(1), it follows that∫Uk(t)− Uk(t)2dt = Op(1) for every k ∈ 1, . . . , K. This implies that

∫∞mUk(t)−

Uk(t)2dt →p 0 as m → ∞. Together with the fact that m − τ−km = Op(1), this

implies that there is an m1 > 0 such that P (E1m(ǫ1)c) < ǫ1 for all m > m1. Using a

stationarity argument (which actually needs some further elaboration) it follows that

there are δ > 0 and m2 > 0 so that P (E2m(δ)c) < ǫ2/2 for all m > m2. By tightness

of Uk(m) − Uk(m), there are C > 0 and m3 > 0 so that P (E3m(C)c) < ǫ2/2 for all

m > m3. Combining these observations yields that P (Em(ǫ1, δ, C)c) < ǫ1 + ǫ2 for all

m > m0 = maxm1, m2, m3.

Returning to (6.31), we now have

lim infm→∞

P(∣∣∣Uk(m) − Uk(m)

∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η)

= 0

≤ ǫ1 + ǫ2 + lim infm→∞

P(∣∣∣Uk(m) − Uk(m)

∣∣∣∣∣∣Ak(m) − Ak(m)

∣∣∣ > η∩ Em(ǫ1, δ, C)

)

≤ ǫ1 + ǫ2 + lim infm→∞

P(∣∣∣Ak(m) − Ak(m)

∣∣∣ > η

C

∩ Em(ǫ1, δ, C)

), (6.32)

using the definition of E3m(C) in the last line. On the event E1m(ǫ1),

ǫ1 ≥∫ ∞

τ−km

Uk(t) − Uk(t)2dt ≥∫ τkm

τ−km

Uk(t) − Uk(t)2dt+

∫ τ+km

τkm

Uk(t) − Uk(t)2dt

= I + II.

169

Furthermore, on the event E2m(δ) one of the following must hold: (i) τ−km = τkm,

(ii)∣∣∣Uk(τkm) − Uk(τkm)

∣∣∣ ≥ δ/2, or (iii)∣∣∣Uk(τkm−) − Uk(τkm−)

∣∣∣ ≥ δ/2. Suppose (ii)

holds. Then II ≥ δ2(τ+km − τkm)/4 since Uk − Uk is piecewise constant, and hence

τ+km − τkm ≤ 4ǫ1/δ

2. Next, suppose that (iii) holds. Then I ≥ δ2(τkm − τ−km)/4, so

that τkm − τ−km ≤ 4ǫ1/δ2. Thus, on the event E1m(ǫ1) ∩ E2m(δ) there is a jump point

of Uk that is within 4ǫ1/δ2 of τkm. Without loss of generality, we assume that this

holds for τ−km.

We now return to (6.32) and consider the quantity Ak(m) − Ak(m). First, note

that

akHk(m) + aK+1H+(m) = akXk(τkm) + aK+1X+(τkm)

+

∫ m

τkm

akUk(u) + aK+1U+(u)du,

akHk(m) + aK+1H+(m) = akXk(τ−km) + aK+1X+(τ−km)

+

∫ m

τ−km

akUk(u) + aK+1U+(u)du.

Then

Ak(m) − Ak(m) = ak(Hk(m) − Hk(m)) + aK+1(H+(m) − H+(m))

= akXk(τkm) −Xk(τ−km) + aK+1X+(τkm) −X+(τ−km),

+

∫ m

τkm

ak(Uk(u) − Uk(u)) + aK+1(U+(u) − U+(u))du

−∫ τkm

τ−km

akUk(u) + aK+1U+(u)du.

Since τkm−τ−km ≤ 4ǫ1/δ2, and since Xk, X+, Hk and H+ are continuous, it follows that

the first and third lines on the right side of this expression can be made arbitrarily

170

small by choosing ǫ1 small. Hence, we only need to consider the third term:

∣∣∣∣∫ m

τkm

akUk(u) − Uk(u) + aK+1U+(u) − U+(u)

du

∣∣∣∣

≤ ak

(∫ m

τkm

Uk(u) − Uk(u)2du

)1/2

(m− τkm)1/2

+ aK+1

(∫ m

τkm

U+(u) − U+(u)2du

)1/2

(m− τkm)1/2

= Op

(∫ m

τkm

Uk(u) − Uk(u)2du

)1/2

+Op

(∫ m

τkm

U+(u) − U+(u)2du

)1/2

= op(1), m→ ∞.

The inequality follows from the Cauchy-Schwarz inequality. The first equality follows

from m− τkm = Op(1), and the second equality follows since the integrals∫Uk(u)−

Uk(u)2du and∫U+(u)− U+(u)2du are of order Op(1), so that the tail distribution

of the integrals must be of order op(1).

This implies that

lim infm→∞

P(∣∣∣Ak(m) − Ak(m)

∣∣∣ > η

C

∩ E1(ǫ1, δ, C)

)= 0.

Using similar reasoning for −m, it follows that the right side of (6.29) equals zero

with probability one. In turn, this implies that with probability one, Uk = Uk almost

everywhere for k = 1, . . . , K. Taking into account the monotonicity and right con-

tinuity of Uk and Uk, we find that Uk must be identical to Uk with probability one.

2

Remark 6.22 An alternative method for proving uniqueness could proceed along

the following lines. Let ǫ > 0, and define the following event

Am =

∫ m+1

m

Uk(t) − Uk(t)2dt > ǫ

, m ∈ N.

171

Since the integrals∫Uk(t)−Uk(t)2dt are of order Op(1), we know that P (Am i.o.) =

0. If we can show that the sequence Am is strongly mixing in the sense of Theorem

2 of Yoshihara (1979), then it follows that the second Borel-Cantelli lemma holds,

so that∑P (Am) < ∞. Since the Am are identically distributed, this implies that

P (Am) = 0 for all m ∈ N, and this implies Uk ≡ Uk.

6.2.3 Convergence of the MLE to the limiting distribution

We prove the limiting distribution of the MLE (Theorem 6.10) along the same lines

as the limiting distribution of the naive estimator (Theorem 6.4). Thus, we start by

localizing the characterization. However, the characterization of the MLE is more

complicated than the characterization of the naive estimator. To simplify it, we

replace (Fnk(u))−1 and (1−Fn+(u))−1 by (F0k(t0))

−1 and (1−F0+(t0))−1. This results

in a rest term which is bounded in Lemma 6.23. The proof of this lemma is given in

Section 6.3.

Lemma 6.23 Let τnk be the last jump point of Fnk before t0, for k = 1, . . . , K. Then

for every m > 0, and k = 1, . . . , K:

∫

[τnk,t0+n−1/3h)

Fnk(u) − δk

Fnk(u)+Fn+(u) − δ+

1 − Fn+(u)

dPn(u, δ)

=

∫

[τnk,t0+n−1/3h)

Fnk(u) − δkF0k(t0)

+Fn+(u) − δ+1 − F0+(t0)

dPn(u, δ) + op(n

−2/3),


Next, we give analogues of Lemmas 6.6 and 6.7. We only mention the key ingredients

of the proof, since the proofs themselves are completely analogous to the proofs of

Lemmas 6.6 and 6.7. The key ingredients are the local rate of convergence of the

MLE (Theorem 5.20) and t0 − τnk = Op(n−1/3) (Corollary 5.22), where τnk is the last

jump point of Fnk before t0, k = 1, . . . , K.

172

Lemma 6.24 Let τnk be the last jump point of Fnk before t0, for k = 1, . . . , K. Then

∫

[τnk,t0)

Fnk(u) − δkdPn = Op(n−2/3), k = 1, . . . , K.

Lemma 6.25 Let the conditions of Theorem 6.4 be satisfied, let m > 0, and k ∈1, . . . , K. Then

1

g(t0)

∫

[t0,t0+n−1/3h]


=

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du+ op(n−2/3),


We now give the proof of Theorem 6.10.

Proof of Theorem 6.10: Let τnk be the last jump point of Fnk before t0, for k =

1, . . . , K. Recall from Proposition 2.34 that the MLE Fnk(t), k = 1, . . . , K, is char-

acterized by

∫

[τnk ,t)

δk

Fnk(u)− 1 − δ+

1 − Fn+(u)

dPn(u, δ) ≥ 0,

for k = 1, . . . , K and t < T(n), where equality must hold if t is a jump point of Fnk.

This is equivalent to:

∫

[τnk,t)

Fnk(u) − δk


1 − Fn+(u)

dPn(u, δ) ≤ 0, (6.33)


We now replace (Fnk(u))−1 by (F0k(t0))

−1 = ak, and (1 − Fn+(u))−1 by (1 −

173

F0+(t0))−1 = aK+1, at the cost of a term

∫[τnk ,t)

RnkdPn, where

Rnk(u, δ) =

Fnk(u) − δk


1 − Fn+(u)

−Fnk(u) − δkF0k(t0)

+Fn+(u) − δ+1 − F0+(t0)

.

This yields

∫

[τnk ,t)

akFnk(u) − δk + aK+1Fn+(u) − δ+ + Rnk(u, δ)

dPn(u, δ) ≤ 0, (6.34)


In order to change the integration interval [τnk, t) to [t0, t), we define for k =

1, . . . , K:

cnk =

∫

[τnk,t0)


dPn(u, δ).

Then (6.34) is equivalent to

cnk +

∫

[t0,t)


dPn(u, δ) ≤ 0,

for k = 1, . . . , K, where equality must hold if t is a jump point of Fnk.

We now localize this expression, by adding and subtracting∫[t0,t)

F0k(t0)dGn(u)

and∫[t0,t)

F0+(t0)dGn(u), and applying the change of variable t → t0 + n−1/3h. This

yields

cnk +

∫

[t0,t0+n−1/3h)

Rnk(u, δ)dPn(u, δ) + ak

∫

[t0,t0+n−1/3h)


+ aK+1

∫

[t0,t0+n−1/3h)

Fn+(u) − F0+(t0)dGn(u) (6.35)

≤ ak

∫

[t0,t0+n−1/3h)

δk − F0k(t0)dPn(u, δ) + aK+1

∫

[t0,t0+n−1/3h)

δ+ − F0+(t0)dPn(u, δ),

174

for k = 1, . . . , K, h < n1/3(T(n)−t0), where equality must hold if t0 +n−1/3h is a jump

point of Fnk. Next, for k = 1, . . . , K and h ∈ R, we define the following processes:

X locnk (h) =

n2/3

g(t0)

∫

[t0,t0+n−1/3h]

δk − F0k(t0)dPn(u, δ),

H locnk (h) = n2/3

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du,

U locnk (h) = n1/3Fnk(t0 + n−1/3h) − F0k(t0).

Note that U locnk = (H loc

nk )′ at continuity points of U locnk . Furthermore, define

c locnk =n2/3

g(t0)cnk,

R locnk (h) =

n2/3

g(t0)

∫

[t0,t0+n−1/3h]

Rnk(u, δ)dPn(u, δ)

+ ak

(n2/3

g(t0)

∫

[t0,t0+n−1/3h]


− n2/3

∫

[t0,t0+n−1/3h]

Fnk(u) − F0k(t0)du)

+ aK+1

(n2/3

g(t0)

∫

[t0,t0+n−1/3h]

Fn+(u) − F0+(t0)dGn(u)

− n2/3

∫

[t0,t0+n−1/3h]

Fn+(u) − F0+(t0)du).

Then multiplying (6.35) by n2/3/g(t0) yields, for all k = 1, . . . , K and h < n1/3(T(n) −t0):

c locnk + R locnk (h) + akH

locnk (h) + aK+1H

locn+(h) ≤ akX

locnk (h) + aK+1X

locn+(h) (6.36)

and c locnk + R locnk (h−) + akH

locnk (h−) + aK+1H

locn+(h−) = akX

locnk (h−) + aK+1X

locn+(h−) if

175

U locnk has a jump at h. Combining these statements gives

c locnk + R locnk (h) + akH

locnk (h) + aK+1H

locn+(h) ≤ akX

locnk (h) + aK+1X

locn+(h), (6.37)

∫ n1/3(T(n)−t0)

−∞

c locnk + R loc

nk (h−) + akHlocnk (h−) + aK+1H

locn+(h−)

− akXlocnk (h−) − aK+1X

locn+(h−)

dU loc

nk (h) = 0, (6.38)

where (6.37) must hold for all h < n1/3(T(n)−t0). Note that these conditions also hold

when the processes are restricted to [−m,m] ∩ (−∞, n1/3(T(n) − t0), for each m ∈ N.

Next, we define the following vectors:

c locn = (c locn1 , . . . , clocnK), H loc

n = (H locn1 , . . . , H

locnK),

R locn = (R loc

n1 , . . . , RlocnK), U loc

n = (U locn1 , . . . , U

locnK),

X locn = (X loc

n1 , . . . , XlocnK).

Furthermore, for m ∈ N, we define the space

E[−m,m] = RK × (D[−m,m])K × (D[−m,m])K × (C[−m,m])K × (D[−m,m])K

≡ RK × I × II × III × IV,

endowed with the product topology induced by the uniform topology on I×II×III,and the Skorohod topology on IV . Note that this space supports the vector

Vn|[−m,m] = (c locn , R locn , X loc

n , H locn , U loc

n )|[−m,m],

where the notation |[−m,m] denotes that the processes R locnk , X loc

nk , H locnk and U loc

nk are

restricted to [−m,m] for all k = 1, . . . , K.

The remainder of the proof is analogous to the proof of Theorem 6.4, and we

omit some details that can be found there. First, we show that Vn|[−m,m] is tight

176

in E[−m,m] for each m ∈ N. To do so, we use Lemmas 6.24 and 6.25 to show that

cnk = Op(1) and R locnk = op(1) uniformly in h ∈ [−m,m], for all k = 1, . . . , K. Hence,

by a diagonal argument it follows that for every subsequence Vn′ there is a further

subsequence that converges in distribution to a limit

V = (c, 0, X,H, U)

∈ RK × (C(−∞,∞))K × (C(−∞,∞))K × (C(−∞,∞))K × (D(−∞,∞))K,

with H ′ = U at continuity points of U . By the continuous mapping theorem and

(6.37) and (6.38), it follows that for each m ∈ N:

inf[−m,m]

akXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) − ck ≥ 0,

∫ m

−makXk(t−) + aK+1X+(t−) − akHk(t−) − aK+1H+(t−) − ckdUk(t) = 0.

Since Xk and Hk are continuous, we can write the second condition as

∫ m

−makXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) − ckdUk(t) = 0.

Defining Hk = Hk + ck/ak − F0k(t0)∑K

k=1(ck/ak), we have

akHk = akHk + ck −K∑

k=1

ckak,

aK+1H+ = aK+1H+ +

K∑

k=1

ckak,

using aK+1 = (1 − F0+(t0))−1 to obtain the second line. This gives

akHk + aK+1H+ + ck = akHk + aK+1H+,

177

so that

inf[−m,m]

akXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t) ≥ 0,

∫ m

−makXk(t) + aK+1X+(t) − akHk(t) − aK+1H+(t)dUk(t) = 0.

Letting m→ ∞ it follows that H1, . . . , HK satisfy conditions (i) and (ii) of Theorem

6.9. Furthermore, condition (iii) of Theorem 6.9 is satisfied by Corollary 5.22.

Hence, there exists a K-tuple of processes (H1, . . . , HK) that satisfies the condi-

tions of Theorem 6.9. Furthermore, there is only one such K-tuple, by the uniqueness

established in Lemma 6.21. Hence, each subsequence converges to the same limit

H = H , with H as defined in Theorem 6.10. This implies that U locn →d U in the

Skorohod topology. In particular,

U locn (0) = n1/3(Fn(t0) − F0(t0)) →d U(0), in R

K .

2

6.3 Technical lemmas and proofs

Lemma 6.26 states a well-known fact about convex functions. We provide this lemma

and its proof for completeness.

Lemma 6.26 Let (fn) be a sequence of convex functions on R converging pointwise

to a convex function f on R. Then, at each point t where the two-sided derivative f ′

of f exists, we have:

limn→∞

D+fn(t) = limn→∞

D−fn(t) = f ′(t),

where D+fn and D−fn are the right and left derivative of fn, respectively.

Proof: Fix ǫ > 0 and suppose that f is differentiable at t. Then there exists an η > 0

178

such that

f ′(t) − ǫ ≤ f(t) − f(t− η)

η≤ f(t+ η) − f(t)

η≤ f ′(t) + ǫ.

Moreover,

limn→∞

fn(t+ η) − fn(t)

η=f(t+ η) − f(t)

η,

and also

limn→∞

fn(t) − fn(t− η)

η=f(t) − f(t− η)

η.

The statement now follows from

fn(t) − fn(t− η)

η≤ D−fn(t) ≤ D+fn(t) ≤

fn(t+ η) − fn(t)

η.

2

Proof of Proposition 6.8: Note that

X locnk (h) = n2/3

∫

[t0,t0+n−1/3h]

δk − F0k(u)dPn(u, δ)

+ n2/3

∫

[t0,t0+n−1/3h]

F0k(u) − F0k(t0)dGn(u). (6.39)

For each k = 1, . . . , K, the first term on the right side of (6.39) converges in distri-

bution to Wk(h) in l∞[−M,M ], where Wk(h) is the Brownian motion process defined

in Definition 6.2. This follows by applying Theorem 2.11.22 of Van der Vaart and

Wellner (1996, page 220) to the class of functions

Fnk = fnkh(u, δ) = n1/61[t0,t0+n−1/3h](u)δk − F0k(u) : h ∈ [−M,M ], n ∈ N.

This convergence also holds jointly in k = 1, . . . , K. Namely, marginal tightness of

the processes implies joint tightness. Hence, there is a subsequence that converges

179

weakly to a tight Borel measure, jointly for k = 1, . . . , K. The marginal distributions

of this Borel measure are given by W1, . . . ,WK , and its covariance structure can

be determined by considering the finite dimensional distributions. Since ∆|T has

a multinomial distribution (see (2.4)), (W1, . . . ,WK) has a multinomial covariance

structure.

The second term on the right side of (6.39) can be written as

n2/3

∫

[t0,t0+n−1/3h]

F0k(u) − F0k(t0)dGn(u)

= n2/3

∫

[t0,t0+n−1/3h]

F0k(u) − F0k(t0)dG(u)

+ n2/3

∫

[t0,t0+n−1/3h]

F0k(u) − F0k(t0)d(Gn −G)(u)

= n2/3

∫

[t0,t0+n−1/3h]

F0k(u) − F0k(t0)dG(u) + op(1)

→ 1

2f0k(t0)g(t0)h

2,

where the convergence is uniformly for h ∈ [−M,M ], and jointly for k = 1, . . . , K.

Here the second to last line of the display follows from Lemma 5.13. The last line of the

display follows from the continuity and positivity of f0k(t) and g(t) in a neighborhood

of t0. 2

Proof of Lemma 6.17: It is sufficient to show that there exists an M > 0 such that

the statement holds for a fixed k. Let k ∈ 1, . . . , K and j ∈ 0, 1, . . .. Note that

dXk(t) = dWk(t)/g(t0) + F0k(t)dt. Hence,

∫ j+Mv(j)

w

[F0k(j +Mv(j))dt− dXk(t)

]

=

∫ j+Mv(j)

w

[F0k(j +Mv(j) − t)dt− dWk(t)/g(t0)

]

=1

2f0k(t0)(j +Mv(j) − w)2 −

∫ j+Mv(j)

w

dWk(t)/g(t0).

180

Furthermore, for any C > 0 fixed, we have that

(j +Mv(j) − w)3/2C ≤ 1

4f0k(t0)(j +Mv(j) − w)2,

for M sufficiently large. Hence, it is sufficient to show that

P

(supw≤j+1

∫ j+Mv(j)

w

dWk(t)/g(t0) −1

4f0k(t0)(j +Mv(j) − w)2 ≥ 0

)≤ pjM

2. (6.40)

We again consider a grid, now with grid points j + 1− q, q ∈ N. Then we can bound

the left side of the above display by

∞∑

q=0

P

(sup

w∈[j−q,j−q+1)

∫ j+Mv(j)

w

dWk(t)

g(t0)− 1

4f0k(t0)(j +Mv(j) − w)2 ≥ 0

)

∞∑

q=0

P

(sup

w∈[j−q,j−q+1)

∫ j+Mv(j)

w

dWk(t) ≥ λkjq

), (6.41)

where λkjq is obtained by plugging in w = j + 1 − q in the quadratic term:

λkjq =1

4f0k(t0)g(t0)(Mv(j) − 1 + q)2.

Let Bk(·) denote standard Brownian motion. We write the qth term in (6.41) as

P

(sup

w∈[j−q,j−q+1)

Wk(j +Mv(j) − w) ≥ λkjq

)

≤ P

(sup

w∈[0,Mv(j)+q)

Wk(w) ≥ λkjq

)= P

(supw∈[0,1)

Wk((Mv(j) + q)w) ≥ λkjq

)

= P

(supw∈[0,1)

Wk(w) ≥ λkjq√Mv(j) + q

)≤ P

(supw∈[0,1]

Bk(w) ≥ λkjq

bk√Mv(j) + q

)

≤ 2P

(N(0, 1) ≥ λkjq

bk√Mv(j) + q

)≤ 2bkq exp

−1

2

(λkjq

bk√Mv(j) + q

)2 ,

181

where

bk =√F0k(t0)(1 − F0k(t0))g(t0), k = 1, . . . , K,

bkq =bk√Mv(j)+q

λkjq

√2π

, k = 1, . . . , K, q ∈ N.

Here we used standard properties of Brownian motion. The second to last inequality

is given in for example Shorack and Wellner (1986, equation 6, page 33), and the last

inequality follows from Mills’ ratio (Gordon (1941, Equation (10), page 366)). Note

that bkq ≤ bk/(f0k(t0)g(t0)√

2π) for M > 3. Hence, returning to (6.41), we have

∞∑

q=0

P

(sup

w∈[j−q,j−q+1)

∫ j+Mv(j)

w

dWk(t) ≥ λkjq

)

≤∞∑

q=0

2bk

f0k(t0)g(t0)√

2πexp

−1

2

(λkjq

bk√Mv(j) + q

)2

≈∞∑

q=0

2bk

f0k(t0)g(t0)√

2πexp

(−1

2

(Mv(j) + q)3

b2k

)≤ d1 exp(−d2(Mv(j))3),

using (a+ b)3 ≥ a3 + b3 for a, b ≥ 0. 2

Proof of Lemma 6.18: Since l is only defined on the event Aj , this entire proof

should be read on the event Aj. If l = K, then we can apply the method of Lemma

6.17. Therefore, assume that l < K. In this case, we cannot apply the method of

Lemma 6.17, for the reason discussed in the proof of Lemma 5.16 and illustrated in

Figure 5.3. Hence, we break the term∫ j+Mv(j)

τl

[dX+(t)− U+(t)dt

]into pieces that we

analyze separately. We define l∗ ∈ l, . . . , K as follows. If

∫ τk

τl

[dX+(t) − U+(t)dt

]≥ 0, for all k = l + 1, . . . , K, (6.42)

we let l∗ = l. Otherwise we define l∗ such that

182

∫ τk

τl

[dX+(t) − U+(t)dt

]≥ 0, k = l∗ + 1, . . . , K, (6.43)

∫ τl∗

τl

[dX+(t) − U+(t)dt

]< 0. (6.44)

Then, by (6.44) and the decomposition∫ j+Mv(j)

τl=∫ τl∗τl

+∫ j+Mv(j)

τl∗, we get

∫ j+Mv(j)

τl

[dX+(t) − U+(t)dt

]≤∫ j+Mv(j)

τl∗

[dX+(t) − U+(t)dt

], (6.45)

where strict inequality holds if l 6= l∗. Rearranging the sum and using the notation

τK+1 = j +Mv(j), we can rewrite the right side of (6.45) as

K∑

k=1

∫ τk

τl∗

[dXk(t) − Uk(t)dt

]

=

K∑

k=l∗+1

∫ τk

τl∗

[dXk(t) − Uk(t)dt

]+

K∑

k=l∗

k∑

p=1

∫ τk+1

τk

[dXp(t) − Up(t)dt

]. (6.46)

We now derive upper bounds for the terms in (6.46). For the first term, note that

∫ τk

τl∗

[dX+(t) − U+(t)dt

]≥ 0, k = l∗ + 1, . . . , K. (6.47)

Namely, if l = l∗, then (6.47) is the same as (6.42). If l < l∗, then (6.47) follows from

(6.43), (6.44) and the decomposition∫ τkτl

=∫ τl∗τl

+∫ τkτl∗

. Furthermore, the definition of

τ1, . . . , τK and condition (i) of Theorem 6.9 imply that

∫ τk

t

[dXk(t) − Uk(t)dt+

aK+1

akdX+(t) − U+(t)dt

]≤ 0, k = 1, . . . , K, t ≤ τk.

Using this inequality with t = τl∗ together with (6.47) yields that

∫ τk

τl∗

[dXk(t) − Uk(t)dt

]≤ 0,

183

for k = l∗ + 1, . . . , K. This implies that the first term of (6.46) is bounded above by

zero.

We now derive an upper bound for the second term of (6.46). On the event Aj ,

the inequalities (6.17) in the definition of l imply that

K∑

p=k+1

Up(j + 1) <

K∑

p=k+1

F0p(j +Mv(j)), k = l, . . . , K.

Together with the definition of τk, it follows that on the event Aj we have

k∑

p=1

Up(τp) =

k∑

p=1

Up(j + 1) >

k∑

p=1

F0p(j +Mv(j)), k = l, . . . , K.

Furthermore, Up(τp) ≤ Up(τk) for p ≤ k by the monotonicity of Up and the ordering

τ1 ≤ · · · ≤ τK . Hence, we get for k = l, . . . , K, and u ≥ τk:

k∑

p=1

Up(u) ≥k∑

p=1

Up(τk) >k∑

p=1

F0p(j +Mv(j)).

This means that on the event Aj , the second term of (6.46) is bounded above by

K∑

k=l∗

k∑

p=1

∫ τk+1

τk

[dXp(t) − F0p(j +Mv(j))dt

]

=

K∑

k=1

∫ j+Mv(j)

τk∨τl∗


].

Combining (6.45), (6.46) and the upper bound for (6.46), we obtain

P

(∫ j+Mv(j)

τl

[dX+(t) − U+(t)dt

]≥ 0

∩Aj

)

≤ P

(∫ j+Mv(j)

τl∗

[dX+(t) − U+(t)dt

]≥ 0

∩Aj

)

≤ P

( K∑

k=1

∫ j+Mv(j)

τk∨τl∗


]≥ 0

).

184

In turn, this is bounded above by

P

(sup

k ∈ 1, . . . , K

w ≤ j + 1

∫ j+Mv(j)

w


]≥ 0

),

and this can be bounded by pjM/2 using Lemma 6.17. 2

Proof of Lemma 6.23: Note that

∫

[τnk,t0+n−1/3h)

Fnk(u) − δk


1 − Fn+(u)

dPn(u, δ)

=

∫

[τnk ,t0+n−1/3h)

Fnk(u) − δkF0k(t0)

+Fn+(u) − δ+1 − F0+(t0)

dPn(u, δ)

+

∫

[τnk ,t0+n−1/3h)

Fnk(u) − δkF0k(t0) − Fnk(u)Fnk(u)F0k(t0)

dPn(u, δ)

+

∫

[τnk ,t0+n−1/3h)

Fn+(u) − δ+Fn+(u) − F0+(t0)1 − Fn+(u)1− F0+(t0)

dPn(u, δ)

≡ I + II + III.

Since the terms II and III are analogous, we only show that II is of order op(n−2/3).

As in Lemma 5.9, it is sufficient to consider the numerator. We write

∫

[τnk,t0+n−1/3h)

Fnk(u) − δkF0k(t0) − Fnk(u)dPn(u, δ)

=

∫

[τnk,t0+n−1/3h)

Fnk(u) − F0k(u)F0k(t0) − Fnk(u)d(Pn − P )(u, δ)

+

∫

[τnk,t0+n−1/3h)

Fnk(u) − F0k(u)F0k(t0) − Fnk(u)dP (u, δ)

+

∫

[τnk,t0+n−1/3h)

F0k(u) − δkF0k(t0) − Fnk(u)d(Pn − P )(u, δ)

≡ IIa + IIb + IIc.

Note that IIb is of order Op(n−1), using that the length of the integration interval

185

is Op(n−1/3) (Corollary 5.22) and the local rate of convergence (Theorem 5.10). The

terms IIa and IIc are of order op(n−2/3) by Theorem 2.11.22 of Van der Vaart and

Wellner (1996), analogously to the treatment of term I in Lemma 6.7. 2

186

Chapter 7

A FAMILY OF SMOOTH FUNCTIONALS

Let c : R → R be a fixed function. We consider estimation of the following smooth

functionals of the sub-distribution functions:

Vk(F ) =

∫Fk(t)c(t)dG(t) =

∫Cg(x)dFk(x), k = 1, . . . , K + 1,

where Cg(t) =∫[t,∞)

c(x)dG(x). The second equality follows from Fubini’s theorem

if∫Fk(t) |c(t)| dG(t) < ∞. We choose to consider the functionals

∫Fk(t)c(t)dG(t)

instead of∫Fk(t)c(t)dt, because doing so allows us to get asymptotic results with

few assumptions on G. Furthermore, the functionals∫Fk(t)c(t)dt fit into the fam-

ily∫Fk(t)b(t)dG(t) by assuming that G has a density g with respect to Lebesgue

measure, and choosing b(t) = c(t)/g(t).

Jewell, Van der Laan and Henneman (2003, Section 8) discuss results that suggest

that the naive estimators yield fully efficient estimators for these smooth functionals,

and that under some conditions

√nVk(Fn) − Vk(F0) =

√n

∫Fnk(t) − F0k(t)c(t)dG(t)

→d N

(0,

∫F0k(t)(1 − F0k(t))c

2(t)dG(t)

).

We show that the same is true for the MLE, and hence that the naive estimator and

the MLE are asymptotically equivalent for these smooth functionals. In Section 7.1

we derive the information lower bound for our model, and in Section 7.2 we show that

the MLE achieves this lower bound. We assume that the MLEs Fnk are piecewise

187

constant and right-continuous, with jumps only at points in Tk (see Definition 2.22).

7.1 Information bound calculations

Since our variables of interest (X, Y ) are subject to censoring, we can consider so

called hidden and observed models for our data. The hidden data consist of the

triplets (T,X, Y ), and the hidden model is Q = QF : F ∈ FK. The corresponding

density qF (x, y = k) is simply fk(x). The observed data are H(T,X, Y ) = (T,∆) and

the observed model is

P = QFH−1 : F ∈ FK. (7.1)

The density of PF ∈ P is

pF (t, δ) =K∏

k=1

Fk(t)δk(1 − F+(t))1−δ+ ,

with respect to µ, where µ = G × # and # is counting measure on the unit vectors

ek ∈ RK+1, k = 1, . . . , K + 1.Let L2(P ) be the equivalence class of P -square integrable functions, with inner

product 〈g1, g2〉L2(P ) =∫g1g2dP and norm ‖g‖2 =

∫g2dQ1/2. Let L0

2(P ) be the

subset of g ∈ L2(P ) with EP (g) =∫gdP = 0. Finally, note that both P ∈ P and

Q ∈ Q depend on the underlying distribution F . However, we often suppress this

dependence in the notation.

The functionals Vk(F ), F ∈ Q are implicitly defined in the sense that the observed

data are from PF rather than directly from F . In terms of the observed data, we can

write the functionals as Θ(PF ), P ∈ P. Here the observation time distribution G acts

as a nuisance parameter. Information bounds for such implicitly defined functionals

were studied by Van der Vaart (1991). Discussions of Van der Vaart’s work can be

found in Groeneboom and Wellner (1992, pages 23-32), and Bickel, Klaassen, Ritov

and Wellner (1993, pages 201-210).

188

Throughout, we need the following assumptions:

(a) The distribution G of T is fixed and known (see Remark 7.16 for a discussion

of the effects of not knowing G);

(b) I−1Fk

=∫Fk(t)(1 − Fk(t))c

2(t)dG(t) <∞;

(c)∫Fk(t) |c(t)| dG(t) <∞.

Proposition 7.1 The score operator lF relates the observed model to the hidden

model. It is the bounded linear operator from L02(Q) to L0

2(P ) given by

[lFa](t, δ) =

K∑

k=1

∫

(−∞,t]

a(x, k)dFk(x)

(δk

Fk(t)− δK+1

1 − F+(t)

)a.e. PF (7.2)

The adjoint lT of the score operator is the bounded linear functional from L02(P ) to

L02(Q) given by

[lT b](x, k) =

∫

[x,∞)

b(t, ek)dG(t) +

∫

(−∞,x)

b(t, eK+1)dG(t) a.e. F. (7.3)

Note that lT does not depend on F .

Proof: Let a ∈ L02(Q). By for example Groeneboom and Wellner (1992, page 8,

equation (1.5)), we have

[lFa](t, δ) = E (a(X, Y )|H(T,X, Y ) = (t, δ))

=

K∑

k=1

δk

∫(−∞,t]

a(x, k)dFk(x)

Fk(t)+ δK+1

∫(t,∞)

a(x, k)dFk(x)

1 − F+(t)

=K∑

k=1

δk

∫(−∞,t]

a(x, k)dFk(x)

Fk(t)− δK+1

∫(−∞,t]

a(x, k)dFk(x)

1 − F+(t)

=

K∑

k=1

∫

(−∞,t]

a(x, k)dFk(x)

(δk

Fk(t)− δK+1

1 − F+(t)

)a.e. PF ,

189

where we use∑K

k=1

∫a(x, k)dFk(x) = 0 (since a ∈ L0

2(Q)) to obtain the third line.

Let b ∈ L02(P ). The adjoint lT of the score operator is the bounded linear func-

tional from L02(P ) to L0

2(Q) given by

[lT b](x, k) = E (b(T,∆)|(X, Y ) = (x, k))

=

∫

[x,∞)

b(t, ek)dG(t) +

∫

(−∞,x)

b(t, eK+1)dG(t) a.e. F.

2

The functional Vk(F ) is said to be pathwise differentiable at F in the hidden model

Q if there is a continuous linear map vkF from L02(Q) to R such that

η−1(Vk(Fη) − Vk(F )) → vkF

for every path Fη in Q. We call vkF the canonical gradient of Vk(F ) in the hidden

model.

Proposition 7.2 The canonical gradients v1F , . . . , vK+1,F of V1, . . . , VK+1 at F in the

hidden model are bounded linear functionals from L02(Q) to R, given by

[vkFa](x) =

K∑

j=1

(∫Cg(x)1j = k −

∫Cg(t)dFk(t)

)a(x, j)dFj(x),

[vK+1,Fa](x) = −K∑

k=1

∫ (Cg(x) −

∫Cg(t)dF+(t)

)a(x, k)dFk(x).

where the first equality holds for k = 1, . . . , K. Furthermore, their adjoints are

bounded linear functions from R to L02(Q), given by

[vTkF b](x, j) =

(Cg(x)1j = k −

∫Cg(t)dFk(t)

)b, k = 1, . . . , K,

[vTK+1,Fb](x) = −(Cg(x) −

∫Cg(t)dF+(t)

)b.

190

Proof: Let a ∈ L02(Q) be bounded. Consider the perturbation

Fkη(t) = Fk(t) + η

∫

(−∞,t]

a(x, k)dFk(x).

Note that these functions are monotone nondecreasing for small η. Furthermore,

a ∈ L02(Q) implies

∑Kk=1

∫a(x, k)dFk(x) = 0, so that

K∑

k=1

Fkη(∞) = F+(∞) + ηK∑

k=1

∫a(x, k)dFk(x) = F+(∞) = 1.

It follows that the Fkη’s are valid sub-distribution functions. Now let k ∈ 1, . . . , K.Then

Vk(Fη) − Vk(F ) =

∫Cg(x)d

[η

∫

(−∞,x]

a(t, k)dFk(t)

]= η

∫Cg(x)a(x, k)dFk(x),

so that

[vkFa](x) =

∫Cg(x)a(x, k)dFk(x) =

K∑

j=1

∫Cg(x)a(x, j)1j = kdFj(x)

=

K∑

j=1

∫ (Cg(x)1j = k −

∫Cg(t)dFk(t)

)a(x, j)dFj(x),

again using that a ∈ L02(Q). Furthermore,

FK+1,η = 1 − F+η = 1 − F+(t) − η

K∑

k=1

∫

(−∞,t]

a(x, k)dFk(x).

Hence,

VK+1 (Fη) − VK+1 (F ) =

∫Cg(x)d

[−η

K∑

k=1

∫

(−∞,x]

a(t, k)dFk(t)

]

= −ηK∑

k=1

∫Cg(x)a(x, k)dFk(x).

191

Hence,

[vK+1,Fa](x) = −K∑

k=1

∫Cg(x)a(x, k)dFk(x)

= −K∑

k=1

∫ (Cg(x) −

∫Cg(t)dF+(t)

)a(x, k)dFk(x).

We find the adjoints vTkF , k = 1, . . . , K + 1, using the relation 〈vkFa, b〉R =

〈a, vTkF b〉L2(Q). This yields

〈vkFa, b〉R =

K∑

j=1

∫b

(Cg(x)1j = k −

∫Cg(t)dFk(t)

)a(x, j)dFj(x)

=

⟨(Cg(x)1j = k −

∫Cg(t)dFk(t)

)b, a(x, j)

⟩

L2(Q)

,

so that

[vTkF b](x, j) =

(Cg(x)1j = k −

∫Cg(t)dFk(t)

)b, k = 1, . . . , K.

The adjoint vTK+1,F can be derived analogously. 2

It now follows from Van der Vaart (1991, Theorem 3.1) that the functionals

Vk(F ) = Θk(PF ), k = 1, . . . , K + 1, are pathwise differentiable in the observed model

if and only if

vTkF ∈ R(lT ),

and if this holds, then the canonical gradient is the unique element θkF ∈ R(l)

satisfying

lT θkF = vTkF .

192

Proposition 7.3 The canonical gradients θ1F , . . . , θK+1,F of V1, . . . , VK+1 in the ob-

served model are bounded linear functionals from L02(P ) to R, given by

θjF (t, δ) = δj − Fj(t)c(t), j = 1, . . . , K + 1. (7.4)

Furthermore,

θjF (t, δ) =K∑

k=1

δk

Fk(t)− 1 − δ+

1 − F+(t)

djF (t, k), j = 1, . . . , K + 1, (7.5)

where

djF (t, k) = Fj(t)(1j = k − Fk(t))c(t), j = 1, . . . , K + 1. (7.6)

The information lower bounds for estimating V1, . . . , VK+1 in the observed model are

I−1Fj

= ‖θjF‖2L2(P ) =

∫Fj(t)(1 − Fj(t))c

2(t)dG(t), j = 1, . . . , K + 1. (7.7)

Proof: We first consider θjF for j ∈ 1, . . . , K. For all k = 1, . . . , K, we have

[lT θjF ](x, k) =

∫ ∞

x

θjF (t, ek)dG(t) +

∫ x

−∞θjF (t, eK+1)dG(t)

=

∫ ∞

x

1k = j − Fj(t)c(t)dG(t) −∫ x

−∞Fj(t)c(t)dG(t)

= Cg(x)1k = j −∫Cg(t)dFj(t) = vTjF (x, k).

We now consider θK+1,F :

[lT θK+1,F ](x) = −∫

[x,∞)

FK+1(t)c(t)dG(t) +

∫

(−∞,x)

1 − FK+1(t)c(t)dG(t)

= −∫FK+1(t)c(t)dG(t) +

∫

(−∞,x)

c(t)dG(t).

193

This can be written as

∫F+(t)c(t)dG(t) −

∫

[x,∞)

c(t)dG(t)

=

∫Cg(x)dF+(x) − Cg(x) = vTK+1,F (x).

Next, we check expression (7.5). For j = 1, . . . , K, we have

K∑

k=1

δk

Fk(x)− 1 − δ+

1 − F+(x)

djF (x, k)

=K∑

k=1

δk

Fk(x)− 1 − δ+

1 − F+(x)

Fj(x)(1j = k − Fk(x))c(x)

= c(x)

[δj

Fj(x)− 1 − δ+

1 − F+(x)

Fj(x)(1 − Fj(x))

−∑

k 6=j

δk

Fk(x)− 1 − δ+

1 − F+(x)

Fj(x)Fk(x)

]

= c(x)

[δj(1 − Fj(x)) −

(1 − δ+)Fj(x)(1 − Fj(x))

1 − F+(x)

−∑

k 6=j

δkFj(x) −

(1 − δ+)Fj(x)Fk(x)

1 − F+(x)

]

= c(x)

[δj − δ+Fj(x) −

(1 − δ+)Fj(x)

1 − F+(x)+

(1 − δ+)Fj(x)F+(x)

1 − F+(x)

]

= c(x)δj − δ+Fj(x) − (1 − δ+)Fj(x)

= c(x)δj − Fj(x) = θjF (x, δ).

We verify the expression for j = K + 1 analogously:

K∑

k=1

δk

Fk(x)− 1 − δ+

1 − F+(x)

dK+1,F (x, k)

= −K∑

k=1

δk

Fk(x)− δK+1

FK+1(x)

FK+1(x)Fk(x)c(x).

194

This can be written as

− c(x)

K∑

k=1

δkFK+1(x) − δK+1Fk(x)

= −c(x)δ+FK+1(x) − δK+1F+(x)

= δK+1 − FK+1(x)c(x) = θK+1,F (x, δ).

The expressions for the information lower bounds follow from direct computation.

2

Remark 7.4 The expressions djF (t, k) given in (7.6) typically have discontinuities

that do not coincide with discontinuities of Fk, for two reasons: (i) Fj can have jumps

at other locations than Fk; (ii) the function c(t) may have jumps at other locations

than Fk. In such cases we cannot express djF (t, k) as∫ t−∞ a(x, k)dFk(x) for some

a ∈ L02(Q). Hence θjF ∈ R(l)\R(l).

7.2 Asymptotic normality of functionals of the MLE

We now let c(x) = ξ(x)1[0,t](x), where ξ : R+ 7→ R+ is Lipschitz continuous. With this

choice of the function c(·), assumptions (b) and (c) of Section 7.1 are automatically

satisfied when t is finite. Furthermore, this choice of the function c(·) yields

θj,F,t(u, δ) = δj − Fj(u)ξ(u)1[0,t](u),

δj,F,t(u, k) = Fj(u)(1j = k − Fk(u))ξ(u)1[0,t](u).

Throughout, we assume F0+(0) = 0, and we use the convention 0/0. We now give the

main result of this chapter.

Theorem 7.5 Let t0 ∈ R and c(t) = ξ(t)1[0,t0](t), where ξ : R+ 7→ R+ is Lips-

chitz continuous. Assume that F01, . . . , F0K are absolutely continuous with respect to

Lebesgue measure on [0, t0], with densities f01, . . . , f0K. Assume that ǫ < f0k(t) < M

195

for some constants 0 < ǫ < M , for all t ∈ [0, t0] and k = 1, . . . , K. Furthermore, as-

sume that F0+(t0) < 1, that F0+ is continuous at t0, and that G has a strictly positive

density g on a neighborhood of t0. Then

√n(Vk(F0) − Vk(Fn)) =

√n

∫

[0,t0]

F0k(t) − Fnk(t)ξ(t)dG(t)

→d N

(0,

∫

[0,t0]

F0k(t)(1 − F0k(t))ξ2(t)dG(t)

),

for k = 1, . . . , K + 1.

Proof: The proof is similar in spirit to the proofs of Huang and Wellner (1995) and

Geskus and Groeneboom (1996, 1997, 1999). However, the new aspect here is that

we have a system of sub-distribution functions.

We first consider FK+1. In Lemma 7.14 (ahead), we show that

∫

[0,t0]

F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) ≤∫θK+1,F0,t0d(P0 − Pn) +Op(n

−2/3).

Letting θ+,F,t =∑K

k=1 θk,F,t and noting that FK+1 = 1 − F+ and θ+,F,t = −θK+1,F,t,

we find that this is equivalent to

∫

[0,t0]

F0+(t) − Fn+(t)ξ(t)dG(t) ≥∫θ+,F0,t0d(P0 − Pn) +Op(n

−2/3). (7.8)

Next, we consider the components F1, . . . , FK . In Lemma 7.15 (ahead), we show that

∫

[0,t0]

F0k(t) − Fnk(t)ξ(t)dG(t) ≤ maxτ∈σnk ,τnk

∫θk,F0,τ−d(P0 − Pn) +Op(n

−2/3), (7.9)

where σnk is the last jump point of Fnk before t0, and τnk is the first jump point of Fnk

after t0. Consistency of Fnk in a neighborhood of t0 (Proposition 4.15) and f0k(t0) > 0

imply that τnk − σnk →a.s. 0. Furthermore, the central limit theorem implies that for

196

any α > 0

√n

∫θk,F0,t0+α − θk,F0,t0d(P0 − Pn)

=√n

∫

(t0,t0+α]

δk − F0k(t)ξ(t)d(P0 − Pn)

→d N

(0,

∫

(t0,t0+α]

F0k(t)(1 − F0k(t)ξ2(t)dG(t)

).

Hence,

maxτ∈σnk ,τnk

∫θk,F0,τ−d(P0 − Pn) =

∫θk,F0,t0d(P0 − Pn) + op(n

−1/2).

Combining this with (7.9) yields

∫

[0,t0]

F0k(t) − Fnk(t)ξ(t)dG(t) ≤∫θk,F0,t0d(P0 − Pn) + op(n

−1/2), (7.10)

and summing over k = 1, . . . , K gives

∫

[0,t0]

F0+(t) − Fn+(t)ξ(t)dG(t) ≤∫θ+,F0,t0d(P0 − Pn) + op(n

−1/2). (7.11)

Combining (7.11) and (7.8) yields

√n

∫

[0,t0]

F0+(t) − Fn+(t)ξ(t)dG(t) =√n

∫θ+,F0,t0d(P0 − Pn) + op(1)

→d N(0, ‖θ+,F0,t0‖2L2(P0)),

where the convergence follows from the central limit theorem. Since F+ = 1−F+ and

‖θ+,F0,t0‖2 = ‖θK+1,F0,t0‖2, this is equivalent to

√n

∫

[0,t0]

F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) →d N(0, ‖θK+1,F0,t0‖2L2(P0)).

197

Finally, (7.10) and (7.8) imply that

√n

∫

[0,t0]

F0k(t) − Fnk(t)ξ(t)dG(t) =√n

∫θk,F0,t0d(P0 − Pn) + op(1), (7.12)

for k = 1, . . . , K. The convergence result then again follows from the central limit

theorem. 2

Remark 7.6 Theorem 7.5 requires that the underlying F01, . . . , F0K are absolutely

continuous with respect to Lebesgue measure, with densities bounded away from zero.

This assumption was not used when computing the MLE, and in fact the MLE does

not satisfy this assumption. Namely, the MLE can be taken to be piecewise contin-

uous, but it always contains horizontal pieces, where its density equals zero. Thus,

under the assumptions of Theorem 7.5, we expect that one can construct better esti-

mators than the MLE. However, these estimators are not better (asymptotically) for

the estimation of the smooth functionals we consider, since the MLE is asymptotically

efficient for these smooth functionals.

In order to prove the key Lemmas 7.14 and 7.15 that are needed in the proof of

Theorem 7.5, we need to establish a few results. We start with a basic but important

fact.

Lemma 7.7 For any F = (F1, . . . , FK) ∈ FK, we have

∫

[0,t0]

(F0k(t) − Fk(t))ξ(t)dG(t) =

∫θk,F,t0dP0, k = 1, . . . , K + 1.

Proof:

∫θk,F,t0dP0 =

∫δk − Fk(t)ξ(t)1[0,t0](t)dP0

=

∫

[0,t0]

F0k(t) − Fk(t)ξ(t)dt.

2

198

Next, we introduce an adapted version of θj,F,t0. The main goal of this adaptation

θj,F,t0 is that the corresponding functions dj,F,t0(x, k) are constant on the same in-

tervals as Fk, so that we can use the (in)equalities given by the characterization in

Proposition 2.34.

Definition 7.8 Let F = (F1, . . . , FK) ∈ FK be piecewise constant, and let k ∈1, . . . , K. Let 0 = τk0 < τk1 < · · · < τk,pk

< τk,pk+1 = ∞ be the ordered jump

points of Fk. For i ∈ 1, . . . , pk + 1, let Jki = [τk,i−1, τki) and distinguish the follow-

ing three cases:

(i) Fk(ski) = F0k(ski) for some ski ∈ Jki;

(ii) Fk(u) < F0k(τki) for all u ∈ Jki;

(iii) Fk(u) > F0k(τk,i+1−) for all u ∈ Jki.

In case (ii) we define ski = τki, and in case (iii) we define ski = τk,i+1−. Furthermore,

for j = 1, . . . , K + 1, we choose a point ukji ∈ Jki such that

|Fj(ukji) − F0j(ski)| = minx∈Jki

|Fj(x) − F0j(ski)| .

Then, for t ∈ Jki, we define

F(k)

j (t) = Fj(ukji),

ξ(k)

(t) = ξ(ski).

Finally, we define

dj,F,t0(t, k) = F(k)

j (t)(1j = k − Fk(t))ξ(k)

(t)1[0,t0](t),

θj,F,t0(t, δ) =

K∑

k=1

δk

Fk(t)− 1 − δ+

1 − F+(t)

dj,F,t0(t, k).

199

This adapted version of θj,Fn,t0is useful, because we have information on the sign of

∫θj,Fn,t0

dPn, j = 1, . . . , K + 1. In Lemma 7.9 we show that∫θK+1,Fn,t0

dPn ≤ 0. In

Lemma 7.10 we show that∫θj,Fn,τnj−dPn ≤ 0 for j = 1, . . . , K and τnj a jump point

of Fnj.

Lemma 7.9 We have

∫θK+1,Fn,t0

dPn ≤ 0.

Proof: Note that

∫θK+1,Fn,t0

dPn =

K∑

k=1

∫ δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

dK+1,Fn,t0

(t, k)dPn(t, δ),

and that dK+1,Fn,t0(t, k) is constant on the same intervals as Fnk(t), except for the one

containing t0. Using the characterization given in Proposition 2.34, it follows that for

each k = 1, . . . , K

∫

t∈[0,t0]

δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

dK+1,Fn

(t, k)dPn(t, δ)

=

∫

t∈[τnk1,t0]

δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

dK+1,Fn

(t, k)dPn(t, δ)

= −∫

t∈[τnk1,t0]

δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

F

(k)

n,K+1(t)Fnk(t)ξ(k)

(t)dPn(t, δ) ≤ 0.

Here τnk1 is the first jump point of Fnk, and the first equality follows from Fnk(t) = 0

for t < τnk1. The last inequality follows from equality in (2.41) for t = τnk1, inequality

in expression (2.41) for t = t0+, and the fact that F(k)

n,K+1(t)Fnk(t)ξ(k)

(t) is constant

on the same intervals as Fnk. 2

We can say something similar about∫θj,Fn,t0

dPn, j = 1, . . . , K, but only for jump

points of Fnj.

200

Lemma 7.10 Let τnj be a jump point of Fnj. Then we have for j = 1, . . . , K,

∫θj,Fn,τnj−dPn ≤ 0.

Proof: Let τnk1 be the first jump point of Fnk, k = 1, . . . , K. Note that

∫θj,Fn,τnj−dPn

=K∑

k=1

∫ δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

dj,Fn,τnj−(t, k)dPn(t, δ)

=

∫

t∈[0,τnj)

δj

Fnj(t)− 1 − δ+

1 − Fn+(t)

Fnj(t)(1 − Fnj(t))ξ

(j)(t)dPn(t, δ)

−∑

k 6=j

∫

t∈[0,τnj)

δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

F

(k)

nj (t)Fnk(t)ξ(k)

(t)dPn(t, δ)

=

∫

t∈[τnj1,τnj)

δj

Fnj(t)− 1 − δ+

1 − Fn+(t)

Fnj(t)(1 − Fnj(t))ξ

(j)(t)dPn(t, δ)

−∑

k 6=j

∫

t∈[τnk1,τnj)

δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

F

(k)

nj (t)Fnk(t)ξ(k)

(t)dPn(t, δ)

= I − II ≤ 0.

This inequality follows from the characterization in Proposition 2.34 which implies

I = 0 and II ≥ 0. Here I = 0 follows from the fact that we have equality at

t = τnj1 and t = τnj in (2.41) for the jth component. Similarly, II ≥ 0 follows from

expression (2.41) for the kth component (k 6= j), where we have equality at t = τnk1

and inequality at t = τnj (because τnj is typically not a jump point of Fnk). 2

The last two ingredients for the proofs of Lemmas 7.14 and 7.15 are

∣∣∣∣∫θj,Fn,t0

− θj,Fn,t0dP0

∣∣∣∣ = Op(n−2/3), (7.13)

∣∣∣∣∫

θk,Fn,t0− θk,F0,t0d(Pn − P0)

∣∣∣∣ = Op(n−2/3). (7.14)

201

In order to prove this, we first bound the differences between F(k)

nj (t) − Fnj(t) and

ξ(k)

(t) − ξ(t).

Lemma 7.11 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1

and k = 1, . . . , K,

∣∣∣Fnj(x) − F(k)

nj (x)∣∣∣ ≤ 2

∣∣∣Fnj(x) − F0j(x)∣∣∣ + (2M/ǫ)

∣∣∣Fnk(x) − F0k(x)∣∣∣, (7.15)

∣∣∣ξ(x) − ξ(k)

(x)∣∣∣ ≤ (C/ǫ)

∣∣∣F0k(x) − Fnk(x)∣∣∣, (7.16)

where ǫ and M are defined in Theorem 7.5 and C > 0 is a constant.

Proof: Let j ∈ 1, . . . , K + 1, k ∈ 1, . . . , K, i ∈ 1, . . . , pk + 1, and x ∈ Jki.

Then the triangle inequality and the definition of F(k)

nj yield:

∣∣∣Fnj(x) − F(k)

nj (x)∣∣∣ ≤

∣∣∣Fnj(x) − F0j(x)∣∣∣+∣∣∣F0j(x) − F0j(ski)

∣∣∣+∣∣∣F0j(ski) − F

(k)

nj (x)∣∣∣

=∣∣∣Fnj(x) − F0j(x)

∣∣∣ +∣∣∣F0j(x) − F0j(ski)

∣∣∣ +∣∣∣F0j(ski) − Fnj(ukji)

∣∣∣

≤∣∣∣Fnj(x) − F0j(x)

∣∣∣+∣∣∣F0j(x) − F0j(ski)

∣∣∣+∣∣∣F0j(ski) − Fnj(x)

∣∣∣.

Furthermore, the triangle inequality implies

∣∣∣F0j(ski) − Fnj(x)∣∣∣ ≤

∣∣∣F0j(ski) − F0j(x)∣∣∣ +∣∣∣F0j(x) − Fnj(x)

∣∣∣.

Hence,

∣∣∣Fnj(x) − F(k)

nj (x)∣∣∣ ≤ 2

∣∣∣Fnj(x) − F0j(x)∣∣∣ + 2

∣∣∣F0j(x) − F0j(ski)∣∣∣. (7.17)

Using ǫ < f0j(t) < M for t ∈ [0, t0], we get

∣∣∣F0j(x) − F0j(ski)∣∣∣ ≤ M

∣∣∣x− ski

∣∣∣ ≤ (M/ǫ)∣∣∣F0k(x) − F0k(ski)

∣∣∣. (7.18)

202

For the interval Jki, we now consider the three possible cases in Definition 7.8. In

case (i) we have

∣∣∣F0k(x) − F0k(ski)∣∣∣ =

∣∣∣F0k(x) − Fnk(ski)∣∣∣ =

∣∣∣F0k(x) − Fnk(x)∣∣∣.

In case (ii), we have Fnk(x) < F0k(τki) for all x ∈ Jki and ski = τki. This yields

∣∣∣F0k(x) − F0k(ski)∣∣∣ =

∣∣∣F0k(x) − F0k(τki)∣∣∣ <

∣∣∣F0k(x) − Fnk(τki)∣∣∣ =

∣∣∣F0k(x) − Fnk(x)∣∣∣.

Similarly, in case (iii) we have Fnk(x) > F0k(τk,i+1−) for all x ∈ Jki and ski = τk,i+1−.

This yields

∣∣∣F0k(x) − F0k(ski)∣∣∣ =

∣∣∣F0k(x) − F0k(τk,i+1−)∣∣∣ <

∣∣∣F0k(x) − Fnk(τk,i+1−)∣∣∣

=∣∣∣F0k(x) − Fnk(x)

∣∣∣.

Hence, in all three cases we have∣∣∣F0k(x)−F0k(ski)

∣∣∣ ≤∣∣∣F0k(x)− Fnk(x)

∣∣∣. Combining

this with (7.17) and (7.18) gives (7.15).

To prove (7.16), we again consider the three cases in Definition 7.8. In case (i) we

have

∣∣∣ξ(x) − ξ(k)

(x)∣∣∣ =

∣∣∣ξ(x) − ξ(ski)∣∣∣ ≤ C

∣∣∣x− ski

∣∣∣ ≤ (C/ǫ)∣∣∣F0k(x) − F0k(ski)

∣∣∣

= (C/ǫ)∣∣∣F0k(x) − Fnk(ski)

∣∣∣ = (C/ǫ)∣∣∣F0k(x) − Fnk(x)

∣∣∣.

The expressions for case (ii) and (iii) follow analogously. 2

We can now prove (7.13) and (7.14).

Lemma 7.12 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1,

∣∣∣∣∫

θj,Fn,t0− θj,Fn,t0

dP0

∣∣∣∣ = Op(n−2/3).

203

Proof: First, note that for j = 1, . . . , K, we have

∫θj,Fn,t0

− θj,Fn,t0dP0

=

K∑

k=1

∫ δk

Fnk(t)− 1 − δ+

1 − Fn+(t)

dj,Fn,t0

(t, k) − dj,Fn,t0(t, k)dP0(t, δ)

=K∑

k=1

∫ F0k(t)

Fnk(t)− 1 − F0+(t)

1 − Fn+(t)

dj,Fn,t0

(t, k) − dj,Fn,t0(t, k)dG(t)

=

∫

[0,t0]

F0j(t)

Fnj(t)− 1 − F0+(t)

1 − Fn+(t)

Fnj(t)(1 − Fnj(t))(ξ(t) − ξ

(j)(t))dG(t)

−∑

k 6=j

∫

[0,t0]

F0k(t)

Fnk(t)− 1 − F0+(t)

1 − Fn+(t)

Fnk(t)Fnj(t)ξ(t) − F

(k)

nj (t)ξ(k)

(t)dG(t).

Similarly, for j = K + 1, we write

∫θK+1,Fn,t0

− θK+1,Fn,t0dP0

= −K∑

k=1

∫

[0,t0]

[F0k(t)

Fnk(t)− 1 − F0+(t)

1 − Fn+(t)

· Fnk(t)Fn,K+1(t)ξ(t) − F(k)

n,K+1(t)ξ(k)

(t)]dG(t).

Note that all terms in these expressions contain the common factor

F0k(t)

Fnk(t)− 1 − F0+(t)

1 − Fn+(t)

Fnk(t).

For k = 1, . . . , K, we rewrite the absolute value of this expression as

∣∣∣∣∣F0k(t)(1 − Fn+(t)) − Fnk(t)(1 − F0+(t))

1 − Fn+(t)

∣∣∣∣∣

=

∣∣∣∣∣F0k(t)(F0+(t) − Fn+(t)) + (1 − F0+(t))(F0k(t) − Fnk(t))

1 − Fn+(t)

∣∣∣∣∣ . (7.19)

204

Due to the assumption F0+(t0) < 1 and consistency of Fn+ in a neighborhood of t0

(Proposition 4.15), we can assume at the cost of a small probability that 1−Fn+(t0) >

(1 − F0+(t0))/2 > 0 for n sufficiently large. Hence, with large probability, (7.19) is

bounded by

C1

∣∣∣F0+(t) − Fn+(t)∣∣∣+∣∣∣F0k(t) − Fnk(t)

∣∣∣,

for some constant C1 > 0.

We now consider the remaining parts of∫θj,Fn,t0

− θj,Fn,t0dP0. First, by Lemma

7.11 we have∣∣∣ξ(t) − ξ

(j)(t)∣∣∣ ≤ C2

∣∣∣Fnj(t) − F0j(t)∣∣∣. Furthermore, using the same

lemma we obtain

∣∣∣∣Fnj(t)ξ(t) − F(k)

nj (t)ξ(k)

(t)

∣∣∣∣ =

∣∣∣∣(Fnj(t) − F(k)

nj (t))ξ(t) + F(k)

nj (t)(ξ(t) − ξ(k)

(t))

∣∣∣∣

≤ C3

∣∣∣Fnk(t) − F0k(t)∣∣∣ +∣∣∣Fnj(t) − F0j(t)

∣∣∣,

for j = 1, . . . , K + 1 and some constant C3 > 0. The result now follows by combining

these expressions, and using the Cauchy-Schwarz inequality and the L2(G) rate of

convergence given in (5.7). 2

Lemma 7.13 Under the conditions of Theorem 7.5, we have for all j = 1, . . . , K+1,

∣∣∣∣∫

θj,Fn,t0− θj,F0,t0d(Pn − P0)

∣∣∣∣ = Op(n−2/3).

Proof: We use the modulus of continuity result of Van de Geer (2000, Lemma 5.13,

page 79). For j ∈ 1, . . . , K + 1 we define

hjF = θj,F,t0 − θj,F0,t0 , F ∈ F

HjF = hjF : F ∈ F , F+(t0) < 1 − a/2 ,

205

where a = 1 − F0+(t0) > 0. Note that

θj,F,t0(t, δ) =K∑

k=1

δk

Fk(t)− 1 − δ+

1 − F+(t)

dj,F,t0(t, k)

=

δj(1 − Fj(t)) −

(1 − δ+)Fj(t)(1 − Fj(t))

1 − F+(t)

ξ

(j)(t)1[0,t0](t)

−∑

k 6=j

δkF

(k)

j (t) − (1 − δ+)F(k)

j (t)Fk(t)

1 − F+(t)

ξ

(k)(t)1[0,t0](t).

The class Hj is uniformly bounded, since F+(t0) < 1 − a/2, F0+(t0) = 1 − a, and

ξ is continuous and hence bounded on [0, t0]. This implies that we can rescale the

functions in Hj so that Van de Geer’s condition (5.39) is satisfied.

In order to satisfy Van de Geer’s condition (5.40), we must show that the γ-entropy

with bracketing of Hj is bounded by Aγ−1 for some constant A > 0. To see this, note

that the function ξ is fixed, but that the adapted versions ξ(k)

depend on Fk. Since

ξ is Lipschitz continuous, its restriction to [0, t0] is of bounded variation. Since the

adaptations ξ(k)

are ‘more constant’ versions of ξ, they are also of bounded variation.

Furthermore, the functions Fj , 1 − Fj , F(k)

j , (1 − F+)−1, for j = 1, . . . , K + 1 and

k = 1, . . . , K, are bounded and monotone, and hence of bounded variation. Here we

again use that F+(t0) < 1−a/2. Hence, our class of functions consists of the sums and

products of functions of bounded variation. It then follows from Propositions 5.23

and 5.24 that the γ-bracketing entropy of Hj is bounded by A′γ−1 for some A′ > 0.

Next, we define Hj(s) = hjF ∈ Hj : ‖hjF‖2 ≤ s. After some algebra, we obtain

θj,Fn,t0− θj,F0,t0 = δk

[(1 − Fnj(t))ξ

(j)(t) − (1 − F0j(t))ξ(t)

]

+ (1 − δ+)

[F0j(t)(1 − F0j(t))ξ(t)

1 − F0+(t)− Fnj(t)(1 − Fnj(t))ξ

(j)(t)

1 − Fn+(t)

]

+∑

k 6=jδk[F0j(t)ξ(t) − F

(k)

nj (t)ξ(k)

(t)]

+∑

k 6=j(1 − δ+)

[F

(k)

nj (t)Fnk(t)ξ(k)

(t)

1 − Fn+(t)− F0j(t)F0k(t)ξ(t)

1 − F0+(t)

].

206

Using the L2(G) rate of convergence of the MLE and Lemma 7.11, we find that the

four terms on the right side have L2(P0)-norms of order Op(n−1/3). This implies that

we can find some C > 0 such that hjFn∈ Hj(Cn

−1/3) with large probability. We now

apply Van de Geer (2000, Lemma 5.13, page 79, equation (5.42)) with α = 1 and

β = 0. This yields

suphjF∈Hj(Cn−1/3)

∣∣∣∣∫hjFd(P − Pn)

∣∣∣∣ = Op(n−2/3)

and completes the proof. 2

We are now ready to prove Lemmas 7.14 and 7.15.

Lemma 7.14 Under the conditions of Theorem 7.5, we have

∫

[0,t0]

F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t) ≤∫θK+1,F0,t0d(P0 − Pn) +Op(n

−2/3).

Proof: By Lemma 7.7, we have

∫

[0,t0]

F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t)

=

∫θK+1,Fn,t0

dP0

=

∫θK+1,Fn,t0

dP0 +

∫θK+1,Fn,t0

− θK+1,Fn,t0dP0.

In Lemma 7.9 we showed that∫θK+1,Fn,t0

dPn ≤ 0. Hence,

∫

[0,t0]

F0,K+1(t) − Fn,K+1(t)ξ(t)dG(t)

≤∫θK+1,Fn,t0

d(P0 − Pn) +

∫θK+1,Fn,t0

− θK+1,Fn,t0dP0.

207

The right side of this expression can be written as

∫θK+1,F0,t0d(P0 − Pn) +

∫θK+1,Fn,t0

− θK+1,F0,t0d(P0 − Pn)

+

∫θK+1,Fn,t0

− θK+1,Fn,t0dP0

=

∫θK+1,F0,t0d(P0 − Pn) +Op(n

−2/3),

where the last equality follows from Lemmas 7.12 and 7.13. 2

Lemma 7.15 Under the conditions of Theorem 7.5, we have for k = 1, . . . , K,

∫

[0,t0]

F0k(t) − Fnk(t)ξ(t)dG(t) ≤ maxτ∈σnk ,τnk

∫θk,F0,τ−d(P0 − Pn) +Op(n

−2/3),

where σnk is the last jump point of Fnk at or before t0, and τnk is the first jump point

of Fnk strictly after t0.

Proof: Let k ∈ 1, . . . , K and τ ∈ σnk, τnk. We use Lemmas 7.10, 7.12 and 7.13

to show that

∫

[0,τ)

F0k(t) − Fnk(t)ξ(t)dG(t) ≤∫θk,F0,τ−d(P0 − Pn) +Op(n

−2/3),

analogously to the proof of Lemma 7.14. Next, we relax the requirement that the

upper endpoint of the integration interval is a jump point of Fnk. We define

ψnk(t) =

∫

[0,t)

F0k(x) − Fnk(x)ξ(x)dG(x).

Suppose there is a point s ∈ [σnk, τnk) so that F0k(s) = Fnk(s). Then ψnk(t) is

decreasing on the interval [σnk, s) and increasing on the interval [s, τnk). Hence,

ψnk(t) ≤ maxψnk(σnk), ψnk(τnk). If there is no such s, then either Fnk(u) < F0k(σnk)

for all u ∈ [σnk, τnk), or Fnk(u) > F0k(τnk−) for all u ∈ [σnk, τnk). In the former case,

208

ψnk(t) is increasing for all t ∈ [σnk, τnk), so that ψnk(t) ≤ ψnk(τnk). In the latter case,

ψnk(t) is decreasing for all t ∈ [σnk, τnk), so that ψnk(t) ≤ ψnk(σnk). 2

Remark 7.16 We now briefly discuss what happens if we do not know G. Note that

our results imply that

√n

∫

[0,t0]

Fnk(t) − F0k(t)dGn(t) →d N

(0,

∫

[0,t0]

F0k(t)(1 − F0k(t))dG(t)

),

since

√n

∫

[0,t0]

Fnk(t) − F0k(t)dGn(t)

=√n

∫

[0,t0]

Fnk(t) − F0k(t)dG(t) +√n

∫

[0,t0]

Fnk(t) − F0k(t)d(Gn −G)(t),

and the last term on the right hand side is of order Op(n−1/6) by a modulus of

continuity result (Van de Geer (2000, Lemma 5.13, page 79). Hence, in this sense we

do not lose anything by not knowing the distribution G.

Furthermore, note that

√n

(∫

[0,t0]

Fnk(t)dGn(t) −∫

[0,t0]

F0k(t)dGn(t)

)

=√n

∫

[0,t0]

Fnk(t) − F0k(t)dGn(t) +√n

∫

[0,t0]

F0k(t)d(Gn −G)(t).

The first term on the right side converges to a normal distribution with variance∫[0,t0]

F0k(t)(1 − F0k(t))dG(t), by the argument given above. However, the second

term on the right side also gives a contribution. Thus, considering∫[0,t0]

Fnk(t)dGn(t)

as an estimator for∫[0,t0]

F0k(t)dG(t), not knowing G does result in a bigger asymptotic

variance.

Remark 7.17 It may be of interest to consider joint convergence of smooth func-

tionals of several components. For example, we can use (7.12) to show that the limit

209

of the vector

(√n

∫

[0,t0]

Fn1(t) − F01(t)dG(t), . . . ,√n

∫

[0,t0]

FnK(t) − F0K(t)dG(t)

)

is a multivariate normal distribution NK(0,Σ), where

Σjk =

∫

[0,t0]

F0j(t)(1j = k − F0k(t))dG(t), j, k ∈ 1, . . . , K.

In turn, this result can be used to study smooth functionals that consist of a linear

combination of several components.

210

Chapter 8

EXAMPLES

In this chapter we apply our methods to real and simulated data. First, in Section

8.1, we reanalyze a data set on the menopausal status of women, and verify that our

results agree with those of Jewell, Van der Laan and Henneman (2003) and Jewell and

Kalbfleisch (2004). Next, in Section 8.2, we compare the MLE and several variants of

the naive estimator in a simulation study. We consider both pointwise estimation and

estimation of smooth functionals. For pointwise estimation, we find that the MLE is

superior to the naive estimator in terms of mean squared error, both for small and

large samples. For the estimation of smooth functionals, we find that the MLE and

the naive estimator behave similarly, and in agreement with the theoretical results in

Chapter 7.

8.1 Menopause data

MacMahon and Worcestor (1966) and Krailo and Pike (1983) studied the menopausal

status of women participating in Cycle I of the Health Examination Survey of the Na-

tional Center for Health Statistics. This study consisted of a nationwide probability

sample of persons between age 18 and 79 from the United States civilian, noninstitu-

tional population. The participants were asked to complete a self-administered ques-

tionnaire. The sample contained 4211 females, of whom 3581 completed the question-

naire. The question regarding menopausal status is given in Figure 8.1. MacMahon

and Worcestor (1966) found that there was marked terminal digit clustering in the re-

sponse to part c of this question, especially for women who had a natural menopause.

Therefore, Krailo and Pike (1983) decided to only analyze the responses to parts b and

211

Question 74. WOMEN ONLY

a. Age when periods started ______

b. Have periods stopped? (not counting pregnancy) Yes / No

IF YES

c. Age when periods stopped _____

d. Was this due to an operation? Yes / No

IF NO

e. Have they begun to stop? Yes / No

f. Date of last period _____

Figure 8.1: Question 74 of the Health Examination Survey (taken from MacMahonand Worcestor (1966)).

d. These parts provide current status data with competing risks, where X is the age

at menopause, Y is the cause of menopause, and T is the age at the time of the survey.

Krailo and Pike (1983) performed a parametric analysis. Nonparametric analyses of

the same data have been performed by Jewell, Van der Laan and Henneman (2003)

and Jewell and Kalbfleisch (2004). In these three analyses, attention was restricted

to the age range 25-59 years. Furthermore, seven women who were less than 35 years

of age and reported having had a natural menopause were excluded as being an error

or abnormal. The remaining data set contained information on 2423 women.

In order to verify our methods, we reanalyzed these data and computed the MLE

and the naive estimator. The results are given in Figure 8.2. Note that the MLE

and the naive estimators are very similar, and that they are indistinguishable for the

sub-distribution function for operative menopause. Furthermore, note that operative

menopause seems to occur at a constant rate between age 30 and 55, while the rate

of natural menopause peaks around age 50-55. Our results agree with those of Jewell,

Van der Laan and Henneman (2003) and Jewell and Kalbfleisch (2004).

212

20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Menopause data

age (year)

MLE operativeNE operativeMLE naturalNE natural

Figure 8.2: The MLE and the naive estimator (NE) for the sub-distribution functionsfor the menopause data analyzed by Krailo and Pike (1983). The MLE and naiveestimator for operative menopause are indistinguishable.

213

8.2 Simulations

In order to compare the MLE and the naive estimator, we simulated data from the

following model with K = 5 competing risks:

P (T ≤ t) = 1 − exp(−5t/2),

P (Y = k) = k/15, k = 1, . . . , 5,

P (X ≤ t|Y = k) = 1 − exp(−kt), k = 1, . . . , 5,

(8.1)

with T independent of (X, Y ). The true sub-distribution functions in this model are

F0k(t) =k

15(1 − exp(−kt)), k = 1, . . . , 5,

and are shown in Figure 8.3 on page 218.

We simulated 1000 data sets of sizes 100, 1000 and 10000. For each data set

we computed the MLEs Fn1, . . . , Fn5 and the naive estimators Fn1, . . . , Fn5. The

MLE was computed using sequential quadratic programming (SQP) and the support

reduction algorithm, as described in Section 3.1. As convergence criterion we used

the conditions in (2.37) of Proposition 2.25, with a tolerance of 10−10. The naive

estimators were computed with a convex minorant algorithm.

8.2.1 Pointwise estimation

We now compare the behavior of the estimators for pointwise estimation. We do this

by computing the bias, variance and mean squared error of the estimators on the

following grid:

0.0, 0.01, 0.02, . . . , 3.0. (8.2)

214

Recall that the estimators are not uniquely defined for all t ∈ R+, due to representa-

tional non-uniqueness (see Section 2.2). Thus, in order to compare the estimators on

a grid, we use the convention that the naive estimators Fnk are right-continuous and

piecewise constant, with jumps only at points in T1, . . . , Tn. Similarly, we assume

that the MLEs Fnk are right-continuous and piecewise constant with jumps only at

points in Tk (see Definition 2.22). These conventions are equivalent to assigning all

probability mass to the right endpoints of the maximal intersections.

Jewell, Van der Laan and Henneman (2003) stated that the performance of the

naive estimators can be improved by suitably modifying them when their sum exceeds

one. To investigate this claim, we computed two variants of the naive estimator: a

scaled naive estimator F(s)nk , and a truncated naive estimator F

(t)nk . The scaled naive

estimator is defined as follows:

F(s)nk (t) =

Fnk(t) if Fn+(T(n)) ≤ 1,

Fnk(t)/Fn+(T(n)) if Fn+(T(n)) > 1,k = 1, . . . , 5.

To define the truncated naive estimator, we let

t∗ = mint ∈ (0, T(n)] : Fn+(t) > 1 ∪ T(n) + 1.

If t∗ = T(n)+1, then the naive estimator does not violate the constraint Fn+(T(n)) ≤ 1,

and hence we let the truncated naive estimator be equal to the naive estimator. If

t∗ ≤ T(n), then the constraint Fn+(T(n)) ≤ 1 is violated, and we define

F(t)nk (t) =

Fnk(t) if t < t∗,

Fnk(t−) + αnk if t ≥ t∗,k = 1, . . . , 5,

where

αnk =Fnk(t

∗) − Fnk(t∗−)

Fn+(t∗) − Fn+(t∗−)(1 − Fn+(t∗−)), k = 1, . . . , 5.

215

In order to limit the number of plots, we only show the results for the estimation

of F01, F03 and F05. In the legends of the plot we use the following abbreviations:

naive estimator (NE), scaled naive estimator (SNE), and truncated naive estimator

(TNE).

Figure 8.4 shows the estimators in one simulation for each sample size. Note that

the MLE is close to the true underlying distribution. Furthermore, note that the MLE

and the naive estimator tend to be similar for smaller values of t, while they tend to

diverge for larger values of t, where the naive estimator becomes too large and violates

the constraint Fn+(t) ≤ 1. The truncated naive estimator repairs such a violation by

only changing the estimator at points for which the constraint is violated, while the

scaled naive estimator changes the values at all points. As a result, the scaled naive

estimator tends to yield a significant underestimate for smaller values of t.

Figure 8.5 shows the sample bias of the estimators, scaled by a factor n1/3. We

see that the bias of the MLE is smallest in absolute value. Furthermore, note that

the bias of the scaled naive estimator is largely negative, for the reason discussed in

the previous paragraph.

Figure 8.6 shows the sample variance of the estimators, scaled by a factor n2/3.

We see that the scaled naive estimator has the smallest variance for small values of

t. This can be explained by the fact that the estimator is scaled down. Among the

remaining estimators, the MLE tends to have the smallest variance.

Figure 8.7 shows the sample mean squared error of the estimators, scaled by a

factor n2/3. We see that the mean squared error of the MLE is in general smaller

than that of the naive estimators. Considering the three naive estimators, we see

that the truncated naive estimator performs best and is significantly better than the

naive estimator. On the other hand, we see that the mean squared error of the scaled

naive estimator tends to be worse than that of the naive estimator. The latter can

be explained by the large negative bias of the scaled naive estimator.

Figure 8.8 shows the relative efficiency of the estimators, in the form of the mean

216

squared error of the MLE divided by the mean squared error of each estimator. We

clearly see that the MLE is most efficient. The only exception is the upper left plot

for k = 1 and n = 100, where the scaled naive estimator is more efficient for smaller

values of t. This can be viewed as an anomaly due to the small values of F01. In

all other cases the relative efficiency of the scaled naive estimator quickly drops to

almost zero. The relative efficiency of the naive estimator also decreases to a number

close to zero, but its decrease is more gradual in t. The truncated naive estimator

seems to stabilize at a relative efficiency of about 75%.

Considering Figures 8.5 to 8.8, we see that the curves of the naive estimator and

the truncated naive estimator coincide until a certain point, and then start to diverge.

This point is the smallest time s for which Fn+(s) > 1 in one of the 1000 simulated

data sets. The value of this point increases as the sample size increases, due to

consistency of the naive estimator.

To investigate the behavior of the estimators at larger values of t, we computed

the sample bias, variance and mean squared error at the point t = 10. The results

are given in Table 8.1. We see that the bias, variance and mean squared error of the

naive estimator do not decrease with n. The bias can be as large as 0.3, even for

sample size 10000. On the other hand, the MLE still behaves well.

8.2.2 Smooth functionals

We now consider estimation of the following smooth functional:

∫

[0,t0]

F0k(t)dG(t),

for t0 = 2 and t0 = 10. In Chapter 7 we proved that

√n

∫

[0,t0]

Fnk(t) − F0k(t)dG(t) →d N

(0,

∫

[0,t0]

F0k(t)1 − F0k(t)dG(t)

), (8.3)

217

where Fnk can be either the naive estimator or the MLE.

We computed the left hand side of (8.3) for the MLE and the naive estima-

tor, for the 1000 simulated data sets for each sample size. Note that the integral∫[0,t0]

Fnk(t)dG(t) can be computed easily by using partial integration:

∫

[0,t0]

Fnk(t)dG(t) =

∫

[0,t0]

G(t0) −G(t)dFnk(t).

The results for t0 = 2 are given in Figures 8.9 and 8.10. The results for t0 = 10 are

given in Figures 8.11 and 8.12. Note that the MLE and the naive estimator behave

similarly. Furthermore, their behavior agrees with the theoretical limit (8.3), depicted

by a black line in the figures.

Given the fact that the naive estimator performs very badly at pointwise estima-

tion at t = 10, it may come as a surprise that the naive estimator behaves well for

estimation of the smooth functional when t0 = 10. However, the smooth functionals

are integrated with respect to G, and the density g(t) = 52exp(−5t/2) is very small

for large t, so that any effects in the tail of the distributions are suppressed.

218

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.1

0.2

0.3

0.4

True sub−distribution functions

t

k=5

k=4

k=3

k=2

k=1

Figure 8.3: The true underlying sub-distribution functions in model (8.1).

219

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=1, n=100

t

MLENESNETNE

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=3, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=5, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=1, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=3, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=5, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=1, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=3, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

k=5, n=10000

t

Figure 8.4: The estimators for F0k, k = 1, 3, 5, for one simulation for each samplesize. The solid black lines denote the true underlying sub-distribution functions.

220

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=1, n=100

t

MLENESNETNE

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=3, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=5, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=1, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=3, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=5, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=1, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=3, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

k=5, n=10000

t

Figure 8.5: Sample bias of the estimators for F0k, k = 1, 3, 5, scaled by a factor n1/3.The results are computed over 1000 simulations for each sample size n, on the griddefined in (8.2).

221

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=100

t

MLENESNETNE

0.0 0.5 1.0 1.5 2.0 2.5 3.00.

000.

050.

100.

150.

200.

25

k=3, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=3, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=3, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=10000

t

Figure 8.6: Sample variance of the estimators for F0k, k = 1, 3, 5, scaled by a factorn2/3. The results are computed over 1000 simulations for each sample size n, on thegrid defined in (8.2).

222

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=100

t

MLENESNETNE

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=3, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=3, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=1, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=3, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.05

0.10

0.15

0.20

0.25

k=5, n=10000

t

Figure 8.7: Sample mean squared error of the estimators for F0k, k = 1, 3, 5, scaledby a factor n2/3. The results are computed over 1000 simulations for each sample sizen, on the grid defined in (8.2).

223

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=1, n=100

t

MLENESNETNE

0.0 0.5 1.0 1.5 2.0 2.5 3.00.

00.

51.

01.

52.

0

k=3, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=5, n=100

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=1, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=3, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=5, n=1000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=1, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=3, n=10000

t

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

k=5, n=10000

t

Figure 8.8: Sample relative efficiency of the estimators for F0k, k = 1, 3, 5, withrespect to the MLE. The sample relative efficiency for each estimator is computedusing the formula (MSE MLE)/(MSE estimator), where the sample mean squarederrors (MSEs) were computed as in Figure 8.7.

224

Table 8.1: Sample bias, variance and mean squared error for estimating F0k(10) for k = 1, 3, 5. The results arecomputed over 1000 simulation for each sample size n.

Bias k = 1 Bias k = 3 Bias k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 -1.5e-2 1.1e-1 -2.2e-2 1.0e-2 -2.3e-3 2.8e-1 -1.1e-3 1.9e-2 1.2e-2 2.9e-1 1.9e-2 -4.2e-2

1000 -9.0e-3 1.6e-1 -1.7e-2 2.8e-2 1.0e-3 2.7e-1 2.7e-3 7.4e-3 8.1e-3 3.0e-1 1.4e-2 -5.1e-210000 -7.6e-3 1.6e-1 -1.2e-2 2.7e-2 4.7e-4 2.7e-1 2.0e-3 6.1e-3 5.6e-3 3.2e-1 8.0e-3 -4.6e-2

Var k = 1 Var k = 3 Var k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 1.3e-3 6.6e-2 1.3e-3 1.1e-2 3.0e-3 9.8e-2 4.1e-3 1.9e-2 4.5e-3 7.1e-2 5.8e-3 1.9e-2

1000 3.2e-4 6.6e-2 2.9e-4 1.0e-2 6.0e-4 8.1e-2 7.9e-4 1.5e-2 8.0e-4 6.8e-2 9.3e-4 1.6e-210000 7.5e-5 6.0e-2 6.5e-5 9.3e-3 1.2e-4 7.9e-2 1.4e-4 1.5e-2 1.5e-4 7.0e-2 1.7e-4 1.6e-2

MSE k = 1 MSE k = 3 MSE k = 5n MLE NE TNE SNE MLE NE TNE SNE MLE NE TNE SNE100 1.5e-3 7.8e-2 1.8e-3 1.1e-2 3.0e-3 1.8e-1 4.1e-3 2.0e-2 4.7e-3 1.6e-1 6.2e-3 2.0e-2

1000 4.0e-4 9.1e-2 5.6e-4 1.1e-2 6.0e-4 1.6e-1 8.0e-4 1.5e-2 8.7e-4 1.6e-1 1.1e-3 1.9e-210000 1.3e-4 8.4e-2 2.2e-4 1.0e-2 1.2e-4 1.5e-1 1.4e-4 1.5e-2 1.8e-4 1.7e-1 2.3e-4 1.8e-2

225

k=1, n=100D

ensi

ty

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=100

Den

sity

−1.0 −0.5 0.0 0.5 1.00.

00.

51.

01.

5

k=5, n=100

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=1000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=1000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=1000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=10000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=10000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=10000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 8.9: Smooth functionals of the MLE for t0 = 2. The histograms and den-sity estimates (green) are based on 1000 simulations for each sample size from√n∫

[0,2]Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limit-

ing distribution is given in black.

226

k=1, n=100

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=100

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=100

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=1000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=1000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=1000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=10000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=10000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=10000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 8.10: Smooth functionals of the naive estimator for t0 = 2. The histogramsand density estimates (red) are based on 1000 simulations for each sample size from√n∫[0,2]

Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting

distribution is given in black.

227

k=1, n=100D

ensi

ty

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=100

Den

sity

−1.0 −0.5 0.0 0.5 1.00.

00.

51.

01.

5

k=5, n=100

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=1000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=1000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=1000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=10000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=10000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=10000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 8.11: Smooth functionals of the MLE for t0 = 10. The histograms anddensity estimates (green) are based on 1000 simulations for each sample size from√n∫

[0,10]Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting


228

k=1, n=100

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=100

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=100

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=1000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=1000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=1000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

k=1, n=10000

Den

sity

−0.4 −0.2 0.0 0.2 0.4

01

23

4

k=3, n=10000

Den

sity

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

k=5, n=10000

Den

sity

−1.5 −0.5 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 8.12: Smooth functionals of the naive estimator for t0 = 10. The histogramsand density estimates (red) are based on 1000 simulations for each sample size from√n∫[0,10]

Fnk(u) − F0k(u)dG(u), k = 1, 3, 5. The density of the theoretical limiting


229

Chapter 9

AN EXTENSION:

INTERVAL CENSORED CONTINUOUS MARK DATA

In the preceding chapters we considered the situation in which X ∈ R+ is subject

to current status censoring (or interval censoring case 1, i.e., one observation time

per subject), and Y ∈ 1, . . . , K is a discrete variable. In this chapter we study an

extension of this model, in the following two directions. First, we let the survival

time X ∈ R+ be subject to case k interval censoring, meaning that there are exactly

k observation times for each subject. Second, we let Y ∈ R be a continuous random

variable. The variable Y is also called a mark variable, so that this model is sometimes

referred to as the interval censored continuous mark model.

Interval censored continuous mark data arise in various situations. For example,

X can be the time of onset of a disease and Y its incubation period. Alternatively, X

can be the time of death and Y a measure of utility or cost, such as quality adjusted

lifetime or lifetime medical costs (Huang and Louis (1998)). A third example is the

HIV vaccine trial data analyzed in Hudgens, Maathuis and Gilbert (2006), where X

is the time of HIV infection and Y is the viral distance between the infecting HIV

virus and the virus in the vaccine.

The work in this chapter is largely taken from Maathuis and Wellner (2006),

and was motivated by Hudgens, Maathuis and Gilbert (2006). Our main focus is

on asymptotic properties of the MLE for interval censored continuous mark data,

and in particular on consistency. In Section 9.1 we use the analogy with univariate

right censored data to obtain an explicit formula for the MLE. In Section 9.2 we use

this formula and the mark specific cumulative hazard function of Huang and Louis

230

(1998) to derive the almost sure limit of the MLE. We conclude that the MLE is

inconsistent in general. In Section 9.3 we show that the inconsistency can be repaired

by discretizing the marks. In Section 9.4 we illustrate the behavior of the inconsistent

and repaired MLE in four examples.

9.1 The model and an explicit formula for the MLE

9.1.1 Intermezzo: univariate right censored data

Hudgens, Maathuis and Gilbert (2006) noted a close connection between the MLE for

univariate right censored data and the MLE for interval censored continuous mark

data. We will use this connection in Section 9.1.2 to derive a new explicit formula for

the MLE in the interval censored continuous mark model. However, we first briefly

review univariate right censored data in a way that shows the similarity between the

two models.

Let X > 0 be a survival time subject to right censoring. Let T > 0 be the

censoring variable, with T independent of X. Let U = X ∧ T = min(X, T ) and

Γ = 1X ≤ T. We are interested in the MLE Fn(x) of F0(x) = P (X ≤ x) based on

n independent and identically distributed copies of (U1,Γ1), . . . , (Un,Γn) of (U,Γ).

Using the censored data perspective of Section 2.2, we find that the observed sets

for these data can have two forms: R = U if Γ = 1 and R = (U,∞) if Γ = 0.

Let U(1), . . . , U(n) be the order statistics of U1, . . . , Un, and let Γ(i) and R(i) be the

corresponding values of Γ and R. We assume that all Ri with Γi = 1 are distinct,

since this will be the case for the continuous mark data. However, we allow ties in

the T ’s and U ’s provided this assumption is not violated. We break such ties in U

arbitrarily after ensuring that observations with Γ = 1 are ordered before those with

Γ = 0.

Assuming that F has a density f with respect to some dominating measure

µ, the likelihood (conditional on G) is Ln(F ) =∏n

i=1 q(Ui,Γi), where q(u, γ) =

231

f(u)γ 1 − F (u)1−γ . The first term of q is a density-type term, and hence Ln(F )

can be made arbitrarily large by letting f peak at some value Ui with Γi = 1. This

problem is usually solved by maximizing Ln(F ) over the class of distribution functions

that have a density with respect to counting measure on the observed failure times.

We can then write Ln(F ) =∏n

i=1 PF (Ri), where PF (R) is the probability of R under

F .

We now consider the maximal intersections of R1, . . . , Rn. Using the idea of the

height map of Maathuis (2005), we find that each R(i) with i ∈ I = i ∈ 1, . . . , n :

Γ(i) = 1 is a maximal intersection. We denote these maximal intersections by A(i).

This notation may seem redundant since A(i) = R(i), but it will be useful in the

next section. Furthermore, there is an extra maximal intersection A(n+1) = R(n) =

(U(n),∞) if and only if Γ(n) = 0. Let I be the collection of indices of all maximal

intersections. Thus, I = I if Γ(n) = 1 and I = I ∪ n+ 1 if Γ(n) = 0. Let αi be the

probability mass of maximal intersection A(i), i ∈ I. We can then write the likelihood

in terms of the αi’s, analogously to the expression of the log likelihood in (2.15):

n∏

i=1

P (Ri) =n∏

i=1

∑

j∈I

αj1A(j) ⊆ R(i)

=

n∏

i=1

αΓ(i)

i

∑

j≥i+1,j∈I

αj

1−Γ(i)

. (9.1)

The MLE α maximizes this expression under the constraints

∑

i∈I

αi = 1 and αi ≥ 0 for all i ∈ I. (9.2)

It is well-known that α is the Kaplan-Meier or product-limit estimator, given by

αi =

i−1∏

j=1

(1 − Γ(j)

n− j + 1

)Γ(i)

n− i+ 1, i ∈ I,

and αn+1 = 1 −∑i∈I αi if Γ(n) = 0 (see for example Shorack and Wellner (1986),

232

Chapter 7, pages 332-333). Equivalently, we can write

∑

j≥i,j∈I

αj =∏

j≤i−1

(1 − Γ(j)

n− j + 1

), i ∈ I.

The vector α is uniquely determined. We obtain Fn(x) by summing the probability

mass in the interval (0, x]. Note that the maximal intersections A(i) : i ∈ I are

points. The extra maximal intersection A(n+1), that exists if and only if Γ(n) = 0,

takes the form of a half line. Hence, representational non-uniqueness occurs if and

only if Γ(n) = 0, and if it occurs then it affects Fn(x) for x > T (n).

9.1.2 Continuous mark data

We now formally introduce the interval censored continuous mark model. Let X ∈R+ = (0,∞) be a survival time, let Y ∈ R be a continuous mark variable, and let

F0(x, y) = P (X ≤ x, Y ≤ y) be their joint distribution. Let F0X(x) = F0(x,∞) and

F0Y (y) = F0(∞, y) be the marginal distributions of X and Y . Let X be subject to

interval censoring case k, using the terminology of Groeneboom and Wellner (1992).

Let T = (T1, . . . , Tk) be the k observation times and let G be their distribution.1 We

assume that T is independent of (X, Y ) and G(0 < T1 < · · · < Tk) = 1. We use

subscripts to denote the marginal distributions of G. For example, G1 is the distribu-

tion of T1 and G23 is the distribution of (T2, T3). Let Γ = (Γ1, . . . ,Γk+1) be a vector

of indicator functions, where Γj = 1Tj−1 < X ≤ Tj for j = 1, . . . , k + 1, T0 = 0

and Tk+1 = ∞. We say that X is right censored if Γk+1 = 1, and we assume that Y

is observed if and only if X is not right censored. Thus, we observe Z = (T,Γ,W ),

where W = Γ+Y and Γ+ =∑k

j=1 Γj = 1 − Γk+1. We study the nonparametric maxi-

mum likelihood estimator Fn(x, y) for F0(x, y) based on n independent and identically

distributed copies Z1, . . . , Zn of Z, where Zi = (Ti,Γi,Wi), Ti = (T1i, . . . , Tki) and

1In case of current status censoring (k = 1), we denote the observation time simply by T .

233

Γi = (Γ1i, . . . ,Γk+1,i). We allow ties between components of the vectors Ti and Tj

for i 6= j.

The observed sets for this model are defined as follows:

R =

(Tj−1, Tj ] × W if Γj = 1, j = 1, . . . , k

(Tk,∞) × R if Γk+1 = 1.

Note that R is a line segment if Γ+ = 1 and R is a half plane if Γk+1 = 1. Assuming

that F has a density f with respect to some dominating measure µX × µY , the

likelihood (conditional on G) is given by Ln(F ) =∏n

i=1 q(Zi), where

q(z) = q(t, γ, w) =

k∏

j=1

∫

(tj−1,tj ]

f(s, w)µX(ds)

γj

(1 − FX(tk))γk+1 , (9.3)

where FX(x) = F (x,∞) is the marginal distribution ofX under F . The first term of q

is a density-type term. Hence, Ln(F ) can be made arbitrarily large by letting f(s, w)

peak at w = Wi for some observation with Γ+i = 1. We therefore define the MLE

Fn(x, y) to be the maximizer of Ln(F ) over the class F of all bivariate distribution

functions that have a marginal density fY with respect to counting measure on the

observed marks. We can then write Ln(F ) =∏n

i=1 PF (Ri).

As in Maathuis (2005), we call the projection of R on the x- and y-axis its x-

interval and y-interval. We denote the left and right endpoint of the x-interval of R

by TL and TR:

TL =

k+1∑

j=1

ΓjTj−1, TR =

k+1∑

j=1

ΓjTj. (9.4)

Furthermore, we define a new variable U :

U = Γ+TR + Γk+1TL. (9.5)

234

Note that U equals T if X is subject to current status censoring. The variable U

plays an important role, because it will determine the order of the observations. Let

U(1), . . . , U(n) be the order statistics of U1, . . . , Un and let Γ(i) = (Γ1(i), . . . ,Γk+1,(i)),

W(i), R(i), TL(i) and TR(i) be the corresponding values of Γ, W , R, TL and TR. We

break ties in U arbitrarily after ensuring that observations with Γ+ = 1 are ordered

before those with Γ+ = 0. Let I = i ∈ 1, . . . , n : Γ+(i) = 1. Recall that

the maximal intersections are the local maxima of the height map of the canonical

observed sets. Since Y is continuous, the observed sets R(i), i ∈ I, are completely

distinct with probability one. Hence, each such R(i) contains exactly one maximal

intersection A(i):

A(i) = (maxTL(j) : j /∈ I, j < i ∪ TL(i), TR(i)] × W(i), i ∈ I. (9.6)

To understand this expression, let S(i) be the set of right censored observed sets

R(j) with TL(i) < TL(j) < TR(i). Then (9.6) implies that A(i) = R(i) if S(i) = ∅ and

A(i) ⊆ R(i) otherwise. Furthermore, in the latter case the left endpoint of A(i) is

determined by the largest TL(j) with R(j) ∈ S(i). The right endpoints of A(i) and R(i)

are always identical. Equation (9.6) also implies that the maximal intersections can

be computed in O(n logn) time, which is faster than the height map algorithm of

Maathuis (2005) due to the special structure in the data. We again have an extra

maximal intersection A(n+1) = R(n) = (U(n),∞) × R if and only if Γ+(n) = 0. Let Ibe the collection of indices of all maximal intersections. Thus, I = I if Γ+(n) = 1 and

I = I ∪ n+ 1 if Γ+(n) = 0. Let αi be the probability mass of maximal intersection

A(i), i ∈ I. We can then write the likelihood as

n∏

i=1

P (Ri) =

n∏

i=1

∑

j∈I

αj1A(j) ⊆ R(i)

=

n∏

i=1

αΓ+(i)

i

∑

j≥i+1,j∈I

αj

1−Γ+(i)

. (9.7)

The MLE α maximizes this expression under the constraints (9.2). From the analogy

235

with likelihood (9.1) it follows immediately that

αi =

i−1∏

j=1

(1 − Γ+(j)

n− j + 1

)Γ+(i)

n− i+ 1, i ∈ I,

and αn+1 = 1 −∑i∈I αi if Γ+(n) = 0. Equivalently, we can write

∑

j≥i,j∈I

αj =∏

j≤i−1

(1 − Γ+(j)

n− j + 1

), i ∈ I. (9.8)

These formulas are different from (but equivalent to) the ones given in Section 3.1 of

Hudgens, Maathuis and Gilbert (2006). The form given here has several advantages.

First, the tail probabilities (9.8) can be computed in time complexity O(n logn), since

the computationally most intensive step consists of sorting the U ’s. Furthermore, the

current form provides additional insights in the behavior of the MLE. In particular,

it shows that the MLE can be viewed as a right endpoint imputation estimator (see

Remark 9.1), and it allows for an easy derivation of the almost sure limit of the MLE

(see Section 9.2).

The vector α is again uniquely determined. This was noted by Hudgens, Maathuis

and Gilbert (2006) and also follows from our derivation here. We obtain Fn(x, y) by

summing all mass in the region (0, x] × (−∞, y]. We define a marginal MLE for X

by letting FXn(x) = Fn(x,∞). The estimators Fn and FXn can suffer considerably

from representational non-uniqueness, since the maximal intersections A(i) : i ∈ Iare line segments and A(n+1) extends to infinity in two dimensions. We denote the

estimator that assigns all mass to the upper right corners of the maximal intersections

by F ln, since it is a lower bound for the MLE. Similarly, we denote the estimator that

assigns all mass to the lower left corners of the maximal intersections by F un , since it

236

is an upper bound for the MLE. The formulas for F ln simplify considerably:

1 − F lXn(x) =

∏

U(i)≤x

(1 − Γ+(i)

n− i+ 1

), (9.9)

F ln(x, y) =

n∑

i=1

αi1U(i) ≤ x,W(i) ≤ y

=∑

U(i)≤x

∏

U(j)<U(i)

(1 − Γ+(j)

n− j + 1

)Γ+(i)1W(i) ≤ y

n− i+ 1, (9.10)

where U was defined in (9.5).

Remark 9.1 The MLE F ln can be viewed as a right endpoint imputation estimator.

Namely, replace the observed sets R(i) with Γ+(i) = 1 by their right endpoint:

R′(i) =

U(i) × W(i) if i ∈ I,R(i) if i /∈ I.

Then the intersection structures of R(i)ni=1 and R′(i)ni=1 are identical, meaning that

R(i) ∩ R(j) = ∅ if and only if R′(i) ∩ R′

(j) = ∅, for all i, j ∈ 1, . . . , n. Furthermore,

the maximal intersections of R′(i)ni=1 are A(i) = R′

(i) : i ∈ I. Hence, writing the

likelihood for the imputed data in terms of α yields exactly the same likelihood as

(9.7). As a result the values αi, i ∈ I, are identical to the ones for the original

data. Furthermore, since F ln assigns mass to the upper right corners of the maximal

intersections, F ln is completely equivalent to the MLE for the imputed data. Since the

observed sets R′(i) impute an x-value that is always at least as large as the unobserved

value X, F lXn tends to have a negative bias.

9.2 Inconsistency of the MLE

We now derive the almost sure limits F lX∞ and F l

∞ of the MLEs F lXn and F l

n. In some

cases representational non-uniqueness disappears in the limit, so that FX∞ = F lX∞

237

and F∞ = F l∞. This occurs for all (x, y) ∈ R+ × R if and only if the maximal

intersections A(i), i ∈ I, converge to points and∑

i∈I αi → 1 as n→ ∞; see Examples

1 and 2 in Section 9.4. If these conditions fail, then the upper bounds F uX∞ and F u

∞

can be obtained from their lower bounds by reassigning mass from the upper right

corners to the lower left corners of the maximal intersections. We illustrate this in

Examples 3 and 4 in Section 9.4.

However, we first derive the lower bounds F lX∞ and F l

∞. Let

Hn(x) = Pn1U ≤ x, x ≥ 0,

Vn(x, y) = PnΓ+1U ≤ x,W ≤ y, x ≥ 0, y ∈ R,

and V1n(x) ≡ Vn(x,∞) = PnΓ+1U ≤ x. Here U is defined in (9.5) and Pnf(X) =

n−1∑n

i=1 f(Xi). Furthermore, let

Λn(x, y) =

∫

[0,x]

Vn(ds, y)

1 − Hn(s−),

Λ1n(x) ≡ Λn(x,∞) =

∫

[0,x]

V1n(ds)

1 − Hn(s−).

Since

Λn(dx, y) =PnΓ+1U = x,W ≤ y

Pn1U ≥ x and Λ1n(dx) =PnΓ+1U = xPn1U ≥ x

we can write equations (9.9) and (9.10) in terms of Λ1n and Λn:

1 − F lXn(x) =

∏

s≤x1 − Λ1n(ds), (9.11)

F ln(x, y) =

∫

s≤x

∏

u<s

1 − Λ1n(du)Λn(ds, y). (9.12)

Note that (9.11) is analogous to the Kaplan-Meier estimator for right censored data,

and that (9.12) is analogous to equation (3.3) of Huang and Louis (1998). However,

238

our functions Λ1n and Λn are defined differently. As we will see in the following

lemma and theorems, this difference lies at the root of the inconsistency problems of

the MLE.

Lemma 9.2 For I ⊆ Rd, d ≥ 1, let D(I) be the space of cadlag functions on I

(cadlag = right-continuous with left limits). Let ‖ · ‖∞ be the supremum norm on

(D(R+),D(R+),D(R+ × R)). Then

‖(Hn −H,V1n − V1,Vn − V )‖∞ →a.s. 0, (9.13)

where

V (x, y) =

k∑

j=1

∫

[0,x]

F0(t, y)dGj(t) −k∑

j=2

∫

0≤s≤t≤xF0(s, y)dGj−1,j(s, t), (9.14)

V1(x) =

k∑

j=1

∫

[0,x]

F0X(t)dGj(t) −k∑

j=2

∫

0≤s≤t≤xF0X(s)dGj−1,j(s, t), (9.15)

H(x) = V1(x) +

∫

[0,x]

1 − F0X(s)dGk(s). (9.16)

Proof: Equation (9.13) follows immediately from the Glivenko-Cantelli theorem,

with H(x) = E(1U ≤ x), V (x, y) = E(Γ+1U ≤ x,W ≤ y) and V1(x) =

V (x,∞) = E(Γ+1U ≤ x). We now express H , V and V1 in terms of F0 and

G. Note that the events [Γj = 1], j = 1, . . . , k + 1, are disjoint. Furthermore, U = Tj

and W = Y on [Γj = 1], j = 1, . . . , k, and U = Tk on [Γk+1 = 1]. Hence,

V (x, y) = E(Γ+1U ≤ x,W ≤ y) =

k∑

j=1

P (Γj = 1, Y ≤ y, Tj ≤ x)

=k∑

j=1

P (X ∈ (Tj−1, Tj], Y ≤ y, Tj ≤ x)

=k∑

j=1

∫

0≤s≤t≤xF0(t, y) − F0(s, y)dGj−1,j(s, t),

239

and, using T0 = 0, X > 0 and G(0 < T1 < · · · < Tk) = 1, this can be written as

k∑

j=1

∫

[0,x]


j=2

∫

0≤s≤t≤xF0(s, y)dGj−1,j(s, t).

Taking y = ∞ yields the expression for V1(x). The expression for H follows similarly,

using

H(x) = E1U ≤ x =k∑

j=1

P (Γj = 1, Tj ≤ x) + P (Γk+1 = 1, Tk ≤ x).

2

The differentials of V and V1 with respect to x are

V (dx, y) =

k∑

j=1

F0(x, y)dGj(x) −k∑

j=2

∫

[0,x]

F0(s, y)dGj−1,j(s, x), (9.17)

V1(dx) =

k∑

j=1

F0X(x)dGj(x) −k∑

j=2

∫

[0,x]

F0X(s)dGj−1,j(s, x). (9.18)

Let τ be such that H(τ) < 1. In the next theorem we derive the limits of Λ1n and Λn

for x ∈ [0, τ ] and y ∈ R.

Theorem 9.3 Let ‖ · ‖∞ be the supremum norm on (D[0, τ ],D([0, τ ] × R)). Then

‖(Λ1n − Λ1∞, Λn − Λ∞)‖∞ →a.s. 0,

where

Λ∞(x, y) =

∫

[0,x]

V (ds, y)

1 −H(s−), x ∈ [0, τ ], y ∈ R, (9.19)

Λ1∞(x) = Λ∞(x,∞) =

∫

[0,x]

V1(ds)

1 −H(s−), x ∈ [0, τ ]. (9.20)

240

Proof: The proof is similar to the discussion on page 1536 of Gill and Johansen

(1990). For all x ≥ 0, let H−n (x) ≡ Hn(x−). Consider the mappings (H−

n ,V1n,Vn) →((1 − H−

n )−1,V1n,Vn) → (Λ1n, Λn) on the spaces

(D−[0, τ ],D[0, τ ],D([0, τ ] × R)) → (D−[0, τ ],D[0, τ ],D([0, τ ] × R))

→ (D[0, τ ],D([0, τ ] × R)),

where D−(0, τ ] is the space of ‘caglad’ (left-continuous with right limits) functions on

(0, τ ]. The first mapping is continuous with respect to the supremum norm when we

restrict the domain of its first argument to elements of D−[0, τ ] that are bounded by

say 1 +H(τ)/2 < 1. Strong consistency of H−n ensures that it satisfies this bound

with probability one for n large enough. The second mapping is continuous with

respect to the supremum norm by the Helly-Bray lemma. Combining the continuity

of these mappings with Lemma 9.2 yields the result of the theorem. 2

Next, we derive the limits of F lXn and F l

n.

Theorem 9.4 Let ‖ · ‖∞ be the supremum norm on (D[0, τ ],D([0, τ ] × R)). Then

‖(F lXn − F l

X∞, Fln − F l

∞)‖∞ → 0 almost surely,

where

F lX∞(x) = 1 −

∏

s≤x1 − Λ1∞(ds) , (9.21)

F l∞(x, y) =

∫

u≤x

∏

s<u

1 − Λ1∞(ds)Λ∞(du, y). (9.22)

Proof: To derive the almost sure limit of FXn consider the mapping

Λ1n →∏

s≤x1 − Λ1n(ds) = 1 − F l

Xn(x) (9.23)

241

on the space D[0, τ ] to itself. This mapping is continuous with respect to the supre-

mum norm when its domain is restricted to functions of uniformly bounded varia-

tion (Gill and Johansen (1990), Theorem 7). Note that Λ1n ≤ 1/1 − Hn(τ) <

2/1−H(τ) with probability one for n large enough. Together with the monotonic-

ity of Λ1n this implies that with probability one Λ1n is of uniformly bounded variation

for n large enough. The almost sure limit of F lXn now follows by combining Theorem

9.3 and the continuity of (9.23).

To derive the almost sure limit of F ln consider the mapping

(Λ1n, Λn) →∫

u≤x

∏

s<u

1 − Λ1n(ds)Λn(du, y) = F ln(x, y)

on the space (D[0, τ ],D([0, τ ]×R)) to D([0, τ ]×R). This mapping is continuous with

respect to the supremum norm when its domain is restricted to functions of uniformly

bounded variation (Huang and Louis (1998), Theorem 1). Note that Λn(x, y) ≤Λ1n(x), so that with probability one the pair (Λn, Λ1n) is uniformly bounded for n

large enough. The result then follows as in the first part of the proof. 2

In Corollaries 9.5 - 9.7, we rewrite F l∞ in various ways.

Corollary 9.5 For x ∈ [0, τ ], y ∈ R, we can write

F l∞(x, y) =

∫

[0,x]

Λ∞(ds, y)

Λ1∞(ds)dF l

X∞(s) =

∫

[0,x]

V (ds, y)

V1(ds)dF l

X∞(s) . (9.24)

Proof: Combining equations (9.21) and (9.22) yields

F l∞(x, y) =

∫

[0,x]

1 − F lX∞(s−)Λ∞(ds, y) . (9.25)

Taking y = ∞ gives F lX∞(x) = F l

∞(x,∞) =∫[0,x]

1 − F lX∞(s−)Λ1∞(ds). Hence,

dF lX∞(s) = 1 − F l

X∞(s−)Λ1∞(ds). Combining this with equation (9.25) yields the

242

first equality of (9.24). The second equality follows from the identities

Λ∞(ds, y) = V (ds, y)/1−H(s−),

Λ1∞(ds) = V1(ds, y)/1−H(s−).

2

Corollary 9.6 Let X and Y be independent. Then

F l∞(x, y) = F l

X∞(x)F0Y (y), x ∈ [0, τ ], y ∈ R. (9.26)

Proof: If X and Y are independent, equations (9.17) and (9.18) yield V (ds, y) =

F0Y (y)V1(ds). Substituting this into equation (9.24) gives the result. 2

Corollary 9.7 Let X be subject to current status censoring (k = 1). Then

F l∞(x, y) =

∫

[0,x]

P (Y ≤ y|X ≤ s)dF lX∞(s), x ∈ [0, τ ], y ∈ R. (9.27)

Proof: For k = 1 equations (9.17) and (9.18) reduce to V (ds, y) = F0(s, y)dG(s) and

V1(ds) = F0X(s)dG(s). Hence, V (ds, y)/V1(ds) = F0(s, y)/F0X(s) = P (Y ≤ y|X ≤s). 2

We now consider necessary and sufficient conditions for consistency of F lXn and F l

n.

From the one-to-one correspondence between a univariate distribution function and

its cumulative hazard function it follows that F lXn is consistent for F0X if and only

if Λ1∞ equals the cumulative hazard function ΛX of F0X . Similarly, it follows that

F ln(x, y) is consistent for F0(x, y) if and only if Λ∞ equals the mark specific cumulative

hazard function Λ of F0. This is made precise in the following corollary.

243

Corollary 9.8 We introduce the following conditions:

Λ1∞(x) =

∫

[0,x]

V1(ds)

1 −H(s−)=

∫

[0,x]

F0X(ds)

1 − F0X(s−)= ΛX(x) (9.28)

Λ∞(x) =

∫

[0,x]

V (ds, y)

1 −H(s−)=

∫

[0,x]

F0(ds, y)

1 − F0X(s−)= Λ(x, y), (9.29)

Then F lXn is consistent for F0X on (0, τ ] if and only if (9.28) holds for all x ∈ (0, τ ].

Furthermore, F ln is consistent for F0 on (0, τ ] × R if and only if (9.29) holds for all

x ∈ (0, τ ], y ∈ R. Finally, let x0 ∈ (0, τ ] with FX∞(x0) > 0. Then F ln(x0, y)/F

lXn(x0)

is consistent for F0Y (y) if X and Y are independent.

The last claim of the corollary follows from (9.26). Conditions (9.28) and (9.29) are

hard to interpret in general, since F0X and F0 enter on both sides of the equations

when we plug in expressions (9.16), (9.17) and (9.18) for H(s−), V (ds, y) and V1(ds).

However, it is clear that the conditions force a relation between F0 and G, and such

a relation will typically not hold and cannot be assumed, since F0 is unknown. The

following corollary further strengthens this result when X is subject to current status

censoring.

Corollary 9.9 Let X be subject to current status censoring, and let F0X and G be

continuous. Then the MLE F lXn is inconsistent for any choice of F0X and G.

Proof: Let γ = infx : F0X(x) > 0 < τ . For continuous distribution functions G

and F0X condition (9.28) can be rewritten as

∫

(γ,x]

dG(s)

1 −G(s)=

∫

(γ,x]

dF0X(s)

F0X(s)1 − F0X(s) , x ∈ (γ, τ ].

For continuous G and F0X this integral equation is solved by

− log1 −G(x) + C = log

F0X(x)

1 − F0X(x)

, x ∈ (γ, τ ].

244

This yields F0X(x) = [1 + exp(−C)1−G(x)]−1 for x ∈ (γ, τ ]. But there is no finite

C such that F0X(γ) = 0 holds, and hence condition (9.28) fails for all continuous

distributions G and F0X . 2

The following corollary shows that the asymptotic bias of the MLE goes to zero as

the number k of observation times per subject increases, for at least one particular

distribution of T = (T1, . . . , Tk), namely if T1, . . . , Tk are distributed as the order

statistics of a uniform sample on [0, θ].

Corollary 9.10 Let X be subject to interval censoring case k. Assume that the

elements T1, . . . , Tk of T are the order statistics of k independent and identically

distributed uniform random variables on [0, θ]. We denote the resulting limits by

V k(x, y), V k1 (x), Hk(x), Λk(x, y) and Λk

1(x), using the superscript k to denote the

dependence on k. Then

Λk1∞(x) =

∫

[0,x]

dV k1 (s)

1 −Hk(s−)→∫

[0,x]

dF0X(s)

1 − F0X(s−)= ΛX(x), k → ∞,

Λk∞(x, y) =

∫

[0,x]

dV k(s, y)

1 −Hk(s−)→∫

[0,x]

F0(ds, y)

1 − F0X(s−)= Λ(x, y), k → ∞,

for all continuity points x < θ of ΛX(x) and Λ(x, y) and for all y ∈ R.

Proof: Since the observation times are order statistics of k independent and iden-

tically distributed uniform random variables, the marginal densities gj, j = 1, . . . , k

and the joint densities gj−1,j, j = 2, . . . , k are known (see, e.g., Shorack and Wellner

(1986), page 97). Summing them over j yields:

k∑

j=1

gj(t) =k

θ1[0,θ](t)

k−1∑

j−1=0

(k − 1

j − 1

)(t

θ

)j−1(1 − t

θ

)k−1−(j−1)

=k

θ1[0,θ](t),

k∑

j=2

gj−1,j(s, t) =k(k − 1)

θ21[0≤s≤t≤θ]

(1 − t− s

θ

)k−2

.

245

Let x < θ. We compute, using Fubini’s theorem to rewrite the second term,

V k(x, y) =

k∑

j=1

∫

[0,x]


j=2

∫

0≤s≤t≤xF0(s, y)dGj−1,j(s, t)

=k

θ

∫

[0,x]

F0(t, y)dt−∫ ∫

0≤s≤t≤xF0(s, y)

k(k − 1)

θ2

(1 − t− s

θ

)k−2

dsdt

=k

θ

∫

[0,x]

F0(s, y)

(1 − x− s

θ

)k−1

ds =

∫

[0,x]

F0(s, y)dQkx(s),

where, for s ≤ x,

Qkx(s) =

∫ s

0

k

θ

(1 − x− r

θ

)k−1

dr =

(1 − x− s

θ

)k−(1 − x

θ

)k.

Thus, as k → ∞, Qkx(s) converges weakly to the distribution function corresponding

to the measure with mass 1 at x. Plugging in y = ∞ in V k(x, y) yields V k1 (x) =

∫[0,x]

F0X(s)dQkx(s). Furthermore, plugging in the expressions for V k

1 and Gk in (9.16)

gives

Hk(x) =

∫

[0,x]

F0X(s)dQkx(s) +

∫

[0,x]

(1 − F0X(s))k

θ(s/θ)k−1ds.

Hence, for x < θ we have V k(x, y) → F0(x, y), Vk1 (x) → F0X(x) and 1 − Hk(x) →

1 − F0X(x) as k → ∞ for continuity points of the limits. The corollary then follows

from the extended Helly-Bray theorem. 2

Remark 9.11 The MLE for the distribution function of bivariate censored data has

been found to be inconsistent before, namely when X and Y are both right censored

Van der Laan (1996), and when X is current status censored and Y is uncensored

(Maathuis (2003), Section 6.2). In the latter model the inconsistency could be ex-

plained by representational non-uniqueness of the MLE. However, this is not the case

for interval censored continuous mark data, where the MLE is typically inconsistent

even if representational non-uniqueness plays no role in the limit. Rather, the in-

246

consistency in this model is related to the fact that the functions Λ1n and Λn that

define the MLE in (9.9) and (9.10) do not converge to the true underlying cumulative

hazard functions.

However, there is a similarity between these three bivariate censored data models

with inconsistent MLEs. Namely, in each model the observed sets can take the form of

line segments, and the likelihood contains corresponding partial density-type terms.

Thus, observed line segments can be viewed as a warning sign for consistency prob-

lems, and whenever they occur consistency of the MLE should be carefully studied.

These warning signs arise in the model for HIV vaccine data in Hudgens, Maathuis

and Gilbert (2006). This model is slightly different from ours, since it allows the mark

variable to be missing for observations that are not right censored. As a result, there

is no explicit formula for the MLE and hence it is more difficult to derive its almost

sure limit. Consistency of the MLE in this model is currently still an open problem,

but simulation results clearly point to inconsistency (Hudgens, Maathuis and Gilbert

(2006)).

9.3 Repaired MLE via discretization of marks

We now define a simple repaired estimator Fn(x, y) which is consistent for F0(x, y) for

y on a grid. The idea behind the estimator is that one can define discrete competing

risks based on a continuous random variable. Doing so transforms interval censored

continuous mark data into interval censored competing risks data, for which the MLE

is consistent.

To describe the method, we let K > 0 and define a grid y1 < · · · < yK . We let

y0 = −∞ and yK+1 = ∞, and introduce a new random variable C ∈ 1, . . . , K + 1:

C =K+1∑

j=1

j1yj−1 < Y ≤ yj.

We can determine the value of C for all observations with an observed mark. Hence,

247

we can transform the observations (T,Γ,W ) into (T,Γ,W ∗), where W ∗ = Γ+C. This

gives interval censored competing risks data with K + 1 competing risks. Hence, this

repaired MLE can be computed with one of the algorithms described in Chapter 4.

Since the observed sets for interval censored competing risks data form a partition

of the space R+ × 1, . . . , K + 1, global consistency of the MLE follows from Theo-

rems 9 and 10 of Van der Vaart and Wellner (2000). We can derive local consistency

from the global consistency as done in Section 4.2. This means that we can consis-

tently estimate the sub-distribution functions F0j(x) = P (X ≤ x, C = j) = P (X ≤x, yj−1 < Y ≤ yj). Hence, we can consistently estimate F0(x, yj) =

∑jl=1 F0l(x) for

x ∈ R+ and yj on the grid.

It may be tempting to choose K large, such that F0(x, y) can be estimated for

y on a fine grid. However, this may result in a poor estimator. To obtain a good

estimator one should choose the grid such that there are ample observations for each

value of C. In practice, one can start with a course grid, and then refine the grid as

long as the estimator stays close to the one computed on the course grid.

We close this section with some general remarks about this method. First, note

that the repaired MLE corresponds to an existing consistent MLE in the following

two cases: (a) estimation of F0(x, y) for right censored continuous mark data, and (b)

estimation of F0X(x) for interval censored continuous mark data. In the first case the

discretization does not change the intersection structure of the data if the distribution

of the observation times is continuous. Hence, the repaired MLE equals the consistent

MLE as defined by Huang and Louis (1998) for y on the grid. In the second case

we can take K = 0, thereby ignoring any information on Y . This means that we

compute the MLE for univariate interval censored data (T,Γ) which is known to be

consistent (Schick and Yu (2000), Van der Vaart and Wellner (2000)). In a simulation

study we found that moderate values of K tend to give better estimates for F0X , and

in Section 9.4 we present results for n = 10, 000 and K = 20. Finally, note that the

grouping of the data that occurs in the discretization tends to yield smaller maximal

248

intersections in the x-direction and hence diminishes problems with representational

non-uniqueness. This is visible in Examples 3 and 4 in Section 9.4.

9.4 Examples

We illustrate the asymptotic behavior of the inconsistent and repaired MLE in four

examples. The examples are chosen to cover a range of scenarios, summarized in Table

9.1. In each example we compute the MLEs F ln and F u

n and the repaired estimators F ln

and F un for sample size n = 10,000. For the repaired estimator we use an equidistant

grid with K = 20 points as shown in Figure 9.3. We compare these estimators to the

true underlying distribution F0 and the derived limits F l∞ and F u

∞.

Figure 9.1 shows the contour lines of the MLE F ln, its limit F l

∞ and the true

underlying distribution F0. Note that F ln and F l

∞ are almost indistinguishable, while

there is a clear difference between F l∞ and F0. The results for the upper limits F u

n

and F u∞ are similar and not shown. Figure 9.2 contains the results for F0X and shows

that the MLE tends to underestimate F0X , which can be understood through Remark

9.1. However, the repaired MLE Fn closely follows F0X . Figure 9.3 shows the results

for F0(x0, y) for fixed x0. This function is often estimated as an alternative for F0Y ,

since F0Y cannot be consistently estimated if the support of T1, . . . , Tk is contained in

the support of X, a situation that typically occurs in practice. The values of x0 were

chosen to show a range of possible scenarios for the behavior of the MLE, and we

see that Fn can suffer from significant positive or negative bias and non-uniqueness.

However, the repaired MLE is again close to the underlying distribution. We now

discuss each example in detail.

Example 9.12 Let X and Y be independent, with X ∼ Unif(0, 1) and Y ∼ Exp(1).

Let X be subject to current status censoring with observation time T ∼ Unif(0, 0.5)

independent of (X, Y ). Thus, F0X(x) = x, F0Y (y) = 1 − exp(−y) and F0(x, y) =

x(1 − exp(−y)) for x ∈ [0, 1] and y ≥ 0.

249

Table 9.1: Summary of the examples for interval censored continuous mark data.

Example 1 Example 2 Example 3 Example 4(In)dependence of (X, Y ) independent dependent dependent dependentCensoring mechanism for X case 1 case 1 case 2 case 2Distribution of T continuous continuous continuous discrete

We derive the limits for (x, y) ∈ [0, τ ] × R+ for τ < 0.5. Using equations (9.18),

(9.20), (9.21) and the fact that∏

s≤x1 − Λ1∞(s) = exp−Λ1∞(s) when Λ1∞ is

continuous, we obtain

Λ1∞(x) =

∫ x

0

F0X

1 −GdG =

∫ x

0

2s

1 − 2sds = −x− log

√2 − 4x+ log

√2,

1 − F lX∞(x) = exp−Λ1∞(x) =

√1 − 2x exp(x) 6= 1 − F0X(x) = 1 − x.

Since all maximal intersections A(i), i ∈ I, converge to points and F lX∞(0.5) = 1, the

limit FX∞ does not suffer from representational non-uniqueness. Hence, FX∞ = F lX∞.

Figure 9.2 shows that FX∞(x) < F0X(x) for small values of x, but FX∞(x) > F0X(x)

for large values of x. In particular, FX∞(0.5) = 1 > F0X(0.5) = 0.5. The fact that

FX∞ equals one at the upper support point of T is true in some generality and can be

explained as follows. Let η = G−1(1), let X be subject to current status censoring, let

F0X(η) > 0, and let F0X and G be continuous at η. Then Λ1∞(x) =∫ x0F0X/(1−G)dG

can be viewed as a scaled down version of the cumulative hazard function of G, and

hence it converges to infinity for x ↑ η. This implies that FX∞(x) converges to one

for x ↑ η. This observation is relevant in practice since it often happens in medical

studies that the support of G is strictly contained in the support of X. Figure 9.2

also shows that the repaired estimator FXn(x) closely follows F0X(x) for x < 0.5.

Neither estimator behaves well for x > 0.5, but this was to be expected since we

cannot estimate outside of the support of G.

Since X and Y are independent, the bivariate limit F∞ follows from equation

250

(9.26): F∞(x, y) = FX∞(x)F0Y (y) = 1−√

1 − 2x exp(x)1−exp(−y). This implies

that F0(x0, y) for x0 = 0.49 is overestimated by a factor FX∞(0.49)/F0X(0.49) ≈ 1.57,

as shown in Figure 9.3. The repaired estimator Fn(0.49, y) behaves quite well, but is

slightly off for larger values of x.

Example 9.13 Let X ∼ Unif(0, 1), and let Y |X be exponentially distributed with

mean 1/(X + a), where a = 0.5. Let X be subject to current status censoring

with observation time T ∼ Unif(0, 1) independent of (X, Y ). Thus, F0X(x) = x,

F0Y (y) = 1−exp(−ay)1−exp(−y)/y and F0(x, y) = x−exp(−ay)1−exp(−xy)/yfor x ∈ [0, 1] and y ≥ 0.

Let x ∈ [0, τ ] × R+ for τ < 1. Equations (9.18), (9.20) and (9.21) yield

Λ1∞(x) =

∫ x

0

F0X

1 −GdG =

∫ x

0

s

1 − sds = x− log(1 − x),

1 − F lX∞(x) = exp−Λ1∞(x) = (1 − x) exp(x) ≥ 1 − F0X(x) = 1 − x,

where the inequality in the last line is strict for all x ∈ (0, 1]. As in Example 1

FX∞ = F lX∞ is unique. Note P (Y ≤ y|X ≤ x) = 1 − exp(−ay)1 − exp(−xy)/(xy)

and fX∞(x) = x exp(x). Hence, equation (9.27) yields

F∞(x, y)= x exp(x) +exp(−ay)y(1 − y)

exp(x− xy) − 1 −

1 +exp(−ay)

y

exp(x) − 1.

Figures 9.2 and 9.3 show that FXn(x) and Fn(0, 5, y) underestimate F0X(x) and

F0(0.5, y), while the repaired MLE behaves very well.

Example 9.14 Let X ∼ Unif(0, 2), and let Y ≡ X. Let X be subject to interval

censoring case 2 with observation times T = (T1, T2), independent of (X, Y ) and

uniformly distributed over (t1, t2) : 0 ≤ t1 ≤ 1, 1 ≤ t2 ≤ 2. Thus, F0X(x) = 12x,

F0Y (y) = 12y and F0(x, y) = 1

2(x ∧ y) for (x, y) ∈ [0, 2]2.

We derive the limits for (x, y) ∈ [0, τ ] × [0, 2] for τ < 2. Using equations (9.16),

251

(9.18), (9.20) and (9.21), we get

Λ1∞(x) = − log

1 − 1

4(1 ∧ x)2

+

2

3− 2

3x− log(2 − x)

1x > 1,

F lX∞(x) =

1

4x21x ≤ 1 +

1 − 3

4(2 − x) exp

(2

3x− 2

3

)1x > 1.

In this example the limit FX∞ is non-unique and hence we also derive the upper bound

F uX∞. To do so, we look at the x-intervals of the observed sets which take the form

(0, t1], (t1, t2] and (t2,∞), with t1 ∈ (0, 1] and t2 ∈ (1, 2]. Since there are no right

censored observations with TL < 1, equation (9.6) implies that observed sets with

x-interval (0, t1] are maximal intersections, and these maximal intersections do not

converge to points when n goes to infinity. On the other hand, maximal intersections

corresponding to observed sets with x-interval (t1, t2] do converge to points. Hence,

we obtain the upper bound F uX∞ by reassigning all mass at points t1 ≤ 1 to x = 0+,

where 0+ denotes a point slightly bigger than zero to account for the fact that the

x-intervals are left-open. This yields

F uX∞(x) =

1

410 < x ≤ 1 +

1 − 3

4(2 − x) exp

(2

3x− 2

3

)1x > 1.

Note that F uX∞ is left-continuous at zero. We obtain F l

∞ by first computing V (dx, y)

using (9.17), and then integrating V (dx, y)/V1(dx) against F lX∞(x) using (9.24):

F l∞(x, y) =

F lX∞(x) x ≤ y,

F lX∞(y) + 1

2y(x− y) y ≤ x ≤ 1,

F lX∞(y) + 3

8(2y − 1)exp(2

3x− 2

3) − exp(2

3y − 2

3) 1 ≤ y ≤ x,

F lX∞(y) + 1

2y(1− y) + 3

8y2exp(2

3x− 2

3) − 1 y ≤ 1 ≤ x.

We find F u∞ by reassigning mass from the upper right to the lower left corners of the

maximal intersections, as outlined for FX∞. Figure 9.1 shows that F l∞ is smoother

than F0 and clearly different. Figure 9.2 shows that F lX∞(x) < F0X(x) for all x ∈ (0, τ ]

252

and F lX∞(x) = F u

X∞(x) for x ≥ 1, and Figure 9.3 shows that both F l∞(0.75, y) and

F u∞(0.75, y) are smaller than F0(0.75, y). However, the repaired estimators FXn and

Fn(0.75, y) are unique and behave very well.

Example 9.15 Let (X, Y ) be uniformly distributed over (x, y) : 0 ≤ x ≤ y ≤ 1.Let X be subject to interval censoring case 2 with observation times T = (T1, T2)

independent of (X, Y ). Let the distribution of T be discrete: G(0.25, 0.5) = 0.3,

G(0.25, 0.75) = 0.3 and G(0.5, 0.75) = 0.4. Thus, F0X(x) = 2x−x2, F0Y (y) = y2

and F0(x, y) = (2xy − x2)1x ≤ y + y21x > y for (x, y) ∈ [0, 1]2.

Since we can only expect to get sensible estimates for F0(x, y) for values of x

in the support of the observation time distribution, we derive the limits for x ∈0.25, 0.5, 0.75 and y ∈ [0, 1]. Equations (9.16), (9.18), (9.20) and (9.21) yield

F lX∞(x) ≈ 0.26, F l

X∞(0.5) ≈ 0.66 and F lX∞(0.75) ≈ 0.94. Since G is discrete,

we do not use the exponential function in (9.21), but compute the product. As

in Example 9.14, FX∞ is non-unique. We obtain F uX∞ from F l

X∞ by moving the

probability mass from the right endpoints to the left endpoints of the maximal inter-

sections. The possible x-intervals of the maximal intersections are (0, 0.25], (0, 0.5],

(0.25, 0.5], (0.5, 0.75] and (0.75,∞). Consider the interval (0, 0.25] and note that

moving mass from x = 0.25 to x = 0+ does not change the value of FX∞(x) for

x ∈ 0, 0.25, 0.5, 0.75. This also holds if we move mass in the other intervals, except

for the interval (0, 0.5], where moving the mass from x = 0.5 to x = 0+ increases the

value of FX∞(x) at x = 0.25. Note that the mass F lX∞(0.5) comes from maximal

intersections with x-intervals (0, 0.5] and (0.25, 0.5]. The proportion of mass coming

from the latter is

α = P (TL = 0.25, TR = 0.5|TR = 0.5)

=G(0.25, 0.5)F0X(0.5) − F0X(0.25)

G(0.25, 0.5)F0X(0.5) − F0X(0.25) +G(0.5, 0.75)F0X(0.5)≈ 0.238.

Hence, we get F uX∞(0.25) = F l

X∞(0.25) + (1 − α)F lX∞(0.5) ≈ 0.56 and F u

X∞(x) =

253

F lX∞(x) for x ∈ 0, 0.5, 0.75. To derive the bivariate limit F l

∞, we first find V (dx, y)

using equation (9.17) and then integrate V (dx, y)/V1(dx) against F lX∞(x) using equa-

tion (9.24). This yields F l∞(0.25, y) = 0.6F0(0.25, y), F l

∞(0.5, y) = 0.3F0(0.25, y) +

0.7F0(0.5, y) and F l∞(0.75, y) ≈ 0.90F0(0.75, y)+0.19F0(0.5, y)−0.084F0(0.25, y). The

upper bound F u∞(x, y) can be found by reassigning mass to the lower left corners of

the maximal intersections. To do so, we compute

α(y) = P (TL = 0.25, TR = 0.5|TR = 0.5, Y ≤ y)

=G(0.25, 0.5)F0(0.5, y)− F0(0.25, y)

G(0.25, 0.5)F0(0.5, y)− F0(0.25, y)+G(0.5, 0.75)F0(0.5, y).

We then get F u∞(0.25, y) = F l

∞(0.25, y) + 1 − α(y)F l∞(0.5, y) − F l

∞(0.25, y), and

the value of F∞(x, y) is unchanged for x ∈ 0, 0.5, 0.75. The discrete nature of the

limit F l∞ is visible in Figure 9.1. Figure 9.2 shows significant non-uniqueness in all

estimators for x-values outside the support of G. However, FXn(x) is unique for x ∈0.25, 0.5, 0.75 and very close to F0X(x). Finally, Figure 9.3 shows that F∞(0.25, y)

is non-unique, while the repaired MLE is unique and closely follows F0(0.25, y).

254

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Fn, Example 1

0.0 0.2 0.4 0.6 0.8 1.00

12

34

F∞, Example 1

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

F, Example 1

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Fn, Example 2

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

F∞, Example 2

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

F, Example 2

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

Fn, Example 3

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

F∞, Example 3

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

F, Example 3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fn, Example 4

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

F∞, Example 4

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

F, Example 4

Figure 9.1: Contour lines for the bivariate functions F ln, F

l∞ and F0. All functions

were computed on an equidistant grid with mesh size 0.02, and n = 10,000.

255

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

FX, Example 1

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

x

FX, Example 2

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x

FX, Example 3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

FX, Example 4

Figure 9.2: Dotted: F0X . Dashed: F lX∞ and F u

X∞. Solid black: F lXn and F u

Xn using

the equidistant grid with K = 20 shown in Figure 9.3. Solid grey: F lXn and F u

Xn. Inall cases n = 10,000.

256

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

y

F(0.49, y), Example 1

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

y


0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

y


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

y


Figure 9.3: Dotted: F0(x0, y). Dashed: F l∞(x0, y) and F u

∞(x0, y). Circles: F ln(x0, y) =

F un (x0, y) using an equidistant grid with K = 20. Solid grey: F l

n(x0, y) and F un (x0, y).

In all cases n = 10,000.

257

BIBLIOGRAPHY

Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E.

(1955). An empirical distribution function for sampling with incomplete informa-tion. Annals of Mathematical Statistics 26 641–647.

Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D.

(1972). Statistical Inference Under Order Restrictions. The Theory and Applicationof Isotonic Regression. John Wiley & Sons, London - New York - Sydney.

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Effi-cient and adaptive estimation for semiparametric models. Johns Hopkins UniversityPress.

Birman, M. and Solomjak, M. (1967). Piece-wise polynomial approximations offunctions of the classes wαp . Mathematics of the USSR Sbornik 73 295–317.

Dudley, R. M. (1968). Distances of probability measures and random variables.Ann. Math. Statist. 39 1563–1572.

Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab.6 899–929. (Correction: (1979) Ann. Probab. 7 909-911).

Gentleman, R. and Geyer, C. J. (1994). Maximum likelihood for interval cen-sored data: Consistency and computation. Biometrika 81 618–623.

Gentleman, R. and Vandal, A. C. (2001). Computational algorithms forcensored-data problems using intersection graphs. J. Comput. Graph. Statist. 10

403–421.

Gentleman, R. and Vandal, A. C. (2002). Nonparametric estimation of thebivariate CDF for arbitrarily censored data. Can. J. Statist. 30 557–571.

Geskus, R. B. and Groeneboom, P. (1996). Asymptotically optimal estimation ofsmooth functionals for interval censoring, part 1. Statistica Neerlandica 50 69–88.

Geskus, R. B. and Groeneboom, P. (1997). Asymptotically optimal estimation ofsmooth functionals for interval censoring, part 2. Statistica Neerlandica 51 201–219.

Geskus, R. B. and Groeneboom, P. (1999). Asymptotically optimal estimationof smooth functionals for interval censoring, case 2. Ann. Statist. 27 627–674.

258

Gill, R. D. and Johansen, S. (1990). A survey of product-integration with a viewtoward application in survival analysis. Ann. Statist. 18 1501–1555.

Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. AcademicPress, New York.

Gordon, R. D. (1941). Values of Mills’ ratio of area to bounding ordinate and of thenormal probability integral for large values of the argument. Ann. Math. Statistics12 364–366.

Groeneboom, P. (1989). Brownian motion with a parabolic drift and Airy func-tions. Probability Theory and Related Fields 81 79–109.

Groeneboom, P. (1996). Lectures on inverse problems. In Lectures on ProbabilityTheory and Statistics. Ecole d’Ete de Probabilites de Saint Flour XXIV, 1994,Springer, Berlin.

Groeneboom, P., Jongbloed, G. and Wellner, J. (2002). The support reduc-tion algorithm for computing nonparametric function estimates in mixture models.Technical Report 2002-13, Vrije Universiteit Amsterdam, The Netherlands. Avail-able at arXiv:math/ST/0405511.

Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001a). A canonicalprocess for estimation of convex functions: The “invelope” of integrated Brownianmotion + t4. Ann. Statist. 29.

Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001b). Estimation ofa convex function: Characterizations and asymptotic theory. Ann. Statist. 29.

Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonpara-metric Maximum Likelihood Estimation. Birkhauser Verlag, Basel.

Hajos, G. (1957). Uber eine Art von Graphen. Internationale MathematischeNachrichten 11. Problem 65.

Huang, J. and Wellner, J. A. (1995). Asymptotic normality of the NPMLEof linear functionals for interval censored data, case 1. Statistica Neerlandica 49

153–163.

Huang, Y. and Louis, T. A. (1998). Nonparametric estimation of the joint distri-bution of survival time and mark variables. Biometrika 85 785–798.

Hudgens, M. G., Maathuis, M. H. and Gilbert, P. B. (2006). Nonparametricestimation of the joint distribution of a survival time subject to interval censoringand a continuous mark variable. Submitted.

259

Hudgens, M. G., Satten, G. A. and Longini, I. M. (2001). Nonparametricmaximum likelihood estimation for competing risks survival data subject to intervalcensoring and truncation. Biometrics 57 74–80.

Jewell, N. P. and Kalbfleisch, J. D. (2004). Maximum likelihood estimationof ordered multinomial parameters. Biostatistics 5 291 – 306.

Jewell, N. P., Van der Laan, M. J. and Henneman, T. (2003). Nonparametricestimation from current status data with competing risks. Biometrika 90 183–197.

Jongbloed, G. (1995). Three Statistical Inverse Problems. Ph.D. thesis, DelftUniversity of Technology, The Netherlands.

Jongbloed, G. (1998). The iterative convex minorant algorithm for nonparametricestimation. J. Comput. Graph. Statist. 7 310–321.

Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Statist. 18 191–219.

Krailo, M. D. and Pike, M. C. (1983). Estimation of the distribution of age atnatural menopause from prevalence data. American Journal of Epidemiology 117

356–361.

Maathuis, M. H. (2003). Nonparametric Maximum Likelihood Estimation For Bi-variate Censored Data. Master’s thesis, Delft University of Technology, The Nether-lands.

Maathuis, M. H. (2005). Reduction algorithm for the MLE for the distributionfunction of bivariate interval censored data. J. Comput. Graph. Statist. 14 352–362.

Maathuis, M. H. and Wellner, J. A. (2006). Inconsistency of the MLE forthe joint distribution of interval censored survival times and continuous marks.Submitted.

MacMahon, B. and Worcestor, J. (1966). Age at menopause, United States1960 - 1962. National Center for Health Statistics. Vital and Health Statistics 11.

Pfanzagl, J. (1988). Consistency of maximum likelihood estimators for certainnonparametric families, in particular: mixtures. J. Statist. Plann. Inference 19

137–158.

Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, NewYork. Available at http://ameliabedelia.library.yale.edu/dbases/pollard1984.pdf.

Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order RestrictedStatistical Inference. John Wiley & Sons, Chichester.

260

Rudin, W. (1976). Principles of Mathematical Analysis. 3rd ed. McGraw-Hill, NewYork.

Schick, A. and Yu, Q. (2000). Consistency of the GMLE with mixed case interval-censored data. Scand. J. Statist. 27 45–55.

Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applicationsto Statistics. John Wiley & Sons, New York.

Silverman, B. W. (1982). On the estimation of a probability density function bythe maximum penalized likelihood method. Ann. Statist. 10 795–810.

Turnbull, B. W. (1976). The empirical distribution function with arbitrarilygrouped, censored, and truncated data. J. R. Statist. Soc. B 38 290–295.

Van de Geer, S. A. (1991). The entropy bound for monotone functions. Tech.Rep. 91-10, University of Leiden, The Netherlands.

Van de Geer, S. A. (1993). Hellinger-consistency of certain nonparametric maxi-mum likelihood estimators. Ann. Statist. 21 14–44.

Van de Geer, S. A. (1996). Rates of convergence of the maximum likelihoodestimator in mixture models. J. Nonparametr. Stat. 6 293–310.

Van de Geer, S. A. (2000). Applications of Empirical Process Theory. CambridgeUniversity Press, Cambridge.

Van der Laan, M. J. (1996). Efficient estimation in the bivariate censoring modeland repairing NPMLE. Ann. Statist. 24 596–627.

Van der Vaart, A. W. (1991). On differentiable functionals. Ann. Statist. 19

178–204.

Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence andEmpirical Processes: With Applications to Statistics. Springer-Verlag, New York.

Van der Vaart, A. W. and Wellner, J. A. (2000). Preservation theoremsfor Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In High DimensionalProbability II. Birkhauser, Boston, 115–133.

Vandal, A. C., Gentleman, R. and Liu, X. (2006). Mixture nonuniqueness ofthe CDF NPMLE with censored data. Submitted.

Wellner, J. A. (2003). Empirical processes: Theory and applications. Lecture notesfor summer school on statistics and probablity, Bocconi University, Milan. Availableat http://www.stat.washington.edu/jaw/RESEARCH/TALKS/talks.html.

261

Wong, G. Y. and Yu, Q. (1999). Generalized MLE of a joint distribution functionwith multivariate interval-censored data. Journal of Multivariate Analysis 69 155–166.

Yoshihara, K.-i. (1979). The Borel-Cantelli lemma for strong mixing sequences ofevents and their applications to LIL. Kodai Math. J. 2 148–157.

Zeidler, E. (1985). Nonlinear Functional Analysis and its Applications III: Varia-tional Methods and Optimization. Springer-Verlag, New York.

262

VITA

Marloes Henriette Maathuis was born to Harry and Ina Maathuis on May 28 1978

in Groningen, The Netherlands. After graduating from the Praedinius Gymnasium in

Groningen in 1996, she started studies in Applied Mathematics at the Delft University

of Technology. As part of this program, she did an internship at the Ethiopian

Netherlands AIDS Research Project in Addis Ababa, Ethiopia. In 2001 she came to

the University of Washington to write her Master’s thesis, resulting in a Master of

Science degree in Applied Mathematics from the Delft University of Technology in

2003. She simultaneously started graduate studies at the University of Washington,

and graduated with a Doctor of Philosophy in Statistics in 2006. She will remain

associated with the University of Washington in the following year, as an Acting

Assistant Professor in Statistics.

thesis - ETH Zurichstat.ethz.ch/~maathuis/papers/thesis.pdf · 2007. 9. 6. · Title: thesis.dvi...

Documents

Transcript of thesis - ETH Zurichstat.ethz.ch/~maathuis/papers/thesis.pdf · 2007. 9. 6. · Title: thesis.dvi...