A Bayesian analysis of doubly censored data using a ...€¦ · infection is of interest. In this...
Transcript of A Bayesian analysis of doubly censored data using a ...€¦ · infection is of interest. In this...
A Bayesian analysis of doubly censored data using
a hierarchical Cox model
Wei Zhang1,2,∗,†, Kathryn Chaloner1,3, Mary Kathryn Cowles1,3,
Ying Zhang1, Jack T. Stapleton4
1Department of Biostatistics, University of Iowa, Iowa City, IA
2Department of Biometrics and Data Management, Boehringer Ingelheim Pharmaceuticals, Ridgefield, CT
3Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA
4Department of Internal Medicine, University of Iowa and Iowa City VA Medical Center, Iowa City, IA
SUMMARY
Two common statistical problems in pooling survival data from several studies are addressed. The first
problem is that the data are doubly censored in that the origin is interval censored and the endpoint
event may be right censored. Two approaches to incorporating the uncertainty of interval censored
origins are developed, and then compared to more usual analyses using imputation of a single fixed
∗Correspondence to: Wei Zhang, Department of Biometrics and Data Management, Boehringer Ingelheim
Pharmaceuticals, Ridgefield, CT 06877, U.S.A.
†E-mail: [email protected]
Contract/grant sponsor: NIH/NIAID; contract/grant number: R01 058740
Contract/grant sponsor: National Security Agency; contract/grant number: H98230-04-1-0042
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 1
value for each origin. The second problem is that the data are collected from multiple studies and
it is likely that heterogeneity exists among the study populations. A random-effects hierarchical Cox
proportional hazards model is therefore used. The scientific problem motivating this work is a pooled
survival analysis of data sets from three studies to examine the effect of GB virus type C (GBV-C)
coinfection on survival of HIV-infected individuals. The time of HIV infection is the origin and for
each subject this time is unknown, but is known to lie later than the last time at which the subject
was known to be HIV negative, and earlier than the first time the subject was known to be HIV
positive. The use of an approximate Bayesian approach using the partial likelihood as the likelihood
is recommended because it more appropriately incorporates the uncertainty of interval censored HIV
infection times. Copyright c© 200000 John Wiley & Sons, Ltd.
KEY WORDS: GBV-C; human immunodeficiency virus; interval censoring; MCMC;
Multicenter AIDS Cohort Study; partial likelihood
1. INTRODUCTION
Infection with GB virus type C (GBV-C) in humans is common, but no association between
the virus and any known disease state has been demonstrated [1, 2, 3]. Individuals infected with
human immunodeficiency virus (HIV) are commonly coinfected with GBV-C, since GBV-C
shares the same modes of transmission as HIV. The prevalence of coinfection with GBV-C in
HIV-infected individuals ranges from 14% to 43% [4]. Several recent studies of data from early
in the epidemic, before the availablility of effective therapy, suggest that coinfection with GBV-
C is associated with prolonged survival among HIV-infected people [5, 6, 7, 8]: other studies
have concluded that there is no association [9, 10]. A meta-analysis of summary statistics was
performed and published in Zhang et al. [11] and this study indicates that persistent GBV-C
coinfection is associated with prolonged survival. To further investigate this conclusion, which
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
2 W. ZHANG ET AL.
remains controversial [12], individual level data from separate studies is modeled here.
Original data sets from three published studies [6, 7, 8] are obtained. These data sets are
doubly censored. First, the origin (HIV infection time) Y is interval censored in that it is
known to lie in an interval Y ∈ [L,U ]. Second, the endpoint (death) time is possibly right
censored. Denote survival time T = E − Y , where E is the minimum of death time and last
followup time. The dependence of T on covariates, and in particular on the indicator of GBV-C
infection is of interest. In this paper, approaches are developed for the pooled survival analysis
of doubly censored data from multiple studies and applied to the pooled data from the three
studies.
A random effects model for the indicator of GBV-C coinfection incorporates the
heterogeneity of patient characteristics between the studies. The most popular modeling
method in survival analysis, the Cox proportional hazards model [14], avoids making any
assumptions about the baseline hazard function λ0(t). Several authors have considered Cox
survival models with random effects [15, 16, 17, 18, 19], but these random effects models for
survival data either require assumptions regarding the form of the baseline hazard function or
restrictions on the classes of models that can be fit. Sargent [20] and Gustafson [21] present a
framework through which random effects can be introduced into the Cox model, which uses
the Cox partial likelihood [22], and allows very general random-effect structures for the model
parameters.
The remainder of this paper is organized as follows. Section 2 gives an introduction to
the hierarchical Cox proportional hazards model. Section 3 presents different approaches to
incorporating the interval censoring. In Section 4, these approaches are applied to the case
study and results are summarized and compared. Section 5 concludes with discussion.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 3
2. HIERARCHICAL COX PROPORTIONAL HAZARDS MODEL
The standard Cox regression represents the relationship between the covariates of interest and
the hazard of event at time t through a proportional hazards model:
λ(t;xi) = λ0(t) exp(xiβ)
where λ(t;xi) is the hazard function, xi is a 1× p covariate vector for individual i, and β is a
p× 1 vector of coefficients corresponding to fixed effects.
2.1. The hierarchical model
Suppose there are q covariates to be modeled as random effects in addition to p covariates
with fixed effects. For each of the q random covariates, there are rl levels (e.g., trials, centers,
studies), for l = 1, . . . , q, with parameter vector γl = (γl1, . . . , γlrl)T corresponding to the lth
random covariate. Let wil be a scalar indicating the ith subject’s value of the lth random effect
covariate. Let zil = (zil1, . . . , zilrl), with zilj = Iiljwil, where Iilj = 1 if subject i falls in the jth
level of the lth random covariate, 0 otherwise. Define γ = (γT1 , . . . , γT
q )T and zi = (zi1, . . . , ziq).
Using this notation, the Cox model with random effects, a reparameterized frailty model [20],
can be written as λ(t; xi, zi) = λ0(t) exp(xiβ + ziγ). Let D denote a right censored data set,
D = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where ti is time from baseline to the minimum of endpoint
time and censoring time, δi is the indicator for censoring with δi = 1 if censored and δi = 0
otherwise, and n is the number of observations. Let <ti be the set of subjects at risk at time
ti. If there are no ties, the partial likelihood incorporating random effects is then given by
L(β, γ|D) ∝n∏
i=1
[exp(xiβ + ziγ)∑
j∈<tiexp(xjβ + zjγ)
]δi
, (1)
The partial likelihood in (1) serves as the first stage of the hierarchical model and can
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
4 W. ZHANG ET AL.
be treated as a likelihood for computing a posterior density. Kalbfleisch [23] demonstrates
that treating the partial likelihood as a likelihood leads to a limiting marginal posterior
distribution of the regression parameters, assuming an independent increments gamma process
prior distribution for the baseline cumulative hazard and independently a uniform distribution
on the regression parameters. This result is shown with a different proof in Sinha et al. [24]
which extends results to situations with time-dependent covariates, time-varying regression
parameters and grouped survival data, and presents a Bayesian justification of a modified
partial likelihood for handling ties. See also Chapter 4 of [25]. Chen at al. [26] carry out
an in-depth theoretical investigation of Bayesian inference for the Cox regression model and
discuss posterior propriety and computation based on Cox’s partial likelihood. Sargent [20]
and Gustafson [21] present methods for Bayesian analysis of multivariate survival data using
(1).
The level-specific parameters γlj are modeled as draws from a distribution gl with mean µl
and variance νl. Let g denote the joint density for γ, and assume γlj ’s are independent of each
other given µl and νl, for l = 1, . . . , q and j = 1, . . . , rl, then
g(γ|µ, ν) =q∏
l=1
rl∏
j=1
gl(γlj |µl, νl), (2)
where µ = (µ1, . . . , µq)T , and ν = (ν1, . . . , νq)T .
For the final stage of the hierarchical model, prior distributions need to be specified. A
proper prior distribution for the variance component is typically essential for proper posterior
distribution and computational stability. Let f(µ, ν|ω) represent this prior distribution, where
ω is the vector of hyperparameters and is taken to be known. Also let β have, independent of
γ, a uniform prior distribution.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 5
An approximate posterior distribution for the model parameters is then be assumed to be:
π(β, γ, µ, ν|D, ω) ∝ L(β, γ|D) g(γ|µ, ν) f(µ, ν|ω), (3)
where D is right censored data.
2.2. Estimation of parameters using MCMC methods
The approximate posterior distribution (3) can be estimated with Markov chain Monte Carlo
(MCMC) methods. The Metropolis-Hastings algorithm [27, 28] is a general term for a family of
MCMC methods that are useful for drawing samples from Bayesian posterior distributions. Let
θ denote the set of parameters involved in the hierarchical model. The parameter vector θ is
divided into components corresponding to the hierarchy and the single-component Metropolis-
Hastings algorithm is used [29].
3. ANALYSIS OF DOUBLY CENSORED DATA
Denote doubly censored data C = {([li, ui], ei, δi, xi, zi) : i = 1, 2, , . . . , n}, where [li, ui] is the
interval of dates within which origin yi falls, and ei is the minimum of endpoint date and
last followup date. Note that dates are defined as length of time from a fixed time point.
Further define a function D(·, ·) mapping a doubly censored data set C and a set of origin
dates y = (y1, y2, . . . , yn)T to a right censored data set; specifically D(C,y) = {(ti, δi, xi, zi) :
i = 1, 2, . . . , n}, where ti = ei − yi is time from yi to ei. A common approach in medical
applications [8, 30, 31] is that midpoints of censoring intervals are used to impute interval-
censored origins, yi = (li + ui)/2, and are then used to compute ti = ei − yi in analysis as if
they are right censored data. In this case, D(C, y) = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where
y = (y1, y2, . . . , yn)T . Law and Brookmeyer [32] demonstrate that in HIV studies the Kaplan-
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
6 W. ZHANG ET AL.
Meier estimate based on this method is notably biased when origin intervals are longer than
two years.
Three alternative methods are proposed below which can be implemented using MCMC.
If the partial likelihood is used as a likelihood in estimating the posterior distribution using
MCMC, the assumption has to be made that this is valid under interval censoring of the origin:
an assumption which has not been proved, but which seems reasonable given the results in
[23, 24, 26].
3.1. MCMC for imputed data (MCMCid) approach
As an improvement to using the midpoint of censoring interval, MCMCid is proposed here
that samples a value of each interval censored origin generated from an estimated distribution
of origins. Let G denote the distribution function of origins y. The estimate of G, G, can be
either obtained parametrically using the maximum likelihood estimation based on a known
distribution (e.g., Weibull, log Normal), or obtained nonparametrically using Turnbull’s self-
consistency algorithm [33]. To perform an analysis for doubly censored survival data using
MCMCid, for each subject i, i = 1, . . . , n, a value of yi, denoted yi, is randomly sampled
from G, conditional on the interval [li, ui] within which yi falls. The doubly censored data
set C and imputed origins y = (y1, y2, . . . , yn)T are then mapped to a right censored data
set D(C, y) = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where ti = ei − yi. The hierarchical Cox
proportional hazards model can then be fit to the right censored data D(C, y) using MCMC
methods.
The MCMCid approach is straightforward to implement and understand but underestimates
the variability of parameter estimates because the uncertainty of imputed origins y is not
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 7
incorporated.
3.2. Imputation-embedded MCMC (ieMCMC) approach
The imputation-embedded MCMC (ieMCMC) approach is developed as an alternative to the
MCMCid approach. For each MCMC iteration step m, m = 1, . . . ,M (e.g., M = 20000),
origins ym = (ym1 , . . . , ym
n )T are randomly sampled based on their distribution G, conditional
on the intervals {[li, ui] : i = 1, 2, . . . , n} within which they fall. A right censored data set,
D(C,ym) = {(tmi , δi, xi, zi) : i = 1, 2, . . . , n}, where tmi = ei−ymi , is generated at each iteration
step m. In the ieMCMC approach the origins ym vary at each MCMC sampler iteration with
a distribution based on G, but the estimate G is still fixed. The uncertainty in estimating G
by G is not therefore taken into account, although the uncertainty in ym conditional on G is
considered.
3.3. Bayesian partial likelihood (Bayesian-PL) approach
Finally, an approach is proposed with a parametric assumption on origins which is more
complete. Denote h(y|ξ) as the probability density function of the origins, with parameter
vector ξ. Let p(ξ|$) represent the prior distribution for ξ, where $ is the vector of
hyperparameters governing p(·). For a doubly censored data set C = {([li, ui], ei, δi, xi1, zi) :
i = 1, 2, , . . . , n}, define Iyi to be the indicator function which is equal to 1 if yi is in [li, ui]
and 0 otherwise. The approximate posterior distribution, based on using the partial likelihood
as the likelihood, for all the model parameters, which include the interval censored unknown
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
8 W. ZHANG ET AL.
origins y, can be expressed as
π(β, γ, µ, ν, ξ,y|C, ω, $) ∝ L(β, γ|D(C,y)) g(γ|µ, ν)
{∏
i
h(yi|ξ) Iyi
[H(ui|ξ)−H(li|ξ)]} f(µ, ν|ω) p(ξ|$), (4)
where H(·|ξ) is cumulative density function of y. At each iteration of the MCMC sampler, the
origins y are drawn from their full conditional distribution, given all other model quantities.
The resulting analysis provides an updated estimate of the distribution of origins, as well as
correctly capturing the effect of the uncertainty in origins on estimation of the parameters of
primary interest. Because (4) is an approximation of the posterior distribution, in what follows
we refer to this as the Bayesian-PL method.
4. CASE STUDY
The pooled dataset consists of doubly censored data sets from three studies [6, 7, 8] where
GBV-C is measured late in HIV disease. Each study corresponds to a different population.
The “late” data set of the Williams study [8] is used, from the Multicenter AIDS Cohort Study
(MACS) which has a documented HIV seroconversion window of approximately 6 months on
average. In the MACS study, the intervals are based on the retrospective testing of stored blood
samples obtained on a regular basis, including those before and after HIV seroconversion. In
the other two studies, testing is also on stored blood samples, and the date of subjects first
known positive HIV test is used as the right limit of the interval, and January 1st 1978 (or date
of birth for subjects born after January 1st 1978) is treated as the left limit of the interval.
January 1st 1978 was chosen because an analysis of stored blood samples from a study in San
Francisco indicates an extremely low prevalence of HIV infection before this date [34]. The
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 9
sample sizes of [6, 7, 8] are 362, 197 and 138, respectively. A summary of these three studies
can be found in [11]. All studies follow subjects through the time period before the advent of
highly effective therapy for HIV in 1996.
Fitting the regular Cox model to three imputed data sets separately, the estimated log
hazard ratio of GBV-C coinfection, controlling for baseline log(CD4+ count in cells/mL) and
age at HIV infection, is -1.23, -1.62 and -0.97 for [6, 7, 8], respectively. For both Xiang and
Tillmann studies [6, 7], HIV infection time is heavily interval censored, with a mean interval
width of about 10 years. In contrast, HIV infection time in the Williams study [8] has much
narrower intervals. The subjects in the Tillmann study are from Germany, and the subjects
in the other two studies are from the U.S.A. The differences among the studies could be due
to several reasons, including the fact that GBV-C testing is not standardized and each study
used a different primer for a qualitative test. A recent study [35] indicates that the sensitivity
and specificity of each test varies and the sensitivity of one particular test depends on GBV-C
RNA levels. This motivates the need for an analysis that has the ability to account for the
possibly differing effect of GBV-C infection within each population. The primary endpoint for
this pooled analysis is overall survival. All 695 eligible patients are included in the analysis.
Let xi1, xi2, and wi denote log(CD4+ count in cells/mL), age at HIV infection and GBV-
C coinfection status for subject i, respectively. Let Iij = 1 if subject i is from study j, 0
otherwise, with j = 1, 2, 3 corresponding to the three studies [6, 7, 8]. Define zi = (zi1, zi2, zi3),
where zij = Iijwi. Let β = (β1, β2)T denote the fixed effects of covariates xi1 and xi2. Let
γ = (γ1, γ2, γ3)T denote the random effects of covariates zi = (zi1, zi2, zi3), with constraint
γj ∼ N(µ, σ2), where µ is the population effect of GBV-C infection and σ2 population variance.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
10 W. ZHANG ET AL.
Specifically, the hazard function for individual i at time tk is given by
λ(tk;xi1, xi2, zi) = λ0(tk) exp(xi1β1 + xi2β2 + ziγ). (5)
Note that age at HIV infection xi2 is a known deterministic function of infection time yi. Fixed
effects, rather than random effects, are used for the age and CD4+ cell count as the effects
from each data set of the three studies are very similar. This is unlike the effect of GBV-C in
each study which varies more.
In what follows approximate posterior density, mean and corresponding approximate highest
posterior density region (HDR) are calculated using the approximation based on the partial
likelihood. These will be referred to as posterior density, mean and HDR for simplicity, without
explicitly qualifying that they are approximations.
4.1. Prior distributions
Other than the three studies for which we have data, there are four additional studies
[5, 9, 10, 38] providing only summary statistics (hazard ratio and corresponding 95% confidence
interval). A meta-analysis of summary statistics for these four studies was done, similarly to
the meta-analysis of all summary statistics in [11]. The estimated combined effect of GBV-C
in these four studies is −0.41 with estimated standard error 0.42. This result helps postulate
the prior distribution for µ. To be conservative, the standard error is multiplied by 2, so that
µ is normally distributed as N [−0.41, (0.42× 2)2].
A proper prior distribution for σ2 is used for the sake of computational stability of the
MCMC methods and to generate a proper approximate posterior distribution. A gamma
distribution on τ = σ−2 is used with τ distributed as Γ(0.25, 0.005). This distribution is
specified by considering what values for the random effects are reasonable. For example, a
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 11
belief representing moderate heterogeneity between the studies would be that γj (j = 1, 2, 3)
vary around µ by ±0.10. Using 0.10 as an estimate of the standard error σ leads to a prior
estimate of τ = 100. A prior belief that presents substantial heterogeneity might be that the
γj vary around µ by ±1.0. Using 1.0 as the prior standard error of γj leads to a prior estimate
of τ = 1. The prior distribution Γ(0.25, 0.005), with mean 50 and standard error of 100, gives
reasonable weight to these extremes. A flat prior distribution is used for β.
4.2. Joint posterior distributions
A parametric model for the distribution of y is implemented in the example instead of the
non-parametric method. Reasons for this choice include that the non-parametric maximum
likelihood estimator G given by Turnbull in [33] is only unique up to an equivalence class
and also has discrete components. The parametric model gives a smoother distribution for the
times of infection, which is thought to more realistically model the reality of the spread of HIV
infection. In addition, for the pooled data set, Turnbull’s estimate of the survival function is not
very different from the maximum likelihood estimate based on assuming a Weibull distribution
(Figure 1).
4.2.1. MCMCid and ieMCMC approaches The distribution function G of HIV infection times
y is assumed to be a Weibull(α, γ). Conditional on the maximum likelihood estimates (α,
λ) and intervals for infection times, infection times y are randomly sampled. Let C denote
the doubly censored data C = {([li, ui], ei, δi, xi1, zi) : i = 1, 2, , . . . , n}. Following the prior
distributions in Section 4.1, the joint approximate posterior density of all parameters for the
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
12 W. ZHANG ET AL.
MCMCid approach is given by
π(β, γ, µ, τ |D(C, y)) ∝ L(β, γ|D(C, y))3∏
j=1
τ12 exp(− (γj − µ)2τ
2)
exp(− (µ− µ0)2
2σ20
) τa0−1 exp(− τ
b0), (6)
where D(C, y) = {(ti, δi, xi1, xi2, zi) : i = 1, 2, . . . , n}, ti = ei − yi, xi2 is a function of yi, and
(a0, b0, µ0, σ0) = (0.25, 0.005,−0.41, 0.42× 2).
The joint posterior density of all parameters for the ieMCMC approach is the same as (6),
except that D(C, y) is replaced by D(C,ym), where D(C,ym) = {(tmi , δi, xi1, xmi2, zi) : i =
1, 2, . . . , n}, tmi = ei − ymi , and xm
i2 is a function of ymi . In the ieMCMC approach, D(C,ym)
changes at each MCMC iteration m, while D(C, y) is fixed during the process of the MCMCid
approach.
For both MCMCid and ieMCMC approaches, given data and other parameters in the
model, the full conditional posterior distribution for τ has a gamma distribution π(τ |·) ∝
Γ[a0 + 32 , b0 +
∑3j=1
(γj−µ)2
2 )], and the full conditional posterior distribution for µ has a normal
distribution π(µ|·) ∝ N [λµ0+(1−λ)γ, (1−λ)(3τ)−1], where λ = (3τ)−1
(3τ)−1+σ20
and γ =∑3
j=1 γj/3.
It is straightforward to perform the Gibbs sampling on µ and τ , but there is no direct way
to draw from parameters β and γ, and the single-component Metropolis-Hastings algorithm
[29] for the sampling of these two parameter components is used.
4.2.2. Bayesian-PL approach The distribution of y is also assumed to be a Weibull(α, λ)
for the purpose of a fair comparison to the MCMCid and ieMCMC approaches, and the
prior distributions for α and λ are specified independently as log normal distributions:
LogNorm(µα, σ2α) and LogNorm(µλ, σ2
λ), respectively. For yi ∈ [li, ui], i = 1, . . . , n, the joint
approximate posterior density of all parameters for the Bayesian-PL approach is then given
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 13
by
π(β, γ, µ, τ, α, λ,y|C) ∝ L(β, γ|D(C,y))3∏
j=1
τ12 exp[− (γj − µ)2τ
2]
exp[− (µ− µ0)2
2σ20
] τa0−1 exp(− τ
b0)
n∏
i=1
{αλ−αyα−1i exp[−(
yi
λ)α]}{exp[−(
liλ
)α]− exp[−(ui
λ)α]}−1
α−1 exp[− (log(α)− µα)2
2σ2α
] λ−1 exp[− (log(λ)− µλ)2
2σ2λ
], (7)
where (µα, σα, µλ, σλ) = (1.31, 0.4, 3.51, 0.5).
4.3. Results of primary analysis
All methods are implemented in R [39], using the Metropolis-within-Gibbs algorithm. The
posterior full conditional distributions of µ and τ are normal and gamma respectively, so these
parameters were drawn using Gibbs sampling. The approximate posterior full conditional
distributions for parameters using the single-component Metropolis-Hastings algorithm [29] in
the Bayesian-PL approach are given in the Appendix. Code is available from the first author.
The WinBUGS software package [40] could not be used easily because of the doubly censored
data complicated by an interval censored covariate. Three independent chains are generated
for each of the 3 approaches (MCMCid, ieMCMC and Bayesian-PL approach). Each chain
consists of 14,000 iterations after a series of 6,000 burn-in iterations. The Brooks and Gelman
convergence diagnostic [41] indicates that there is no evidence against the convergence of
sampler for each parameter in all approaches.
Table I summarizes results from the MCMCid, ieMCMC and Bayesian-PL approaches
based on the hierarchical Cox proportional hazards model. The point estimates from the
three approaches are similar for each parameter except for σ. For each parameter estimate,
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
14 W. ZHANG ET AL.
the standard error from the Bayesian-PL approach is, appropriately, the largest among
three approaches, while the standard error from the MCMCid approach is the smallest one.
Consequently, the 95% HDR from the Bayesian-PL approach is generally wider than the one
from the MCMCid or ieMCMC approach; the 95% HDR from the MCMCid approach tends
to be the narrowest. The differences are substantial, especially for the parameters of most
interest: µ and σ.
Results from all three approaches indicate that GBV-C infection is associated with
prolonged survival. From the Bayesian-PL approach, the estimated hazard ratio for GBV-
C viremia is e−µ = e−0.891 = 0.41 with 95% probability falling into the interval
(e−1.423, e−0.335) = (0.24, 0.72) after adjusting for baseline log(CD4+ count in cells/mL) and
age at HIV infection. Baseline log(CD4+ count in cells/mL) is also associated with prolonged
survival: estimated hazard ratio eβ1 = e−0.645 = 0.52, with 95% probability falling into the
interval (e−0.842, e−0.444) = (0.43, 0.64).
4.4. Results of sensitivity analysis
To examine the behavior of the estimate of µ from the Bayesian-PL approach when the
distribution for infection times y is modified, the hierarchical Cox proportional hazards model
was fit using different distributions for y. Specifically, we used a flat distribution for y, a
single Weibull(α, λ) for y, and a set of three Weibull distributions, one for each of the three
studies, Weibull(αj , λj) for yj , j = 1, 2, 3. The choice of Γ(0.25, 0.005) as the prior for τ is
also examined by using Γ(0.001, 0.001). Overall, there are 3×2 = 6 scenarios in the sensitivity
analyses, with the first one corresponding to the primary analysis (see Table II and Figure 2).
The value of µ, the estimate for the logarithm of hazard ratio of GBV-C infection, changes
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 15
only slightly, as does the standard error and 95% HDR. This analysis suggests that the results
are insensitive to the choice of prior distribution.
5. DISCUSSION
The imputation-embedded MCMC (ieMCMC) and the Bayesian-PL approaches are developed
to deal with doubly censored survival data and compared to an MCMC analysis of a
right censored data set constructed by imputing a single value for each interval censored
origin (MCMCid). The MCMCid approach considerably underestimates the variability of the
estimates. The ieMCMC allows for some uncertainty of imputed origins to play a role but again
results in underestimation of the variability of the parameter estimates. In comparison the
Bayesian-PL approach treats unobservable origins y as unknown quantities with a parametric
distribution G. Prior distributions are then assigned for the hyperparameters of G. Interval
censoring is treated by data augmentation [42] with y drawn from their posterior predictive
distribution. The results from the Bayesian-PL approach more appropriately reflect uncertainty
than the MCMCid and ieMCMC approaches. This paper demonstrates the ability of the
Bayesian-PL approach to incorporate the uncertainty of imputed origins in doubly censored
survival data. Our sensitivity study shows the results from the Bayesian-PL approach are
reasonably insensitive to the specification of the parametric form of G (Figure 2).
Harkanen et al. [43] presented a non-parametric Bayesian intensity model for doubly
censored data in the fully Bayesian framework, which treats unobservable origins y as unknown
quantities with piece-wise constant hazard functions. Hazard functions are assigned gamma
prior distributions. Komarek et al. [44] applied a modified version of this approach to doubly
censored dental data to examine the effect of fluoride-intake on the time to caries development
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
16 W. ZHANG ET AL.
in children. However, these approaches are complex to implement and computationally
demanding.
It should be noted that a formal justification of using the Bayesian-PL approach has not
been provided for interval censored origins and random effects. Given the results in [23, 24, 26]
this assumption is not unreasonable, but additional work needs to be done.
A parametric model could have been used for the case study and an estimate of the survival
curve obtained. The primary interest in this data analysis is however the question of whether
or not co-infection with GBV-C is associated with prolonged survival of individuals infected
with HIV disease. All three studies, and the other studies in the meta-analysis [11] use data
before the advent of highly effective therapy for HIV infection, and so the survival curve itself
is not of current interest.
The Bayesian hierarchical Cox model has accommodated random study-specific effects
and therefore incorporated between-study heterogeneity. Through the specification of prior
distributions, the prior information relevant to the parameters of interest has been taken into
account.
The methods for doubly censored survival data developed in this paper have enabled
the analysis of these data sets and lead to the conclusion that the hazard ratio with
GBV-C infection is approximately 40% of the hazard without GBV-C infection, and the
hypothesis of no difference in hazard can be ruled out with high probability. The pooled
analysis of the individual subject data therefore augments and supports the meta-analysis
result of the summary statistics previously reported in [11]. Biological plausibility for a
beneficial mechanism and in vitro evidence in inhibiting HIV replication are provided in
[6, 45, 46, 47, 48, 49]. However, as in all observational data, the results of our analyses do
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 17
not provide evidence that GBV-C is causally related to improved survival, and it is possible
that GBV-C infection is not the reason that HIV-positive individuals coinfected with GBV-C
live longer but, rather, that it serves as a biological marker of a different factor related to
HIV disease progression. This warrants further investigation and is a subject of debate in the
scientific literature [9, 12, 13, 49, 50].
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
18 W. ZHANG ET AL.
APPENDIX
The posterior full conditional distributions of most of the parameters in the Bayesian-PL
model were of nonstandard forms and were sampled using Metropolis or Metropolis-Hastings
updates. This appendix lists these full conditional distributions and any unusual features of
the sampling algorithms used.
A.1 Full conditional distribution of y
The unnormalized full conditional distribution of y is
π(y|β, γ, µ, τ, α, λ,C) ∝ L(β, γ|D(C,y))(n∏
i=1
yi)α−1 exp[−n∑
i=1
(yi
λ)α],
where C = {([li, ui], ei, δi, xi, zi) : i = 1, 2, , . . . , n}. At each iteration m, candidates ynewi ,
i = 1, . . . , n, are generated independently from truncated normal densities:
ynewi |ym−1
i ∼ TN(ym−1i , σ2
y|li, ui),
where [li, ui] is the interval within which subject i’s infection time yi is known to lie. Then the
entire vector ynew = (ynew1 , . . . , ynew
n ) is accepted based on the acceptance probability
R = min{1,L(βm−1, γm−1|D(C,ynew))L(βm−1, γm−1|D(C,ym−1))
× (∏n
i=1 ynewi∏n
i=1 ym−1i
)(αm−1−1)
× exp[n∑
i=1
(ym−1
i
λm−1)αm−1 −
n∑
i=1
(ynew
i
λm−1)αm−1
]×n∏
i=1
Φ(ui−ynewi
σy)− Φ( li−ynew
i
σy)
Φ(ui−ym−1i
σy)− Φ( li−ym−1
i
σy)},
where Φ(·) is the standard normal cumulative distribution function. Note that a new set of
infection times y results in a new set of calculated failure and censoring times. This in turn
produces new values of distinct survival times {tk : k = 1, . . . ,K}, so that different individuals
appear in the risk set <k involved in the partial likelihood calculations in (2).
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 19
A.2 Full conditional distributions of β, γ, α, and λ
The unnormalized full conditional distributions for β, γ, α, and λ are given below. Each of
these parameters is sampled using a random-walk Metropolis update in which a candidate
value is drawn from a normal or multivariate normal density centered at the value from the
previous iteration.
• π(β|γ, µ, τ, α, λ,C,y) ∝ L(β, γ|D(C,y))
• π(γ|β, µ, τ, α, λ,C,y) ∝ L(β, γ|D(C,y))∏3
j=1 exp[− (γj−µ)2τ2 ]
• π(α|β, γ, µ, τ, λ,C,y) ∝ αn−1λ−nα(∏n
i=1 yi)α−1 exp[−∑ni=1(
yi
λ )α] exp[− (log(α)−µα)2
2σ2α
]
• π(λ|β, γ, µ, τ, α,C,y) ∝ λ−nα−1 exp[−∑ni=1(
yi
λ )α] exp[− (log(α)−µα)2
2σ2α
].
ACKNOWLEDGEMENTS
The authors also wish to thank Dr Hans Tillmann from the University of Leipzig, Germany, and the
Multicenter AIDS Cohort Study (MACS) for providing data. The MACS has centers located at: The
Johns Hopkins Bloomberg School of Public Health (Joseph Margolick); Howard Brown Health Center
and Northwestern University Medical School (John Phair); University of California, Los Angeles
(Roger Detels); University of Pittsburgh (Charles Rinaldo); and Data Analysis Center (Lisa Jacobson).
This research was supported by NIH/NIAID (R01 058740) and National Security Agency (H98230-
04-1-0042).
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
20 W. ZHANG ET AL.
Table
I.C
om
pari
son
of
resu
lts
from
diff
eren
tappro
ach
esfo
rpoole
danaly
sis
usi
ng
the
hie
rarc
hic
al
Cox
pro
port
ionalhaza
rds
model
.
MC
MC
id∗
ieM
CM
C∗
Bay
esia
n-P
L∗∗
Para
met
erM
ean(S
E)
95%
HD
RM
ean(S
E)
95%
HD
RM
ean(S
E)
95%
HD
R
µ-0
.797(0
.174)
(-1.1
36,-0
.475)
-0.8
46(0
.208)
(-1.2
48,-0
.436)
-0.8
91(0
.277)
(-1.4
23,-0
.335)
σ0.3
27(0
.413)
(0.0
27,1.0
16)
0.4
25(0
.519)
(0.0
29,1.2
92)
0.7
27(0
.783)
(0.0
28,1.9
47)
γ1
-0.7
79(0
.145)
(-1.0
67,-0
.494)
-0.8
45(0
.158)
(-1.1
62,-0
.544)
-0.9
98(0
.185)
(-1.3
60,-0
.623)
γ2
-1.0
73(0
.359)
(-1.8
12,-0
.486)
-1.2
27(0
.416)
(-2.1
23,-0
.596)
-1.5
85(0
.476)
(-2.4
97,-0
.775)
γ3
-0.6
36(0
.274)
(-1.1
11,-0
.039)
-0.6
23(0
.336)
(-1.1
90,0.1
09)
-0.4
63(0
.412)
(-1.1
89,0.3
39)
β1
-0.5
38(0
.086)
(-0.7
07,-0
.371)
-0.5
69(0
.098)
(-0.7
56,-0
.377)
-0.6
45(0
.101)
(-0.8
42,-0
.444)
β2
0.0
13(0
.006)
(0.0
01,0.0
27)
0.0
17(0
.007)
(0.0
02,0.0
30)
0.0
08(0
.007)
(-0.0
06,0.0
21)
∗R
andom
imputa
tion
for
infe
ctio
nti
mes
ybase
don
the
esti
mate
dW
eibu
ll(α
,λ).
∗∗P
rior
dis
trib
uti
on
for
infe
ctio
nti
mes
y:W
eibu
ll(α
,λ),
wher
eα∼
Log
Nor
m(1
.31,0
.4)
and
λ∼
Log
Nor
m(3
.51,0
.5).
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 21
Table II. Sensitivity analysis for the Bayesian-PL approach using different prior distributions for τ
and y.
Analysis τ Infection times y µ (SE) 95% HDR
Tau1Y1 Γ(0.25, 0.005) Weibull(α, λ) -0.891(0.277) (-1.423, -0.335)
Tau1Y2 Γ(0.25, 0.005) Weibull(αj , λj), j = 1, 2, 3 -0.866(0.288) (-1.398, -0.263)
Tau1Y3 Γ(0.25, 0.005) flat prior -0.924(0.272) (-1.417, -0.350)
Tau2Y1 Γ(0.001, 0.001) Weibull(α, λ) -0.876(0.271) (-1.385, -0.326)
Tau2Y2 Γ(0.001, 0.001) Weibull(αj , λj), j = 1, 2, 3 -0.878(0.283) (-1.392, -0.279)
Tau2Y3 Γ(0.001, 0.001) flat prior -0.911(0.250) (-1.360, -0.412)
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
22 W. ZHANG ET AL.
20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Time in years
Sur
viva
l Pro
babi
lity
TurnbullWeibull
Figure 1. Estimated survival curves for HIV infection time of the pooled data set based on different
distribution assumptions.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 23
Hazard Ratio
0.0 0.2 0.4 0.6 0.8 1.0
Tau2Y3
Tau2Y2
Tau2Y1
Tau1Y3
Tau1Y2
Tau1Y1
Analysis
Bayesian−PL
MCMCid
ieMCMC
Figure 2. Estimated hazard ratio and 95% HDR from the MCMCid and ieMCMC approaches (above
the dashed line), and from sensitivity analysis with different prior distributions for τ and HIV infection
times y using the Bayesian-PL approach (below the dashed line). The primary analysis is Tau1Y1.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
24 W. ZHANG ET AL.
REFERENCES
1. Alter HJ. The cloning and clinical implications of HGV and HGBV-C. New England Journal of Medicine
1996; 334:1536-1537.
2. Rambusch EG, Wedemeyer H, Tillmann HL, Heringlake S, Manns MP. Significance of coinfection with
hepatitis G virus for chronic hepatitis C–a review of the literature. Z Gastroenterol 1998; 36:41-53.
3. Tillmann HL, Heringlake S, Trautwein C, et al. Antibodies against the GB virus C envelope 2 protein
before liver transplantation protect against GB virus C de novo infection. Hepatology 1998; 28:379-384.
4. Stapleton JT. GB virus type C/hepatitis G virus. Semin Liver Disease 2003; 23:137-148.
5. Lefrere JJ, Roudot-Thraval F, Morand-Joubert L, et al. Carriage of GB virus C/Hepatitis G virus RNA
is associated with a slower immunologic, virologic, and clinical progression of human immunodeficiency
virus disease in coinfected persons. Journal of Infectious Diseases 1999; 179:783-789.
6. Xiang J, Wunschmann S, Diekema DJ, et al. Effect of coinfection with GB virus C on survival among
patients with HIV infection. New England Journal of Medicine 2001; 345:707-714.
7. Tillmann HL, Heiken H, Knapir-Botor A, et al. Infection with GB virus C and reduced mortality among
HIV-infected patients. New England Journal of Medicine 2001; 345:715-724.
8. Williams CF, Klinzman D, Yamashita TE, et al. Persistent GB virus C infection and survival in HIV-
infected men. New England Journal of Medicine 2004; 350:981-990.
9. Bjorkman P, Flamholc L, Naucler A, et al. GB virus C during the natural course of HIV-1 infection:
viremia at diagnosis does not predict mortality. AIDS 2004; 18:877-886.
10. Birk M, Lindback S, Lidman C. No influence of GB virus C replication on the prognosis in a cohort of
HIV-1-infected patients. AIDS 2002; 16:2482-2485.
11. Zhang W, Chaloner K, Tillmann HL, Williams CF, Stapleton JT. Effect of early and late GBV-C viremia
on survival of HIV infected individuals: a meta-analysis. HIV Medicine 2006; 7:173-180.
12. Stapleton JT, Chaloner K, Williams CF. GB virus C infection and survival in the Amsterdam Cohort
Study. Journal of Infectious Diseases 2005; 191:2157-2158.
13. Van der Bij AK, Kloosterboer N, Prins M, et al. Reply to George and Stapleton et al. Journal of Infectious
Diseases 2005; 191:2158-2160.
14. Cox DR. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society,
Series B 1972; 34:187-200.
15. Clayton DG. A model for association in biovariate life tables and its applications in epidemiological studies
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 25
of familial tendency in chronic disease indigence. Biometrika 1978; 65:141-151.
16. Clayton DG, Cuzick J. Multivariate associations of the proportional hazards model. Journal of the Royal
Statistical Society, Series A 1985; 148:82-108.
17. Gustafson P. A Bayesian analysis of bivariate survival data from a multicenter cancer clinical trial.
Statistics in Medicine 1995; 14:2523-2535.
18. Stangl D. Prediction and decision making using Bayesian hierarchical models. Statistics in Medicine 1995;
14:2173-2190.
19. Stangl D, Greenhouse J. Assessing placebo response using Bayesian hierarchical survival models. Lifetime
Data Analysis 1998; 4:5-28.
20. Sargent DJ. A general framework for random effects survival analysis in the Cox proportional hazards
setting. Biometrics 1998; 54:1486-1497.
21. Gustafson P. Large hierarchical Bayesian analysis of multivariate survival data. Biometrics 1997; 53:230-
242.
22. Cox DR. Partial likelihood. Biometrika 1975; 62:269-275.
23. Kalbfleisch JD. Nonparametric Bayesian analysis of survival time data. Journal of the Royal Statistical
Society, Series B 1978; 40:214-221.
24. Sinha D, Ibrahim JG, Chen M. A Bayesian justification of Cox’s partial likelihood. Biometrics 2003;
90:629-641.
25. Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. Springer-Verlag Inc: New York, 2001.
26. Chen M-H, Ibrahim JG, Shao Q-M. Posterior propriety and computation for the Cox regression model
with applications to missing covariates. Biometrika 2006; 93:791-807.
27. Metropolis N, Rosenbluth AW, Rosenbluth MN, et al. Equations of state calculations by fast computing
machines. Journal of Chemical Physics 1953; 21:1087-1092.
28. Hastings WK. Monte Carlo sampling methos using Markov chains and their applications. Biometrika 1970;
57:97-109.
29. Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. Chapman and Hall,
1996.
30. Liu KJ, Darrow WW, Rutherford GW. A model-based estimate of the mean incubation period for AIDS
in homosexual men. Science 1988; 240:1333-1335.
31. Mariotto AB, Mariotti S, Pezzotti P, et al. Estimation of the acquired immunodeficiency syndrome
incubation period in intravenous drug users: a comparison with male homosexuals. American Journal
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
26 W. ZHANG ET AL.
of Epidemiology 1992; 135:428-437.
32. Law CG, Brookmeyer R. Effects of mid-point imputation on the analysis of doubly censored data. Statistics
in Medicine 1992; 11:1569-1578.
33. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data.
Journal of the Royal Statistical Society, Series B 1976; 38:290-295.
34. Jaffe HW, Darrow WW, Echenberg DF, et al. The Aquired Immunodeficiency Syndrome in a cohort of
homosexual men: a six year follow-up study. Annals of Internal Medicine 1985; 103:210-214.
35. Souza IE, Allen JB, Xiang J, et al. Effect of primer selection on estimates of GB virus C (GBV-C)
prevalence and response to antiretroviral therapy for optimal testing for GBV-C viremia. Journal of
Clinical Microbiology 2006; 44:3105-3113.
36. Zhang W. Analysis of doubly censored survival data with applications to GBV-C and HIV studies. Ph.D.
dissertation, 2005, University of Iowa, Iowa City, Iowa.
37. Gauvreau K, DeGruttola V, Pagano M, et al. The effect of covariates on the induction time of AIDS using
improved imputation of exact seroconverson times. Statistics in Medicine 1994; 13:2021-2030.
38. Toyoda H, Fukuda Y, Hayakawa T, Takamatsu, Saito H. Effect of GB virus C/hepatitis G virus coinfection
on the course of HIV infection in hemophilia patients in Japan. Journal of Acquired Immune Deficiency
Syndrome and Human Retrovirology 1998; 17:209-213.
39. R Development Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. IBSN 3-900051-07-0, URL http://www.r-project.org, 2005.
40. Spiegelhalter D, Thomas A, Best N, Lunn D. WinBUGS User Manual, 2003. URL http://www.mrc-
bsu.cam.ac.uk/bugs.
41. Brooks SP, Gelman A. General methods for monitoring convergence of iterative simulations. Journal of
Computational and Graphical Statistics 1998; 7:434-455.
42. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of
American Statistical Association 1987; 82:528-540.
43. Harkanen T, Virtanen JI, Arjas E. Caries on permanent teeth: a non-parametric Bayesian analysis.
Scandinavian Journal of Statistics 2000; 27:577-588.
44. Komarek A, Lesaffre E, Harkanen T, Declerck D, Virtanen JI. A Bayesian analysis of multivariate doubly-
interval-censored dental data. Biostatistics 2005; 6:145-155
45. Nattermann J, Nischalke HD, Kupfer B, et al. Regulation of CC chemokine receptor 5 in hepatitis G virus
infection. AIDS 2003; 17:1457-1462.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls
A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 27
46. Xiang J, George SL, Wunschmann S, et al. Inhibition of HIV-1 replication by GB virus C infection through
increases in RANTES, MIP-1alpha, MIP-1beta, and SDF-1. Lancet 2004; 363:2040-2046.
47. Jung S, Knauer O, Donhauser N, et al. Inhibition of HIV strains by GB virus C in cell culture can be
mediated by CD4 and CD8 T-lymphocyte derived soluble factors. AIDS 2005; 19:1267-1272.
48. Xiang J, Sathar MA, McLinden JH, Klinzman D, Chang Q, Stapleton JT. South African GB virus C
isolates: interactions between genotypes 1 and 5 isolates and HIV. Journal of Infectious Diseases 2005;
192:2147-2151.
49. George SL, Varmaz D, Stapleton JT. GB virus C replicates in primary T and B lymphocytes. Journal of
Infectious Diseases 2006; 193:451-454.
50. Stiehm ER. Disease versus disease: how one disease may ameliorate another. Pediatrics 2006; 117:184-191.
Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0
Prepared using simauth.cls