A Bayesian analysis of doubly censored data using a ...€¦ · infection is of interest. In this...

A Bayesian analysis of doubly censored data using

a hierarchical Cox model

Wei Zhang1,2,∗,†, Kathryn Chaloner1,3, Mary Kathryn Cowles1,3,

Ying Zhang1, Jack T. Stapleton4

1Department of Biostatistics, University of Iowa, Iowa City, IA

2Department of Biometrics and Data Management, Boehringer Ingelheim Pharmaceuticals, Ridgefield, CT

3Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA

4Department of Internal Medicine, University of Iowa and Iowa City VA Medical Center, Iowa City, IA

SUMMARY

Two common statistical problems in pooling survival data from several studies are addressed. The first

problem is that the data are doubly censored in that the origin is interval censored and the endpoint

event may be right censored. Two approaches to incorporating the uncertainty of interval censored

origins are developed, and then compared to more usual analyses using imputation of a single fixed

∗Correspondence to: Wei Zhang, Department of Biometrics and Data Management, Boehringer Ingelheim

Pharmaceuticals, Ridgefield, CT 06877, U.S.A.

†E-mail: [email protected]

Contract/grant sponsor: NIH/NIAID; contract/grant number: R01 058740

Contract/grant sponsor: National Security Agency; contract/grant number: H98230-04-1-0042

A BAYESIAN ANALYSIS OF DOUBLY CENSORED DATA 1

value for each origin. The second problem is that the data are collected from multiple studies and

it is likely that heterogeneity exists among the study populations. A random-effects hierarchical Cox

proportional hazards model is therefore used. The scientific problem motivating this work is a pooled

survival analysis of data sets from three studies to examine the effect of GB virus type C (GBV-C)

coinfection on survival of HIV-infected individuals. The time of HIV infection is the origin and for

each subject this time is unknown, but is known to lie later than the last time at which the subject

was known to be HIV negative, and earlier than the first time the subject was known to be HIV

positive. The use of an approximate Bayesian approach using the partial likelihood as the likelihood

is recommended because it more appropriately incorporates the uncertainty of interval censored HIV

infection times. Copyright c© 200000 John Wiley & Sons, Ltd.

KEY WORDS: GBV-C; human immunodeficiency virus; interval censoring; MCMC;

Multicenter AIDS Cohort Study; partial likelihood

1. INTRODUCTION

Infection with GB virus type C (GBV-C) in humans is common, but no association between

the virus and any known disease state has been demonstrated [1, 2, 3]. Individuals infected with

human immunodeficiency virus (HIV) are commonly coinfected with GBV-C, since GBV-C

shares the same modes of transmission as HIV. The prevalence of coinfection with GBV-C in

HIV-infected individuals ranges from 14% to 43% [4]. Several recent studies of data from early

in the epidemic, before the availablility of effective therapy, suggest that coinfection with GBV-

C is associated with prolonged survival among HIV-infected people [5, 6, 7, 8]: other studies

have concluded that there is no association [9, 10]. A meta-analysis of summary statistics was

performed and published in Zhang et al. [11] and this study indicates that persistent GBV-C

coinfection is associated with prolonged survival. To further investigate this conclusion, which

Copyright c© 200000 John Wiley & Sons, Ltd. Statist. Med. 200000; 00:0–0

Prepared using simauth.cls

2 W. ZHANG ET AL.

remains controversial [12], individual level data from separate studies is modeled here.

Original data sets from three published studies [6, 7, 8] are obtained. These data sets are

doubly censored. First, the origin (HIV infection time) Y is interval censored in that it is

known to lie in an interval Y ∈ [L,U ]. Second, the endpoint (death) time is possibly right

censored. Denote survival time T = E − Y , where E is the minimum of death time and last

followup time. The dependence of T on covariates, and in particular on the indicator of GBV-C

infection is of interest. In this paper, approaches are developed for the pooled survival analysis

of doubly censored data from multiple studies and applied to the pooled data from the three

studies.

A random effects model for the indicator of GBV-C coinfection incorporates the

heterogeneity of patient characteristics between the studies. The most popular modeling

method in survival analysis, the Cox proportional hazards model [14], avoids making any

assumptions about the baseline hazard function λ0(t). Several authors have considered Cox

survival models with random effects [15, 16, 17, 18, 19], but these random effects models for

survival data either require assumptions regarding the form of the baseline hazard function or

restrictions on the classes of models that can be fit. Sargent [20] and Gustafson [21] present a

framework through which random effects can be introduced into the Cox model, which uses

the Cox partial likelihood [22], and allows very general random-effect structures for the model

parameters.

The remainder of this paper is organized as follows. Section 2 gives an introduction to

the hierarchical Cox proportional hazards model. Section 3 presents different approaches to

incorporating the interval censoring. In Section 4, these approaches are applied to the case

study and results are summarized and compared. Section 5 concludes with discussion.




2. HIERARCHICAL COX PROPORTIONAL HAZARDS MODEL

The standard Cox regression represents the relationship between the covariates of interest and

the hazard of event at time t through a proportional hazards model:

λ(t;xi) = λ0(t) exp(xiβ)

where λ(t;xi) is the hazard function, xi is a 1× p covariate vector for individual i, and β is a

p× 1 vector of coefficients corresponding to fixed effects.

2.1. The hierarchical model

Suppose there are q covariates to be modeled as random effects in addition to p covariates

with fixed effects. For each of the q random covariates, there are rl levels (e.g., trials, centers,

studies), for l = 1, . . . , q, with parameter vector γl = (γl1, . . . , γlrl)T corresponding to the lth

random covariate. Let wil be a scalar indicating the ith subject’s value of the lth random effect

covariate. Let zil = (zil1, . . . , zilrl), with zilj = Iiljwil, where Iilj = 1 if subject i falls in the jth

level of the lth random covariate, 0 otherwise. Define γ = (γT1 , . . . , γT

q )T and zi = (zi1, . . . , ziq).

Using this notation, the Cox model with random effects, a reparameterized frailty model [20],

can be written as λ(t; xi, zi) = λ0(t) exp(xiβ + ziγ). Let D denote a right censored data set,

D = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where ti is time from baseline to the minimum of endpoint

time and censoring time, δi is the indicator for censoring with δi = 1 if censored and δi = 0

otherwise, and n is the number of observations. Let <ti be the set of subjects at risk at time

ti. If there are no ties, the partial likelihood incorporating random effects is then given by

L(β, γ|D) ∝n∏

i=1

[exp(xiβ + ziγ)∑

j∈<tiexp(xjβ + zjγ)

]δi

, (1)

The partial likelihood in (1) serves as the first stage of the hierarchical model and can



4 W. ZHANG ET AL.

be treated as a likelihood for computing a posterior density. Kalbfleisch [23] demonstrates

that treating the partial likelihood as a likelihood leads to a limiting marginal posterior

distribution of the regression parameters, assuming an independent increments gamma process

prior distribution for the baseline cumulative hazard and independently a uniform distribution

on the regression parameters. This result is shown with a different proof in Sinha et al. [24]

which extends results to situations with time-dependent covariates, time-varying regression

parameters and grouped survival data, and presents a Bayesian justification of a modified

partial likelihood for handling ties. See also Chapter 4 of [25]. Chen at al. [26] carry out

an in-depth theoretical investigation of Bayesian inference for the Cox regression model and

discuss posterior propriety and computation based on Cox’s partial likelihood. Sargent [20]

and Gustafson [21] present methods for Bayesian analysis of multivariate survival data using

(1).

The level-specific parameters γlj are modeled as draws from a distribution gl with mean µl

and variance νl. Let g denote the joint density for γ, and assume γlj ’s are independent of each

other given µl and νl, for l = 1, . . . , q and j = 1, . . . , rl, then

g(γ|µ, ν) =q∏

l=1

rl∏

j=1

gl(γlj |µl, νl), (2)

where µ = (µ1, . . . , µq)T , and ν = (ν1, . . . , νq)T .

For the final stage of the hierarchical model, prior distributions need to be specified. A

proper prior distribution for the variance component is typically essential for proper posterior

distribution and computational stability. Let f(µ, ν|ω) represent this prior distribution, where

ω is the vector of hyperparameters and is taken to be known. Also let β have, independent of

γ, a uniform prior distribution.




An approximate posterior distribution for the model parameters is then be assumed to be:

π(β, γ, µ, ν|D, ω) ∝ L(β, γ|D) g(γ|µ, ν) f(µ, ν|ω), (3)

where D is right censored data.

2.2. Estimation of parameters using MCMC methods

The approximate posterior distribution (3) can be estimated with Markov chain Monte Carlo

(MCMC) methods. The Metropolis-Hastings algorithm [27, 28] is a general term for a family of

MCMC methods that are useful for drawing samples from Bayesian posterior distributions. Let

θ denote the set of parameters involved in the hierarchical model. The parameter vector θ is

divided into components corresponding to the hierarchy and the single-component Metropolis-

Hastings algorithm is used [29].

3. ANALYSIS OF DOUBLY CENSORED DATA

Denote doubly censored data C = {([li, ui], ei, δi, xi, zi) : i = 1, 2, , . . . , n}, where [li, ui] is the

interval of dates within which origin yi falls, and ei is the minimum of endpoint date and

last followup date. Note that dates are defined as length of time from a fixed time point.

Further define a function D(·, ·) mapping a doubly censored data set C and a set of origin

dates y = (y1, y2, . . . , yn)T to a right censored data set; specifically D(C,y) = {(ti, δi, xi, zi) :

i = 1, 2, . . . , n}, where ti = ei − yi is time from yi to ei. A common approach in medical

applications [8, 30, 31] is that midpoints of censoring intervals are used to impute interval-

censored origins, yi = (li + ui)/2, and are then used to compute ti = ei − yi in analysis as if

they are right censored data. In this case, D(C, y) = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where

y = (y1, y2, . . . , yn)T . Law and Brookmeyer [32] demonstrate that in HIV studies the Kaplan-



6 W. ZHANG ET AL.

Meier estimate based on this method is notably biased when origin intervals are longer than

two years.

Three alternative methods are proposed below which can be implemented using MCMC.

If the partial likelihood is used as a likelihood in estimating the posterior distribution using

MCMC, the assumption has to be made that this is valid under interval censoring of the origin:

an assumption which has not been proved, but which seems reasonable given the results in

[23, 24, 26].

3.1. MCMC for imputed data (MCMCid) approach

As an improvement to using the midpoint of censoring interval, MCMCid is proposed here

that samples a value of each interval censored origin generated from an estimated distribution

of origins. Let G denote the distribution function of origins y. The estimate of G, G, can be

either obtained parametrically using the maximum likelihood estimation based on a known

distribution (e.g., Weibull, log Normal), or obtained nonparametrically using Turnbull’s self-

consistency algorithm [33]. To perform an analysis for doubly censored survival data using

MCMCid, for each subject i, i = 1, . . . , n, a value of yi, denoted yi, is randomly sampled

from G, conditional on the interval [li, ui] within which yi falls. The doubly censored data

set C and imputed origins y = (y1, y2, . . . , yn)T are then mapped to a right censored data

set D(C, y) = {(ti, δi, xi, zi) : i = 1, 2, . . . , n}, where ti = ei − yi. The hierarchical Cox

proportional hazards model can then be fit to the right censored data D(C, y) using MCMC

methods.

The MCMCid approach is straightforward to implement and understand but underestimates

the variability of parameter estimates because the uncertainty of imputed origins y is not




incorporated.

3.2. Imputation-embedded MCMC (ieMCMC) approach

The imputation-embedded MCMC (ieMCMC) approach is developed as an alternative to the

MCMCid approach. For each MCMC iteration step m, m = 1, . . . ,M (e.g., M = 20000),

origins ym = (ym1 , . . . , ym

n )T are randomly sampled based on their distribution G, conditional

on the intervals {[li, ui] : i = 1, 2, . . . , n} within which they fall. A right censored data set,

D(C,ym) = {(tmi , δi, xi, zi) : i = 1, 2, . . . , n}, where tmi = ei−ymi , is generated at each iteration

step m. In the ieMCMC approach the origins ym vary at each MCMC sampler iteration with

a distribution based on G, but the estimate G is still fixed. The uncertainty in estimating G

by G is not therefore taken into account, although the uncertainty in ym conditional on G is

considered.

3.3. Bayesian partial likelihood (Bayesian-PL) approach

Finally, an approach is proposed with a parametric assumption on origins which is more

complete. Denote h(y|ξ) as the probability density function of the origins, with parameter

vector ξ. Let p(ξ|$) represent the prior distribution for ξ, where $ is the vector of

hyperparameters governing p(·). For a doubly censored data set C = {([li, ui], ei, δi, xi1, zi) :

i = 1, 2, , . . . , n}, define Iyi to be the indicator function which is equal to 1 if yi is in [li, ui]

and 0 otherwise. The approximate posterior distribution, based on using the partial likelihood

as the likelihood, for all the model parameters, which include the interval censored unknown



8 W. ZHANG ET AL.

origins y, can be expressed as

π(β, γ, µ, ν, ξ,y|C, ω, $) ∝ L(β, γ|D(C,y)) g(γ|µ, ν)

{∏

i

h(yi|ξ) Iyi

[H(ui|ξ)−H(li|ξ)]} f(µ, ν|ω) p(ξ|$), (4)

where H(·|ξ) is cumulative density function of y. At each iteration of the MCMC sampler, the

origins y are drawn from their full conditional distribution, given all other model quantities.

The resulting analysis provides an updated estimate of the distribution of origins, as well as

correctly capturing the effect of the uncertainty in origins on estimation of the parameters of

primary interest. Because (4) is an approximation of the posterior distribution, in what follows

we refer to this as the Bayesian-PL method.

4. CASE STUDY

The pooled dataset consists of doubly censored data sets from three studies [6, 7, 8] where

GBV-C is measured late in HIV disease. Each study corresponds to a different population.

The “late” data set of the Williams study [8] is used, from the Multicenter AIDS Cohort Study

(MACS) which has a documented HIV seroconversion window of approximately 6 months on

average. In the MACS study, the intervals are based on the retrospective testing of stored blood

samples obtained on a regular basis, including those before and after HIV seroconversion. In

the other two studies, testing is also on stored blood samples, and the date of subjects first

known positive HIV test is used as the right limit of the interval, and January 1st 1978 (or date

of birth for subjects born after January 1st 1978) is treated as the left limit of the interval.

January 1st 1978 was chosen because an analysis of stored blood samples from a study in San

Francisco indicates an extremely low prevalence of HIV infection before this date [34]. The




sample sizes of [6, 7, 8] are 362, 197 and 138, respectively. A summary of these three studies

can be found in [11]. All studies follow subjects through the time period before the advent of

highly effective therapy for HIV in 1996.

Fitting the regular Cox model to three imputed data sets separately, the estimated log

hazard ratio of GBV-C coinfection, controlling for baseline log(CD4+ count in cells/mL) and

age at HIV infection, is -1.23, -1.62 and -0.97 for [6, 7, 8], respectively. For both Xiang and

Tillmann studies [6, 7], HIV infection time is heavily interval censored, with a mean interval

width of about 10 years. In contrast, HIV infection time in the Williams study [8] has much

narrower intervals. The subjects in the Tillmann study are from Germany, and the subjects

in the other two studies are from the U.S.A. The differences among the studies could be due

to several reasons, including the fact that GBV-C testing is not standardized and each study

used a different primer for a qualitative test. A recent study [35] indicates that the sensitivity

and specificity of each test varies and the sensitivity of one particular test depends on GBV-C

RNA levels. This motivates the need for an analysis that has the ability to account for the

possibly differing effect of GBV-C infection within each population. The primary endpoint for

this pooled analysis is overall survival. All 695 eligible patients are included in the analysis.

Let xi1, xi2, and wi denote log(CD4+ count in cells/mL), age at HIV infection and GBV-

C coinfection status for subject i, respectively. Let Iij = 1 if subject i is from study j, 0

otherwise, with j = 1, 2, 3 corresponding to the three studies [6, 7, 8]. Define zi = (zi1, zi2, zi3),

where zij = Iijwi. Let β = (β1, β2)T denote the fixed effects of covariates xi1 and xi2. Let

γ = (γ1, γ2, γ3)T denote the random effects of covariates zi = (zi1, zi2, zi3), with constraint

γj ∼ N(µ, σ2), where µ is the population effect of GBV-C infection and σ2 population variance.



10 W. ZHANG ET AL.

Specifically, the hazard function for individual i at time tk is given by

λ(tk;xi1, xi2, zi) = λ0(tk) exp(xi1β1 + xi2β2 + ziγ). (5)

Note that age at HIV infection xi2 is a known deterministic function of infection time yi. Fixed

effects, rather than random effects, are used for the age and CD4+ cell count as the effects

from each data set of the three studies are very similar. This is unlike the effect of GBV-C in

each study which varies more.

In what follows approximate posterior density, mean and corresponding approximate highest

posterior density region (HDR) are calculated using the approximation based on the partial

likelihood. These will be referred to as posterior density, mean and HDR for simplicity, without

explicitly qualifying that they are approximations.

4.1. Prior distributions

Other than the three studies for which we have data, there are four additional studies

[5, 9, 10, 38] providing only summary statistics (hazard ratio and corresponding 95% confidence

interval). A meta-analysis of summary statistics for these four studies was done, similarly to

the meta-analysis of all summary statistics in [11]. The estimated combined effect of GBV-C

in these four studies is −0.41 with estimated standard error 0.42. This result helps postulate

the prior distribution for µ. To be conservative, the standard error is multiplied by 2, so that

µ is normally distributed as N [−0.41, (0.42× 2)2].

A proper prior distribution for σ2 is used for the sake of computational stability of the

MCMC methods and to generate a proper approximate posterior distribution. A gamma

distribution on τ = σ−2 is used with τ distributed as Γ(0.25, 0.005). This distribution is

specified by considering what values for the random effects are reasonable. For example, a




belief representing moderate heterogeneity between the studies would be that γj (j = 1, 2, 3)

vary around µ by ±0.10. Using 0.10 as an estimate of the standard error σ leads to a prior

estimate of τ = 100. A prior belief that presents substantial heterogeneity might be that the

γj vary around µ by ±1.0. Using 1.0 as the prior standard error of γj leads to a prior estimate

of τ = 1. The prior distribution Γ(0.25, 0.005), with mean 50 and standard error of 100, gives

reasonable weight to these extremes. A flat prior distribution is used for β.

4.2. Joint posterior distributions

A parametric model for the distribution of y is implemented in the example instead of the

non-parametric method. Reasons for this choice include that the non-parametric maximum

likelihood estimator G given by Turnbull in [33] is only unique up to an equivalence class

and also has discrete components. The parametric model gives a smoother distribution for the

times of infection, which is thought to more realistically model the reality of the spread of HIV

infection. In addition, for the pooled data set, Turnbull’s estimate of the survival function is not

very different from the maximum likelihood estimate based on assuming a Weibull distribution

(Figure 1).

4.2.1. MCMCid and ieMCMC approaches The distribution function G of HIV infection times

y is assumed to be a Weibull(α, γ). Conditional on the maximum likelihood estimates (α,

λ) and intervals for infection times, infection times y are randomly sampled. Let C denote

the doubly censored data C = {([li, ui], ei, δi, xi1, zi) : i = 1, 2, , . . . , n}. Following the prior

distributions in Section 4.1, the joint approximate posterior density of all parameters for the



12 W. ZHANG ET AL.

MCMCid approach is given by

π(β, γ, µ, τ |D(C, y)) ∝ L(β, γ|D(C, y))3∏

j=1

τ12 exp(− (γj − µ)2τ

2)

exp(− (µ− µ0)2

2σ20

) τa0−1 exp(− τ

b0), (6)

where D(C, y) = {(ti, δi, xi1, xi2, zi) : i = 1, 2, . . . , n}, ti = ei − yi, xi2 is a function of yi, and

(a0, b0, µ0, σ0) = (0.25, 0.005,−0.41, 0.42× 2).

The joint posterior density of all parameters for the ieMCMC approach is the same as (6),

except that D(C, y) is replaced by D(C,ym), where D(C,ym) = {(tmi , δi, xi1, xmi2, zi) : i =

1, 2, . . . , n}, tmi = ei − ymi , and xm

i2 is a function of ymi . In the ieMCMC approach, D(C,ym)

changes at each MCMC iteration m, while D(C, y) is fixed during the process of the MCMCid

approach.

For both MCMCid and ieMCMC approaches, given data and other parameters in the

model, the full conditional posterior distribution for τ has a gamma distribution π(τ |·) ∝

Γ[a0 + 32 , b0 +

∑3j=1

(γj−µ)2

2 )], and the full conditional posterior distribution for µ has a normal

distribution π(µ|·) ∝ N [λµ0+(1−λ)γ, (1−λ)(3τ)−1], where λ = (3τ)−1

(3τ)−1+σ20

and γ =∑3

j=1 γj/3.

It is straightforward to perform the Gibbs sampling on µ and τ , but there is no direct way

to draw from parameters β and γ, and the single-component Metropolis-Hastings algorithm

[29] for the sampling of these two parameter components is used.

4.2.2. Bayesian-PL approach The distribution of y is also assumed to be a Weibull(α, λ)

for the purpose of a fair comparison to the MCMCid and ieMCMC approaches, and the

prior distributions for α and λ are specified independently as log normal distributions:

LogNorm(µα, σ2α) and LogNorm(µλ, σ2

λ), respectively. For yi ∈ [li, ui], i = 1, . . . , n, the joint

approximate posterior density of all parameters for the Bayesian-PL approach is then given




by

π(β, γ, µ, τ, α, λ,y|C) ∝ L(β, γ|D(C,y))3∏

j=1

τ12 exp[− (γj − µ)2τ

2]

exp[− (µ− µ0)2

2σ20

] τa0−1 exp(− τ

b0)

n∏

i=1

{αλ−αyα−1i exp[−(

yi

λ)α]}{exp[−(

liλ

)α]− exp[−(ui

λ)α]}−1

α−1 exp[− (log(α)− µα)2

2σ2α

] λ−1 exp[− (log(λ)− µλ)2

2σ2λ

], (7)

where (µα, σα, µλ, σλ) = (1.31, 0.4, 3.51, 0.5).

4.3. Results of primary analysis

All methods are implemented in R [39], using the Metropolis-within-Gibbs algorithm. The

posterior full conditional distributions of µ and τ are normal and gamma respectively, so these

parameters were drawn using Gibbs sampling. The approximate posterior full conditional

distributions for parameters using the single-component Metropolis-Hastings algorithm [29] in

the Bayesian-PL approach are given in the Appendix. Code is available from the first author.

The WinBUGS software package [40] could not be used easily because of the doubly censored

data complicated by an interval censored covariate. Three independent chains are generated

for each of the 3 approaches (MCMCid, ieMCMC and Bayesian-PL approach). Each chain

consists of 14,000 iterations after a series of 6,000 burn-in iterations. The Brooks and Gelman

convergence diagnostic [41] indicates that there is no evidence against the convergence of

sampler for each parameter in all approaches.

Table I summarizes results from the MCMCid, ieMCMC and Bayesian-PL approaches

based on the hierarchical Cox proportional hazards model. The point estimates from the

three approaches are similar for each parameter except for σ. For each parameter estimate,



14 W. ZHANG ET AL.

the standard error from the Bayesian-PL approach is, appropriately, the largest among

three approaches, while the standard error from the MCMCid approach is the smallest one.

Consequently, the 95% HDR from the Bayesian-PL approach is generally wider than the one

from the MCMCid or ieMCMC approach; the 95% HDR from the MCMCid approach tends

to be the narrowest. The differences are substantial, especially for the parameters of most

interest: µ and σ.

Results from all three approaches indicate that GBV-C infection is associated with

prolonged survival. From the Bayesian-PL approach, the estimated hazard ratio for GBV-

C viremia is e−µ = e−0.891 = 0.41 with 95% probability falling into the interval

(e−1.423, e−0.335) = (0.24, 0.72) after adjusting for baseline log(CD4+ count in cells/mL) and

age at HIV infection. Baseline log(CD4+ count in cells/mL) is also associated with prolonged

survival: estimated hazard ratio eβ1 = e−0.645 = 0.52, with 95% probability falling into the

interval (e−0.842, e−0.444) = (0.43, 0.64).

4.4. Results of sensitivity analysis

To examine the behavior of the estimate of µ from the Bayesian-PL approach when the

distribution for infection times y is modified, the hierarchical Cox proportional hazards model

was fit using different distributions for y. Specifically, we used a flat distribution for y, a

single Weibull(α, λ) for y, and a set of three Weibull distributions, one for each of the three

studies, Weibull(αj , λj) for yj , j = 1, 2, 3. The choice of Γ(0.25, 0.005) as the prior for τ is

also examined by using Γ(0.001, 0.001). Overall, there are 3×2 = 6 scenarios in the sensitivity

analyses, with the first one corresponding to the primary analysis (see Table II and Figure 2).

The value of µ, the estimate for the logarithm of hazard ratio of GBV-C infection, changes




only slightly, as does the standard error and 95% HDR. This analysis suggests that the results

are insensitive to the choice of prior distribution.

5. DISCUSSION

The imputation-embedded MCMC (ieMCMC) and the Bayesian-PL approaches are developed

to deal with doubly censored survival data and compared to an MCMC analysis of a

right censored data set constructed by imputing a single value for each interval censored

origin (MCMCid). The MCMCid approach considerably underestimates the variability of the

estimates. The ieMCMC allows for some uncertainty of imputed origins to play a role but again

results in underestimation of the variability of the parameter estimates. In comparison the

Bayesian-PL approach treats unobservable origins y as unknown quantities with a parametric

distribution G. Prior distributions are then assigned for the hyperparameters of G. Interval

censoring is treated by data augmentation [42] with y drawn from their posterior predictive

distribution. The results from the Bayesian-PL approach more appropriately reflect uncertainty

than the MCMCid and ieMCMC approaches. This paper demonstrates the ability of the

Bayesian-PL approach to incorporate the uncertainty of imputed origins in doubly censored

survival data. Our sensitivity study shows the results from the Bayesian-PL approach are

reasonably insensitive to the specification of the parametric form of G (Figure 2).

Harkanen et al. [43] presented a non-parametric Bayesian intensity model for doubly

censored data in the fully Bayesian framework, which treats unobservable origins y as unknown

quantities with piece-wise constant hazard functions. Hazard functions are assigned gamma

prior distributions. Komarek et al. [44] applied a modified version of this approach to doubly

censored dental data to examine the effect of fluoride-intake on the time to caries development



16 W. ZHANG ET AL.

in children. However, these approaches are complex to implement and computationally

demanding.

It should be noted that a formal justification of using the Bayesian-PL approach has not

been provided for interval censored origins and random effects. Given the results in [23, 24, 26]

this assumption is not unreasonable, but additional work needs to be done.

A parametric model could have been used for the case study and an estimate of the survival

curve obtained. The primary interest in this data analysis is however the question of whether

or not co-infection with GBV-C is associated with prolonged survival of individuals infected

with HIV disease. All three studies, and the other studies in the meta-analysis [11] use data

before the advent of highly effective therapy for HIV infection, and so the survival curve itself

is not of current interest.

The Bayesian hierarchical Cox model has accommodated random study-specific effects

and therefore incorporated between-study heterogeneity. Through the specification of prior

distributions, the prior information relevant to the parameters of interest has been taken into

account.

The methods for doubly censored survival data developed in this paper have enabled

the analysis of these data sets and lead to the conclusion that the hazard ratio with

GBV-C infection is approximately 40% of the hazard without GBV-C infection, and the

hypothesis of no difference in hazard can be ruled out with high probability. The pooled

analysis of the individual subject data therefore augments and supports the meta-analysis

result of the summary statistics previously reported in [11]. Biological plausibility for a

beneficial mechanism and in vitro evidence in inhibiting HIV replication are provided in

[6, 45, 46, 47, 48, 49]. However, as in all observational data, the results of our analyses do




not provide evidence that GBV-C is causally related to improved survival, and it is possible

that GBV-C infection is not the reason that HIV-positive individuals coinfected with GBV-C

live longer but, rather, that it serves as a biological marker of a different factor related to

HIV disease progression. This warrants further investigation and is a subject of debate in the

scientific literature [9, 12, 13, 49, 50].



18 W. ZHANG ET AL.

APPENDIX

The posterior full conditional distributions of most of the parameters in the Bayesian-PL

model were of nonstandard forms and were sampled using Metropolis or Metropolis-Hastings

updates. This appendix lists these full conditional distributions and any unusual features of

the sampling algorithms used.

A.1 Full conditional distribution of y

The unnormalized full conditional distribution of y is

π(y|β, γ, µ, τ, α, λ,C) ∝ L(β, γ|D(C,y))(n∏

i=1

yi)α−1 exp[−n∑

i=1

(yi

λ)α],

where C = {([li, ui], ei, δi, xi, zi) : i = 1, 2, , . . . , n}. At each iteration m, candidates ynewi ,

i = 1, . . . , n, are generated independently from truncated normal densities:

ynewi |ym−1

i ∼ TN(ym−1i , σ2

y|li, ui),

where [li, ui] is the interval within which subject i’s infection time yi is known to lie. Then the

entire vector ynew = (ynew1 , . . . , ynew

n ) is accepted based on the acceptance probability

R = min{1,L(βm−1, γm−1|D(C,ynew))L(βm−1, γm−1|D(C,ym−1))

× (∏n

i=1 ynewi∏n

i=1 ym−1i

)(αm−1−1)

× exp[n∑

i=1

(ym−1

i

λm−1)αm−1 −

n∑

i=1

(ynew

i

λm−1)αm−1

]×n∏

i=1

Φ(ui−ynewi

σy)− Φ( li−ynew

i

σy)

Φ(ui−ym−1i

σy)− Φ( li−ym−1

i

σy)},

where Φ(·) is the standard normal cumulative distribution function. Note that a new set of

infection times y results in a new set of calculated failure and censoring times. This in turn

produces new values of distinct survival times {tk : k = 1, . . . ,K}, so that different individuals

appear in the risk set <k involved in the partial likelihood calculations in (2).




A.2 Full conditional distributions of β, γ, α, and λ

The unnormalized full conditional distributions for β, γ, α, and λ are given below. Each of

these parameters is sampled using a random-walk Metropolis update in which a candidate

value is drawn from a normal or multivariate normal density centered at the value from the

previous iteration.

• π(β|γ, µ, τ, α, λ,C,y) ∝ L(β, γ|D(C,y))

• π(γ|β, µ, τ, α, λ,C,y) ∝ L(β, γ|D(C,y))∏3

j=1 exp[− (γj−µ)2τ2 ]

• π(α|β, γ, µ, τ, λ,C,y) ∝ αn−1λ−nα(∏n

i=1 yi)α−1 exp[−∑ni=1(

yi

λ )α] exp[− (log(α)−µα)2

2σ2α

]

• π(λ|β, γ, µ, τ, α,C,y) ∝ λ−nα−1 exp[−∑ni=1(

yi

λ )α] exp[− (log(α)−µα)2

2σ2α

].

ACKNOWLEDGEMENTS

The authors also wish to thank Dr Hans Tillmann from the University of Leipzig, Germany, and the

Multicenter AIDS Cohort Study (MACS) for providing data. The MACS has centers located at: The

Johns Hopkins Bloomberg School of Public Health (Joseph Margolick); Howard Brown Health Center

and Northwestern University Medical School (John Phair); University of California, Los Angeles

(Roger Detels); University of Pittsburgh (Charles Rinaldo); and Data Analysis Center (Lisa Jacobson).

This research was supported by NIH/NIAID (R01 058740) and National Security Agency (H98230-

04-1-0042).



20 W. ZHANG ET AL.

Table

I.C

om

pari

son

of

resu

lts

from

diff

eren

tappro

ach

esfo

rpoole

danaly

sis

usi

ng

the

hie

rarc

hic

al

Cox

pro

port

ionalhaza

rds

model

.

MC

MC

id∗

ieM

CM

C∗

Bay

esia

n-P

L∗∗

Para

met

erM

ean(S

E)

95%

HD

RM

ean(S

E)

95%

HD

RM

ean(S

E)

95%

HD

R

µ-0

.797(0

.174)

(-1.1

36,-0

.475)

-0.8

46(0

.208)

(-1.2

48,-0

.436)

-0.8

91(0

.277)

(-1.4

23,-0

.335)

σ0.3

27(0

.413)

(0.0

27,1.0

16)

0.4

25(0

.519)

(0.0

29,1.2

92)

0.7

27(0

.783)

(0.0

28,1.9

47)

γ1

-0.7

79(0

.145)

(-1.0

67,-0

.494)

-0.8

45(0

.158)

(-1.1

62,-0

.544)

-0.9

98(0

.185)

(-1.3

60,-0

.623)

γ2

-1.0

73(0

.359)

(-1.8

12,-0

.486)

-1.2

27(0

.416)

(-2.1

23,-0

.596)

-1.5

85(0

.476)

(-2.4

97,-0

.775)

γ3

-0.6

36(0

.274)

(-1.1

11,-0

.039)

-0.6

23(0

.336)

(-1.1

90,0.1

09)

-0.4

63(0

.412)

(-1.1

89,0.3

39)

β1

-0.5

38(0

.086)

(-0.7

07,-0

.371)

-0.5

69(0

.098)

(-0.7

56,-0

.377)

-0.6

45(0

.101)

(-0.8

42,-0

.444)

β2

0.0

13(0

.006)

(0.0

01,0.0

27)

0.0

17(0

.007)

(0.0

02,0.0

30)

0.0

08(0

.007)

(-0.0

06,0.0

21)

∗R

andom

imputa

tion

for

infe

ctio

nti

mes

ybase

don

the

esti

mate

dW

eibu

ll(α

,λ).

∗∗P

rior

dis

trib

uti

on

for

infe

ctio

nti

mes

y:W

eibu

ll(α

,λ),

wher

eα∼

Log

Nor

m(1

.31,0

.4)

and

λ∼

Log

Nor

m(3

.51,0

.5).




Table II. Sensitivity analysis for the Bayesian-PL approach using different prior distributions for τ

and y.

Analysis τ Infection times y µ (SE) 95% HDR

Tau1Y1 Γ(0.25, 0.005) Weibull(α, λ) -0.891(0.277) (-1.423, -0.335)

Tau1Y2 Γ(0.25, 0.005) Weibull(αj , λj), j = 1, 2, 3 -0.866(0.288) (-1.398, -0.263)

Tau1Y3 Γ(0.25, 0.005) flat prior -0.924(0.272) (-1.417, -0.350)

Tau2Y1 Γ(0.001, 0.001) Weibull(α, λ) -0.876(0.271) (-1.385, -0.326)

Tau2Y2 Γ(0.001, 0.001) Weibull(αj , λj), j = 1, 2, 3 -0.878(0.283) (-1.392, -0.279)

Tau2Y3 Γ(0.001, 0.001) flat prior -0.911(0.250) (-1.360, -0.412)



22 W. ZHANG ET AL.

20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Time in years

Sur

viva

l Pro

babi

lity

TurnbullWeibull

Figure 1. Estimated survival curves for HIV infection time of the pooled data set based on different

distribution assumptions.




Hazard Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Tau2Y3

Tau2Y2

Tau2Y1

Tau1Y3

Tau1Y2

Tau1Y1

Analysis

Bayesian−PL

MCMCid

ieMCMC

Figure 2. Estimated hazard ratio and 95% HDR from the MCMCid and ieMCMC approaches (above

the dashed line), and from sensitivity analysis with different prior distributions for τ and HIV infection

times y using the Bayesian-PL approach (below the dashed line). The primary analysis is Tau1Y1.



24 W. ZHANG ET AL.

REFERENCES

1. Alter HJ. The cloning and clinical implications of HGV and HGBV-C. New England Journal of Medicine

1996; 334:1536-1537.

2. Rambusch EG, Wedemeyer H, Tillmann HL, Heringlake S, Manns MP. Significance of coinfection with

hepatitis G virus for chronic hepatitis C–a review of the literature. Z Gastroenterol 1998; 36:41-53.

3. Tillmann HL, Heringlake S, Trautwein C, et al. Antibodies against the GB virus C envelope 2 protein

before liver transplantation protect against GB virus C de novo infection. Hepatology 1998; 28:379-384.

4. Stapleton JT. GB virus type C/hepatitis G virus. Semin Liver Disease 2003; 23:137-148.

5. Lefrere JJ, Roudot-Thraval F, Morand-Joubert L, et al. Carriage of GB virus C/Hepatitis G virus RNA

is associated with a slower immunologic, virologic, and clinical progression of human immunodeficiency

virus disease in coinfected persons. Journal of Infectious Diseases 1999; 179:783-789.

6. Xiang J, Wunschmann S, Diekema DJ, et al. Effect of coinfection with GB virus C on survival among

patients with HIV infection. New England Journal of Medicine 2001; 345:707-714.

7. Tillmann HL, Heiken H, Knapir-Botor A, et al. Infection with GB virus C and reduced mortality among

HIV-infected patients. New England Journal of Medicine 2001; 345:715-724.

8. Williams CF, Klinzman D, Yamashita TE, et al. Persistent GB virus C infection and survival in HIV-

infected men. New England Journal of Medicine 2004; 350:981-990.

9. Bjorkman P, Flamholc L, Naucler A, et al. GB virus C during the natural course of HIV-1 infection:

viremia at diagnosis does not predict mortality. AIDS 2004; 18:877-886.

10. Birk M, Lindback S, Lidman C. No influence of GB virus C replication on the prognosis in a cohort of

HIV-1-infected patients. AIDS 2002; 16:2482-2485.

11. Zhang W, Chaloner K, Tillmann HL, Williams CF, Stapleton JT. Effect of early and late GBV-C viremia

on survival of HIV infected individuals: a meta-analysis. HIV Medicine 2006; 7:173-180.

12. Stapleton JT, Chaloner K, Williams CF. GB virus C infection and survival in the Amsterdam Cohort

Study. Journal of Infectious Diseases 2005; 191:2157-2158.

13. Van der Bij AK, Kloosterboer N, Prins M, et al. Reply to George and Stapleton et al. Journal of Infectious

Diseases 2005; 191:2158-2160.

14. Cox DR. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society,

Series B 1972; 34:187-200.

15. Clayton DG. A model for association in biovariate life tables and its applications in epidemiological studies




of familial tendency in chronic disease indigence. Biometrika 1978; 65:141-151.

16. Clayton DG, Cuzick J. Multivariate associations of the proportional hazards model. Journal of the Royal

Statistical Society, Series A 1985; 148:82-108.

17. Gustafson P. A Bayesian analysis of bivariate survival data from a multicenter cancer clinical trial.

Statistics in Medicine 1995; 14:2523-2535.

18. Stangl D. Prediction and decision making using Bayesian hierarchical models. Statistics in Medicine 1995;

14:2173-2190.

19. Stangl D, Greenhouse J. Assessing placebo response using Bayesian hierarchical survival models. Lifetime

Data Analysis 1998; 4:5-28.

20. Sargent DJ. A general framework for random effects survival analysis in the Cox proportional hazards

setting. Biometrics 1998; 54:1486-1497.

21. Gustafson P. Large hierarchical Bayesian analysis of multivariate survival data. Biometrics 1997; 53:230-

242.

22. Cox DR. Partial likelihood. Biometrika 1975; 62:269-275.

23. Kalbfleisch JD. Nonparametric Bayesian analysis of survival time data. Journal of the Royal Statistical

Society, Series B 1978; 40:214-221.

24. Sinha D, Ibrahim JG, Chen M. A Bayesian justification of Cox’s partial likelihood. Biometrics 2003;

90:629-641.

25. Ibrahim JG, Chen M-H, Sinha D. Bayesian Survival Analysis. Springer-Verlag Inc: New York, 2001.

26. Chen M-H, Ibrahim JG, Shao Q-M. Posterior propriety and computation for the Cox regression model

with applications to missing covariates. Biometrika 2006; 93:791-807.

27. Metropolis N, Rosenbluth AW, Rosenbluth MN, et al. Equations of state calculations by fast computing

machines. Journal of Chemical Physics 1953; 21:1087-1092.

28. Hastings WK. Monte Carlo sampling methos using Markov chains and their applications. Biometrika 1970;

57:97-109.

29. Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. Chapman and Hall,

1996.

30. Liu KJ, Darrow WW, Rutherford GW. A model-based estimate of the mean incubation period for AIDS

in homosexual men. Science 1988; 240:1333-1335.

31. Mariotto AB, Mariotti S, Pezzotti P, et al. Estimation of the acquired immunodeficiency syndrome

incubation period in intravenous drug users: a comparison with male homosexuals. American Journal



26 W. ZHANG ET AL.

of Epidemiology 1992; 135:428-437.

32. Law CG, Brookmeyer R. Effects of mid-point imputation on the analysis of doubly censored data. Statistics

in Medicine 1992; 11:1569-1578.

33. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data.

Journal of the Royal Statistical Society, Series B 1976; 38:290-295.

34. Jaffe HW, Darrow WW, Echenberg DF, et al. The Aquired Immunodeficiency Syndrome in a cohort of

homosexual men: a six year follow-up study. Annals of Internal Medicine 1985; 103:210-214.

35. Souza IE, Allen JB, Xiang J, et al. Effect of primer selection on estimates of GB virus C (GBV-C)

prevalence and response to antiretroviral therapy for optimal testing for GBV-C viremia. Journal of

Clinical Microbiology 2006; 44:3105-3113.

36. Zhang W. Analysis of doubly censored survival data with applications to GBV-C and HIV studies. Ph.D.

dissertation, 2005, University of Iowa, Iowa City, Iowa.

37. Gauvreau K, DeGruttola V, Pagano M, et al. The effect of covariates on the induction time of AIDS using

improved imputation of exact seroconverson times. Statistics in Medicine 1994; 13:2021-2030.

38. Toyoda H, Fukuda Y, Hayakawa T, Takamatsu, Saito H. Effect of GB virus C/hepatitis G virus coinfection

on the course of HIV infection in hemophilia patients in Japan. Journal of Acquired Immune Deficiency

Syndrome and Human Retrovirology 1998; 17:209-213.

39. R Development Core Team. R: A language and environment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria. IBSN 3-900051-07-0, URL http://www.r-project.org, 2005.

40. Spiegelhalter D, Thomas A, Best N, Lunn D. WinBUGS User Manual, 2003. URL http://www.mrc-

bsu.cam.ac.uk/bugs.

41. Brooks SP, Gelman A. General methods for monitoring convergence of iterative simulations. Journal of

Computational and Graphical Statistics 1998; 7:434-455.

42. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of

American Statistical Association 1987; 82:528-540.

43. Harkanen T, Virtanen JI, Arjas E. Caries on permanent teeth: a non-parametric Bayesian analysis.

Scandinavian Journal of Statistics 2000; 27:577-588.

44. Komarek A, Lesaffre E, Harkanen T, Declerck D, Virtanen JI. A Bayesian analysis of multivariate doubly-

interval-censored dental data. Biostatistics 2005; 6:145-155

45. Nattermann J, Nischalke HD, Kupfer B, et al. Regulation of CC chemokine receptor 5 in hepatitis G virus

infection. AIDS 2003; 17:1457-1462.




46. Xiang J, George SL, Wunschmann S, et al. Inhibition of HIV-1 replication by GB virus C infection through

increases in RANTES, MIP-1alpha, MIP-1beta, and SDF-1. Lancet 2004; 363:2040-2046.

47. Jung S, Knauer O, Donhauser N, et al. Inhibition of HIV strains by GB virus C in cell culture can be

mediated by CD4 and CD8 T-lymphocyte derived soluble factors. AIDS 2005; 19:1267-1272.

48. Xiang J, Sathar MA, McLinden JH, Klinzman D, Chang Q, Stapleton JT. South African GB virus C

isolates: interactions between genotypes 1 and 5 isolates and HIV. Journal of Infectious Diseases 2005;

192:2147-2151.

49. George SL, Varmaz D, Stapleton JT. GB virus C replicates in primary T and B lymphocytes. Journal of

Infectious Diseases 2006; 193:451-454.

50. Stiehm ER. Disease versus disease: how one disease may ameliorate another. Pediatrics 2006; 117:184-191.



A Bayesian analysis of doubly censored data using a ...€¦ · infection is of interest. In this...

Documents

Transcript of A Bayesian analysis of doubly censored data using a ...€¦ · infection is of interest. In this...