Download - Right-truncated data

Transcript
Page 1: Right-truncated data

Right-truncated data

For this data, only individuals for whom the event has occurred by a givendate are included in the study. Right truncation can occur in infectiousdisease studies.

Let Ti denote the infection time and Xi the time between infection andonset of the disease (for example, HIV infection versus development ofAIDS). We have a right-truncation time of τ , so that only patients whohave the disease prior to τ are in the study.

STAT474/STAT574 February 7, 2017 1 / 44

Page 2: Right-truncated data

Right-censoring versus right-truncation

For AIDS, for example, if we followed a cohort of people who had beeninfected with HIV, and at the end of the study some of them had AIDSand some didn’t, we would consider those who hadn’t developed AIDS bythe end of the study to be right-censored rather than right-truncated.

Instead, if the study only includes people who have developed AIDS, andthen we determine how long it took to develop AIDS from the time of theHIV infection, then the data is right-trucnated rather than right-censoredsince we are not allowing for the possibility of people who had HIV butdidn’t develop AIDS.

STAT474/STAT574 February 7, 2017 2 / 44

Page 3: Right-truncated data

Transfusion-caused HIV infection in children (p. 20)

infectionTime inductionTime

1.00 5.5

1.50 2.25

2.25 3.0

2.75 1.0

3.00 1.75

3.50 0.75

3.75 0.75, 1, 2.75, 3, 3.5, 4.25

4.00 1.0

4.25 1.75

4.50 3.25

4.75 1.0, 2.25

5.00 0.5, 0.75, 1.5, 2.5

5.25 0.25, 1.0, 1.5

5.50 0.5, 1.5, 2.5

5.75 1.75

6.00 0.5, 1.25

6.25 0.5, 1.25

6.50 0.75

6.75 0.5, 0.75

7.00 0.75

7.25 0.25

STAT474/STAT574 February 7, 2017 3 / 44

Page 4: Right-truncated data

AIDS example

For this example, the study had a window of 8 years, so data isright-truncated, but the truncation time depends upon when the infectiontime took place. Let Ri = τ − Xi . Then the trick of thinking of the Ri

values as left-truncated can be used, and the method for left-truncationfrom the previous section can be used. Here Ri is constrained to begreater than Ti , so the Ri s are left-truncated at Ti .

This procedure estimates Pr [R > t|R ≥ 0] which is equivalent toPr [X < τ − t|X ≤ τ ]. This function is decreasing in t and increasing butincreasing on the original timescale, i.e. Pr [X < x |X ≤ τ ] is increasing inx .

STAT474/STAT574 February 7, 2017 4 / 44

Page 5: Right-truncated data

AIDS example

STAT474/STAT574 February 7, 2017 5 / 44

Page 6: Right-truncated data

Cohort Life Table

In a cohort life table, we follow a group, traditionally from birth, andrecord when they die within an interval (aj−1, aj ], j = 1, . . . , k + 1. Ideally,all individuals are followed until the last one dies; however, censoring mayoccur due to moving out of the study. The life table can also be applied toanimals, and can be applied to other events, such as time to weaning forinfants, first evidence of a certain disease, time to first marriage, etc.

We use the following steps to make a Cohort Life Table

STAT474/STAT574 February 7, 2017 6 / 44

Page 7: Right-truncated data

Cohort Life Table

I The first column gives the mutually exclusive and adjacent timeintervals Ij = (aj−1, aj ], where a0 = 0, and ak =∞. Intervals do notneed to be equal in length.

I The second column gives the number of individuals Y ′j enteringinterval Ij who have not experienced the event

I The third column gives the number of individuals lost to follow up(cannot be reached) in interval Ij . We assume that the censoringtimes are independent of the event times

I The fourth column gives Yj , the number of intervals at risk ofexperiencing the event in interval Ij assuming that censoring times areuniformly distributed over the interval, so that Yj = Y ′j −Wj/2

I The fifth column gives the number of individuals dj who experiencedthe event in interval Ij

I The sixth column gives the estimated survival function at the start ofthe interval S(aj−1)

STAT474/STAT574 February 7, 2017 7 / 44

Page 8: Right-truncated data

Cohort Life Table

The survival probabilities follow the Kaplan-Meier formula. The maindifference is in how the Yi values are defined since we average over theinterval when dealing with censoring instead of using exact time points.

S(a0) = 1; S(aj) = S(aj−1)(1− dj/Yj) =

j∏i=1

(1− di/Yi )

STAT474/STAT574 February 7, 2017 8 / 44

Page 9: Right-truncated data

Cohort Life Table

Other columns:

I The seventh column gives the estimated probability density at themidpont of the jth interval

f (amj) =S(aj−1)− S(aj)

aj − aj−1

(This looks like a difference quotient from calculus)I The eighth column gives the estimated hazard rate, based on a linear

approximation for the survival function:

h(amj) = f (amj)/S(amj)

= f (amj)/{S(aj) + [S(aj−1) + S(aj)]/2}

=2f (amj)

S(aj−1) + S(aj)

STAT474/STAT574 February 7, 2017 9 / 44

Page 10: Right-truncated data

Cohort Life Table: other columns

I The 9th column is the standard error estimate at the beginning of thejth interval, which is the same as for the Kaplan-Meier,

SE [S(aj−1)] = Saj−1

√∑j−1i=1

diYi (Yi−di )

I The 10th column gives the standard error esimtate at the midpoint ofthe jth interval which is

S(aj−1)qjaj − aj−1

√√√√ j−1∑i=1

qi/(Yi pi ) + pj/(Yj qj)

where qj = dj/Yj and pj = 1− qjI Finally, the last column gives the estimate standard deviation of the

hazard function at the midpoint of the jth interval as{1− [h(amj)(aj − aj−1)/2]2

dj

}1/2

· h(amj)

STAT474/STAT574 February 7, 2017 10 / 44

Page 11: Right-truncated data

Cohort Life Table: weaning example

STAT474/STAT574 February 7, 2017 11 / 44

Page 12: Right-truncated data

Cohort Life Table: weaning example Hazard function

STAT474/STAT574 February 7, 2017 12 / 44

Page 13: Right-truncated data

Weaning example Hazard function

The hazard function for this example is initially high (indicate highprobability that mother will wean the infant soon), then decreases, thengoes up again.

STAT474/STAT574 February 7, 2017 13 / 44

Page 14: Right-truncated data

Median survival time

If the last observation is right-censored, then the mean survival time isn’twell-defined, so the median is often used. For the median survival time,first determine the interval Ij = (aj−1, aj ] where S(aj−1) ≥ 0.5 and

S(aj) ≤ .5. Then the median survival time can be estimated byinterpolation as

x0.5 = aj + [S(aj−1)− 0.5]aj − aj−1

S(aj−1)− S(aj)= aj + [S(aj−1)− 0.5]/f (amj)

The derivation of this uses college algebra. Draw a line through(aj−1, Sa(j − 1) and (aj , S(aj)). Determine the point on this line that goesthrough 0.5 on the y -axis, and you are done.

STAT474/STAT574 February 7, 2017 14 / 44

Page 15: Right-truncated data

Median residual life

You can also determine the median residual life. From a certain time pointaj−1 the median residual life is the amount of time added to aj−1 forwhich half of the remaining cohort is expected to survive. For example, if40% of the cohort have died by time aj−1, then the median residual life isthe additional time expected until 70% have died.

STAT474/STAT574 February 7, 2017 15 / 44

Page 16: Right-truncated data

Cohort Life Table: Example problem

STAT474/STAT574 February 7, 2017 16 / 44

Page 17: Right-truncated data

Cohort Life Table: Example problem

To do this problem, note that the number of people entering the studywas 100. To construct the table, the first column in the form of timeintervals is given.

To construct the second column, the number of individuals entering theinterval who have not experienced the event, we consider the number whoentered the previous interval without experiencing the event, minus thenumber who experienced the event in the previous interval, and minus thenumber who were lost to follow up in the previous interval

STAT474/STAT574 February 7, 2017 17 / 44

Page 18: Right-truncated data

Cohort Life Table: Example problem

It is easiest to set this up in Excel. Here I just did up to the estimatedstandard errors for the survival function. For this, it is easiest to constructa column for dj/[(Yj(Yj − dj), then another column for∑j

i=1 dj/[(Yj(Yj − dj) before getting the standard error.

STAT474/STAT574 February 7, 2017 18 / 44

Page 19: Right-truncated data

Chapter 6: Topics in Univariate Estimation

This chapter considers miscellaneous topics before going into hypothesistesting (Chapter 7). Topics include

I smoothing estimates of the hazard function

I describing excess mortality compared to a reference population

I Bayesian nonparametric approaches for right-censored data.

STAT474/STAT574 February 7, 2017 19 / 44

Page 20: Right-truncated data

Estimating the hazard function

The Nelson-Aalen estimator (Chapter 4 and week 4 of Notes) is typicallyused to estimate the cumulative hazard function as

H(t) =

{0, if t ≤ t1∑

ti≤tdiYi, if t > t1

The survival function is related to the cumulative hazard byS(t) = exp[−H(t)]. The hazard is the derivative (i.e., slope) of thecumulative hazard, which is crude for a step function. So a smoothedversion of the cumulative hazard is often used to make a more reasonableestimate of the hazard function.

STAT474/STAT574 February 7, 2017 20 / 44

Page 21: Right-truncated data

Smoothing with Kernels

Before jumping in to the idea of smoothing the hazard, we’ll talk aboutsmoothing in a simpler setting first.

Smoothing is also used for nonparametric density estimation.

In particular, think of a histogram, and remove the bars underneath. Thenthe remaining horizontal lines form a rough estimate of the densityfunction.

STAT474/STAT574 February 7, 2017 21 / 44

Page 22: Right-truncated data

Histogram as density estimate

> x <- rnorm(100)

> a <- hist(x,prob=T,cex.axis=1.3,cex.lab=1.3,cex.main=1.3,

border=FALSE,col="grey")

names(a)

[1] "breaks" "counts" "density" "mids" "xname"

"equidist"

> a$density

[1] 0.02 0.06 0.04 0.22 0.30 0.32 0.40 0.36 0.16 0.08 0.04

> plot(a$breaks,a$density,type="s")

Error in xy.coords(x, y, xlabel, ylabel, log) :

’x’ and ’y’ lengths differ

> points(a$mids,a$density,type="s")

> points(a$mids-.25,a$density,type="s")

STAT474/STAT574 February 7, 2017 22 / 44

Page 23: Right-truncated data

Histogram as density estimate

STAT474/STAT574 February 7, 2017 23 / 44

Page 24: Right-truncated data

Histogram as density estimate

STAT474/STAT574 February 7, 2017 24 / 44

Page 25: Right-truncated data

Histogram as density estimate

This isn’t quite right because the height of the histogram should start atthe left-endpoint of the bin, not the midpoint, so you need to shift if overor use the a$breaks variable

> a <- hist(x,prob=T,cex.axis=1.3,cex.lab=1.3,cex.main=1.3,

border=FALSE,col="grey")

> points(a$breaks[1:length(a$density)],a$density,type="s",

lwd=3)

STAT474/STAT574 February 7, 2017 25 / 44

Page 26: Right-truncated data

Histogram as density estimate

STAT474/STAT574 February 7, 2017 26 / 44

Page 27: Right-truncated data

Histogram as density estimate

Now we’ll remove the grey, and just think of this as an estimate of thestandard normal density.

> a <- hist(x,prob=T,cex.axis=1.3,cex.lab=1.3,cex.main=1.3,border=FALSE)

> x2 <- ((1:100)/100)*6 - 3

> points(x2,dnorm(x2),type="l",lwd=2)

> points(a$breaks[1:length(a$density)],a$density,type="s",lwd=3,lty=3)

> legend(-3,.4,legend=c("true","estimated"),lty=c(1,2),cex=1.5)

> legend(-3,.4,legend=c("true","estimated"),lty=c(1,2),cex=1.5,lwd=c(3,2))

STAT474/STAT574 February 7, 2017 27 / 44

Page 28: Right-truncated data

Histogram as density estimate

STAT474/STAT574 February 7, 2017 28 / 44

Page 29: Right-truncated data

Histogram as a density estimate

Of course, with more data, the better the histogram estimates the density.

> x <- rnorm(100000)

> a <- hist(x,prob=T,cex.axis=1.3,cex.lab=1.3,cex.main=1.3,

border=FALSE,nclass=50)

> x2 <- ((1:100)/100)*8 - 4

> points(x2,dnorm(x2),type="l",lwd=2)

> points(a$breaks[1:length(a$density)],a$density,

type="s",lwd=3,lty=1,col="red")

> legend(-4.2,0.4,legend=c("true","estimated"),

col=c("black","red"),cex=1.3,lty=c(1,1))

STAT474/STAT574 February 7, 2017 29 / 44

Page 30: Right-truncated data

Histogram as density estimate

STAT474/STAT574 February 7, 2017 30 / 44

Page 31: Right-truncated data

Histogram as density estimate

One way of thinking about what the histogram does is that it that it stacksrectangular blocks on top of each other for each observation. The locationof the block depends on which bin the observation occurs in. If there are10 observations between -3 and -2, then it stacks 10 blocks in thatcategory. If there are 15 between -2 and -1, then it stacks 15 blocks in thatcategory. Each block is the same height and width (the width of the bin).The heights are chosen so that the total area under the curve is equal to 1.

STAT474/STAT574 February 7, 2017 31 / 44

Page 32: Right-truncated data

Histogram as density estimate

The blocky of appearance of the histogram is due to this business ofstacking blocks. If we want a smoother estimate of the density, we canimagine stacking objects that aren’t so....blocky.

The idea is that instead of adding a rectangle at every data point (within abin), we instead put the value of a function at the observed data pointwithout worrying about binning, and add up these functions. Thefunctions are scaled in such a way that the total area under the curve willstill be 1.0.

STAT474/STAT574 February 7, 2017 32 / 44

Page 33: Right-truncated data

Kernel density estimation

If a square Kernel is used, then you add one unit of height, but instead ofputting it into a bin, you center it wherever the data happens to be.

STAT474/STAT574 February 7, 2017 33 / 44

Page 34: Right-truncated data

Kernel density estimation

Let’s say a normal Kernel is used. Then if you observe data x1, x2, and x3(only three data points), you draw normal curves centered at x1, x2 andx3. The Kernel density estimate at a point say x distinct from the datapoints, is the sum of the normal curves evaluated at x .

The total density estimate is therefore a sum of normals with differentmeans, but the same variance, and this leads to a smooth, though possiblymultimodal, density estimate

STAT474/STAT574 February 7, 2017 34 / 44

Page 35: Right-truncated data

Kernel density estimation: normal Kernel

STAT474/STAT574 February 7, 2017 35 / 44

Page 36: Right-truncated data

Kernel density estimation

To see how this works in more detail, we’ll try the example with 3 datapoints. Say x1 = 1.2, x2 = 1.5, and x3 = 3.1. To use a Kernel function,you need to specify a bandwidth h as well as the Kernel function. TheKernel density estimate at a point x can be written as

f (x) =1

nh

n∑i=1

K

(x − xi

h

)where K is the Kernel function. For a rectangular (also called uniform)Kernel, we might let K (u) = 1/2(−1 ≤ u ≤ 1) This adds a rectangularblock at each point xi . For example, we have

f (2) =1

3 · 1

[K

(2− 1.2

1

)+ K

(2− 1.5

1

)+ K

(2− 3.1

1

)]=

1

3·(

1

2+

1

2+ 0

)

STAT474/STAT574 February 7, 2017 36 / 44

Page 37: Right-truncated data

Kernel density estimation

For the same data, if we used the standard normal kernel with bin width 2,we have K (u) = φ(u), and we use

f (x) =1

2n

n∑i=1

K

(x − xi

2

)We then get

f (2) =1

6(φ(0.4) + φ(0.25) + φ(−0.55))

> (1/6)*(dnorm(.4) + dnorm(.25) + dnorm(-.55))

[1] 0.1829804

STAT474/STAT574 February 7, 2017 37 / 44

Page 38: Right-truncated data

Kernel density estimation

If instead we had used a binwidth of h = 1, then we’d have

f (2) =1

3(φ(0.8) + φ(0.5) + φ(−1.1)) = 0.2865364

This changes the answer quite a bit (and makes it closer to the uniformKernel answer), but this is a small sample size, making the answer moresensitive to bin width

STAT474/STAT574 February 7, 2017 38 / 44

Page 39: Right-truncated data

Kernel density estimation in R

To get an idea of how to do this in R, we can use, for a basndwidth of0.5122:

> (1/(3*.5122)) *(dnorm((1-x[1])/.5122) + dnorm((1-x[2])/.5122)

+ dnorm((1-x[3])/.5122))

[1] 0.4018495

to get f (1).

> plot(xaxis,(1/(3*.5122)) *(dnorm((xaxis-x[1])/.5122) +

dnorm((xaxis-x[2])/.5122) + dnorm((xaxis-x[3])/.5122)),

main="",cex.axis=1.3,cex.lab=1.3,ylab="Density estimate",

type="l",xlab="x")

plots the whole curve from -4 to 4.

STAT474/STAT574 February 7, 2017 39 / 44

Page 40: Right-truncated data

Kernel density estimation in R: bandwidth = 0.5

STAT474/STAT574 February 7, 2017 40 / 44

Page 41: Right-truncated data

Kernel density estimation in R: bandwidth = 1.0

STAT474/STAT574 February 7, 2017 41 / 44

Page 42: Right-truncated data

Kernel density estimation in R

Of course, there is an easier way, which is to use the density() functionin R. The default is a Guassian kernel, but others are possible also. It usesit’s own algorithm to determine the bin width, but you can override andchoose your own. If you rely on the density() function, you are limitedto the built-in kernels. If you want to try a different one, you have to writethe code yourself.

STAT474/STAT574 February 7, 2017 42 / 44

Page 43: Right-truncated data

Kernel density estimation in R: effect of bandwidth forrectangular kernel

STAT474/STAT574 February 7, 2017 43 / 44

Page 44: Right-truncated data

Kernel density estimation

There are lots of popular Kernel density estimates, and statisticians haveput a lot of work into establishing their properties, showing when someKernels work better than others (for example, using mean integratedsquare error as a criterion), determining how to choose bandwidths, and soon.

STAT474/STAT574 February 7, 2017 44 / 44