Chapter 7: Hypothesis testingjames/w8-STAT574b.pdfSAS Programming March 6, 2015 1 / 43 Hypothesis...
Transcript of Chapter 7: Hypothesis testingjames/w8-STAT574b.pdfSAS Programming March 6, 2015 1 / 43 Hypothesis...
Chapter 7: Hypothesis testing
Hypothesis testing is typically done based on the cumulative hazardfunction. Here we’ll use the Nelson-Aalen estimate of the cumulativehazard. The survival function is used to weight differences between theobserved and expected cumulative hazard.
Recall that the Nelson-Aalen estimate of the cumulative hazard is
H(t) =∑t≤ti
diYi
In a one-sample problem, you test whether the hazard rate h(t) is equal tosome reference hazard, h0(t). The null hypothesis is H0 : h(t) = h0(t).Under the null hypothesis, the expected hazard rate at time ti is h0(ti ).
SAS Programming March 6, 2015 1 / 43
Hypothesis testing: one sample
The idea is then to compare observed - expected cumulative hazard ratesat the time τ , the largest time in the study (τ = tD) if the largest time isa death time). The test statistic is then
Z (τ) = O(τ)− E (τ) =D∑i=1
W (ti )diYi−∫ τ
0W (s)h0(s) ds
where W (·) is a weight function.
The variance is
V [Z (τ)] =
∫ τ
0W 2(s)
h0(s)
Y (s)ds
SAS Programming March 6, 2015 2 / 43
Hypothesis testing
The expected value of Z (τ) = 0, so if we take a z-score of Z (τ)(subtracting the mean and dividing by the standard deviation), we get
Z (τ)/√V [Z (τ)]
which has an approximate standard normal distribution. This can be usedfor either a two-sided or one-sided test. For example, a one-sided testwould be H1 : h(t) > h0(t), and you would reject only for large values of
Z (τ)/√V [Z (τ)]
SAS Programming March 6, 2015 3 / 43
Hypothesis testing
The most popular choice for a weighting function is W (t) = Y (t), whichleads to
O(τ) =D∑i=1
Y (ti )diYi
=D∑i=1
di
This is also called the log-rank test (not sure why).
Other weight functions are possible. For example
W (t) = Y (t)S0(t)p[1− S0(t)]q
with 0 ≤ p, q ≤ 1 (you don’t necessarily need q = 1− p here). The choiceof p affects whether you care more about the hazard not matching thehypothesized hazard for small t or large t. For example, if p is large, thenmore emphasis is placed on the estimated hazard matching the null hazardfor small values of t.
S0(t) can be obtained from S0(t) = − exp[−H0(t)].SAS Programming March 6, 2015 4 / 43
Hypothesis testing
An example where you would use the one-sided hypothesis test is intesting whether some population has a higher hazard than a referencepopulation, such as the psychiatric patients from Iowa. Recall that for thisexample, we looked at excess mortality previously.
SAS Programming March 6, 2015 5 / 43
Hypothesis testing: two or more samples
If you have two or more samples (i.e., mortality for three differenttreatments or three different risk groups), then the null and alternativehypothesis are similar to that for ANOVA:
H0 : h1(t) = h2(t) = · · · hK (t), for all t ≤ τ
HA : hi (t) 6= hj(t) for some i 6= j and some t ≤ τ
where τ is the largest time at which all of the groups have at least onesubject at risk.
SAS Programming March 6, 2015 6 / 43
Hypothesis testing: two or more samples
We now define ti as the unique death times for the pooled data (i.e.,ignoring the group that each observation comes from), and again tD is thelargest death time.
We observe dij deaths at time ti in sample j , and there are Yij individuals
at risk at time ti in sample j . We let di =∑K
j=1 dij be the total number of
deaths at time ti and Yi =∑K
j=1 Yij be the total number of indivdiuals atrisk (available for death?) at time ti .
SAS Programming March 6, 2015 7 / 43
Hypothesis testing: two or more samples
The idea for testing the hypothesis is that under the null hypothesis, theestimate of the hazard (and cumulative hazard) should be the same (inexpectation) using the pooled data (ignoring the group the samples arefrom) and for the individual samples. We can think of the pooled data asproviding a more precise estimate of the hazard for the jth sample thanthe jth sample itself, so using the idea of observed minus expected, we canwrite
Zj(τ) =D∑i=1
Wj(t)
(dijYij− di
Yi
), j = 1, . . . ,K
If all of the Zj(τ) terms are close to 0, then all of the sample estimatedcumulative hazards are close to the pooled cumulative hazard, so they allmust be close to each other, and this supports the null hypothesis.
SAS Programming March 6, 2015 8 / 43
Hypothesis testing: two or more samples
The typical weight function used is Wj(t) = Yij(t)W (ti ), where W (ti ) is acommon weight shared by each group. For this weighting scheme,
Zj(τ) =D∑i=1
[dij − Yij
(diYi
)]
V [Zj(τ)] = σjj =D∑i=1
W (ti )2Yij
Yi
(1−
Yij
Yi
)(Yi − diYi − 1
)di , j = 1, . . . ,K
cov(Zj(τ),Zk(τ)) = σjk =D∑i=1
W (ti )2Yij
Yi
Yik
Yi
(Yi − diYi − 1
)di , j 6= k
SAS Programming March 6, 2015 9 / 43
Hypothesis testing: two or more samples
Based on the second formula for Zj(τ), the sum∑K
j=1 Zj(τ) is equal to 0,meaning that the Zj(τ) are not independent of one another. In particularZK (τ) is a linear combination of Z1(τ), . . . ,ZK−1(τ). Consequently, weconstruct a test statistic just based on the first K − 1 Zj(τ) terms:
χ2 = (Z1(τ), . . . ,ZK−1(τ))Σ−1(Z1(τ), . . . ,ZK−1(τ))′
where (Z1(τ), . . . ,ZK−1(τ)) is interpreted as a K − 1 row-vector, Σ is a(K − 1)× (K − 1) covariance matrix (if you had made a K × K matrixusing all the variables, it wouldn’t be full rank, and therefore notinvertible). The χ2 statistic has K − 1 degrees of freedom, and you canbase the test on this distribution.
SAS Programming March 6, 2015 10 / 43
Hypothesis testing: two samples
Several weight functions are possible. W (t) = 1 for all t leads to thetwo-sample log-rank test. W (ti ) = Yi and W (ti ) =
√Yi have also been
used.
In the case of K = 2 samples, the test statistic can be written as
Z =
∑Di=1W (ti )
[di1 − Yi1
(diYi
)]√∑D
i=1W (ti )2Yi1Yi
(1− Yi1
Yi
)(Yi−diYi−1
)SInce we don’t have to square in this case, we can do one-sided as well astwo-sided hypothesis tests based on a standard normal distribution insteadof a χ2, or you can square the statistic and use a χ2
1 distribution.
SAS Programming March 6, 2015 11 / 43
Hypothesis testing: two samples
SAS Programming March 6, 2015 12 / 43
Hypothesis testing: two samples
This example was kidney dialysis patients with surgically implantedcatheters versus percutaneous (needle-puncture) placement of catheter.Even though the survival curves look fairly different after 1 year or so, thedifferences are not statistically signficant. Note that there are also veryfew observations for the percutaneous sample.
Actually the number of observations is fairly small for both samples, so theconfidence intervals would be fairly wide.
SAS Programming March 6, 2015 13 / 43
Hypothesis testing: two samples
SAS Programming March 6, 2015 14 / 43
Hypothesis testing: two samples
SAS Programming March 6, 2015 15 / 43
Hypothesis testing: two samples
Different choices for the weight function affect the p-value. It is reassuringif a lot of weighting schemes give the same conclusion. The cases wherethe p-value were low were where the weighting scheme gave a lot ofweight to differences in the hazard for large values of ti , which of course iswhere they appear different. This can also be sensitive to differences incensoring patterns in the two samples, so should be used cautiously.
A problem with using lots of weighting schemes is if you only reportweighting schemes that give the results you want and different weightsconflict. This would be dishonest, so you should either pick a weightingscheme and stick to it, or report results of the different weighting schemesthat you used.
SAS Programming March 6, 2015 16 / 43
Hypothesis testing: weight functions
SAS Programming March 6, 2015 17 / 43
Hypothesis testing: weight functions
The most common weight functions are either flat, W (ti ) = 1 ordecreasing, with W (ti ) = Yi . A weight function that is increasing mightbe used if to compare longer term survival when early survival might bedue to complications rather than long term effectiveness of a treatment.
An example is in comparing autologous transplants versus allogenictransplants for bone marrow for leukemia. Allogenic transplant patients(receiving bone marrow from sibling) tend to have more complicationsearly on, reducing early survival rates (and increasing early hazard rates),but if interest is in long term survival, then a weight function could beused that emphasized later times.
SAS Programming March 6, 2015 18 / 43
Hypothesis testing in R
To test the difference in survival curves in R, you can use survdiff()
from the survival library. An example is with the allo- versus auto-patients in the leukemia data.
> x <- read.table("leukemia2.txt")
> a <- survdiff(Surv(x$V1,x$V2)~factor(x$V3))
Call:
survdiff(formula = Surv(x$V1, x$V2) ~ factor(x$V3))
N Observed Expected (O-E)^2/E (O-E)^2/V
factor(x$V3)=1 51 28 25.8 0.182 0.382
factor(x$V3)=2 50 22 24.2 0.195 0.382
Chisq= 0.4 on 1 degrees of freedom, p= 0.537
The results suggest that the two groups had survival experiences that werenot statistically significantly different from each other.
SAS Programming March 6, 2015 19 / 43
Hypothesis testing in R
To plot the two survival curves together you can use
> x <- read.table("leukemia2.txt")
> a <- survfit(Surv(x$V1[x$V3==1],x$V2[x$V3==1])~1)
> b <- survfit(Surv(x$V1[x$V3==2],x$V2[x$V3==2])~1)
> plot(a,conf=F)
> points(b$time,b$surv,type="s",col="red",lwd=3)
> legend(20,1,legend=c("auto","allo"),col=c("black","red"),
lty=c(1,1),lwd=c(1,3),cex=1.3)
SAS Programming March 6, 2015 20 / 43
Hypothsis testing in R
SAS Programming March 6, 2015 21 / 43
Hypothesis testing in R
The survdiff() function in R has an optional paramter rho whosedefault is 0, which results in the log rank test. Larger values of rho putlarger weight on later times and can have a big impact on the p-value.
SAS Programming March 6, 2015 22 / 43
Hypothesis testing in SAS
You can use PROC LIFETEST in SAS to do hypothesis testing. We’ll takea look at examples after the break.
SAS Programming March 6, 2015 23 / 43
Tests of trend
For multiple samples (K > 2), a different alternative hypothesis is thefollowing:
HA : h1(t) ≤ h2(t) ≤ · · · ≤ hK (t)
, for t ≤ τ , where at least one inequality is strict. This is equivalent to
HA : S1(t) ≥ · · · ≥ SK (t)
SAS Programming March 6, 2015 24 / 43
Tests of trend
We construct the Zj(τ)s as before and use any weight functions Wj(ti ).We also pick a new set of weights aj , j = 1, . . . ,K , where aj = j is oftenused.
The test statistic is now
Z =
∑Kj=1 ajZj(τ)√∑K
j=1
∑Kk=1 ajak σjk
where Σ = (σjk) is the K ×K covariance matrix. (It isn’t full rank, but wedon’t need the inverse.) The test statistic can be compared to a standardnormal.
SAS Programming March 6, 2015 25 / 43
Tests of trend
SAS Programming March 6, 2015 26 / 43
Stratified tests
If different populations have different covariates (age, sex, etc.), thenideally, you could use a regression approach to survival analysis to adjustfor covariates before comparing survival curves or hazard rates. This isdone in Chapter 8.
If there are a small number of levels for a predictor, then you can use astratified test instead.
Let
H0 : h1s(t) = h2s(t) = · · · = hKs(t), s = 1, . . . ,M, t ≤ τ
The idea is that for each level of the covariate (indexed by s), the hazardrate should be the same. Typically, M is small.
SAS Programming March 6, 2015 27 / 43
Stratified tests
For the stratified test, let
Zj .(τ) =M∑s=1
Zjs(τ)
σjk =M∑s=1
σjks
Then the test statistic is as before with multiple samples:
(Z1.(τ), . . . ,ZK−1,.(τ))Σ−1(Z1.(τ), . . . ,ZK−1,.(τ))′
which is approximately χ2 with K − 1 degrees of freedom. Here we haveK samples and M strata within each sample.
SAS Programming March 6, 2015 28 / 43
Renyi type tests
For a two sample problem, if hazard functions cross, then the previoustests might not detect much overall difference in the hazard rates. Thus,the overall survival experience might be similar, but it could be different inthe short term and different in the long term. If one group is at more atrisk in the short term, and another in the long term, these changes ofdirection could cancel out leading one to not reject the hypothesis that thehazards are different.
Renyi-type tests are based on the maximum absolute value of thedifferences between cumulative hazard rates rather than the summeddifferences.
The idea is similar to the Kolmogorov-Smirnov test for comparing twodistributions, which uses the largest absolute value of the differencebetweent the two empirical CDF functions, but Renyi tests allow forcensoring.
SAS Programming March 6, 2015 29 / 43
Renyi type tests
To construct this test, let
Z (ti ) =∑tk≤ti
W (tk)
[dk1 − Yk1
(dkYk)
)], i = 1, . . . ,D
where as usual dk = dk1 + dk2 and Yk = Yk1 + Yk2 (i.e., dk and Yk arethe pulled number of deaths and number at risk at time tk over bothsamples). The standard error of Z (τ) is
σ2(τ) =∑τk≤τ
W (tk)2(Yk1
Yk
)(Yk2
Yk
)(Yk − dkYk − 1
)dk
where τ is the largest death time tk with Yk1,Yk2 > 0
SAS Programming March 6, 2015 30 / 43
Renyi type tests
The test statistic is
Q = sup{|Z (t)|, t ≤ τ}/σ(τ)
you can think of the supremum here as just the maximum of the absolutevalues of the Z (tj) values. Critical values are given in the Appendix, tableC.5, and are based on the theory of Brownian motion.
SAS Programming March 6, 2015 31 / 43
Renyi type tests
SAS Programming March 6, 2015 32 / 43
Renyi type tests: finding the maximum |Z (tj)|
SAS Programming March 6, 2015 33 / 43
SAS Programming March 6, 2015 34 / 43
Testing based on a fixed point in time
Instead of testing survival and hazard rates over all time points, you mightbe interested in the 1-yr survival rate. Note that the time being testedshould be chosen before doing the test. If you look at two survival curvesand say, “Wow, they look really different at year 3, is that significant?”then the p-value will biased too low.
It is similar to testing at many time points but then not adjusting formultiple comparisons. In practice, this is what happens all the timethough. People look at a graph of the data, which is maybe meant to bedescriptive, something jumps out at them as being unusual, and they say,“Wow, is that significant?” It’s extremely difficult to answer this type ofquestion. A better approach in this type of case might be the Renyi typeof test, because it is accounting for the fact that you are looking atmaximum differences over the entire time frame.
SAS Programming March 6, 2015 35 / 43
Testing based on a fixed point in time
Here we want to testH0 : S1(t0) = S2(t0)
againstHA : S1(t0) 6= S2(t0)
for two survival curves. (The method can be generalized to more survivalcurves.) The test statistic is
Z =S1(t0)− S2(t0)√
V [S1(t0)] + V [S2(t0)]
which has an approximate standard normal distribution for large samples.
SAS Programming March 6, 2015 36 / 43
Testing based on a fixed point in time
If you want to test multiple fixed time points, such as the 1-yr and 5-yrsurvival rates, then you should adjust for multiple comparisons. For testingtwo time points, a Bonferroni adjustment could be made, meaning thatyou reject each hypothesis only if the p-value is less than α/2. The moretime points you check, the less power you will have to find signficantdifferences.
SAS Programming March 6, 2015 37 / 43
Bonferroni adjustments
Probably the most popular, and simplest adjustment to make for multipletesting is Bonferroni adjustments. The idea is that to have k tests at levelα (meaning that if the null hypotheses are true for all k tests, there is onlya 5% chance of making an error on any one of them), you use an α levelof α/k for each test.
What is the rationale for doing this?
SAS Programming March 6, 2015 38 / 43
Bonferroni adjustments
There are several ways to justify Bonferroni adjustments. One is to look atthe expected number of false positives under the null. Let Xi = 1 if youmake a correct decision on test i , and otherwise Xi = 0. What type ofvariable is Xi? What is the probability that Xi = 1 if the null hypothesis(for experiment i) is true? What is the expected value of Xi?
SAS Programming March 6, 2015 39 / 43
Bonferroni adjustments
Xi as defined previously is Bernoulli with p = α if testing using level α.The expected value of a Bernoulli(p) random variable is p. (Why?), so theexpected value of Xi is α.
If you do k experiments, the expected number of false positives is
E
[k∑
i=1
Xi
]= kα
However, if you test at the α/k level, then the expected number of falsepositives is α. Thus, the Bonferroni adjustment controls the expectednumber of false positives.
SAS Programming March 6, 2015 40 / 43
Bonferroni adjustments
Another approach is to use something called Bonferroni’s inequality. LetAi be the event that you don’t reject the null hypothesis. Suppose we setP(Ai ) = 1− α/k when the null is true. From the Inclusion-Exclusionformula
P(A1A2) = P(A1) + P(A2)− P(A1 ∪ A2) ≥ P(A1) + P(A2)− 1
If we apply the formula again, setting B = A1A2, we get
P(A1A2A3) = [P(A1)+P(A2)−1]+P(A3)−1 ≥ P(A1)+P(A2)+P(A3)−2
In general for k events
P(A1 · · ·Ak) ≥k∑
i=1
P(Ai )− (k − 1)
SAS Programming March 6, 2015 41 / 43
Bonferroni adjustments
If P(Ai ) = 1− α/k , then we get
P(A1 · · ·Ak) ≥ k(
1− α
k
)− k + 1 = 1− α
Thus, the probability of all decisions being correct is at least 1− α, andthe probability of making any wrong decision is at most α.
SAS Programming March 6, 2015 42 / 43
Bonferroni adjustments
Bonferroni’s inequality can be useful in other probabilistic arguments aswell.
SAS Programming March 6, 2015 43 / 43