COURSE 6: SAMPLE SIZE CALCULATION OF s sample size formula for dichotomous response variables 91...

59
COURSE 6: SAMPLE SIZE CALCULATION OF SAMPLE SIZE ............................................................ 85 Preparing to Calculate Sample Size..................................................... 86 Sample Size Calculations for Dichotomous Response Variables ..89 Sample Size Calculations for Continuous Response Variables ...102 Sample Size for Time-to-failure Data (censored data case) .......... 106 Sample Size for Testing Equivalence of Treatments ...................... 109 Pocock’s Table 9.1 ................................................................................... 112 SAMPLE SIZE TABLES .............................................................................. 114 POWER AND SAMPLE SIZE (PS) SOFTWARE ..................................... 135 How to Download PS ............................................................................. 135 How to Use PS: Examples ..................................................................... 138 APPENDIX: PAGANO TABLE A.3 ................................................................ I - 84 -

Transcript of COURSE 6: SAMPLE SIZE CALCULATION OF s sample size formula for dichotomous response variables 91...

COURSE 6: SAMPLE SIZE CALCULATION OF SAMPLE SIZE ............................................................85

Preparing to Calculate Sample Size.....................................................86 Sample Size Calculations for Dichotomous Response Variables ..89 Sample Size Calculations for Continuous Response Variables ...102 Sample Size for Time-to-failure Data (censored data case)..........106 Sample Size for Testing Equivalence of Treatments......................109 Pocock’s Table 9.1...................................................................................112

SAMPLE SIZE TABLES ..............................................................................114 POWER AND SAMPLE SIZE (PS) SOFTWARE .....................................135

How to Download PS .............................................................................135 How to Use PS: Examples .....................................................................138

APPENDIX: PAGANO TABLE A.3 ................................................................ I

- 84 -

85

CALCULATION OF SAMPLE SIZE Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical interest. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of trial planning.

Type I error, Type II error, p value, and power of a test H0 is true H0 is not true Reject H0 (type I error) 1 – (power) Do not reject H0 (type II error) Power

= the probability of REJECTING the null hypothesis if a specific alternative is true

Power = Fn. (, variation, clinically significant level, and SAMPLE SIZE)

Sample size = Fn. (power, variation, clinically significant level)

p-value = the probability we would have observed this difference (or a greater difference) if the null hypothesis were true

Calculation of a proper sample size is necessary to ensure adequate levels of significance and power to detect differences of clinical interest. Biggest danger: Sample size too small No significant difference found Treatment that may be useful is discarded. Sample size calculations are approximate. Often based on roughly estimated parameter values. Usually based on mathematical models that only approximate truth. Changes may occur in the target population, the eligibility criteria, or the

expected treatment effect before the study begins.

Be conservative when estimating sample size.

86

Preparing to Calculate Sample Size 1. What is the main purpose of the trial? (This is the question on which

sample size is based.) 2. What is the principal measure of patient outcome (endpoint)? Is this

measure continuous or discrete? Censoring? 3. What statistical test will be used to assess treatment difference (e.g., t-

test, log-rank, chi-square)? With what -level? One-tailed or two-tailed? 4. What result is anticipated with the standard treatment (e.g., average

value or rate)? 5. How small a treatment difference is it important to detect (), and

with what degree of certainty (power = 1 – )? = type I error = probability of rejecting H0: = 0 when H0: =

0 is true

= type II error = probability of not rejecting H0: = 0 when 0. changes as a function of :

near zero is large; far from zero is small 1 – = power (as a

function of ) = probability of rejecting H0: = 0 when 0

near zero low power; far from zero high power

Often we set = 0.05 or 0.01, but want to check various values of n, , and . For fixed n, a plot of power = 1 – vs. is a power curve. For a two-sided test, it looks like this:

1

large n Power = 1 -

small n

α

< 0 = 0 > 0 H false H true H false 0 0 0

Alternatively, we could plot any two parameters for fixed values of the others. For example, for fixed and , we could plot n vs.:

200

1 – = 0.9 Sample size

100

1 – = 0.8

0 |δ|

0

Because sample size planning often involves a trade-off between desired sample size, cost, and patient resources, such curves are useful.

87

88

Alternatively, sample sizes may be based on lengths of confidence intervals instead of power. If done, it’s best to still check power to make sure power is adequate. In either case, C.I.’s are useful for reporting results. These sample size methods assume a single final analysis at the end of the trial. Interim analyses increase the chance of finding significant difference either make adjustments to sample size or use group sequential testing methods. Sample size methods will next be given for dichotomous, continuous, and continuous but censored data.

Sample Size Calculations for Dichotomous Response Variables Compare drug A (standard) vs. drug B (new). PA = proportion of failures expected on drug A PB = proportion of failures on Drug B that one would want to detect

as being different Note: = PA – PB

We want to test H0:PA = PB vs. Ha:PA PB (P = true value) with significance level , and power = 1 – to detect a difference of = PA – PB. The total sample size required (N in each group) is:

2

2

2

2 2 (1 ) (1 ) (1 )2

( )

A A B B

A B

Z p p Z p p p pN

p p

,

where:

p p pA B ( ) / 2

.

, and Z/2 and Z are critical values of the standard normal distribution, for example, for = 0.05 (two-sided test), Z0.05/2 = 1.96. The table below gives Z/2 and Z for common values of and

Z/2 1 – Z

0.10 1.645 0.80 0.84 0.05 1.960 0.85 1.03 0.025 2.240 0.90 1.282 0.01 2.576 0.95 1.645

89

Example PA = 0.4 PB = 0.3 (An “event” is a failure, so we want a reduced proportion on the new therapy.) Let = 0.05, 1 – = 0.90, two-sided test. Note:

(0.4 0.3)0.35

2p

From the table provided, we have Z/2 = 1.96 and Z = 1.282. Substituting those values into the formula gives:

2

2

2 1.96 2(0.35)(0.65) 1.282 (0.4)(0.6) (0.3)(0.7)2

(0.4 0.3)N

= 952.3

Rounding up to the nearest 10 yields 2N = 960, or N = 480 in each group. The tables from Fleiss (handout) give sample sizes for several cases calculated using an adjusted version of the above formula.

90

Pocock’s sample size formula for dichotomous response variables

91

BThe variance of equals var + var if the two samples are

independent. The binomial variance [i.e., the variance

p pA ( )pA ( )pB

p̂ = x/n, where x has the binomial (n,p) distribution] is a function of p(1 – p). The trouble is that we don’t know the true values, pA and pB, to compute the true variance. (If we did, we wouldn’t have to do the experiment in the first place!) The variance of under H p pA B 0:pA = pB = p is a function of 2 1p p( )

A

similar.

, and

the variance of under H p pB A:pA pB is a function of pA(1 – pA) + pB(1 – pB). (The sample size formula derivation will be given later, which shows why the former is multiplied by Z and the latter by Z.) Often these two values will be very Pocock uses pA(1 – pA) + pB(1 – pB) in both places in the sample size formula above, which simplifies the formula considerably:

22 1 1 2

2

2N

p p p p Z Z

p p

A A B B

A B

( ) ( ) /b gb g

b g

Pocock’s formula uses proportions multiplied by 100% (e.g., 75% instead of 0.75), but this change in scale cancels in the numerator and denominator, and gives the same result as using proportions. Pocock’s Table 9.1 gives (Z + Z for several values of and .

Table 9.1 (Pocock). Values of ƒ(,) to calculate the required number of patients for a trial

(type II error) 0.05 0.1 0.2 0.5

0.10 10.8 8.60 6.20 2.7 (type I 0.05 13.0 10.5 7.90 3.8 error) 0.02 15.8 13.0 10.0 5.4 0.01 17.8 14.9 11.7 6.6

N adjusted for continuity correction (Fleiss, 1981; Casagrande et al., 1978) Recall: Underlying distribution is binomial (discrete), which we approximate with a normal distribution (continuous).

Using the continuity correction leads to the following adjustment in sample size:

Ncorrected = N

N p pA B41 1

42

LNMM

OQPP

Using the previous example, with pA = 0.4, pB = 0.3, N = 480:

Ncorrected =

2

480 41 1 499.8 500

4 480 0.4 0.3

Using the uncorrected N, the sample size would be too small by:

2 x (500 – 480) = 40 patients. The corrected N is recommended, and the continuity-corrected test statistic also should be used. Corrected values are tabulated for extensive combinations of , , pA, and pB in the references.

92

93

For example, for power = 0.80) Z = 1.96, Z

PA PB N

0.05 0.15 140 0.10 0.20 199 0.20 0.30 293 0.30 0.40 356 0.40 0.50 387 0.45 0.55 391 0.50 0.60 387 0.60 0.70 356 0.70 0.80 293 0.80 0.90 199 0.85 0.95 140

References: Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: Wiley; 1981. Casagrande JT, Pike MC, Smith PG. An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics 1978;34(3):483-486.

Effect of binomial variance on sample size Recall that the variance of is a function of p(1 – p), which is graphed below: p

0.25

0.20

p(1 – p) 0.15

0.10

0.05

¼ ½ ¾ 1

p

The variance of is largest when p = 0.5, and smallest when p is near 0 or 1. p

Larger sample sizes are required to detect a change, pA - pB, when pA and pB are near 0.5. Smaller sample sizes are required for pA and pB near 0 or 1.

If one has no idea about the true value of p, then one can conservatively use p = 0.5 in the variance formula for sample size calculations. In general, dichotomous outcomes require substantial sample sizes to detect moderate differences. Continuous outcomes usually require smaller sample sizes.

94

Derivation of the (uncorrected) sample size formula Let and be the sample proportions, and N is the sample size in each group.

pA pB

95

est statistic: To test H0:pA = pB vs. Ha:pA > pB (one-tailed test used for simpler calculations), we use the t

Zp p

pq N

A B

/2,

where:

p p p qA B 1

21( ) and p

tandardize.

Testing at level means:

P(rejecting H0H0 true) = P(Z>ZH0 true) = We can perform anlevel test for any sample size (recall power curve). To determine N, we need to specify and . For a given and = pA - pB, we have:

P(rejecting H0Ha true) = P(Z>ZpA - pB = ) = 1 – This probability is a function of N (because Z is a function of N), so we can solve the equation for N. However, the Z statistic does not have a standard normal distribution if Ha is true. B was standardized assuming H p pA 0 true, so we must un-standardize and then re-s Recall that under Ha:

p pA B ~ , A A B BA B

p q p qN p pN

,

where q = 1 – p.

So:

12

FHGG

IKJJ P Z Z p p P p p

pq NZ p pA B

A BA B( )

/

Un-standardizing:

ˆ ˆ2

( )A B A Bp ppq

P Z p pN

And re-standardizing:

F

H

GGGG

I

K

JJJJP

p p p p

p q p q N

Z pq N p p

p q p q Np p

A B A B

A A B B

A B

A A B BA B

/

/

/

a f a fa f

a fa f

standard normal

2

So:

1 P Z Z p pA B( ) ,

where Z is a critical value from the normal distribution, and also:

Z

Z pq N p p

p q p q N

A B

A A B B

2 /

/

b gb g

96

We can now solve this equation for N. First multiply by N N/ 1:

Z

Z pq p p N

p q p q

A B

A A B B

2 ( )

b g

2 ( )A A B B Z pq pA pB NZ p q p q

( ) 2A B A A B Bp p N Z pq Z p q p q

2

2

( )

A A B B

A B

Z pq Z p q p qN

p p

,

which is the sample size formula given earlier. N is the number required in each group. The formula in the notes for 2N multiplies the above result by 2. (Note that pq is still an unknown quantity, but we will approximate it with [pA + pB]/2.)

97

Sample size based on width of confidence intervals (McHugh & Le, 1984) If we want a C.I. of width 2d (i.e., d), then solve for N:

d Z p q p q NA A B B b g/

1A A B BN Z p q p q

d

21

A A B BN Z p q p qd

For the previous example, pA = 0.4, pB = 0.3, N= 480, and Z = 1.96:

1.96 (0.4)(0.6) (0.3)(0.7) / 480 0.06d

If we wanted a C.I. of width 2(0.05) instead of 2(0.06), the required sample size would be:

21

1.96 0.4 (0.6) (0.3)(0.7) 691.50.05

N

Reference: McHugh RB, Le CT. Confidence estimation and the size of a clinical trial. Control Clin Trials 1984;5(2):157-163.

98

Adjustment for noncompliance (crossovers) Assume a new treatment is being compared with a standard treatment. Dropouts: those who refuse the new treatment some time after

randomization and revert to the standard treatment Drop-ins: those who receive the new treatment some time after

initial randomization to the standard treatment These generally dilute the treatment effect.

Example: drug A vs. placebo Suppose the true values are:

pA = 0.6 pplacebo = 0.4

So:

= 0.6 – 0.4 = 0.2

Enroll N = 100 patients in each treatment group. 25% of drug A group drops out. 10% of placebo group drops in. So, instead of observing:

ˆ ˆ( ) 0.6 and ( ) 0.4A BE p E p , we observe:

75 25ˆ( ) (0.6) (0.4) 0.55100 100

0.55 0.42 0.13 (instead of 0.20)

90 10ˆ( ) (0.4) (0.6) 0.42100 100

A

B

E p

E p

The power of the study will be less than intended, or else the sample size must be increased to compensate for the dilution effect.

99

For a dropout or drop-in rate of R (crossovers in one direction only), the adjusted sample size is:

2

1Nadjusted (1 )

NR

For example, if R = 0.25 in the previous example with N = 480:

2

1N 480 480(1.78) 853.3adjusted (1 0.25)

For a dropout rate of R1 (A placebo) and a drop-in rate of R2 (placebo A), the adjusted sample size is:

21 2

1Nadjusted (1 )

NR R

For example, if R1 = 0.25 and R2 = 0.10:

2

1N 480 1,136adjusted (1 0.25 0.10)

The large increase in sample size shows the considerable impact of noncompliance on the ability to detect treatment differences.

Keep noncompliance to a minimum during trials.

Justification for sample size adjustment formula for noncompliance Expected difference between treatments: pA - pB = . R1 = dropout rate on treatment A.

p E p p R p R p R p pA A A B A A( ) ( ) (1 1 1 1 B)

100

Recall the (uncorrected) sample size formula:

2

( )

B A A B B

A B

Z pq Z p q p qN

p p

2

A small change in pA or pB will have little effect on the numerator. The denominator, however, will become:

p p p R p p p p p R p pA B A A B B A B A Bc h b gc h b g b g21

21

2

p p R p p RA B A Bb gb g b gb g1 112 2

12

Thus, the adjustment to N for a dropout rate of R1 is:

1

1 12 Rb g

Similarly, if there is also a drop-in rate of R2 (treatment B A):

p E p p R p R p R p pB B B A B A( ) ( ) (1 2 2 2 B) The denominator of the sample size formula becomes:

( ) p p p R p p p R pA pBA B A A B B2

1 22b gc h b gc h

p p R RA Ba fa f1 1 22

So the adjustment to N is:

1

1 1 22

R R

101

Sample Size Calculations for Continuous Response Variables Examples of continuous response variables: Blood pressure Time to tumor clearance Length of hospital stay Assume all observations are known completely (no censoring). Data are assumed to be approximately normally distributed. A transformation (e.g., log or square root) may be required to normalize skewed data.

102

To test H0: = A - B = 0 vs. Ha:= A - B 0, use the test statistic:

Z

N N

A B

A B

1 1

Using the same technique as in the dichotomous response case to derive the sample size formula, we obtain (for given , , , and ):

24 2

2 2

2N

Z Z

/b g

Note: This formula is based on a normal (not a t) distribution either is known or N is large enough (N > 30 in both groups) to make this assumption valid. If 2 is not known, compute N for a range of 2 to determine its effect on sample size. If N < 30, this formula will underestimate the correct sample size if is not known. If variances in the two groups are not equal, base N on the larger value.

Example In a study of a new diet to reduce cholesterol, a 10 mg/dl difference would be clinically significant. From other data, is estimated to be 50 mg/dl. We want a two-sided test with = 0.05, power = 1 – = 0.9 to detect a 10 mg/dl difference. Z/2 = 1.96 and Z = 1.282. So:

2 2

2

4 1.96 1.282 502 1

10N

,051

How different would the required sample size be if were actually 60?

2 2

2

4 1.96 1.282 602 1,513.5

10N

A big difference in N considering the relatively small increase in .

Be conservative in estimates of !!

103

Sample size based on width of confidence intervals

d Z NA B 2 2e j/

Here, the relationship with power is the same as in the dichotomous response case.

104

Sample size for change-from-baseline response variables For example, = final – baseline cholesterol level. We test:

H0: A – B = 0 vs. Ha: A – B 0 The variance of may be much smaller than the variance of the original values (person-to-person variability is removed).

Smaller sample sizes result.

Example If, in the example above, we used the change in cholesterol level, we may have found = 20 (compared with = 50 above), so N is now:

24 196 1282 20

10170

2 2

2N

. .

( )b gb g

(This is much smaller than 1,051!)

105

Sample Size for Time-to-failure Data (censored data case) Generally we want to compare the survival curves s(t) from two groups, where s(t) = P(T >t) = P(surviving beyond time t).

106

Generally, the log-rank or Wilcoxon (nonparametric) tests are used to test differences between survival functions for two groups. However, sample size calculations are often based on assuming time to failure has an exponential distribution (parametric assumption). s(t) = e-t, where is the hazard rate (force of mortality):

1

mean survival time

If T is the length of the study, and and B are the hazard rates for patients under treatments A and B, respectively:

/ 222

22( )

A B

A B

Z ZN

,

where:

ch

2

1 e T

This assumes all patients enter at the beginning of the study.

1

s(t)

time (t)

Example We plan a 5-year study (T=5) with A = 0.20 and B = 0.30, = 0.05, 1 – = 0.90, so Z/2 = 1.96, Z = 1.282. Assume all patients will enter at the beginning of the 1st year. Then:

2

0.2(5)

(0.2)0.0633

1A

e

,

2

0.3(5)

(0.3)0.1158

1B

e

,

and

22 1.96 1.2822 0.0633 0.1158 376.5

2(0.2 0.3)N

For patients recruited continually during this study period, use:

ch

3

1

T

T e T

For the same parameters used above, this would give:

3

(0.2)(5)

(0.2) (5)0.1087

(0.2)(5) 1A

e

3

0.3(5)

(0.3) (5)1.867

(0.3)(5) 1B

e

2N = 620.9

Accrual throughout the period requires more patients than if all start at the beginning of the study.

107

For the situation where accrual occurs over a fixed time period, T0, followed by a fixed interval of follow-up, T, use:

00

2

1 /T T Te e

T

Early accrual builds information faster, and can lead to reduced sample sizes. See also: Freedman LS. Tables of the number of patients required in clinical trials

using the logrank test. Stat Med 1982;1:121-129.

108

Sample Size for Testing Equivalence of Treatments We may be testing a less expensive, less toxic, or less invasive procedure, and want to make sure that it is “as good” as the standard treatment in terms of efficacy. If we do not reject H0: A = B, that does not mean that we conclude the treatments are equivalent.

We want high power to detect differences of clinical importance, and low power to detect differences that are clinically the same.

1 Power curve

Power = 1 - β

Often this will mean switching the emphasis of and (e.g., using = 0.10 and 0.90).

δ < 0 δ ≈ 0 δ > 0

References: Based on a C.I. approach: Makuch R, Simon R. Sample size requirements for evaluating a

conservative therapy. Cancer Treat Rep 1978;62(7):1037-1040 Based on hypothesis testing, but switching H0 and Ha: Blackwelder WC. ‘Proving the null hypothesis’ in clinical trials. Control

Clin Trials 1982;3(4):345-353. Blackwelder WC. Sample size graphs for ‘proving the null hypothesis.’

Control Clin Trials 1984;5(2):97-105.

109

110

Sample size for testing equality of several normal means (i.e., continuous response variables) Procedure is straightforward, but requires tables. References: Mace AE. Sample Size Determination. Malabar, FL: Kreiger; 1974. Neter J, Kutner MH, Wasserman W, Nachtsheim CJ. Applied Linear Statistical Models. 4th ed. New York, NY: McGraw-Hill/Irwin; 1996. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Philadelphia, PA: Lawrence Erlbaum; 1988.

111

Sample size for testing equality of several proportions N based on chi-square test for homogeneity Reference: Lachin JM. Sample size determinants for r X c comparative trials. Biometrics 1997;33(2):315-324.

Pocock’s Table 9.1 Pocock’s Table 9.1 gives values of ƒ(,) to calculate the required number of patients for a trial.

Table 9.1 (Pocock) (type II error) 0.05 0.1 0.2 0.5

0.10 10.8 8.60 6.20 2.7 (type I 0.05 13.0 10.5 7.90 3.8 error) 0.02 15.8 13.0 10.0 5.4 0.01 17.8 14.9 11.7 6.6 = the level of the 2 significance test used for detecting a treatment difference (often set = 0.05). 1 – = the degree of certainty that the difference p1 - p2, if present, would be detected (often set 1 – = 0.90). (commonly called the type I error) is the probability of detecting a

significant difference when the treatments are really equally effective (i.e., it represents the risk of a false-positive result).

(commonly called the type II error) is the probability of not

detecting a significant difference when there really is a difference of magnitude p1- p2 (i.e., it represents the risk of a false-negative result).

1 - commonly called the power) is the probability of detecting a

difference of magnitude p1- p2. Here, p1 and p2 are the hypothetical percentage successes on the two treatments that might be achieved if each were given to a large population of patients. They merely reflect the realistic expectations or goals that one aims for when planning the trial and do not relate directly to the eventual results.

112

Example In a trial of anturan, the investigators chose: p1 = 90% on placebo expected to survive one year p2 = 95% = 0.1 The required number of patients on each treatment n =

1 1 22

22 1

(100 ) (100 )( , )

( )

p p p pf

p p

,

where ƒ(,) is a function of and , the values of which are given in Pocock’s Table 9.1 (reproduced above). In fact:

f ( , ) ( / ) ( ) 1 1 22 ,

where is the cumulative distribution function of a standardized normal deviation. Numerical values for -1 may be obtained from statistical tables such as Geigy (1970, p.28). Hence, for the anturan trial:

n

90 10 95 5

95 9010 5 5782a f .

Thus, 578 patients are required on each treatment.

113

SAMPLE SIZE TABLES

114

115

116

117

118

119

120

121

122

123

123

124

124

125

126

127

127

128

129

130

131

132

133

134

POWER AND SAMPLE SIZE (PS) SOFTWARE PS is a free resource available for download on the Department of Biostatistics web site.

How to Download PS 1. In your favorite browser, type the following URL:

http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize

2. When the following page appears, click on the “Get PS” link:

135

3. A screen similar to the following one should appear. Click OK.

136

4. The “save as” dialog box will appear, and you can choose the location to

save your file. You may want to save to C:\Temp, so that you can easily remove the setup files after you have installed the software. When you have chosen your location, click Save.

5. Go to your C:\Temp folder and double click on the PS icon.

6. The PS software will be automatically installed to your machine.

137

How to Use PS: Examples

138

139

140

141

APPENDIX: PAGANO TABLE A.3

I