080109P FINAL Supp Materials (1)

8/3/2019 080109P FINAL Supp Materials (1)

1/37

Web-based Supplementary Materials forUsing Regression Models to Analyze Randomized

Trials: Asymptotically Valid Hypothesis Tests

Despite Incorrectly Specified Models

Michael Rosenblum and Mark J. van der Laan

September 8, 2008

These supplementary materials consist of the following six appendices to the mainpaper:

Appendix A: Details of Simulation Studies from Section 6 and Additional Simula-tion Results

Appendix B: Robust Variance Estimators

Appendix C: Robustness Results for Regression-Based Test of Effect Modification

Appendix D: Proof of Theorem from Section 4

Appendix E: R Code for Data Example from Section 7

Appendix F: Comparison of Superpopulation Inference to the Randomization In-

ference Approach of Rosenbaum (2002)

1


2/37

A Details of Simulation Studies from Section 6and Additional Simulation Results

We first describe in full detail the hypothesis testing methods M0-M5 that werecompared in Section 6. Next, we present results of simulations showing the Type Ierror for these methods for various data generating distributions satisfying the nullhypothesis (2) of the paper, for different sample sizes. Then we present results fromadditional simulations comparing the power of the different methods under variousadditional working models, data generating distributions, and sample sizes. Next,we discuss selection of coefficients for use in our regression-based method, and thepossibility of combining test statistics based on several methods or working models.Lastly, we give the R code used for our simulations. Note that each reported valuein the tables below is the result of 100,000 Monte Carlo simulations.

A.1 Details of Hypothesis Testing Methods M0-M5

We give the details of hypothesis testing methods M0-M5 below:Hypothesis Testing Methods:

M0: Regression-based test: This is the hypothesis testing method (*) describedin Section 4. The hypothesis test (*) requires that one pre-specify a set ofcoefficients corresponding to treatment terms in the working model used. Inall of the simulation results below and in the paper, we used the estimatedcoefficients corresponding to all terms in the working model that contain thetreatment variable, and combined these into a Wald statistic, as described in

Web Appendix B.

M1: Intention-to-treat based test: Reject the null hypothesis whenever the 95%confidence interval for the estimated risk difference excludes 0. This is a stan-dard z-test for comparing two sample means.

M2: Cochran-Mantel-Haenszel test: (Cochran, 1954; Mantel and Haenszel, 1959)First, the baseline variable is discretized, and then the Cochran-Mantel-Haenszeltest is run. We discretized into five levels, corresponding to the quintiles of thedistribution of the baseline variable. That is, the odds ratio within strata corre-sponding to each of these levels was computed, and then combined into a single

test statistic using the weights specified by the Cochran-Mantel-Haenszel test(see e.g. pg. 130 of (Jewell, 2004)).

M3: Permutation test: (Rosenbaum, 2002) First, the binary outcome Y is re-gressed on baseline variable V using a logistic regression working model forP(Y = 1|V), which we define in the subsection on working models below. Pear-son residuals for each observation are calculated based on the model fit. Then,

2


3/37

the residuals for observations in which A = 1 are compared to those for A = 0using the Wilcoxon rank sum test.

M4: Targeted Maximum Likelihood based test: (Moore and van der Laan,

2007; van der Laan and Rubin, 2006) The risk difference is estimated, adjustingfor the baseline variable using the targeted maximum likelihood approach; thenull hypothesis is rejected if the 95% confidence interval for the risk differenceexcludes 0. The details of this approach are given in Section 3.1 of (Moore andvan der Laan, 2007).

M5: Augmented Estimating Function based test: (Tsiatis et al., 2007; Zhanget al., 2007) The log odds ratio is estimated, using an estimating function thatis augmented to adjust for the baseline variable; the null hypothesis is rejectedif the 95% confidence interval for the log odds ratio excludes 0. The details aregiven in Section 4 of (Zhang et al., 2007) (where the direct method was used).

A.2 Working Models 1, 2, and 3 from the Paper

In Section 6 of the paper, we gave informal descriptions of three working models usedin the simulations there. We now formally define these working models. Consider theworking models used by methods M0, M4, and M5. (The working models used bymethod M3 need to be slightly different, and we deal with these next.) These workingmodels are for the probability that outcome Y = 1 given treatment A and baselinevariable V. Working Model 1, which is correctly specified for all the data generatingdistributions defined in Section 6, is

logit1 (0 + 1A + 2V + 3AV) . (1)

Working Model 2 has a different functional form than the true data generating dis-tributions. Working Model 2 is defined as

logit1

0 + 1A + 2V2 + 3AV

2

. (2)

Working Model 3 incorporates a noisy version of the baseline variable V, to repre-sent measurement error. More precisely, Working Model 3 is obtained by replacingV in (1) by the variable V = sign(V) + W, where sign(x) is 1 for x > 0, 0 if x = 0,and

1 for x < 0, and where W is a standard normal random variable indepen-

dent of V,A,Y. Working Models 2 and 3 are misspecified for the data generatingdistributions corresponding to those described in Section 6 for which the outcomeis generated according to the model P(Y = 1|A, V) = logit1(A + V) or the modelP(Y = 1|A, V) = logit1(A + V AV). (Working Models 2 and 3 are correctlyspecified when the outcome is generated according to P(Y = 1|A, V) = logit1(A),as is any logistic regression model containing A as a main term.)

3


4/37

Method M3 (the permutation-based test) requires a working model for the prob-ability that outcome Y = 1 given just the baseline variable V. In order to make thecomparison with the other methods a fair one, we define Working Model 1 used bymethod M3 so that it is correctly specified. This requires use of a slightly complex

logistic regression model: logit1(0 + 1f(V)), where we set

f(V) = logit1 {[logit(P(Y = 1|1, V)) + logit(P(Y = 1|0, V))]/2} .

Working Model 2 for method M3 is formed by instead using f(V) = V2. WorkingModel 3 for method M3 is formed by instead using f(V) = sign(V) + W, where W isa standard normal random variable independent of V,A,Y.

A.3 Type I error

We now explore the Type I error of methods M0-M5, using simulations. Since all

the methods we consider have asymptotically correct Type I error, the remainingquestion is how large their Type I error is for different sample sizes under various datagenerating distributions. We consider two cases: first, we look at what happens whenwe use Working Model 1 defined above, under various data generating distributionsfor which the null hypothesis is true. Next, we look at what happens to Type I errorwhen working models include a large number of variables. To summarize our findings:in all cases considered, Type I error was either correct (at most 0 .05) or nearly correct(at most 0.06) for all methods in all situations we considered.

We consider five different data generating distributions described in Table 1 of thisWeb Appendix below. For each of these five distributions, we consider two sample

sizes: 200 and 400 subjects.For both these sample sizes and all five of the data generating distributions con-sidered in Table 1 of this Web Appendix, the Type I error behavior of all the methodswas either correct ( 0.05) or nearly correct ( 0.06).

We now consider what happens to Type I error when models use a large numberof baseline variables. This is of interest since it is useful to know if using a longlist of predictors could adversely affect Type I error for moderate sample sizes. Weonly consider methods M0, M3, M4, M5, since the other methods do not use workingmodels.

We consider data generated as follows: first, we generate 10 baseline variablesV1, . . . , V 10, each of which is normally distributed with mean 0 and variance 1 and

independent of the others; next, the outcome Y is set to be 1 with probabilitylogit1((V1 + . . . + V10)/

10), and 0 otherwise; the treatment variable A is binary

and is generated independent of V1, . . . , V 10 and Y. Since in this case the treatmenthas no effect on the outcome, the null hypothesis (2) of the paper is true; thus, anyrejections of the null are false rejections (Type I error).

We look at Type I error for methods M0, M3, M4, M5 when they use workingmodels containing different numbers of the baseline variables. We let methods M0,

4


5/37

Table 1 of this Web Appendix: Type I error for methods M0-M5, at various datagenerating distributions. Working Model used by methods M0, M4, M5 below islogit (P(Y = 1|A, V)) = 0 + 1A + 2V + 3AV. Below, W is a standard normalrandom variable independent of V,A,Y. Involving W in the data generating distri-bution represents the situation, which will almost always occur in practice, whereunmeasured variables affect the mean outcome. We consider sample sizes 400 and200, respectively.

Type I Error When Data Generated byLogistic Regression Model logit(P(Y|A, V)) =

V V2/2 eV/1.63 V + W eV/1.63 + WHypothesisTesting Methods

At Sample Size 400M0: Regression Based 0.05 0.05 0.05 0.05 0.05M1: Intention-to-Treat 0.05 0.05 0.05 0.05 0.05M2: C-M-H Test 0.04 0.04 0.04 0.04 0.04

M3: Permutation Based 0.05 0.05 0.05 0.05 0.05M4: Targeted MLE 0.05 0.05 0.05 0.05 0.05M5: Aug. Estimating Fn. 0.05 0.05 0.05 0.05 0.05

At Sample Size 200M0: Regression Based 0.04 0.04 0.04 0.05 0.05M1: Intention-to-Treat 0.05 0.05 0.05 0.05 0.05M2: C-M-H Test 0.04 0.04 0.04 0.04 0.04M3: Permutation Based 0.05 0.05 0.05 0.05 0.05M4: Targeted MLE 0.05 0.05 0.05 0.05 0.05M5: Aug. Estimating Fn. 0.05 0.05 0.05 0.05 0.05

5


6/37

Table 2 of this Web Appendix: Type I error for methods M0, M3, M4, M5 at asingle data generating distribution, for various working models containing differentnumbers of baseline variables. The working model used by methods M0, M4, M5

is m(A, V|) = logit1

(0 +

0A + 1V1 + . . . + jVj), for each j {2, 4, 6, 8, 10}.The working model used by method M3 is logit1(0 + 1V1 + . . . + jVj), for eachj {2, 4, 6, 8, 10}. We consider sample sizes 400 and 200 respectively.

Type I Error When Working Model UsedContains j Baseline Variables:

j = 2 j = 4 j = 6 j = 8 j = 10HypothesisTesting Methods

At Sample Size 400

M0: Regression Based 0.05 0.05 0.05 0.05 0.05M3: Permutation Based 0.05 0.05 0.05 0.05 0.05M4: Targeted MLE 0.05 0.05 0.05 0.06 0.06M5: Aug. Estimating Fn. 0.05 0.05 0.05 0.06 0.06

At Sample Size 200M0: Regression Based 0.05 0.05 0.05 0.05 0.04M3: Permutation Based 0.05 0.05 0.05 0.05 0.05M4: Targeted MLE 0.05 0.06 0.06 0.06 0.06M5: Aug. Estimating Fn. 0.05 0.06 0.06 0.06 0.06

M4, and M5 use the following working model for P(Y = 1|A, V):

m(A, V|) = logit1(0 + 0A + 1V1 + . . . + jVj),

for each j {2, 4, 6, 8, 10}; this corresponds to five different working models, and welook at Type I error when each one is used. For method M3, we use as workingmodel for P(Y = 1|V), the following: logit1(0 + 1V1 + . . . + jVj), for each j {2, 4, 6, 8, 10}. The results of simulations using this set of working models are givenin Table 2 of this Web Appendix below.

For all the methods considered (M0, M3, M4, M5), the Type I error at nominal

level 0.05 was always at most 0.06, for all working models considered (ranging fromhaving 4 terms to 12 terms).

6


7/37

A.4 Power under Additional Working Models and Data Gen-erating Distributions

Table 3 of this Web Appendix gives the approximate power using the same set of

data generating distributions as in Table 1 of the paper, but under several othermisspecified working models. Working Model 4 is similar to Working Model 1, exceptthat the logit link is replaced by the probit link. Working Model 5 is similar toWorking Model 2, except that we use the function

|V| instead of V2. Working

Model 6 is similar to Working Model 3, except that we replace V in (1) by the variableV = sign(V) + 2W, where W is a standard normal random variable independent ofV,A,Y.

We can compare the power of methods M0-M5 between Table 1 in the paperand Table 3 of this Web Appendix, under the various working models considered.Comparing Working Model 4, in which the probit link was used while the data wasactually generated using the logit link, to Working Model 1 (which used the correctlink function), we see virtually no difference in power; the only difference is thatthe power of method M0 is slightly increased under Working Model 4, under datagenerating distribution 2.1 Comparing Working Models 2 and 5, where the functionalforms of the working models were wrongly set using V2 and

|V|, respectively, instead

of V, we see the power is quite similar. Comparing Working Models 3 and 6, wherenoise was added to the baseline variable (with more noise added to Working Model 6),we see that the additional noise added to Working Model 6 has almost no effect, exceptfor reducing the power of method M0 under data generating distribution 3 definedin the paper. The power of method M0 decreases when more noise is added sincethis added noise attenuates the coefficients corresponding to the treatment variables.

This indicates that the power of method M0 can be sensitive to measurement error,depending on how much it erodes the coefficients being used in the hypothesis test(*).

In Table 4 of this Web Appendix, we further examine for which data generatingdistributions the regression-based method M0 has more power than the other methodswe considered. We consider a different set of data generating distributions than in theabove Table 3 of this Web Appendix. Working Model 3 (defined above) is used, andis misspecified for these data generating distributions. The first column in Table 4of this Web Appendix corresponds to a data generating distribution for which thetreatment A always helps (increases probability that Y = 1), within all strata of thebaseline variable V. The second column corresponds to a data generating distribution

for which the treatment sometimes helps (when V2 < 1.5) and sometimes hurts (when

1This occurs since for data generating distribution 2, under Working Model 4 the expected valueof the estimated coefficient 1 is smaller than the corresponding expected value under WorkingModel 1, but the robust standard error is also smaller under Working Model 4 than under WorkingModel 1; it turns out that the magnitude of the reduction in standard error, in this case, is relativelymore than the magnitude in reduction in expected values, resulting in a slightly increased poweroverall. (Under both working models, the expected value of 3 is 0.)

7


8/37

Table 3 of this Web Appendix: Power when Working Model is Incorrectly Specified.Sample size is 200. Working Models used are defined above. The data generatingdistributions corresponding to each column are those described in Section 6exactlythe same as in Table 1 of the paper.

Power When Data Generated by:Logistic Regression Model With:

Treatment Main Terms Main + InteractionTerm Only: A Only: A, V Terms: A, V, AV

Hypothesis

Testing Methods

Using Working Model 4 (Wrong Link Function)M0: Regression Based 0.85 0.73 0.93M1: Intention-to-Treat 0.93 0.76 0.52M2: C-M-H Test 0.91 0.79 0.49M3: Permutation Based 0.92 0.79 0.65M4: Targeted MLE 0.93 0.84 0.54M5: Aug. Estimating Fn. 0.92 0.83 0.53

Using Working Model 5 (Wrong Functional Form)M0: Regression Based 0.85 0.65 0.56M1: Intention-to-Treat 0.93 0.76 0.52M2: C-M-H Test 0.91 0.73 0.48M3: Permutation Based 0.81 0.67 0.48M4: Targeted MLE 0.93 0.77 0.52M5: Aug. Estimating Fn. 0.92 0.76 0.51

Using Working Model 6 (Measurement Error)M0: Regression Based 0.85 0.63 0.48M1: Intention-to-Treat 0.93 0.76 0.52

M2: C-M-H Test 0.91 0.73 0.48M3: Permutation Based 0.81 0.64 0.44M4: Targeted MLE 0.93 0.76 0.52M5: Aug. Estimating Fn. 0.92 0.76 0.51

8


9/37

Table 4 of this Web Appendix: Power when Working Model is Incorrectly Specified.Sample size is 200. Working Model 3 is used, in which baseline variable V is replacedby a noisy version, to represent measurement error.

Power When Data Generated Using LogisticRegression Model logit(P(Y|A, V)) =

A + V2/1.5 A + V2/1.5 A+sign(V)AV2/1.5 - A sign(V)

HypothesisTesting Methods

M0: Regression Based 0.75 0.16 0.73M1: Intention-to-Treat 0.79 0.16 0.63

M2: C-M-H Test 0.75 0.14 0.60M3: Permutation Based 0.52 0.12 0.66M4: Targeted MLE 0.78 0.16 0.64M5: Aug. Estimating Fn. 0.78 0.16 0.63

V2 > 1.5). The third column corresponds to a data generating distribution for whichthe treatment always helps or has no effect, within each stratum ofV. The regression-based method outperforms the other methods in column 3 and performs as well as theothers in column 2. The simulation results from Table 1 of the paper, and Tables 3and 4 of this Web Appendix, are consistent with the regression method performing

well, in comparison to methods M1-M5, when the treatment effect is large in somesubpopulations and either negative or null in other populations. However, since thisonly holds for the data generating distributions we considered, we caution againstgeneralizing this finding to all data generating distributions. In particular, when theregression model is severely misspecified, we expect the regression-based method toperform quite poorly.

A.5 Coefficient Selection and Combining Test Statistics.

Here we focus on two issues related to the hypothesis testing methods we have consid-ered. First, for the regression-based method of this paper, we consider the questionof which coefficients i from the working model to use. We then consider ways tocombine tests based on several methods or working models. Both of these are openproblems, and we only outline several ideas here.

We first turn to the problem of selecting a subset of the coefficients i from a givenworking model to use in our hypothesis test (*). Recall from Section 4 that using anysubset of the coefficients i corresponding to terms containing the treatment variable

9


10/37

Table 5 of this Web Appendix: Power of Regression-Based MethodM0, Based on Different Sets of Coefficients from the Working Model:logit1 (0 + 1A + 2V + 3AV) . Sample Size is 200.

Power When Data Generated Using LogisticRegression Model with:



M0: Using 1 only 0.86 0.80 0.83M0: Using 3 only 0.04 0.04 0.88

M0: Using 1 and 3 0.86 0.71 0.93

A, the hypothesis test (*) based on a Wald statistic combining the estimates of thesecoefficients (as described in Web Appendix B) will have asymptotically correct Type Ierror under the assumptions of Sections 3 and 4. Thus, the choice of which coefficientsto use can be based solely on power. However, this choice of coefficients must be madeprior to looking at the data. In Table 5 of this Web Appendix below, we give thepower for the regression-based method M0 of our paper, using correctly specifiedWorking Model 1 (1), based on different sets of coefficients. The data generatingdistributions are those defined in Section 6 and used in Table 1 of the paper.

The results in Table 5 of this Web Appendix are consistent with it being advan-tageous to use both 1 and 3 if there are strong main effects and interaction effects(as in the data generating distribution in column 3). The power using both 1 and3 is the same as when using only 1 for data generating distribution 1, in which thebaseline variable is independent of the outcome. For data generating distribution 2,where the baseline variable does influence the outcome, and there is no interactioneffect, using 1 gives more power than using both 1 and 3. It seems quite riskyto use only 3, which corresponds to the interaction term, when there is no interac-tion effect (columns 1 and 2) in the data generating distribution, since in these casespower is quite low. Thus, 3 should not be used alone in the hypothesis test (*) if no

interaction is suspected. In general the best choice of coefficients is a function of theunknown data generating distribution, and so no general rule for choosing a subsetwill work in all situations. However, since the test using both 1 and 3 does rela-tively well in the situations in Table 5 of this Web Appendix, and since it is able gainpower from both main and interaction effects, it may make sense in many situationsto use both of these.

A natural and important question, raised by a referee of this paper, is whether

10


11/37

it ever makes sense, when using the regression-based method of our paper, to usea working model but not use all coefficients corresponding to terms containing thetreatment variable. That is, since Type I error is asymptotically correct for any work-ing model, rather than use only a subset of the possible coefficients from a working

model, why not just use a different working model that omits terms corresponding tounused coefficients? For example, instead of using working model

logit1 (0 + 1A + 2V + 3AV) , (3)

and only coefficient 1, why not just use the following working model:

logit1 (0 + 1A + 2V) , (4)

with coefficient 1? Our simulations are consistent with an answer of yes to theabove question. That is, in our simulations, for the data generating distributions 1, 2,

and 3 (defined in Section 6), the working model and corresponding set of coefficientswith the most power was always either model (3) above using both coefficients 1, 3(for data generating distribution 3) or model (4) above using coefficient 1 (for datagenerating distributions 1 and 2).2 However, we do not have a proof that it is alwaysbetter, in terms of power, to restrict to using all coefficients corresponding to termscontaining the treatment variable in a given working model. This is an importantopen research question.

We now consider ways to combine test statistics based on several methods orworking models. The goal is to construct a single test based on several test statis-tics, so that the Type I error of the combined test is asymptotically correct. Thisis appealing since one may be able to simultaneously have adequate power againstdifferent sets of alternatives targeted by different tests. Since test statistics basedon different methods or working models will likely be correlated with each other,we must use appropriate methods for combining these. For example, we could usethe Bonferroni multiple testing correction to combine test statistics from method M0(regression-based method of this paper) with method M4 (targeted maximum likeli-hood estimation), where the combined test rejects if the p-value from either of thesemethods is less than 0.025 (so as to maintain overall nominal level 0.05). The powerof this combined test, when both methods use Working Model 1 under the data gen-erating distributions 1, 2, and 3 defined in Section 6, are, respectively, 0.89, 0.77, and0.88. Comparing this to the power of each of the individual methods M0-M5 under

these data generating distributions (see Table 1 of the paper), we see that the com-bined test has consistently good power for every data generating distribution (thoughit is never the best), while all the other methods do well for some data generatingdistributions and poorly in others. Thus, it may be advantageous to use the combinedmethod.

2The power for these methods are given in Table 1 of the paper and Tables 5 and 6 of this WebAppendix.

11


12/37

Table 6 of this Web Appendix: Power of Regression-Based Method M0, Based onTwo Working Models. 1st Working Model: logit1 (0 + 1A + 2V); 2nd WorkingModel: logit1 (0 + 1A + 2V + 3AV) . Sample Size is 200.

Power When Data Generated Using LogisticRegression Model with:



M0: 1st Working Model 0.92 0.82 0.50M0: 2nd Working Model 0.86 0.71 0.93

M0: Both Working Models 0.87 0.74 0.88

As another example, we look at combining test statistics based on the regressionmethod of our paper, where each test statistic is based on a different working model.Consider the following two working models:

logit1 (0 + 1A + 2V) , (5)

andlogit1 (0 + 1A + 2V + 3AV) . (6)

In this example, we use both coefficients 1 and 3 to carry out the hypothesis testusing the second working model. Table 6 of this Web Appendix gives the power,at data generating distributions 1, 2, and 3, as in the previous example, based oneach working model separately, and then based on a combined test statistic usingthe Bonferroni correction as in the previous example. The first working model above(with just main terms) gives more power under data generating distributions 1 and 2(where no interaction effect is present). The second working model above has muchmore power at data generating distribution 3 where there is a strong interaction effect.The test based on combining test statistics from both working models performs wellunder all the data generating distributions (though is never the most powerful).

Other methods for combining correlated test statistics can be found in (Hochbergand Tamhane, 1987; Westfall and Young, 1993; van der Laan and Hubbard, 2005). Itis an open problem to determine which tests and working models might be combinedto form tests that have adequate power at a variety of alternatives of interest, andstill have asymptotically correct Type I error regardless of whether the working modelis correctly specified.

12


13/37

A.6 R Code for Power and Type I Error Simulations

Below, we give the R code for the simulations from this appendix and from Section6. This particular set of code is for the simulations when data are generated from

the logistic regression model: P(Y = 1|A, V) = logit1

(A + V AV). Comments inthe code below explain how to adapt the code to other data generating distributions.The code below is for Working Model 1 defined above, but can be easily adapted forthe other working models considered.

# Load library containing function to compute sandwich estimator:

library(sandwich)

# We give the code for each testing method first.

# Each function returns a (2-sided) p-value.

# M0: Regression-based method

M0


14/37

#Construct function Z based on V so that regression model

#logit^(-1)(beta_0 + beta_1 Z) for P(Y=1|V) is correctly specified.

Z1


15/37

total_M0


16/37

(Hardin and Hilbe, 2007, Section 3.6.3). In R, the function vcovHC in the con-tributed package {sandwich} gives robust estimates of the covariance matrix of themaximum likelihood estimator (R Development Core Team, 2004); the diagonal ele-ments of this matrix are robust variance estimators for the model coefficients. All of

these methods are based on the sandwich estimator (Huber, 1967), which we describebelow.

We give the sandwich estimator (Huber, 1967) in the setting of generalized lin-ear models estimated via maximum likelihood. This includes ordinary least squaresestimation in linear models as a special case, since this is equivalent to maximumlikelihood estimation using a Gaussian (Normal) generalized linear model. Assumewe are using a regression model from Section 5 of the paper. Denote the conditionaldensity (or frequency function for discrete random variables) implied by the regressionmodel by p(Y|A,V , ). (Note that we do not require that the actual data generatingdistribution belongs to this model.) We denote the corresponding log-likelihood of

Y given A, V, for a single subject as l(; V,A,Y) = logp(Y|A,V,). Assume thereexists a finite, unique maximizer of E(l(; V,A,Y)), where the expectation hereand below is taken with respect to the true (unknown to the experimenter) distribu-tion of the data. (At the end of this Web Appendix (Web Appendix B), we explainwhy the hypothesis test (*) given in Section 4 of the paper will still have asymp-totically correct Type I error, even when no such unique maximizer exists.) Let ndenote the maximum likelihood estimator for sample size n. Then by Theorem 5.23in (van der Vaart, 1998) even when the model is misspecified, the covariance matrixof

n(n ) converges to the sandwich formula:

= B1W(B1)T, (7)

for B the matrix with i, j entry

Bij = E2l

ij, (8)

and W the matrix with i, j entry

Wij = El

i

l

j, (9)

where all the above derivatives are taken at = .

The matrix (7) can be estimated based on the data. First, one approximates B andW by using the empirical distribution instead of the true data generating distributionin (8) and (9), where one replaces by the maximum likelihood estimate n. Onethen combines these estimates for B and W as in (7) to get an estimated matrix thatwe denote by . Sometimes this estimated matrix is multiplied by a finite sampleadjustment, such as n/(n t) where n is the sample size and t is the number ofterms in the regression model. The classes of regression models defined in Section 5

16


17/37

are sufficiently regular that converges to in probability as sample size goes toinfinity, even when the model is misspecified. The robust variance estimates for theestimated coefficients are the diagonal elements of the matrix .

We said in Assumption (A3) of Section 4 of the paper that one can select more

than one coefficient from the regression model for use in the hypothesis test (*). Herewe describe how to create and use a Wald statistic to test the null hypothesis givenin Section 3 of the paper. Say we are using m coefficientsrecall from (A3) thateach such coefficient must correspond to a term containing the treatment variable.Denote the m 1 vector of estimated coefficients one has selected by . Denotethe m m covariance matrix resulting from the sandwich estimator restricted tothese coefficients by . Then the Wald statistic we use is w = nT1. Underthe null hypothesis, this statistic has as asymptotic distribution the 2 distributionwith m degrees of freedom. This is proved below in Appendix D. We reject the nullhypothesis, then, if the statistic w exceeds the 0.95-quantile of the 2 distribution

with m degrees of freedom (where m is the number of coefficients used).We now consider the case when there does not exist a unique, finite maximizer of the expected log-likelihood, where the expectation is taken with respect to thetrue data generating distribution. We prove below in Web Appendix D that for theclasses of generalized linear models described in Section 5, the expected log-likelihoodis a strictly concave function. This guarantees either the existence of a unique, finitemaximizer, or that for large enough sample size the maximum likelihood algorithmwill fail to converge. In the latter case, statistical software will issue a warning,and our hypothesis testing procedure, as described in the paragraph above the maintheorem in Section 4, is to fail to reject the null hypothesis in such situations. Thus,when there does not exist a unique, finite maximizer of the expected log-likelihood,

Type I error converges to 0 as sample size tends to infinity.

C Robustness Results for Test of Effect Modifica-tion

The results presented in the paper were for tests of the null hypothesis of no meantreatment effect within strata of baseline variables, as formally defined in (2) of thepaper. Here we prove a result for testing a different null hypothesis: that of no effectmodification by selected baseline variables. This result only holds when the treatment

A is dichotomous, the outcome Y is continuous, and a linear model of the form (5)is used. We now describe a regression-based hypothesis test that, in this setting, canbe used to test whether V is an effect modifier on an additive scale, that is, to testthe null hypothesis that the treatment effect E(Y|A = 1, V) E(Y|A = 0, V) is aconstant. This is a weaker null hypothesis than that of no mean treatment effectwithin strata ofV: E(Y|A = 1, V)E(Y|A = 0, V) = 0, which is the null hypothesis(2) of the paper.

17


18/37


19/37

(A3) i is a pre-specified coefficient of a term that contains the treatment variable

A in the linear part of this model. Such coefficients are denoted by (0)j in (5)

and (6) in Section 5. One can also use more than one of these coefficients; forexample, one can use a Wald statistic as described in Web Appendix B.

Consider the following hypothesis test given in Section 4:

Hypothesis Test: (*)For concreteness, we consider hypothesis tests at nominal level = 0.05. The param-eter is estimated with ordinary least squares estimation if the model used is linear;otherwise it is estimated with maximum likelihood estimation. The standard erroris estimated by the sandwich estimator Huber (1967), which can easily be computedwith standard statistical software; we describe the sandwich estimator in detail in WebAppendix B. If a single coefficient i is chosen in (A3), then the null hypothesis of nomean treatment effect within strata of V is rejected at level 0.05 if the estimate for iis more than 1.96 standard errors from 0. Formally, this null hypothesis, defined in(2) in the paper, is that for all treatments a1, a2, E(Y|A = a1, V) = E(Y|A = a2, V);note that this is equivalent to the single equality E(Y|A, V) = E(Y|V). If several co-efficients are chosen in (A3), one can perform a similar test based on a Wald statisticthat uses the estimates of these coefficients along with their covariance matrix basedon the sandwich estimator; this is described above in Appendix B.

We note that in some cases, the estimators we consider will be undefined. Forexample, the ordinary least squares estimator will not be unique if the design matrixhas less than full rank. Also, the maximum likelihood estimator will be undefined

if no finite maximizes the likelihood of the data; furthermore, statistical softwarewill fail to converge to a finite vector if the maximum of the likelihood is achievedat a finite , but this has a component whose magnitude exceeds the maximumallowed by the statistical software. We therefore specify that regardless of whether theestimate for the coefficient i is more than 1.96 standard errors from 0, we always

fail to reject the null hypothesis if the design matrix has less than full rank or if themaximum likelihood algorithm fails to converge. Since standard statistical software(e.g. Stata or R) will return a warning message when the design matrix is not fullrank or when the maximum likelihood algorithm fails to converge, this condition iseasy to check.

We assume that statistical software used to implement maximum likelihood es-

timation uses the Fisher scoring method as described in Section 2.5 of (McCullaghand Nelder, 1998). We further assume this algorithm will fail to converge to a finitevector if the maximum of the likelihood of the data is not achieved for any finite ,or if this maximum is achieved at a finite having a component whose magnitudeexceeds the maximum allowed by the statistical software. We denote the maximummagnitude allowed for any variable by the statistical software by M.

19


20/37

Theorem: (10)

Under assumptions A1-A3, the hypothesis test (*) has the robustness property (3) of

the paper. That is, it has asymptotic Type I error at most 0.05, even when the modelis misspecified.

We also prove this theorem under a modified version of assumption A1 above; thismodified assumption involves treatments being randomly assigned to fixed propor-tions of subjects. We call this modified assumption

Assumption A1: Each treatment a {0, 1,...,k 1} is randomly assigned to afixed proportion pa > 0 of the subjects, where

k1a=0 pa = 1. For each subject,

there is a vector of baseline variables V and a vector of unobserved potentialoutcomes [Y(0), . . . , Y (k 1)], representing the outcomes that subject wouldhave had, had he/she been assigned the different possible treatments. The setof values [Vj, Yj(0), . . . , Y j(k 1)] for each subject j is drawn i.i.d. from anunknown distribution. The observed data vector for each subject j is the triple(Vj , Aj, Yj(Aj)), where Aj is the treatment assigned, and Aj is independent(by randomization) of Vj and the potential outcomes Yj(0), . . . , Y j(k 1). Allvariables are bounded.

We also prove that the test for effect modification given in Web Appendix C abovehas the robustness property (3) of the paper in the subsection of the proof for linearmodels below.

Proof of Theorem: We prove this theorem in two parts. First, we prove it for thelinear models described in Section 5.1; next, we prove it for the non-linear modelsdescribed in Section 5.2. In this Web appendix, we prove the above theorem using theassumptions A1-A3 above. The proof of the above theorem under the modified versionof assumption 1 above (that is, using assumption A1 instead of assumption A1), isgiven in our technical report (Rosenblum and van der Laan, 2007); the web-addressfor this technical report is given in the bibliography.

Throughout the proof, expectations are with respect to the true data generatingdistribution Q (which is unknown to the experimenter); we make no assumptionson Q beyond (A1)-(A3) above. Convergence refers to convergence in probability,unless otherwise stated.

20


21/37

D.1 Proof of Theorem for Linear Models

We prove the theorem above for the case when the model m(A, V|) is a linear modelof form (5) from Section 5. That is, it is of the form

m(A, V|) =t

j=1

(0)j fj(A, V) +

tk=1

(1)k gk(V), (11)

where {fj , gk} can be any square-integrable functions such that for each term fj(A, V),we have E(fj(A, V)|V) is a linear combination of terms {gk(V)}. We denote theparameter vector ((0), (1)) simply by . The parameter of this model is estimatedby ordinary least squares. We also prove a converse to the theorem above for linearmodels: When the hypothesis test (*) uses a linear model not of the form (11) (butstill such that all terms are square-integrable and such that the set of terms is linearlyindependent) then it will not have the robustness property (3) of the paper.

We denote the terms in the model m(A, V|) by the column vectorx = [f1(A, V), f2(A, V), . . . , f t(A, V), g1(V), g2(V), . . . , gt(V)]

T.

Note that this vector has t + t components. Denote the values of this vector for eachsubject s by x(s). If the components of the vector x, considered as random variables,are linearly dependent (that is, if for some non-zero column vector c, we have cTxequals 0 with probability 1), then for sample size n t + t, the design matrixX = [x(1), x(2),..., x(n)]T will not be full rank, with probability 1. Therefore, sincethe above algorithm, by construction, fails to reject the null hypothesis whenever thedesign matrix in not of full rank, it will in this case fail to reject the null hypothesiswith probability 1. This implies that if the components of x are linearly dependent,

then Type I error for our hypothesis testing procedure will be 0 for any sample sizen t + t. We can therefore restrict attention to the case in which the componentsin x are linearly independent, which implies (by strict convexity of the function x2)that there is a unique minimizer of E(Y m(A, V|))2. We therefore restrict tothe case in which such a unique minimizer exists for the remainder of this prooffor linear models.

Theorem 5.23 in (van der Vaart, 1998, pg. 53) on the asymptotic normality ofM-estimators implies that the ordinary least squares estimate of is asymptoticallynormal and converges in probability to the minimizer of E(Y m(A, V|))2. Wenow show that such a minimizer is 0 for all components corresponding to terms in

the model m(A, V|) containing the treatment variable A, when the null hypothesisthat E(Y|A, V) = E(Y|V) is true. That is, in the notation of (11), we will show that(0) = 0 when the null hypothesis is true. For the remainder of the proof, assumethe null hypothesis E(Y|A, V) = E(Y|V) is true.

We start by showing the unique minimizer of E(Y m(A, V|))2 is also theunique minimizer of E(E(Y|V) m(A, V|))2. This follows since

E(Y m(A, V|))2 = E(Y E(Y|A, V) + E(Y|A, V) m(A, V|))2

21


22/37

= E(Y E(Y|A, V))2 + E(E(Y|A, V) m(A, V|))2= E(Y E(Y|A, V))2 + E(E(Y|V) m(A, V|))2, (12)

where in the last line we used our assumption of no mean treatment effect within

strata of V (that is, E(Y|A, V) = E(Y|V)).We now show the unique minimizer of E(E(Y|V) m(A, V|))2 must have

(0) = 0. This follows immediately from the following lemma, setting c(A,V , ) =(E(Y|V) m(A, V|))2. Note that (E(Y|V) m(A, V|))2 is integrable for all finite, by our assumption A1 that V,A,Y are bounded, and our restriction in (11) thatthe functions {fj, gk} defining m(A, V|) are square-integrable.

Lemma 1: Consider any function h(V), and any function c(A,V,) of the form

c(A,V,) =

h(V)

j(0)j fj(A, V) k

(1)k gk(V)

2, (13)

where for each fj(A, V), the function E(fj(A, V)|V) is a linear combination of theterms {gk(V)}. Assume c(A,V,) is integrable for any finite . Assume that A isindependent of V and that there is a unique set of coefficients min achieving theminimum minE(c(A,V,)). Then

(0)min = 0.

Proof of Lemma 1: Let be the L2 projection of h(V) on the space of lin-ear combinations of the functions {gk(V)}. (See Chapter 6 of (Williams, 1991) forthe definition and properties of the space L2 of square-integrable random variables.)Note that all the functions E(fj(A, V)|V) are contained in the space of linear com-binations of the functions {gk(V)} by the assumptions of the lemma. We will showthat for all j, fj(A, V) is orthogonal to h(V)

, which suffices to prove the lemma.

This follows since E(h(V) )fj(A, V) = EE[(h(V) )fj(A, V)|V] = E[(h(V) )E(fj(A, V)|V)] = 0, where the last equality follows since E(fj(A, V)|V) is orthog-onal to h(V) by our choice of . Thus, is the L2 projection of h(V) on thespace generated by linear combinations of the set of functions {fj(A, V)} {gk(V)}.Since , by construction, only involves terms gk(V), the lemma is proved.

Lemma 1 applied to the function c(A,V,) = (E(Y|V) m(A, V|))2 gives thatthe unique minimizer ofE(Ym(A, V|))2 satisfies (0) = 0. Since the argumentabove implies that the ordinary least squares estimator of converges to , wehave that all the components of that correspond to functions fj(A, V) in the model

(11) converge to 0. The Theorem above for linear models of the form (11) then fol-lows since the robust variance estimator given in Web Appendix B is asymptoticallyconsistent, regardless of whether the model is correctly specified. This completes theproof that the above theorem holds for linear models of the form (11).

We now prove that for linear models, (11) is a necessary condition for the robust-ness property (3) of the paper to hold. That is, when the hypothesis test (*) uses

22


23/37

a linear model not of the form (11) (but such that all terms are square-integrableand such that the set of terms is linearly independent3) then it will not have the therobustness property (3) of the paper. More precisely, given any linear model andprobability distribution on A, V such that the terms in the model are square inte-

grable and linearly independent, but for which the key property from (11) is false4,we show how to construct an outcome random variable Y such that (i) the null hy-pothesis E(Y|A, V) = E(Y|V) is true and (ii) the hypothesis test (*) rejects the nullhypothesis with probability tending to 1 as sample size goes to infinity.

To construct a random variable Y with properties (i) and (ii) defined in the pre-vious paragraph, first note that any linear model can be written in the form:

m(A, V|) =t

j=1

(0)j fj(A, V) +

tk=1

(1)k gk(V). (14)

Take any model as in the previous display, and probability distribution under which{fj(A, V), gk(V)} are square-integrable and linearly independent. Assume that thekey property from (11) is false; that is, assume for some term fj(A, V), we haveE(fj(A, V)|V) is not a linear combination of terms {gk(V)}. Then we will showhow to construct a random variable Y satisfying (i) and (ii) defined in the previousparagraph. Define Y := E(fj(A, V)|V) + where is a standard normal randomvariable mutually independent of (A, V). Then (i) above follows immediately sinceY is a function of V and a variable that is independent of A, V. Next, in order toprove (ii), we show that there is a coefficient

(0)i corresponding to a treatment term

that if used in hypothesis test (*), will result in asymptotic Type I error tending to1 as sample size goes to infinity.

By the assumed linear independence of the terms in the model, we have that thereis a unique minimizer of

E

Y

tj=1

(0)j fj(A, V) +

tk=1

(1)k gk(V)

2. (15)

Using the definition of Y, we have that also minimizes

EE(fj(A, V)|V) t

j=1

(0)j fj(A, V) +

t

k=1

(1)k gk(V)

2

. (16)

Assume for the sake of contradiction that for all j, (0)j = 0. This and the previous

3We say a finite set of random variables {Zi} is linearly independent if for any set of coefficientsci,

i ciZi = 0 a.s. implies for all i, ci = 0.4This key property is that for all terms fj(A,V), we have E(fj(A,V)|V) is a linear combination

of terms {gk(V)} in the model.

23


24/37

display imply for all k,

EE(fj(A, V)|V) t

k=1

(1)k gk(V) gk(V) = 0, (17)

and for all j,

E

E(fj(A, V)|V)

tk=1

(1)k gk(V)

fj(A, V) = 0. (18)

Taking j = j in the previous display and using basic rules of conditional expectationgives

E

E(fj(A, V)|V)

t

k=1(1)k gk(V)

E(fj(A, V)|V) = 0. (19)

Equations (19) and (17) imply

E

E(fj(A, V)|V)

tk=1

(1)k gk(V)

2= 0,

which contradicts the assumption that E(fj(A, V)|V) is not a linear combination ofthe terms {gk(V)}. Thus, it must be that for some j, (0)j = 0. Let i be an indexsuch that

(0)i = 0

The above arguments then imply that for Y defined as above, the minimizer ofE(Y

m(A, V

|))2 has a non-zero component

(0)

i

corresponding to a term fi(A, V)that contains the treatment variable. Since the ordinary least squares estimate of isasymptotically normal and converges in probability to , we have that the estimatorfor

(0)i will converge to a non-zero value. This causes the hypothesis test (*) using

the estimate for (0)i to reject with arbitrarily large probability as sample size tends

to infinity. This completes the proof that when the hypothesis test (*) uses a linearmodel not of the form (11), then it will not have the the robustness property (3) ofthe paper.

Before proving Theorem (10) for the class of generalized linear models given inSection 5.2, we show how the above lemma implies that the hypothesis test for effect

modification given in Web Appendix C has the robustness property (3) of the paper.Consider a linear model containing at least the main term A and interaction termV of the form m(A, V|) = 0 + 1A + 2AV +

i

ifi(A, V). The proof abovefor linear models can be used unchanged up until (12). This is the point in theproof in which the null hypothesis (2) of the paper was used. Using instead the nullhypothesis that E(Y|A = 1, V) E(Y|A = 0, V) is a constant b, we can replace theexpression in (12) by E(YE(Y|A, V))2+E(E(Y|A = 0, V)+bAm(A, V|))2. This

24


25/37

shows that minimizing E(Y m(A, V|))2 is equivalent to minimizing E(E(Y|A =0, V) + bA m(A, V|))2 under this null hypothesis of no effect modification. Let be the unique, finite minimizer of this function (if no such minimizer exists, thenas argued above Type I error goes to 0 as sample size goes to infinity). Then by

Lemma 1, we have under this null hypothesis that 2 (which corresponds to the AVterm in m(A, V|)) is 0 and that 1 equals b. Since we are using robust varianceestimators, we then have that the test for effect modification that rejects the nullhypothesis when the design matrix has full rank and the estimate for the coefficient2 is more than 1.96 standard errors from 0 has asymptotic Type I error at most 0 .05.

D.2 Proof of Theorem for Generalized Linear Models

We prove the theorem (10) above for the large class of generalized linear modelsdefined in Section 5.2 of the paper. We repeat this definition here. Consider the

following types of generalized linear models: logistic regression, probit regression,binary regression with complementary log-log link function, and Poisson regression(using the log link function). We define our class of generalized linear models to beany generalized linear model from the previous list, coupled with a linear part of thefollowing form:

(A, V|) =t

j=1

(0)j fj(A)gj(V) +

tk=1

(1)k hk(V), (20)

for any measurable functions {fj, gj , hk} such that for all j, there is some k for whichgj(V) = hk(V); we also assume the functions

{gj , hk

}are bounded on compact subsets

ofRq, where V has dimension q. We denote the parameter vector ((0), (1)) simplyby . Let x denote the column vector of random variables corresponding to the termsin (A, V|). That is, denote

x := [f1(A)g1(V), f2(A)g2(A), . . . , f t(A)gt(V), h1(V), h2(V), . . . , ht(V)]T.

We can restrict attention to the case in which the components in x are linearlyindependent random variables, by virtue of the same arguments given above for thecase of linear models; we assume the components of x are linearly independent forthe remainder of the proof.

As described in (McCullagh and Nelder, 1998), the log-likelihood for any of theabove generalized linear models can be represented, for suitable choices of functionsb,d,g, as

l(; V,A,Y) = Y b() + d(Y), (21)where is called the canonical parameter of the model, and is related to the parameter through the following equality: b() = g1((A, V|)), where b() := db

d. and g()

is called the link function. For binary outcomes, the function b() = log(1 + e)

25


26/37

and d(y) = 0; for Poisson regression, in which the outcome is a nonnegative integer,b() = e and d(y) = log y!. Note that in both cases, b() := d2b

d2> 0 for all . Also

note that b is invertible in both cases. The Theorem (10) also holds when a dispersionparameter is included, and holds for other families of generalized linear models, suchas the Gamma and Inverse Gaussian families with canonical link functions; however,in these cases additional regularity conditions on the likelihood functions are required.For the class of generalized linear models given in Section 5.2, no additional regularityconditions beyond the linear part having the form (20) with all terms measurable andthe functions {gj, hk} bounded on compact subsets ofRq are needed for the theoremto hold.

Before giving the detailed proof that Theorem (10) above holds for models of theabove type, we give an outline of the main steps in the proof. First, we show that dueto the strict concavity of the expected log-likelihood for models from the exponentialfamily, it suffices to consider the case in which the expected log-likelihood has a

unique, finite maximum. Next, we show that when there is a unique, finite maximizer of the expected log-likelihood, all components of corresponding to terms in (20)that contain the treatment variable A are 0. Lastly we apply Theorem 5.39 on theasymptotic normality of maximum likelihood estimators in (van der Vaart, 1998),completing the proof of Theorem (10) for the class of generalized linear models givenin Section 5.2.

We now turn to proving the strict concavity of the expected log-likelihood forthe class of models defined in Section 5.2. This is equivalent to showing the Hessianmatrix H :=

2

jkE(l(; V,A,Y)) is negative definite, or in other words, that for

any and any non-zero column vector a of length t + t, we have aTHa < 0.Consider the case in which the link function in the generalized linear model is

the canonical link for that family (that is, assuming the canonical parameter =(A, V|)), which is the case for logistic regression and for Poisson regression withlog link. We then have from (21)

Hij = E2l

ij= Eb()xixj.

Since as noted above, b() > 0 for all , and since we have restricted to the case inwhich the random variables in the vector x are linearly independent, we have

aTHa = Eb()

aTx

2

< 0.

Thus, we have shown that the expected log-likelihood E(l(; V,A,Y)) is a strictlyconcave function of , whenever the canonical link function is used in a generalizedlinear model.

We now give a similar argument to that given above, but now applied to thegeneralized linear models from Section 5.2 that have non-canonical link functions.More precisely, we will show the expected log-likelihood is a strictly concave functionof for binary regression models using either of the following link functions:

26


27/37

1. 1(), for the distribution function of a standard normal random variable,which corresponds to probit regression,

2. log( log(1 )), called the complementary log-log link.For a generic link function g, let denote its inverse. For a binary outcome Y,

taking values in {0, 1}, the log-likelihood for a single subject is:

l(; V,A,Y) = log

()Y(1 ())1Y , (22)where = (A, V|). The Hessian matrix of the expected log-likelihood is then

H = ExxTz, (23)

for z = Yd2

d2(log ()) + (1 Y) d

2

d2(log(1 ())) ,

and x containing the terms in (A, V|) as defined in (20) above. Thus, to showthat the expected log-likelihood is strictly concave, it suffices to show that log andlog(1 ) are strictly concave; this follows since when these two functions are strictlyconcave, z defined above is strictly negative, and so for any non-zero vector a, wehave

aTHa = E(aTx)2z < 0.

We now verify that for the two link functions listed above, their inverses are suchthat log and log(1 ) are strictly concave.

1. For link function g() = 1(), its inverse is , and d2

d2(log ()) =

1, and

d2

d2(log(1 ())) = d2

d2(log(())) = 1.

2. For link function g() = log( log(1 )), its inverse is () = 1 ee , andd2

d2

log(1 ee) = {(1 e ee)ee}/ 1 ee2 < 0,since the term 1 e ee is strictly negative for all , which follows bysubstituting e for x in the well-known inequality ex > 1 + x for all x = 0;also, d

2

d2 log ee

= e < 0.

Thus, for both of the above link functions, log and log(1 ) are strictly concave,which by (23) implies that the expected log-likelihood is strictly concave.

The purpose of having shown above that the expected log-likelihood is strictlyconcave for each of the models in the class defined in Section 5.2, is that this impliesone of the following two cases must hold:

Case 1: There is no finite that maximizes E(l(; V,A,Y)).

27


28/37

Case 2: There is a unique maximizer of E(l(; V,A,Y)).

Consider what happens if Case 1 holds. Let Qn be the distribution of the maximizer nof the log-likelihood

nj=1 l(; Vj, Aj, Yj). By Theorem 5.7 of (van der Vaart, 1998), we

have for any w > 0 that Qn(|n| > w) converges to 1. Thus, the maximum likelihoodalgorithm will fail to converge to a finite vector that is within the bounds allowedby the algorithm (as explained just before (10) above), with probability tending to1, as sample size n tends to infinity. Since, by construction, the hypothesis testingalgorithm given above fails to reject the null hypothesis when the maximum likelihoodestimation procedure fails to converge, we have that under Case 1 the asymptoticType I error of the above hypothesis test converges to 0. Thus, it suffices to restrictattention to when Case 2 above holds, and we assume this case holds for the rest ofthe proof.

When Case 2 above holds, by Theorem 5.39 on the asymptotic normality of maxi-mum likelihood estimators in (van der Vaart, 1998), the maximum likelihood estima-tor for is asymptotically normal, and converges to the unique value of maximizingthe expected log-likelihood E(l(; V,A,Y)). Call this maximizer = ((0), (1)),where we partition the components of as described just below (20). We will show(0) = 0. Using (21), under the null hypothesis that E(Y|A, V) = E(Y|V), theexpected log-likelihood can be expressed as follows:

E(l(; V,A,Y)) = EE(l(; V,A,Y)|A, V)= E(E(Y|A, V) b() + E(d(Y)|A, V))= E(E(Y|V) b() + E(d(Y)|A, V))

Thus, the parameter that maximizes E(l(; V,A,Y)) also maximizes E(E(Y|V) b()). (Recall that = b1(g1((A, V|))), and so is a function only ofA,V, and .)Using assumption A1 above that all variables V,A,Y are bounded and our restrictionin (20) that the terms in (A, V|) are bounded on compact sets, it is straightforwardto show for each of the generalized linear models in the class defined in Section 5.2,that the corresponding expected log-likelihood is integrable for any finite . Thelemma below, analogous to Lemma 1, implies that must have (0) = 0. Thislemma is the main technical contribution of this paper.

Lemma 2: Consider any function c(A,V,) of the form

c(A,V,) = sV,j

(0)j fj(A)gj(V) +

k

(1)k hk(V) . (24)

where for all j, there is some k for which gj(V) = hk(V). Assume c(A,V,) isintegrable for any finite . Assume that A is independent of V and that there isa unique set of coefficients min achieving the minimum minE(c(A,V,)). Then

(0)min = 0; that is, min assigns 0 to all coefficients in (24) of terms containing the

variable A.

28


29/37

Proof of Lemma 2:Since from elementary probability theory

Ec(A,V,min) = EE[c(A,V,min)

|A],

we must have for some a0 in the range of A that

Ec(A,V,min) E[c(A,V,min)|A = a0]. (25)

We now construct a set of coefficients = ((0), (1)) with (0) = 0 and thatwe will later show attains the minimum minE(c(A,V,)). We leverage the assumed

property of

j (0)j fj(A)gj(V)+

k

(1)k hk(V) that for all j, there is some k for which

gj(V) = hk(V); this property allows us to group all terms together that correspondto the same hk(V) as follows:

j

(0)j fj(A)gj(V) +

k

(1)k hk(V) =

k

j:gj(V)=hk(V)

(0)j fj(A) +

(1)k

hk(V).This motivates defining as follows:

(0)j := 0, for all j, and

(1)k :=

j:gj(V)=hk(V)

(0)min,jfj(a0) +

(1)min,k, for all k.

Note that has the following property:

Ec(a0, V,) = Ec(a0, V , min)

= E[c(A,V,min)|A = a0] (26)where the first equality follows by construction of that for all v,c(a0, v, ) = c(a0, v , min), and the second equality follows by the independence of Aand V.

We now show that minimizes E(c(A,V,)). Since by definition assigns 0 tothe coefficients of all terms containing the variable A, we have c(a,V, ) has the samevalue for all a. We then have

Ec(A,V, ) = Ec(a0, V, )

= E[c(A,V,min)|A = a0] Ec(A,V,min),

where the second line follows from the above property (26) of and the third linefollows from our choice of a0 satisfying (25). Thus minimizes E(c(A,V,)), andby our assumption of a unique minimizer of this quantity, = min. Since byconstruction assigns 0 to the coefficients of all terms containing the variable A, the

29


30/37

lemma follows.

We can apply Lemma 2 to c(A,V,) =

E(E(Y|V) b()) = EE(Y|V)b1(g1((A, V|))) bb1(g1((A, V|))) ,since we had restricted to the case in which the expected log-likelihood has a unique,finite maximizer. This implies that the unique maximizer of the expected log-likelihood E(l(; V,A,Y)) has (0) = 0. This completes the argument above thatthe maximum likelihood estimator for converges to with (0) = 0. The Theorem(10) above then follows for the generalized linear models given in Section 5.2 sinceeach of the robust variance estimators in Web Appendix B is asymptotically consis-tent, regardless of whether the model is correctly specified, for this class of models.

Q.E.D.

We now give examples of a models that do not have the robustness property (3) ofthe paper. In general, median regression models, that is, working models m(A, V|)for the median ofY given A, V, when fit with maximum likelihood estimation do nothave the robustness property (3) of the paper. This is not surprising, since the nullhypothesis we consider is in terms of the mean (not the median) of Y given A, V.We note that even under the null hypothesis that the conditional median of Y givenA,V, does not depend on A, the robustness property (3) of the paper still does nothold for median regression.

E R Code for Data Example from Section 7

We give R Code for the hypothesis test (*) from the data example in Section 7.This hypothesis test used the following logistic regression model for the probabilityof HIV infection by the end of the trial, given treatment arm A and baseline variablesV1, V2, V3, V4, V5:

m(A, V|) = logit1(0 + 1A + 2V1 + 3V2 + 4V3 + 5V4 + 6V5+7AV1 + 8AV2 + 9AV3 + 10AV4 + 11AV5). (27)

The following R code executes hypothesis test (*):

rct_data


31/37

A_coefficient_estimate


32/37

F Comparison of Superpopulation Inference to theRandomization Inference Approach of Rosen-

baum (2002)

We compare the framework of superpopulation inference, used in our paper, to therandomization inference framework and hypothesis tests of Rosenbaum (2002). Fora more detailed comparison of these frameworks, see (Lehmann, 1986, Section 5.10),Rosenbaum (2002), and Robins (2002). These frameworks differ in the assumptionsthey make about how data is generated and in the hypotheses tested. Below, wediscuss these differences and how they relate to the hypothesis tests considered in ourpaper.

We first consider the ways superpopulation inference and randomization inferencediffer regarding the assumptions they make about how data is generated. The maindifference lies in which characteristics of subjects are assumed to be fixed (but possiblyunknown) and which are assumed to be random. To emphasize the contrast betweensuperpopulation inference and randomization inference, we use uppercase letters todenote random quantities, and lowercase letters to denote fixed (non-random) quan-tities. The framework of superpopulation inference we use in our paper assumes thatall data on each subject are random, and can be represented as a simple randomsample from a hypothetical, infinite superpopulation. That is, we assume the set ofvariables observed for each subject i, denoted by (Vi, Ai, Yi), is an i.i.d. draw from anunknown probability distribution P. Examples of work cited in our paper in whichsuperpopulation inference is used include (Robins, 2004; Tsiatis et al., 2007; Zhanget al., 2007; Moore and van der Laan, 2007).

In contrast to superpopulation inference, the randomization inference frameworkused by Rosenbaum (2002) (which can be traced back to Neyman (1923)) assumesthat subjects baseline characteristics and potential outcomes (defined below) are notrandom, but are fixed quantities for each subject. For example, in a randomized trialsetting in which baseline variables, treatment, and outcome are observed, it wouldbe assumed that for each subject i, there is a fixed set of baseline variables vi, anda fixed set of potential outcomes yi(a) for each possible value a of the treatment;yi(a) represents the outcome that subject i would have, if he/she would be assignedto treatment arm a. In randomization inference, the only random quantity is thetreatment assignment Ai for each subject. For each subject i, the only potentialoutcome that is observed is the one corresponding to the treatment assignment thatis actually received: yi(Ai). It is assumed that the set of observed variables for eachsubject i is (vi, Ai, yi(Ai)), denoting the baseline variables, treatment assignment,and observed outcome, respectively. Note that we put vi and yi(a) in lowercase toemphasize that these are assumed to be fixed (non-random) in the randomizationinference framework. Examples of work cited in our paper in which randomizationinference is used include (Freedman, 2007a,b, 2008).

32


33/37

Because of the differences in assumptions between superpopulation inference andrandomization inference, the hypotheses tested differ as well. In superpopulationinference, the hypotheses refer to parameters of the underlying data generating dis-tribution P; inferences are made about what the effects of treatment would be for

the hypothetical superpopulation from which the subjects are assumed to havebeen drawn. In contrast, in randomization inference, hypotheses refer only to thefixed quantities (such as baseline variables and potential outcomes) of the subjectsactually in the study. We consider examples of hypotheses tested in each of theseframeworks next.

In superpopulation inference, in a two-armed randomized trial, one may testwhether the mean of outcomes are affected by treatment assignment, as done byTsiatis et al. (2007); Zhang et al. (2007); Moore and van der Laan (2007); this cor-responds to testing the null hypothesis that E(Y|A = 0) = E(Y|A = 1), wherethe expectations are taken with respect to the unknown data generating distribu-

tion P. In our paper, we test the null hypothesis (2) of no mean treatment effectwithin strata of baseline variables V; for a two-armed randomized trial this corre-sponds to E(Y|A = 0, V) = E(Y|A = 1, V). Note that these null hypotheses referto parameters of the underlying distribution P. In contrast, the hypotheses tested inthe randomization inference framework of Rosenbaum (2002) are hypotheses aboutthe potential outcomes of the subjects in the trial. For example, in a trial with twoarms, the null hypothesis of no treatment effect at all would correspond to equalityof the potential outcomes yi(0) = yi(1) for each subject i; this null hypothesis is thatevery subject in the trial would have exactly the same outcome regardless of whichtreatment arm they were assigned to. A null hypothesis considered by Rosenbaum(2002) is that of additive treatment effects; that is, for some 0, yi(1) = yi(0)+ 0 for

all subjects i. This means that for every subject in the study, the impact of havingreceived the treatment (A = 1) compared to the control (A = 0) would have been toincrease the value of their outcome by exactly 0.

Because of the differences in assumptions and hypotheses tested in the superpop-ulation inference framework of our paper and the randomization inference frameworkof Rosenbaum (2002), the methods used for each situation are generally different.Due to these differences, the methods of Rosenbaum (2002) are in general not appro-priate for testing the hypotheses considered in our paper. (The one exception is whenthe outcome is binary, as we further discuss below.) More precisely, the methods ofRosenbaum (2002) will not have asymptotically correct Type I error for testing thehypotheses considered in our paper in many situations. Before showing this, we ex-plain the intuition for why this is the case. The null hypotheses of Rosenbaum (2002)imply that certain distributions are invariant under permutations of the treatmentvariable. For example, the null hypothesis of no treatment effect at all in randomiza-tion inference (defined above) implies that rearranging which subjects got the activetreatment vs. control will have no effect on the outcomes. Therefore, under this nullhypothesis, permutation-based tests (such as the Wilcoxon rank-sum test described

33


34/37

in method M3 above) are appropriate. This is in stark contrast with the implicationsof the null hypotheses (2) considered in our paper, which only imply that conditionalmeans of the outcome given baseline variables are unaffected by treatment. The dis-tribution of the data, under our framework and null hypothesis (2) of the paper, is

not invariant to permutations of the treatment variable, and so permutation tests willnot in general work, except in the special case considered next.

In the special case in which the outcome Y of a randomized trial is binary (takingvalues 0 or 1), our null hypothesis (2) is that E(Y|A = 1, V) = E(Y|A = 0, V).Assuming treatment assignment A is independent of the baseline variables V (due torandomization), this null hypothesis is equivalent to the outcome Y and baseline vari-ables V being mutually independent of the treatment A. This means that under ournull hypothesis (2), the treatment has no effect at all on the distribution of the out-come, and so permutation tests makes sense; in fact, they have exactly correct TypeI error (not just asymptotically correct Type I error) under the null hypothesis (2) of

the paper in this special case. Thus, in this case, the permutation tests of Rosenbaum(2002) have the robustness property (3) of the paper, even under the superpopulationframework of this paper defined in Section 3. In our simulations in Section 6 com-paring the power of various methods for a binary outcome, the permutation-basedmethod (M3) generally performed well in terms of power; compared to the regression-based method (M0) of this paper, the permutation-based method sometimes had morepower and sometimes had less power.

We now give an example to show that the permutation-based methods of Rosen-baum (2002) are not guaranteed to have the robustness property (3) of the paperwhen testing the null hypothesis (2) (except in the case of binary outcomes, as dis-cussed above). We show that the basic permutation-based method of Rosenbaum

(2002) does not have asymptotically correct Type I error under the null hypothesis(2) of the paper, under the framework of our paper (given in Section 3) for the fol-lowing data generating distribution: Let X be a random variable with the followingskewed distribution: X = 2 with probability 1/3 and equals 1 with probability 2/3;thus, the mean of X is 0. Define baseline variable V to be a random variable inde-pendent of X such as a standard normal random variable. Let the treatment A takevalues 0 and 1 each with probability 1/2, mutually independent of {V, X}. Let theoutcome Y = (2A 1)X+ V. Then the null hypothesis (2) of the paper is true sinceE(Y|A, V) = (2A1)EX+ V = V and so is a function ofV only (and not a functionof A). This also implies that E(Y|V) = V. Assume, for example, a working modelm(V

|) =

0+

1V +

2V2 for E(Y

|V) is used by the permutation-based method,

and is fit with ordinary least squares regression. Asymptotically, the estimates of thecoefficient vector will tend to (0, 1, 0). So for large sample sizes, the residuals {i}corresponding to this model fit will be, approximately, i = Yi Vi = (2Ai 1)Xi,where the subscript i denotes the value corresponding to the ith subject. Note, asmentioned above, we are assuming the framework of Section 3 in which each observa-tion is i.i.d. Then for large sample sizes the empirical distribution of the residuals i

34


35/37

corresponding to treatment (Ai = 1) will be right-skewed (having roughly the samedistribution as X), while the empirical distribution of the residuals i correspondingto the control arm (Ai = 0) will be left-skewed (having roughly the same distributionas

X). Thus, the Wilcoxon rank sum test on the set of residuals i will reject the

null with probability tending to 1 as sample size goes to infinity. This implies Type Ierror for testing the null hypothesis (*) will tend to 1. We stress this occurs becausethe methods of Rosenbaum (2002) are designed for a different framework and differenttype of null hypothesis than considered here.

References

Cochran, W. G. (1954). Some methods for strengthening the common 2 tests.Biometrics 10, 417451.

Freedman, D. A. (2007a). On regression adjustments to experimental data. Advancesin Applied Mathematics (To Appear) .

Freedman, D. A. (2007b). On regression adjustments to experiments with severaltreatments. Annals of Applied Statistics (To Appear) .

Freedman, D. A. (2008). Randomization does not justify logistic regression. StatisticalScience 23, 237249.

Hardin, J. W. and Hilbe, J. M. (2007). Generalized Linear Models and Extensions,2nd Edition. Stata Press, College Station.

Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures. Wiley-Interscience, New York.

Huber, P. J. (1967). The behavior of of the maximum likelihood estimator undernonstandard conditions. Proc. Fifth Berkeley Symp. Math. Statist., Prob. 1 pages221233.

Jewell, N. P. (2004). Statistics for Epidemiology. Chapman and Hall/CRC, BocaRaton.

Lehmann, E. L. (1986). Testing Statistical Hypotheses. Wiley, New York, 2 edition.

Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data fromretrospective studies of disease. Journal of the National Cancer Institute 22, 719748.

McCullagh, P. and Nelder, J. A. (1998). Generalized Linear Models. Chapman andHall/CRC, Monographs on Statistics and Applied Probability 37, Boca Raton,Florida, 2nd edition.

35


36/37

Moore, K. L. and van der Laan, M. J. (2007). Covariate adjustment in ran-domized trials with binary outcomes: Targeted maximum likelihood estimation.U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 215.http://www.bepress.com/ucbbiostat/paper215 .

Neyman, J. (1923). Sur les applications de la theorie des probabilites aux experiencesagricoles: Essai des principes. Roczniki Nauk Rolniczych (in Polish) 10, 151.

R Development Core Team (2004). R: A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing, ISBN 3-900051-00-3, URLhttp://www.R-project.org/., Vienna, Austria.

Robins, J. M. (2002). Comment: Covariance adjustment in randomized experimentsand observational studies. Statistical Science 17, 309321.

Robins, J. M. (2004). Optimal structural nested models of optimal sequential deci-sions. Proceedings of the Second Seattle Symposium on Biostatistics, D. Y. Lin andP. Heagerty (Eds.) pages 611.

Rosenbaum, P. R. (2002). Covariance adjustment in randomized experiments andobservational studies. Statistical Science 17, 286327.

Rosenblum, M. and van der Laan, M. (2007). Using regression to analyze ran-domized trials: Valid hypothesis tests despite incorrectly specified models. U.C.Berkeley Division of Biostatistics Working Paper Series. Working Paper 219.http://www.bepress.com/ucbbiostat/paper219 .

Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2007). Covariate adjustmentfor two-sample treatment comparisons in randomized clinical trials: A principledyet flexible approach. Statistics in Medicine (To Appear) .

van der Laan, M. J. and Hubbard, A. E. (2005). Quantile-function basednull distribution in resampling based multiple testing. Statistical Appli-cations in Genetics and Molecular Biology (Technical Report available athttp://www.bepress.com/ucbbiostat/paper198) 5,.

van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning.U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 213.

http://www.bepress.com/ucbbiostat/paper213 .van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press,

New York.

Westfall, P. H. and Young, S. S. (1993). Resampling-based multiple testing: Examplesand methods for p-value adjustment. John Wiley and Sons.

36


37/37

Williams, D. (1991). Probability with Martingales. Cambridge University Press, Cam-bridge, UK.

Zhang, M., Tsiatis, A. A., and Davidian, M. (2007). Improving efficiency of inferences

in randomized clinical trials using auxiliary covariates. Biometrics (To Appear) .

080109P FINAL Supp Materials (1)

Documents

Transcript of 080109P FINAL Supp Materials (1)