Oracle and Adaptive Compound ... - Statistics Departmenttcai/paper/FDR.pdf · Oracle and Adaptive...
Transcript of Oracle and Adaptive Compound ... - Statistics Departmenttcai/paper/FDR.pdf · Oracle and Adaptive...
Oracle and Adaptive Compound Decision Rules forFalse Discovery Rate Control
Wenguang SUN and T Tony CAI
We develop a compound decision theory framework for multiple-testing problems and derive an oracle rule based on the z values thatminimizes the false nondiscovery rate (FNR) subject to a constraint on the false discovery rate (FDR) We show that many commonlyused multiple-testing procedures which are p valuendashbased are inefficient and propose an adaptive procedure based on the z values Thez valuendashbased adaptive procedure asymptotically attains the performance of the z value oracle procedure and is more efficient than theconventional p valuendashbased methods We investigate the numerical performance of the adaptive procedure using both simulated and realdata In particular we demonstrate our method in an analysis of the microarray data from a human immunodeficiency virus study thatinvolves testing a large number of hypotheses simultaneously
KEY WORDS Adaptive procedure Compound decision rule False discovery rate Local false discovery rate Monotone likelihood ratioWeighted classification
1 INTRODUCTION
In large-scale multiple comparisons where hundreds or thou-sands of hypotheses are tested simultaneously the goal is toseparate the nonnull cases from the null cases Commonly usedmultiple-testing procedures are typically based on the p val-ues of the individual tests For example the well-known step-up procedure of Benjamini and Hochberg (1995) which aimsto maximize the number of true positives while controlling theproportion of false positives among all rejections thresholdsthe p values of the individual tests Other examples includethe adaptive procedure (Benjamini and Hochberg 2000) theplug-in procedure (Genovese and Wasserman 2004) and theaugmentation procedure (van der Laan Dudoit and Pollard2004) The operation of these procedures essentially involvesfirst ranking the hypotheses based on the individual p valuesand then choosing a cutoff along the rankings
The outcomes of a multiple testing procedure can be cat-egorized as done in Table 1 The false discovery rate (FDR)is defined as E(N10R|R gt 0)Pr(R gt 0) with an FDR levelof 0 when no hypotheses are rejected Other similar measuresinclude the positive false discovery rate (pFDR) E(N10R|R gt 0) and the marginal FDR (mFDR) E(N10)E(R) ThepFDR and mFDR are equivalent when test statistics come froma random mixture of the null and nonnull distributions (Storey2003) Genovese and Wasserman (2002) showed that underweak conditions mFDR = FDR + O(mminus12) where m is thenumber of hypotheses
Table 2 summarizes the data notation used in this articlewhich is standard for many microarray studies (see eg Efron2004ab) Suppose that two groups of subjects X1 Xn1 andY1 Yn2 each have measured expression levels for the samem genes For i = 1 m a two-sample t statistic Ti is ob-tained for comparing the two groups on gene i and is thentransformed to a z value through Zi = minus1(F (Ti)) where F
and are the cdfrsquos of the t variable Ti and the standard normalvariable The corresponding p-values of all tests are recordedand denoted by P1 Pm
Wenguang Sun is PhD Candidate Department of Biostatistics and Epi-demiology School of Medicine University of Pennsylvania Philadelphia PA19104 (E-mail wgsunmailmedupennedu) T Tony Cai is Professor of Sta-tistics Department of Statistics The Wharton School University of Pennsylva-nia Philadelphia PA 19104 (E-mail tcaiwhartonupennedu) Cairsquos researchwas supported in part by National Science Foundation grants DMS-03-06576and DMS-06-04954
In this article we develop a compound decision theory frame-work for multiple testing and derive an oracle rule based on thez values that subject to a constraint on the FDR minimizesthe false nondiscovery rate (FNR) E(N01S|S gt 0)Pr(S gt 0)A major goal of this article is to show that the p valuendashbasedapproaches generally do not lead to efficient multiple-testingprocedures The reason for the inefficiency of the p value meth-ods can be traced back to work of Robbins (1951) that showedin a compound decision problem that simple rules (defined inSec 2) are usually inferior compared to compound decisionrules In addition a result of Copas (1974) implies that thep value oracle procedure defined in Section 3 is inadmissi-ble in a compound decision problem Robbinsrsquo argument thatcompound rules are superior to simple rules is especially im-portant in large-scale multiple testing in which the precision ofthe tests can be increased by pooling information from differentsamples Our approach is to use the z values to learn the distri-bution of the test statistics and use the information to constructa more efficient test
In this article we first develop a z valuendashbased oracle proce-dure that minimizes the mFNR subject to mFDR le α We thencompare this procedure with the p value oracle procedure pro-posed by Genovese and Wasserman (2002) A comparison ofthese two oracle procedures in Figure 1 shows that the p valueoracle procedure is dominated by the z value oracle procedureand that the gain in efficiency can be substantial when the al-ternative is asymmetric More numerical examples are given inSection 5 It can be seen that the z value oracle procedure out-performs the p value oracle procedure in all cases except whenthe alternative is perfectly symmetric about the null which im-plies that it is possible that p valuendashbased procedures can beuniformly improved by using the z values
We then develop a data-driven adaptive procedure based onthe z values We show that the z valuendashbased adaptive pro-cedure asymptotically attains the performance of the z valueoracle procedure and is more efficient than the conventionalp valuendashbased methods including the step-up procedure ofBenjamini and Hochberg (1995) and the plug-in procedure ofGenovese and Wasserman (2004) By treating multiple testing
copy 2007 American Statistical AssociationJournal of the American Statistical Association
September 2007 Vol 102 No 479 Theory and MethodsDOI 101198016214507000000545
901
902 Journal of the American Statistical Association September 2007
Table 1 Classification of tested hypothesis
Claimed nonsignificant Claimed significant Total
Null N00 N10 m0Nonnull N01 N11 m1Total S R m
as a compound decision problem our results show that individ-ual p values although appropriate for testing a single hypothe-sis fail to serve as the fundamental building block in large-scalemultiple testing
In addition to the theoretical properties we study the numer-ical performance of the z valuendashbased adaptive procedure usingboth simulated and real data Simulations reported in Section 5demonstrate that our z value procedure yields a lower mFNRthan the p valuendashbased methods at the same mFDR level Thegain in efficiency is substantial in many situations when the al-ternative is asymmetric about the null We then apply our proce-dure to the analysis of microarray data from a human immuno-deficiency virus (HIV) study in Section 6 and find that at thesame FDR level it rejects more hypotheses than the p valuendashbased procedures The type of asymmetry in the HIV data iscommonly observed in microarray study suggesting that ournew z valuendashbased adaptive procedure can aid in the discoveryof additional new meaningful findings in many scientific appli-cations
The article is organized as follows In Section 2 we develop acompound decision-theoretic framework and study various as-pects of weighted classification and multiple-testing problemsin this setting In Sections 3 and 4 we propose oracle and adap-tive testing procedures based on the z values for FDR controlIn Section 5 we introduce numerical examples to show that ournew z valuendashbased procedure is more efficient than the tradi-tional p valuendashbased procedures In Section 6 we illustrate ouradaptive procedure with the analysis of microarray data in anHIV study We conclude the article with a discussion of theresults and some open problems We provide proofs in the Ap-pendix
2 COMPOUND DECISION PROBLEM
Let be the sample space and let be the parameter spaceSuppose that x = (x1 xm) isin are observed and that we areinterested in inference about the unknown θ = (θ1 θm) isin
based on x This involves solving m decision problems si-multaneously and is called a compound decision problem Letδ = (δ1 δm) be a general decision rule Then δ is simple ifδi is a function only of xi that is δi(x) = δi(xi) The simplerules correspond to solving the m component problems sepa-rately In contrast δ is compound if δi depends on other xj rsquos
Table 2 Summary of the data notation
Original data T statistic z value p value
X1 middot middot middot Xn1 Y1 middot middot middot Yn2 T Z P
X11 middot middot middot Xn11 Y11 middot middot middot Yn21 T1 Z1 P1X12 middot middot middot Xn12 Y12 middot middot middot Yn22 T2 Z2 P2
X1m middot middot middot Xn1m Y1m middot middot middot Yn2m Tm Zm Pm
Figure 1 A comparison of the p value ( ) and z value ( )oracle procedures with the mFDR level at 10 The test statis-tics have the normal mixture distribution 8N(01) + p1N(minus31) +(2 minus p1)N(31) The mFNRs of the two procedures are plotted as afunction of p1
j = i A decision rule δ is symmetric if δ(τ (x)) = τ(δ(x)) forall permutation operators τ
Robbins (1951) considered a compound decision problemwhere Xi sim N(θi1) i = 1 n are independent normalvariables with mean θi = 1 or minus1 The goal is to classify each θi
under the usual misclassification error He showed that theunique minimax rule R δi = sgn(xi) does not perform well inthis compound decision problem by exhibiting a compound ruleRlowast δi = sgn(xi minus 1
2 log 1minusx1+x
) that substantially outperforms R
when p the proportion of θi = 1 approaches 0 or 1 with onlyslightly higher risk near p = 5
Let θ1 θm be independent Bernoulli(p) variables and letXi be generated as
Xi |θi sim (1 minus θi)F0 + θiF1 (1)
The variables Xi are observed and the variables θi are unob-served The marginal cumulative distribution function (cdf) ofX is the mixture distribution F(x) = (1 minus p)F0(x) + pF1(x)and the probability distribution function (pdf) is f (x) = (1 minusp)f0(x) + pf1(x) where f is assumed to be continuous andpositive on the real line In statistical and scientific applica-tions the goal is to separate the nonnull cases (θi = 1) fromthe null (θi = 0) which can be formulated either as a weightedclassification problem or a multiple-testing problem The so-lution to both problems can be represented by a decision ruleδ = (δ1 δm) isin I = 01m
In Section 6 we consider a problem in which we are inter-ested in identifying a set of genes that are differentially ex-pressed between HIV-positive patients and HIV-negative con-trols This naturally gives rise to a multiple-testing problemin which the goal is to find as many true positives as possi-ble while controlling the proportion of false positives amongall rejections within level α As in the work of Genovese andWasserman (2002) we define the marginal FNR as mFNR =E(N01)E(S) the proportion of the expected number of non-nulls among the expected number of nonrejections The multi-ple testing problem is then to find δ that minimizes the mFNR
Sun and Cai Rules for False Discovery Rate Control 903
while controlling the mFDR at level α On the other hand inapplications it is possible to rate the relative cost of a false pos-itive (type I error) to a false negative (type II error) This nat-urally gives rise to a weighted classification problem with lossfunction
Lλ(θ δ) = 1
m
sumi
[λI (θi = 0)δi + I (θi = 1)(1 minus δi)] (2)
where λ gt 0 is the relative weight for a false positive Theweighted classification problem is then to find δ that mini-mizes the classification risk E[Lλ(θ δ)] We develop a com-pound decision theory framework for inference about the mix-ture model (1) and make connections between multiple testingand weighted classification
We consider δ(x) isin 01m which is defined in termsof statistic T(x) = [Ti(x) i = 1 m] and threshold c =(c1 cm) such that
δ(x) = I (T lt c) = [I (Ti(x) lt ci) i = 1 m
] (3)
Then δ can be used in both weighted classification and multiple-testing problems It is easy to verify that δ given in (3) is sym-metric if Ti(x) = T (xi) and c = c1 where T is a function Inthis section we allow T to depend on unknown quantities suchas the proportion of nonnulls andor the distribution of Xi Weassume that T (Xi) sim G = (1 minus p)G0 + pG1 where G0 andG1 are the cdfrsquos of T (Xi) under the null and the alternativerespectively The pdf of T (Xi) is g = (1 minus p)g0 + pg1 Let Tbe the collection of functions such that for any T isin T T (Xi)
has monotone likelihood ratio (MLR) that is
g1(t)g0(t) is decreasing in t (4)
We call T the SMLR class Let x = x1 xm be a ran-dom sample from the mixture distribution F and let δ(T c) =I [T (xi) lt c] i = 1 m We call δ(T c) a SMLR decisionrule if T isin T and let Ds denote the collection of all SMLRdecision rules
Suppose that T (Xi) = p(Xi) the p value of an individualtest based on the observation Xi Assume that p(Xi) sim G =(1 minus p)G0 + pG1 where G0 and G1 are the p value distrib-utions under the null and the alternative Assume that G1(t) istwice differentiable Note that g0(t) = Gprime
0(t) = 1 the assump-tion p(middot) isin T implies that Gprimeprime
1(t) = gprime1(t) lt 0 that is G1(t) is
concave Therefore the SMLR assumption can be viewed asa generalized version of the assumption of Storey (2002) andGenovese and Wasserman (2002 2004) that the p value distri-bution under the alternative is concave The following proposi-tion shows that the SMLR assumption is a desirable conditionin both multiple-testing and weighted classification problems
Proposition 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the model (1) Suppose that T isin T then the implemen-tation of δ(T c) = I [T (xi) lt c] i = 1 m in both theweighted classification and the multiple-testing problems im-plies that (a) P(θi = 1|T (Xi) le c) is monotonically decreas-ing in threshold c (b) the mFDR is monotonically increasingin c and the expected number of rejections r (c) the mFNR ismonotonically decreasing in c and r (d) the mFNR is monoton-ically decreasing in the mFDR and (e) in the classificationproblem c (and thus r) is monotonically decreasing in the clas-sification weight λ
The following theorem makes connection between a multi-ple-testing problem and a weighted classification problem Inparticular the former can be solved by solving the latter withan appropriately chosen λ
Theorem 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the mixture model (1) Let isin T and suppose that theclassification risk with the loss function (2) is minimized byδλ[c(λ)] = δλ
1 δλm where δλ
i = I (xi) lt c(λ) Thenfor any given mFDR level α in a multiple testing problemthere exists a unique λ(α) that defines a weighted classifica-tion problem The optimal solution to the classification problemδλ(α)[cλ(α)] is also optimal in the multiple testing prob-lem in the sense that it controls the mFDR at level α with thesmallest mFNR among all decision rules in Ds
The next step is to develop such an optimal rule δλ( c(λ))
as stated in Theorem 1 We first study an ideal setup in whichthere is an oracle that knows pf0 and f1 Then the oracle rulein this weighted classification problem gives the optimal choiceof δ
Theorem 2 (The oracle rule for weighted classification) Letθi i = 1 m be iid Bernoulli(p) variables and let xi |θi i = 1 m be independent observations from the mixturemodel (1) Suppose that pf0 and f1 are known Then the clas-sification risk E[Lλ(θ δ)] with L given in (2) is minimized byδλ(1λ) = δ1 δm where
δi = I
(xi) = (1 minus p)f0(xi)
pf1(xi)lt
1
λ
i = 1 m (5)
The minimum classification risk is Rlowastλ
= infδ E[Lλ(θ δ)] =p+int
K[λ(1minusp)f0(x)minuspf1(x)]dx where K = x isin λ(1minus
p)f0(x) lt pf1(x)Remark 1 Let the likelihood ratio (LR) be defined as Li =
f0(xi)f1(xi) A compound decision rule is said to be orderedif for almost all x Li gt Lj and δi(x) gt 0 imply that δj (x) = 1Copas (1974) showed that if a symmetric compound decisionrule for dichotomies is admissible then it is ordered Note that(5) is a Bayes rule and so is symmetric ordered and admis-sible But because the p valuendashbased procedure δ(p(middot) c) =I [p(Xi) lt c] i = 1 m is symmetric but not ordered bythe LR it is inadmissible in the compound decision problem
Remark 2 In practice some of the p f0 and f1 are un-known but estimable Then we can estimate the unknown quan-tities and use the rule δλ(1λ) The subminimax rule δi =sgn(xi minus 1
2 log 1minusx1+x
) given by Robbins (1951) is recovered byletting λ = 1 f0 = φ(middot+ 1) f1 = φ(middotminus 1) and p1 = (1 + x)2in (5) where φ is the pdf of a standard normal
Remark 3 Suppose that two weights λ1 lt λ2 are chosenin the loss (2) Let 1 = x λ1(1 minus p)f0(x) lt pf1(x) and2 = x λ2(1 minus p)f0(x) lt pf1(x) Then the classificationrisk Rλ is increasing in λ because Rlowast
λ1minus Rlowast
λ2= int
12[λ1(1 minus
p)f0(x) minus pf1(x)]dx + int2
[(λ1 minus λ2)(1 minus p)f0(x)]dx lt 0Also it is easy to see that 1 sup 2 Thus the expected numberof subjects classified to the nonnull population is decreasingin λ demonstrating that (d) in Proposition 1 is satisfied by theclassification rule δλ(1λ)
904 Journal of the American Statistical Association September 2007
3 THE ORACLE PROCEDURES FORmFDR CONTROL
We have shown that δλ(1λ) = [I (x1) lt 1λ I (xm) lt 1λ] is the oracle rule in the weighted classifica-tion problem Theorem 1 implies that the optimal rule for themultiple-testing problem is also of the form δλ(α)[1λ(α)]if isin T although the cutoff 1λ(α) is not obvious Notethat (x) = Lfdr(x)[1 minus Lfdr(x)] is monotonically increas-ing in Lfdr(x) where Lfdr(middot) = (1 minus p)f0(middot)f (middot) is the localFDR (Lfdr) introduced by Efron Tibshirani Storey and Tusher(2001) and Efron (2004a) so the optimal rule for mFDR controlis of the form δ(Lfdr(middot) c) = I [Lfdr(xi) lt c] i = 1 mLfdr has been widely used in the FDR literature to providea Bayesian version of the frequentist FDR measure and in-terpret results for individual cases (Efron 2004a) We ldquoredis-coverrdquo it here as the optimal (oracle) statistic in the multiple-testing problem in the sense that the thresholding rule based onLfdr(X) controls the mFDR with the smallest mFNR
The Lfdr statistic is defined in terms of the z values whichcan be converted from other test statistics including the t
and chi-squared statistics using inverse-probability transforma-tions Note that p is a global parameter Therefore the expres-sion Lfdr(z) = (1 minus p)f0(z)f (z) implies that we actually rankthe relative importance of the observations according to theirLRs and that the rankings are generally different from the rank-ings of p values unless the alternative distribution is symmetricabout the null An interesting consequence of using the Lfdrstatistic in multiple testing is that an observation located fartherfrom the null may have a lower significance level Thereforeit is possible that the test accepts a more extreme observation
while rejecting a less extreme observation implying that the re-jection region is asymmetric This is not possible for a testingprocedure based on the individual p values which has a rejec-tion region always symmetric about the null This phenomenonis illustrated in Figure 2 at the end of this section
Setting the threshold for the test statistics has been the focusof the FDR literature (see eg Benjamini and Hochberg 1995Genovese and Wasserman 2004) Consider an ideal situationin which we assume that an oracle knows the true underlyingdistribution of the test statistics Note that the SMLR assump-tion implies that the mFNR is decreasing in the mFDR there-fore the oraclersquos response to such a thresholding problem is toldquospendrdquo all of the mFDR to minimize the mFNR Besides pro-viding a target for evaluating different multiple-testing proce-dures the oracle procedure also sheds light on the developmentof the adaptive procedure that we propose in Section 4
31 The Oracle Procedure Based on the p Values
Let G1(t) be the distribution of the p values under the alter-native and let p be the proportion of nonnulls We assume thatG1 is concave then according to Genovese and Wasserman(2002) the oracle procedure thresholds the p value at ulowast thesolution to the equation G1(u)u = (1p minus 1)(1α minus 1) Hereulowast is the optimal cutoff for a concave G1(t) in the sense that therule δ[p(middot) ulowast] = I [Pi lt ulowast] i = 1 m yields the small-est mFNR among all p valuendashbased procedures that control themFDR at level α Let G(t) = 1 minus G(t) The resulting mFDR of(1 minus p)ulowast[(1 minus p)ulowast + pG1(u
lowast)] is just α and the mFNR isgiven by pG(ulowast)[pG1(u
lowast) + (1 minus p)(1 minus ulowast)]
(a) (b)
Figure 2 Symmetric rejection region versus asymmetric rejection region (a) Comparison of oracle rules ( p value z value)(b) Rejection regions The normal mixture model is 8N(01) + p1N(minus31) + (2 minus p1)N(41) Both procedures control the mFDR at 10
Sun and Cai Rules for False Discovery Rate Control 905
32 The Oracle Procedure Based on z Values
We write the Lfdr statistic as TOR(Zi) = (1 minus p)f0(Zi)
f (Zi) and call it the oracle test statistic We assume thatTOR(Zi) is distributed with marginal cdf GOR = (1 minus p)G0
OR +pG1
OR where G0OR and G1
OR are the cdfrsquos of TOR(Zi) under thenull and the alternative Set GOR(t) = 1 minus GOR(t) The mFDRof the oracle rule δ(TOR λ) is QOR(λ) = (1 minus p)G0
OR(λ)
GOR(λ) We assume that TOR isin T satisfies the SMLR assump-tion then (b) of Proposition 1 implies that QOR(λ) is increasingin the threshold λ Therefore the oracle procedure thresholdsTOR(z) at λOR = supt isin (01) QOR(t) le α = Qminus1
OR(α) Thusthe oracle testing rule based on z values is
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
(6)
The corresponding mFDR= QOR(λOR) is just α and the mFNRis
QOR(λOR) = pG1OR(λOR)GOR(λOR) (7)
Example 1 We consider a random sample z1 zm from anormal mixture model
f (z) = (1minusp1 minusp2)φ(z)+p1φ(zminusμ1)+p2φ(zminusμ2) (8)
along with corresponding m tests Hi0 μ = 0 versus Hi
1 μ = 0i = 1 m Let p = p1 + p2 be the total proportion of non-nulls We assume that μ1 lt 0 and μ2 gt 0 so that the rejectionregion will be two-sided
The p value distribution under the alternative can be de-rived as G1(t) = (p1p)(Zt2 +μ1)+(Zt2 minusμ1)+ (p2
p)(Zt2 + μ2) + (Zt2 minus μ2) Thus the optimal p valuecutoff ulowast is the solution to equation G1(u)u = (1p minus 1) times(1α minus 1) with corresponding mFNR of pG(ulowast)[pG1(u
lowast) +(1 minus p)(1 minus ulowast)]
It is easy to verify that in normal mixture model (8) theoracle testing rule (6) is equivalent to rejecting the null whenzi lt cl or zi gt cu The threshold λOR can be obtained using thebisection method For a given λ cl and cu can be solved numeri-cally from the equation λ[p1 exp(minusμ1zminus 1
2μ21)+p2 exp(μ2zminus
12μ2
2)] minus (1 minus λ)(1 minus p) = 0 The corresponding mFDR andmFNR can be calculated as
mFDR = (1 minus p)[(cl) + (cu)]((1 minus p)[(cl) + (cu)] + p1[(cl + μ1)
+ (cu + μ1)] + p2[(cl minus μ2) + (cu minus μ2)])
and
mFNR = (p1[(cu + μ1) minus (cl + μ1)]+ p2[(cu minus μ2) minus (cl minus μ2)]
)((1 minus p)[(cu) minus (cl)] + p1[(cu + μ1)
minus (cl + μ1)] + p2[(cu minus μ2) minus (cl minus μ2)])
where (middot) is the cdf of a standard normal and (middot) = 1minus(middot)In this example we choose the mixture model 8N(01) +
p1N(minus31)+(2minusp1)N(41) This is different from the model
shown in Figure 1 which has alternative means minus3 and 3 there-fore different patterns are seen Both oracle procedures con-trol the mFDR at level 10 and we plot the mFNRs of thetwo procedures as a function of p1 in Figure 2(a) We cansee that again the p value oracle procedure is dominated bythe z value oracle procedure We set p = 15 and character-ize the rejection regions of the two oracle procedures in Fig-ure 2(b) Some calculations show that the rejection region ofthe p value oracle procedure is |zi | gt 227 with mFNR = 046whereas the rejection region of the z value oracle procedure iszi lt minus197 and zi gt 341 with mFNR = 038 It is interest-ing to note that the z value oracle procedure rejects observa-tion x = minus2 (p value = 046) but accepts observation x = 3(p value = 003) We provide more numerical comparisons inSection 5 and show that the z value oracle procedure yieldsa lower mFNR level than the p value oracle procedure on allpossible configurations of alternative hypotheses with the dif-ference significant in many cases
33 Connection to Work of Spjoslashtvoll (1972) andStorey (2007)
An ldquooptimalrdquo multiple-testing procedure was introduced bySpjoslashtvoll (1972) Let f01 f0N and f1 fN be integrablefunctions Let Sprime(γ ) denote the set of all tests (ψ1 ψN )that satisfy
sumNt=1
intψt(x)f0t (x) dμ(x) = γ where γ gt 0 is pre-
specified Spjoslashtvoll (1972) showed that the test (φ1 φN) =I [ft (x) gt cf0t (x)] t = 1 N maximizes
Nsumt=1
intφt (x)ft (x) dμ(x) (9)
among all tests (ψ1 ψN) isin Sprime(γ ) Storey (2007) proposedthe optimal discovery procedure (ODP) based on a shrinkagetest statistic that maximizes the expected number of true pos-itives (ETP) for a fixed level of the expected number of falsepositives (EFP) The result of Storey (2007) on the ODP fol-lows directly from the theorem of Spjoslashtvoll (1972) by choosingappropriate f0t and ft
In the setting of the present article both Spjoslashtvollrsquos optimalprocedure and Storeyrsquos ODP procedure depend on unknownfunctions that are not estimable from the data Moreover theoptimal cutoffs for a given test level are not specified in bothprocedures These limitations make the two methods inapplica-ble in terms of the goal of FDR control
The formulation of our testing procedure (6) has two ad-vantages over the procedures of Spjoslashtvoll (1972) and Storey(2007) First we can form good estimates (Jin and Cai 2007)from the data for the unknown functions in the oracle test sta-tistic that we propose Second we specify the optimal cutofffor each mFDR level which was not discussed by Spjoslashtvoll(1972) or Storey (2007) These advantages greatly facilitate thedevelopment of the adaptive procedure (10) where a consistentcutoff (relative to the oracle cutoff) is suggested for each testlevel based on a simple step-up procedure This task is impossi-ble based on the formulation of the work of Spjoslashtvoll or Storeybecause good estimates of the unknown quantities in their testldquostatisticsrdquo are not available and obtaining appropriate cutoffsis difficult
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
902 Journal of the American Statistical Association September 2007
Table 1 Classification of tested hypothesis
Claimed nonsignificant Claimed significant Total
Null N00 N10 m0Nonnull N01 N11 m1Total S R m
as a compound decision problem our results show that individ-ual p values although appropriate for testing a single hypothe-sis fail to serve as the fundamental building block in large-scalemultiple testing
In addition to the theoretical properties we study the numer-ical performance of the z valuendashbased adaptive procedure usingboth simulated and real data Simulations reported in Section 5demonstrate that our z value procedure yields a lower mFNRthan the p valuendashbased methods at the same mFDR level Thegain in efficiency is substantial in many situations when the al-ternative is asymmetric about the null We then apply our proce-dure to the analysis of microarray data from a human immuno-deficiency virus (HIV) study in Section 6 and find that at thesame FDR level it rejects more hypotheses than the p valuendashbased procedures The type of asymmetry in the HIV data iscommonly observed in microarray study suggesting that ournew z valuendashbased adaptive procedure can aid in the discoveryof additional new meaningful findings in many scientific appli-cations
The article is organized as follows In Section 2 we develop acompound decision-theoretic framework and study various as-pects of weighted classification and multiple-testing problemsin this setting In Sections 3 and 4 we propose oracle and adap-tive testing procedures based on the z values for FDR controlIn Section 5 we introduce numerical examples to show that ournew z valuendashbased procedure is more efficient than the tradi-tional p valuendashbased procedures In Section 6 we illustrate ouradaptive procedure with the analysis of microarray data in anHIV study We conclude the article with a discussion of theresults and some open problems We provide proofs in the Ap-pendix
2 COMPOUND DECISION PROBLEM
Let be the sample space and let be the parameter spaceSuppose that x = (x1 xm) isin are observed and that we areinterested in inference about the unknown θ = (θ1 θm) isin
based on x This involves solving m decision problems si-multaneously and is called a compound decision problem Letδ = (δ1 δm) be a general decision rule Then δ is simple ifδi is a function only of xi that is δi(x) = δi(xi) The simplerules correspond to solving the m component problems sepa-rately In contrast δ is compound if δi depends on other xj rsquos
Table 2 Summary of the data notation
Original data T statistic z value p value
X1 middot middot middot Xn1 Y1 middot middot middot Yn2 T Z P
X11 middot middot middot Xn11 Y11 middot middot middot Yn21 T1 Z1 P1X12 middot middot middot Xn12 Y12 middot middot middot Yn22 T2 Z2 P2
X1m middot middot middot Xn1m Y1m middot middot middot Yn2m Tm Zm Pm
Figure 1 A comparison of the p value ( ) and z value ( )oracle procedures with the mFDR level at 10 The test statis-tics have the normal mixture distribution 8N(01) + p1N(minus31) +(2 minus p1)N(31) The mFNRs of the two procedures are plotted as afunction of p1
j = i A decision rule δ is symmetric if δ(τ (x)) = τ(δ(x)) forall permutation operators τ
Robbins (1951) considered a compound decision problemwhere Xi sim N(θi1) i = 1 n are independent normalvariables with mean θi = 1 or minus1 The goal is to classify each θi
under the usual misclassification error He showed that theunique minimax rule R δi = sgn(xi) does not perform well inthis compound decision problem by exhibiting a compound ruleRlowast δi = sgn(xi minus 1
2 log 1minusx1+x
) that substantially outperforms R
when p the proportion of θi = 1 approaches 0 or 1 with onlyslightly higher risk near p = 5
Let θ1 θm be independent Bernoulli(p) variables and letXi be generated as
Xi |θi sim (1 minus θi)F0 + θiF1 (1)
The variables Xi are observed and the variables θi are unob-served The marginal cumulative distribution function (cdf) ofX is the mixture distribution F(x) = (1 minus p)F0(x) + pF1(x)and the probability distribution function (pdf) is f (x) = (1 minusp)f0(x) + pf1(x) where f is assumed to be continuous andpositive on the real line In statistical and scientific applica-tions the goal is to separate the nonnull cases (θi = 1) fromthe null (θi = 0) which can be formulated either as a weightedclassification problem or a multiple-testing problem The so-lution to both problems can be represented by a decision ruleδ = (δ1 δm) isin I = 01m
In Section 6 we consider a problem in which we are inter-ested in identifying a set of genes that are differentially ex-pressed between HIV-positive patients and HIV-negative con-trols This naturally gives rise to a multiple-testing problemin which the goal is to find as many true positives as possi-ble while controlling the proportion of false positives amongall rejections within level α As in the work of Genovese andWasserman (2002) we define the marginal FNR as mFNR =E(N01)E(S) the proportion of the expected number of non-nulls among the expected number of nonrejections The multi-ple testing problem is then to find δ that minimizes the mFNR
Sun and Cai Rules for False Discovery Rate Control 903
while controlling the mFDR at level α On the other hand inapplications it is possible to rate the relative cost of a false pos-itive (type I error) to a false negative (type II error) This nat-urally gives rise to a weighted classification problem with lossfunction
Lλ(θ δ) = 1
m
sumi
[λI (θi = 0)δi + I (θi = 1)(1 minus δi)] (2)
where λ gt 0 is the relative weight for a false positive Theweighted classification problem is then to find δ that mini-mizes the classification risk E[Lλ(θ δ)] We develop a com-pound decision theory framework for inference about the mix-ture model (1) and make connections between multiple testingand weighted classification
We consider δ(x) isin 01m which is defined in termsof statistic T(x) = [Ti(x) i = 1 m] and threshold c =(c1 cm) such that
δ(x) = I (T lt c) = [I (Ti(x) lt ci) i = 1 m
] (3)
Then δ can be used in both weighted classification and multiple-testing problems It is easy to verify that δ given in (3) is sym-metric if Ti(x) = T (xi) and c = c1 where T is a function Inthis section we allow T to depend on unknown quantities suchas the proportion of nonnulls andor the distribution of Xi Weassume that T (Xi) sim G = (1 minus p)G0 + pG1 where G0 andG1 are the cdfrsquos of T (Xi) under the null and the alternativerespectively The pdf of T (Xi) is g = (1 minus p)g0 + pg1 Let Tbe the collection of functions such that for any T isin T T (Xi)
has monotone likelihood ratio (MLR) that is
g1(t)g0(t) is decreasing in t (4)
We call T the SMLR class Let x = x1 xm be a ran-dom sample from the mixture distribution F and let δ(T c) =I [T (xi) lt c] i = 1 m We call δ(T c) a SMLR decisionrule if T isin T and let Ds denote the collection of all SMLRdecision rules
Suppose that T (Xi) = p(Xi) the p value of an individualtest based on the observation Xi Assume that p(Xi) sim G =(1 minus p)G0 + pG1 where G0 and G1 are the p value distrib-utions under the null and the alternative Assume that G1(t) istwice differentiable Note that g0(t) = Gprime
0(t) = 1 the assump-tion p(middot) isin T implies that Gprimeprime
1(t) = gprime1(t) lt 0 that is G1(t) is
concave Therefore the SMLR assumption can be viewed asa generalized version of the assumption of Storey (2002) andGenovese and Wasserman (2002 2004) that the p value distri-bution under the alternative is concave The following proposi-tion shows that the SMLR assumption is a desirable conditionin both multiple-testing and weighted classification problems
Proposition 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the model (1) Suppose that T isin T then the implemen-tation of δ(T c) = I [T (xi) lt c] i = 1 m in both theweighted classification and the multiple-testing problems im-plies that (a) P(θi = 1|T (Xi) le c) is monotonically decreas-ing in threshold c (b) the mFDR is monotonically increasingin c and the expected number of rejections r (c) the mFNR ismonotonically decreasing in c and r (d) the mFNR is monoton-ically decreasing in the mFDR and (e) in the classificationproblem c (and thus r) is monotonically decreasing in the clas-sification weight λ
The following theorem makes connection between a multi-ple-testing problem and a weighted classification problem Inparticular the former can be solved by solving the latter withan appropriately chosen λ
Theorem 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the mixture model (1) Let isin T and suppose that theclassification risk with the loss function (2) is minimized byδλ[c(λ)] = δλ
1 δλm where δλ
i = I (xi) lt c(λ) Thenfor any given mFDR level α in a multiple testing problemthere exists a unique λ(α) that defines a weighted classifica-tion problem The optimal solution to the classification problemδλ(α)[cλ(α)] is also optimal in the multiple testing prob-lem in the sense that it controls the mFDR at level α with thesmallest mFNR among all decision rules in Ds
The next step is to develop such an optimal rule δλ( c(λ))
as stated in Theorem 1 We first study an ideal setup in whichthere is an oracle that knows pf0 and f1 Then the oracle rulein this weighted classification problem gives the optimal choiceof δ
Theorem 2 (The oracle rule for weighted classification) Letθi i = 1 m be iid Bernoulli(p) variables and let xi |θi i = 1 m be independent observations from the mixturemodel (1) Suppose that pf0 and f1 are known Then the clas-sification risk E[Lλ(θ δ)] with L given in (2) is minimized byδλ(1λ) = δ1 δm where
δi = I
(xi) = (1 minus p)f0(xi)
pf1(xi)lt
1
λ
i = 1 m (5)
The minimum classification risk is Rlowastλ
= infδ E[Lλ(θ δ)] =p+int
K[λ(1minusp)f0(x)minuspf1(x)]dx where K = x isin λ(1minus
p)f0(x) lt pf1(x)Remark 1 Let the likelihood ratio (LR) be defined as Li =
f0(xi)f1(xi) A compound decision rule is said to be orderedif for almost all x Li gt Lj and δi(x) gt 0 imply that δj (x) = 1Copas (1974) showed that if a symmetric compound decisionrule for dichotomies is admissible then it is ordered Note that(5) is a Bayes rule and so is symmetric ordered and admis-sible But because the p valuendashbased procedure δ(p(middot) c) =I [p(Xi) lt c] i = 1 m is symmetric but not ordered bythe LR it is inadmissible in the compound decision problem
Remark 2 In practice some of the p f0 and f1 are un-known but estimable Then we can estimate the unknown quan-tities and use the rule δλ(1λ) The subminimax rule δi =sgn(xi minus 1
2 log 1minusx1+x
) given by Robbins (1951) is recovered byletting λ = 1 f0 = φ(middot+ 1) f1 = φ(middotminus 1) and p1 = (1 + x)2in (5) where φ is the pdf of a standard normal
Remark 3 Suppose that two weights λ1 lt λ2 are chosenin the loss (2) Let 1 = x λ1(1 minus p)f0(x) lt pf1(x) and2 = x λ2(1 minus p)f0(x) lt pf1(x) Then the classificationrisk Rλ is increasing in λ because Rlowast
λ1minus Rlowast
λ2= int
12[λ1(1 minus
p)f0(x) minus pf1(x)]dx + int2
[(λ1 minus λ2)(1 minus p)f0(x)]dx lt 0Also it is easy to see that 1 sup 2 Thus the expected numberof subjects classified to the nonnull population is decreasingin λ demonstrating that (d) in Proposition 1 is satisfied by theclassification rule δλ(1λ)
904 Journal of the American Statistical Association September 2007
3 THE ORACLE PROCEDURES FORmFDR CONTROL
We have shown that δλ(1λ) = [I (x1) lt 1λ I (xm) lt 1λ] is the oracle rule in the weighted classifica-tion problem Theorem 1 implies that the optimal rule for themultiple-testing problem is also of the form δλ(α)[1λ(α)]if isin T although the cutoff 1λ(α) is not obvious Notethat (x) = Lfdr(x)[1 minus Lfdr(x)] is monotonically increas-ing in Lfdr(x) where Lfdr(middot) = (1 minus p)f0(middot)f (middot) is the localFDR (Lfdr) introduced by Efron Tibshirani Storey and Tusher(2001) and Efron (2004a) so the optimal rule for mFDR controlis of the form δ(Lfdr(middot) c) = I [Lfdr(xi) lt c] i = 1 mLfdr has been widely used in the FDR literature to providea Bayesian version of the frequentist FDR measure and in-terpret results for individual cases (Efron 2004a) We ldquoredis-coverrdquo it here as the optimal (oracle) statistic in the multiple-testing problem in the sense that the thresholding rule based onLfdr(X) controls the mFDR with the smallest mFNR
The Lfdr statistic is defined in terms of the z values whichcan be converted from other test statistics including the t
and chi-squared statistics using inverse-probability transforma-tions Note that p is a global parameter Therefore the expres-sion Lfdr(z) = (1 minus p)f0(z)f (z) implies that we actually rankthe relative importance of the observations according to theirLRs and that the rankings are generally different from the rank-ings of p values unless the alternative distribution is symmetricabout the null An interesting consequence of using the Lfdrstatistic in multiple testing is that an observation located fartherfrom the null may have a lower significance level Thereforeit is possible that the test accepts a more extreme observation
while rejecting a less extreme observation implying that the re-jection region is asymmetric This is not possible for a testingprocedure based on the individual p values which has a rejec-tion region always symmetric about the null This phenomenonis illustrated in Figure 2 at the end of this section
Setting the threshold for the test statistics has been the focusof the FDR literature (see eg Benjamini and Hochberg 1995Genovese and Wasserman 2004) Consider an ideal situationin which we assume that an oracle knows the true underlyingdistribution of the test statistics Note that the SMLR assump-tion implies that the mFNR is decreasing in the mFDR there-fore the oraclersquos response to such a thresholding problem is toldquospendrdquo all of the mFDR to minimize the mFNR Besides pro-viding a target for evaluating different multiple-testing proce-dures the oracle procedure also sheds light on the developmentof the adaptive procedure that we propose in Section 4
31 The Oracle Procedure Based on the p Values
Let G1(t) be the distribution of the p values under the alter-native and let p be the proportion of nonnulls We assume thatG1 is concave then according to Genovese and Wasserman(2002) the oracle procedure thresholds the p value at ulowast thesolution to the equation G1(u)u = (1p minus 1)(1α minus 1) Hereulowast is the optimal cutoff for a concave G1(t) in the sense that therule δ[p(middot) ulowast] = I [Pi lt ulowast] i = 1 m yields the small-est mFNR among all p valuendashbased procedures that control themFDR at level α Let G(t) = 1 minus G(t) The resulting mFDR of(1 minus p)ulowast[(1 minus p)ulowast + pG1(u
lowast)] is just α and the mFNR isgiven by pG(ulowast)[pG1(u
lowast) + (1 minus p)(1 minus ulowast)]
(a) (b)
Figure 2 Symmetric rejection region versus asymmetric rejection region (a) Comparison of oracle rules ( p value z value)(b) Rejection regions The normal mixture model is 8N(01) + p1N(minus31) + (2 minus p1)N(41) Both procedures control the mFDR at 10
Sun and Cai Rules for False Discovery Rate Control 905
32 The Oracle Procedure Based on z Values
We write the Lfdr statistic as TOR(Zi) = (1 minus p)f0(Zi)
f (Zi) and call it the oracle test statistic We assume thatTOR(Zi) is distributed with marginal cdf GOR = (1 minus p)G0
OR +pG1
OR where G0OR and G1
OR are the cdfrsquos of TOR(Zi) under thenull and the alternative Set GOR(t) = 1 minus GOR(t) The mFDRof the oracle rule δ(TOR λ) is QOR(λ) = (1 minus p)G0
OR(λ)
GOR(λ) We assume that TOR isin T satisfies the SMLR assump-tion then (b) of Proposition 1 implies that QOR(λ) is increasingin the threshold λ Therefore the oracle procedure thresholdsTOR(z) at λOR = supt isin (01) QOR(t) le α = Qminus1
OR(α) Thusthe oracle testing rule based on z values is
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
(6)
The corresponding mFDR= QOR(λOR) is just α and the mFNRis
QOR(λOR) = pG1OR(λOR)GOR(λOR) (7)
Example 1 We consider a random sample z1 zm from anormal mixture model
f (z) = (1minusp1 minusp2)φ(z)+p1φ(zminusμ1)+p2φ(zminusμ2) (8)
along with corresponding m tests Hi0 μ = 0 versus Hi
1 μ = 0i = 1 m Let p = p1 + p2 be the total proportion of non-nulls We assume that μ1 lt 0 and μ2 gt 0 so that the rejectionregion will be two-sided
The p value distribution under the alternative can be de-rived as G1(t) = (p1p)(Zt2 +μ1)+(Zt2 minusμ1)+ (p2
p)(Zt2 + μ2) + (Zt2 minus μ2) Thus the optimal p valuecutoff ulowast is the solution to equation G1(u)u = (1p minus 1) times(1α minus 1) with corresponding mFNR of pG(ulowast)[pG1(u
lowast) +(1 minus p)(1 minus ulowast)]
It is easy to verify that in normal mixture model (8) theoracle testing rule (6) is equivalent to rejecting the null whenzi lt cl or zi gt cu The threshold λOR can be obtained using thebisection method For a given λ cl and cu can be solved numeri-cally from the equation λ[p1 exp(minusμ1zminus 1
2μ21)+p2 exp(μ2zminus
12μ2
2)] minus (1 minus λ)(1 minus p) = 0 The corresponding mFDR andmFNR can be calculated as
mFDR = (1 minus p)[(cl) + (cu)]((1 minus p)[(cl) + (cu)] + p1[(cl + μ1)
+ (cu + μ1)] + p2[(cl minus μ2) + (cu minus μ2)])
and
mFNR = (p1[(cu + μ1) minus (cl + μ1)]+ p2[(cu minus μ2) minus (cl minus μ2)]
)((1 minus p)[(cu) minus (cl)] + p1[(cu + μ1)
minus (cl + μ1)] + p2[(cu minus μ2) minus (cl minus μ2)])
where (middot) is the cdf of a standard normal and (middot) = 1minus(middot)In this example we choose the mixture model 8N(01) +
p1N(minus31)+(2minusp1)N(41) This is different from the model
shown in Figure 1 which has alternative means minus3 and 3 there-fore different patterns are seen Both oracle procedures con-trol the mFDR at level 10 and we plot the mFNRs of thetwo procedures as a function of p1 in Figure 2(a) We cansee that again the p value oracle procedure is dominated bythe z value oracle procedure We set p = 15 and character-ize the rejection regions of the two oracle procedures in Fig-ure 2(b) Some calculations show that the rejection region ofthe p value oracle procedure is |zi | gt 227 with mFNR = 046whereas the rejection region of the z value oracle procedure iszi lt minus197 and zi gt 341 with mFNR = 038 It is interest-ing to note that the z value oracle procedure rejects observa-tion x = minus2 (p value = 046) but accepts observation x = 3(p value = 003) We provide more numerical comparisons inSection 5 and show that the z value oracle procedure yieldsa lower mFNR level than the p value oracle procedure on allpossible configurations of alternative hypotheses with the dif-ference significant in many cases
33 Connection to Work of Spjoslashtvoll (1972) andStorey (2007)
An ldquooptimalrdquo multiple-testing procedure was introduced bySpjoslashtvoll (1972) Let f01 f0N and f1 fN be integrablefunctions Let Sprime(γ ) denote the set of all tests (ψ1 ψN )that satisfy
sumNt=1
intψt(x)f0t (x) dμ(x) = γ where γ gt 0 is pre-
specified Spjoslashtvoll (1972) showed that the test (φ1 φN) =I [ft (x) gt cf0t (x)] t = 1 N maximizes
Nsumt=1
intφt (x)ft (x) dμ(x) (9)
among all tests (ψ1 ψN) isin Sprime(γ ) Storey (2007) proposedthe optimal discovery procedure (ODP) based on a shrinkagetest statistic that maximizes the expected number of true pos-itives (ETP) for a fixed level of the expected number of falsepositives (EFP) The result of Storey (2007) on the ODP fol-lows directly from the theorem of Spjoslashtvoll (1972) by choosingappropriate f0t and ft
In the setting of the present article both Spjoslashtvollrsquos optimalprocedure and Storeyrsquos ODP procedure depend on unknownfunctions that are not estimable from the data Moreover theoptimal cutoffs for a given test level are not specified in bothprocedures These limitations make the two methods inapplica-ble in terms of the goal of FDR control
The formulation of our testing procedure (6) has two ad-vantages over the procedures of Spjoslashtvoll (1972) and Storey(2007) First we can form good estimates (Jin and Cai 2007)from the data for the unknown functions in the oracle test sta-tistic that we propose Second we specify the optimal cutofffor each mFDR level which was not discussed by Spjoslashtvoll(1972) or Storey (2007) These advantages greatly facilitate thedevelopment of the adaptive procedure (10) where a consistentcutoff (relative to the oracle cutoff) is suggested for each testlevel based on a simple step-up procedure This task is impossi-ble based on the formulation of the work of Spjoslashtvoll or Storeybecause good estimates of the unknown quantities in their testldquostatisticsrdquo are not available and obtaining appropriate cutoffsis difficult
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
Sun and Cai Rules for False Discovery Rate Control 903
while controlling the mFDR at level α On the other hand inapplications it is possible to rate the relative cost of a false pos-itive (type I error) to a false negative (type II error) This nat-urally gives rise to a weighted classification problem with lossfunction
Lλ(θ δ) = 1
m
sumi
[λI (θi = 0)δi + I (θi = 1)(1 minus δi)] (2)
where λ gt 0 is the relative weight for a false positive Theweighted classification problem is then to find δ that mini-mizes the classification risk E[Lλ(θ δ)] We develop a com-pound decision theory framework for inference about the mix-ture model (1) and make connections between multiple testingand weighted classification
We consider δ(x) isin 01m which is defined in termsof statistic T(x) = [Ti(x) i = 1 m] and threshold c =(c1 cm) such that
δ(x) = I (T lt c) = [I (Ti(x) lt ci) i = 1 m
] (3)
Then δ can be used in both weighted classification and multiple-testing problems It is easy to verify that δ given in (3) is sym-metric if Ti(x) = T (xi) and c = c1 where T is a function Inthis section we allow T to depend on unknown quantities suchas the proportion of nonnulls andor the distribution of Xi Weassume that T (Xi) sim G = (1 minus p)G0 + pG1 where G0 andG1 are the cdfrsquos of T (Xi) under the null and the alternativerespectively The pdf of T (Xi) is g = (1 minus p)g0 + pg1 Let Tbe the collection of functions such that for any T isin T T (Xi)
has monotone likelihood ratio (MLR) that is
g1(t)g0(t) is decreasing in t (4)
We call T the SMLR class Let x = x1 xm be a ran-dom sample from the mixture distribution F and let δ(T c) =I [T (xi) lt c] i = 1 m We call δ(T c) a SMLR decisionrule if T isin T and let Ds denote the collection of all SMLRdecision rules
Suppose that T (Xi) = p(Xi) the p value of an individualtest based on the observation Xi Assume that p(Xi) sim G =(1 minus p)G0 + pG1 where G0 and G1 are the p value distrib-utions under the null and the alternative Assume that G1(t) istwice differentiable Note that g0(t) = Gprime
0(t) = 1 the assump-tion p(middot) isin T implies that Gprimeprime
1(t) = gprime1(t) lt 0 that is G1(t) is
concave Therefore the SMLR assumption can be viewed asa generalized version of the assumption of Storey (2002) andGenovese and Wasserman (2002 2004) that the p value distri-bution under the alternative is concave The following proposi-tion shows that the SMLR assumption is a desirable conditionin both multiple-testing and weighted classification problems
Proposition 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the model (1) Suppose that T isin T then the implemen-tation of δ(T c) = I [T (xi) lt c] i = 1 m in both theweighted classification and the multiple-testing problems im-plies that (a) P(θi = 1|T (Xi) le c) is monotonically decreas-ing in threshold c (b) the mFDR is monotonically increasingin c and the expected number of rejections r (c) the mFNR ismonotonically decreasing in c and r (d) the mFNR is monoton-ically decreasing in the mFDR and (e) in the classificationproblem c (and thus r) is monotonically decreasing in the clas-sification weight λ
The following theorem makes connection between a multi-ple-testing problem and a weighted classification problem Inparticular the former can be solved by solving the latter withan appropriately chosen λ
Theorem 1 Let θi i = 1 m be iid Bernoulli(p) vari-ables and let xi |θi i = 1 m be independent observationsfrom the mixture model (1) Let isin T and suppose that theclassification risk with the loss function (2) is minimized byδλ[c(λ)] = δλ
1 δλm where δλ
i = I (xi) lt c(λ) Thenfor any given mFDR level α in a multiple testing problemthere exists a unique λ(α) that defines a weighted classifica-tion problem The optimal solution to the classification problemδλ(α)[cλ(α)] is also optimal in the multiple testing prob-lem in the sense that it controls the mFDR at level α with thesmallest mFNR among all decision rules in Ds
The next step is to develop such an optimal rule δλ( c(λ))
as stated in Theorem 1 We first study an ideal setup in whichthere is an oracle that knows pf0 and f1 Then the oracle rulein this weighted classification problem gives the optimal choiceof δ
Theorem 2 (The oracle rule for weighted classification) Letθi i = 1 m be iid Bernoulli(p) variables and let xi |θi i = 1 m be independent observations from the mixturemodel (1) Suppose that pf0 and f1 are known Then the clas-sification risk E[Lλ(θ δ)] with L given in (2) is minimized byδλ(1λ) = δ1 δm where
δi = I
(xi) = (1 minus p)f0(xi)
pf1(xi)lt
1
λ
i = 1 m (5)
The minimum classification risk is Rlowastλ
= infδ E[Lλ(θ δ)] =p+int
K[λ(1minusp)f0(x)minuspf1(x)]dx where K = x isin λ(1minus
p)f0(x) lt pf1(x)Remark 1 Let the likelihood ratio (LR) be defined as Li =
f0(xi)f1(xi) A compound decision rule is said to be orderedif for almost all x Li gt Lj and δi(x) gt 0 imply that δj (x) = 1Copas (1974) showed that if a symmetric compound decisionrule for dichotomies is admissible then it is ordered Note that(5) is a Bayes rule and so is symmetric ordered and admis-sible But because the p valuendashbased procedure δ(p(middot) c) =I [p(Xi) lt c] i = 1 m is symmetric but not ordered bythe LR it is inadmissible in the compound decision problem
Remark 2 In practice some of the p f0 and f1 are un-known but estimable Then we can estimate the unknown quan-tities and use the rule δλ(1λ) The subminimax rule δi =sgn(xi minus 1
2 log 1minusx1+x
) given by Robbins (1951) is recovered byletting λ = 1 f0 = φ(middot+ 1) f1 = φ(middotminus 1) and p1 = (1 + x)2in (5) where φ is the pdf of a standard normal
Remark 3 Suppose that two weights λ1 lt λ2 are chosenin the loss (2) Let 1 = x λ1(1 minus p)f0(x) lt pf1(x) and2 = x λ2(1 minus p)f0(x) lt pf1(x) Then the classificationrisk Rλ is increasing in λ because Rlowast
λ1minus Rlowast
λ2= int
12[λ1(1 minus
p)f0(x) minus pf1(x)]dx + int2
[(λ1 minus λ2)(1 minus p)f0(x)]dx lt 0Also it is easy to see that 1 sup 2 Thus the expected numberof subjects classified to the nonnull population is decreasingin λ demonstrating that (d) in Proposition 1 is satisfied by theclassification rule δλ(1λ)
904 Journal of the American Statistical Association September 2007
3 THE ORACLE PROCEDURES FORmFDR CONTROL
We have shown that δλ(1λ) = [I (x1) lt 1λ I (xm) lt 1λ] is the oracle rule in the weighted classifica-tion problem Theorem 1 implies that the optimal rule for themultiple-testing problem is also of the form δλ(α)[1λ(α)]if isin T although the cutoff 1λ(α) is not obvious Notethat (x) = Lfdr(x)[1 minus Lfdr(x)] is monotonically increas-ing in Lfdr(x) where Lfdr(middot) = (1 minus p)f0(middot)f (middot) is the localFDR (Lfdr) introduced by Efron Tibshirani Storey and Tusher(2001) and Efron (2004a) so the optimal rule for mFDR controlis of the form δ(Lfdr(middot) c) = I [Lfdr(xi) lt c] i = 1 mLfdr has been widely used in the FDR literature to providea Bayesian version of the frequentist FDR measure and in-terpret results for individual cases (Efron 2004a) We ldquoredis-coverrdquo it here as the optimal (oracle) statistic in the multiple-testing problem in the sense that the thresholding rule based onLfdr(X) controls the mFDR with the smallest mFNR
The Lfdr statistic is defined in terms of the z values whichcan be converted from other test statistics including the t
and chi-squared statistics using inverse-probability transforma-tions Note that p is a global parameter Therefore the expres-sion Lfdr(z) = (1 minus p)f0(z)f (z) implies that we actually rankthe relative importance of the observations according to theirLRs and that the rankings are generally different from the rank-ings of p values unless the alternative distribution is symmetricabout the null An interesting consequence of using the Lfdrstatistic in multiple testing is that an observation located fartherfrom the null may have a lower significance level Thereforeit is possible that the test accepts a more extreme observation
while rejecting a less extreme observation implying that the re-jection region is asymmetric This is not possible for a testingprocedure based on the individual p values which has a rejec-tion region always symmetric about the null This phenomenonis illustrated in Figure 2 at the end of this section
Setting the threshold for the test statistics has been the focusof the FDR literature (see eg Benjamini and Hochberg 1995Genovese and Wasserman 2004) Consider an ideal situationin which we assume that an oracle knows the true underlyingdistribution of the test statistics Note that the SMLR assump-tion implies that the mFNR is decreasing in the mFDR there-fore the oraclersquos response to such a thresholding problem is toldquospendrdquo all of the mFDR to minimize the mFNR Besides pro-viding a target for evaluating different multiple-testing proce-dures the oracle procedure also sheds light on the developmentof the adaptive procedure that we propose in Section 4
31 The Oracle Procedure Based on the p Values
Let G1(t) be the distribution of the p values under the alter-native and let p be the proportion of nonnulls We assume thatG1 is concave then according to Genovese and Wasserman(2002) the oracle procedure thresholds the p value at ulowast thesolution to the equation G1(u)u = (1p minus 1)(1α minus 1) Hereulowast is the optimal cutoff for a concave G1(t) in the sense that therule δ[p(middot) ulowast] = I [Pi lt ulowast] i = 1 m yields the small-est mFNR among all p valuendashbased procedures that control themFDR at level α Let G(t) = 1 minus G(t) The resulting mFDR of(1 minus p)ulowast[(1 minus p)ulowast + pG1(u
lowast)] is just α and the mFNR isgiven by pG(ulowast)[pG1(u
lowast) + (1 minus p)(1 minus ulowast)]
(a) (b)
Figure 2 Symmetric rejection region versus asymmetric rejection region (a) Comparison of oracle rules ( p value z value)(b) Rejection regions The normal mixture model is 8N(01) + p1N(minus31) + (2 minus p1)N(41) Both procedures control the mFDR at 10
Sun and Cai Rules for False Discovery Rate Control 905
32 The Oracle Procedure Based on z Values
We write the Lfdr statistic as TOR(Zi) = (1 minus p)f0(Zi)
f (Zi) and call it the oracle test statistic We assume thatTOR(Zi) is distributed with marginal cdf GOR = (1 minus p)G0
OR +pG1
OR where G0OR and G1
OR are the cdfrsquos of TOR(Zi) under thenull and the alternative Set GOR(t) = 1 minus GOR(t) The mFDRof the oracle rule δ(TOR λ) is QOR(λ) = (1 minus p)G0
OR(λ)
GOR(λ) We assume that TOR isin T satisfies the SMLR assump-tion then (b) of Proposition 1 implies that QOR(λ) is increasingin the threshold λ Therefore the oracle procedure thresholdsTOR(z) at λOR = supt isin (01) QOR(t) le α = Qminus1
OR(α) Thusthe oracle testing rule based on z values is
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
(6)
The corresponding mFDR= QOR(λOR) is just α and the mFNRis
QOR(λOR) = pG1OR(λOR)GOR(λOR) (7)
Example 1 We consider a random sample z1 zm from anormal mixture model
f (z) = (1minusp1 minusp2)φ(z)+p1φ(zminusμ1)+p2φ(zminusμ2) (8)
along with corresponding m tests Hi0 μ = 0 versus Hi
1 μ = 0i = 1 m Let p = p1 + p2 be the total proportion of non-nulls We assume that μ1 lt 0 and μ2 gt 0 so that the rejectionregion will be two-sided
The p value distribution under the alternative can be de-rived as G1(t) = (p1p)(Zt2 +μ1)+(Zt2 minusμ1)+ (p2
p)(Zt2 + μ2) + (Zt2 minus μ2) Thus the optimal p valuecutoff ulowast is the solution to equation G1(u)u = (1p minus 1) times(1α minus 1) with corresponding mFNR of pG(ulowast)[pG1(u
lowast) +(1 minus p)(1 minus ulowast)]
It is easy to verify that in normal mixture model (8) theoracle testing rule (6) is equivalent to rejecting the null whenzi lt cl or zi gt cu The threshold λOR can be obtained using thebisection method For a given λ cl and cu can be solved numeri-cally from the equation λ[p1 exp(minusμ1zminus 1
2μ21)+p2 exp(μ2zminus
12μ2
2)] minus (1 minus λ)(1 minus p) = 0 The corresponding mFDR andmFNR can be calculated as
mFDR = (1 minus p)[(cl) + (cu)]((1 minus p)[(cl) + (cu)] + p1[(cl + μ1)
+ (cu + μ1)] + p2[(cl minus μ2) + (cu minus μ2)])
and
mFNR = (p1[(cu + μ1) minus (cl + μ1)]+ p2[(cu minus μ2) minus (cl minus μ2)]
)((1 minus p)[(cu) minus (cl)] + p1[(cu + μ1)
minus (cl + μ1)] + p2[(cu minus μ2) minus (cl minus μ2)])
where (middot) is the cdf of a standard normal and (middot) = 1minus(middot)In this example we choose the mixture model 8N(01) +
p1N(minus31)+(2minusp1)N(41) This is different from the model
shown in Figure 1 which has alternative means minus3 and 3 there-fore different patterns are seen Both oracle procedures con-trol the mFDR at level 10 and we plot the mFNRs of thetwo procedures as a function of p1 in Figure 2(a) We cansee that again the p value oracle procedure is dominated bythe z value oracle procedure We set p = 15 and character-ize the rejection regions of the two oracle procedures in Fig-ure 2(b) Some calculations show that the rejection region ofthe p value oracle procedure is |zi | gt 227 with mFNR = 046whereas the rejection region of the z value oracle procedure iszi lt minus197 and zi gt 341 with mFNR = 038 It is interest-ing to note that the z value oracle procedure rejects observa-tion x = minus2 (p value = 046) but accepts observation x = 3(p value = 003) We provide more numerical comparisons inSection 5 and show that the z value oracle procedure yieldsa lower mFNR level than the p value oracle procedure on allpossible configurations of alternative hypotheses with the dif-ference significant in many cases
33 Connection to Work of Spjoslashtvoll (1972) andStorey (2007)
An ldquooptimalrdquo multiple-testing procedure was introduced bySpjoslashtvoll (1972) Let f01 f0N and f1 fN be integrablefunctions Let Sprime(γ ) denote the set of all tests (ψ1 ψN )that satisfy
sumNt=1
intψt(x)f0t (x) dμ(x) = γ where γ gt 0 is pre-
specified Spjoslashtvoll (1972) showed that the test (φ1 φN) =I [ft (x) gt cf0t (x)] t = 1 N maximizes
Nsumt=1
intφt (x)ft (x) dμ(x) (9)
among all tests (ψ1 ψN) isin Sprime(γ ) Storey (2007) proposedthe optimal discovery procedure (ODP) based on a shrinkagetest statistic that maximizes the expected number of true pos-itives (ETP) for a fixed level of the expected number of falsepositives (EFP) The result of Storey (2007) on the ODP fol-lows directly from the theorem of Spjoslashtvoll (1972) by choosingappropriate f0t and ft
In the setting of the present article both Spjoslashtvollrsquos optimalprocedure and Storeyrsquos ODP procedure depend on unknownfunctions that are not estimable from the data Moreover theoptimal cutoffs for a given test level are not specified in bothprocedures These limitations make the two methods inapplica-ble in terms of the goal of FDR control
The formulation of our testing procedure (6) has two ad-vantages over the procedures of Spjoslashtvoll (1972) and Storey(2007) First we can form good estimates (Jin and Cai 2007)from the data for the unknown functions in the oracle test sta-tistic that we propose Second we specify the optimal cutofffor each mFDR level which was not discussed by Spjoslashtvoll(1972) or Storey (2007) These advantages greatly facilitate thedevelopment of the adaptive procedure (10) where a consistentcutoff (relative to the oracle cutoff) is suggested for each testlevel based on a simple step-up procedure This task is impossi-ble based on the formulation of the work of Spjoslashtvoll or Storeybecause good estimates of the unknown quantities in their testldquostatisticsrdquo are not available and obtaining appropriate cutoffsis difficult
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
904 Journal of the American Statistical Association September 2007
3 THE ORACLE PROCEDURES FORmFDR CONTROL
We have shown that δλ(1λ) = [I (x1) lt 1λ I (xm) lt 1λ] is the oracle rule in the weighted classifica-tion problem Theorem 1 implies that the optimal rule for themultiple-testing problem is also of the form δλ(α)[1λ(α)]if isin T although the cutoff 1λ(α) is not obvious Notethat (x) = Lfdr(x)[1 minus Lfdr(x)] is monotonically increas-ing in Lfdr(x) where Lfdr(middot) = (1 minus p)f0(middot)f (middot) is the localFDR (Lfdr) introduced by Efron Tibshirani Storey and Tusher(2001) and Efron (2004a) so the optimal rule for mFDR controlis of the form δ(Lfdr(middot) c) = I [Lfdr(xi) lt c] i = 1 mLfdr has been widely used in the FDR literature to providea Bayesian version of the frequentist FDR measure and in-terpret results for individual cases (Efron 2004a) We ldquoredis-coverrdquo it here as the optimal (oracle) statistic in the multiple-testing problem in the sense that the thresholding rule based onLfdr(X) controls the mFDR with the smallest mFNR
The Lfdr statistic is defined in terms of the z values whichcan be converted from other test statistics including the t
and chi-squared statistics using inverse-probability transforma-tions Note that p is a global parameter Therefore the expres-sion Lfdr(z) = (1 minus p)f0(z)f (z) implies that we actually rankthe relative importance of the observations according to theirLRs and that the rankings are generally different from the rank-ings of p values unless the alternative distribution is symmetricabout the null An interesting consequence of using the Lfdrstatistic in multiple testing is that an observation located fartherfrom the null may have a lower significance level Thereforeit is possible that the test accepts a more extreme observation
while rejecting a less extreme observation implying that the re-jection region is asymmetric This is not possible for a testingprocedure based on the individual p values which has a rejec-tion region always symmetric about the null This phenomenonis illustrated in Figure 2 at the end of this section
Setting the threshold for the test statistics has been the focusof the FDR literature (see eg Benjamini and Hochberg 1995Genovese and Wasserman 2004) Consider an ideal situationin which we assume that an oracle knows the true underlyingdistribution of the test statistics Note that the SMLR assump-tion implies that the mFNR is decreasing in the mFDR there-fore the oraclersquos response to such a thresholding problem is toldquospendrdquo all of the mFDR to minimize the mFNR Besides pro-viding a target for evaluating different multiple-testing proce-dures the oracle procedure also sheds light on the developmentof the adaptive procedure that we propose in Section 4
31 The Oracle Procedure Based on the p Values
Let G1(t) be the distribution of the p values under the alter-native and let p be the proportion of nonnulls We assume thatG1 is concave then according to Genovese and Wasserman(2002) the oracle procedure thresholds the p value at ulowast thesolution to the equation G1(u)u = (1p minus 1)(1α minus 1) Hereulowast is the optimal cutoff for a concave G1(t) in the sense that therule δ[p(middot) ulowast] = I [Pi lt ulowast] i = 1 m yields the small-est mFNR among all p valuendashbased procedures that control themFDR at level α Let G(t) = 1 minus G(t) The resulting mFDR of(1 minus p)ulowast[(1 minus p)ulowast + pG1(u
lowast)] is just α and the mFNR isgiven by pG(ulowast)[pG1(u
lowast) + (1 minus p)(1 minus ulowast)]
(a) (b)
Figure 2 Symmetric rejection region versus asymmetric rejection region (a) Comparison of oracle rules ( p value z value)(b) Rejection regions The normal mixture model is 8N(01) + p1N(minus31) + (2 minus p1)N(41) Both procedures control the mFDR at 10
Sun and Cai Rules for False Discovery Rate Control 905
32 The Oracle Procedure Based on z Values
We write the Lfdr statistic as TOR(Zi) = (1 minus p)f0(Zi)
f (Zi) and call it the oracle test statistic We assume thatTOR(Zi) is distributed with marginal cdf GOR = (1 minus p)G0
OR +pG1
OR where G0OR and G1
OR are the cdfrsquos of TOR(Zi) under thenull and the alternative Set GOR(t) = 1 minus GOR(t) The mFDRof the oracle rule δ(TOR λ) is QOR(λ) = (1 minus p)G0
OR(λ)
GOR(λ) We assume that TOR isin T satisfies the SMLR assump-tion then (b) of Proposition 1 implies that QOR(λ) is increasingin the threshold λ Therefore the oracle procedure thresholdsTOR(z) at λOR = supt isin (01) QOR(t) le α = Qminus1
OR(α) Thusthe oracle testing rule based on z values is
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
(6)
The corresponding mFDR= QOR(λOR) is just α and the mFNRis
QOR(λOR) = pG1OR(λOR)GOR(λOR) (7)
Example 1 We consider a random sample z1 zm from anormal mixture model
f (z) = (1minusp1 minusp2)φ(z)+p1φ(zminusμ1)+p2φ(zminusμ2) (8)
along with corresponding m tests Hi0 μ = 0 versus Hi
1 μ = 0i = 1 m Let p = p1 + p2 be the total proportion of non-nulls We assume that μ1 lt 0 and μ2 gt 0 so that the rejectionregion will be two-sided
The p value distribution under the alternative can be de-rived as G1(t) = (p1p)(Zt2 +μ1)+(Zt2 minusμ1)+ (p2
p)(Zt2 + μ2) + (Zt2 minus μ2) Thus the optimal p valuecutoff ulowast is the solution to equation G1(u)u = (1p minus 1) times(1α minus 1) with corresponding mFNR of pG(ulowast)[pG1(u
lowast) +(1 minus p)(1 minus ulowast)]
It is easy to verify that in normal mixture model (8) theoracle testing rule (6) is equivalent to rejecting the null whenzi lt cl or zi gt cu The threshold λOR can be obtained using thebisection method For a given λ cl and cu can be solved numeri-cally from the equation λ[p1 exp(minusμ1zminus 1
2μ21)+p2 exp(μ2zminus
12μ2
2)] minus (1 minus λ)(1 minus p) = 0 The corresponding mFDR andmFNR can be calculated as
mFDR = (1 minus p)[(cl) + (cu)]((1 minus p)[(cl) + (cu)] + p1[(cl + μ1)
+ (cu + μ1)] + p2[(cl minus μ2) + (cu minus μ2)])
and
mFNR = (p1[(cu + μ1) minus (cl + μ1)]+ p2[(cu minus μ2) minus (cl minus μ2)]
)((1 minus p)[(cu) minus (cl)] + p1[(cu + μ1)
minus (cl + μ1)] + p2[(cu minus μ2) minus (cl minus μ2)])
where (middot) is the cdf of a standard normal and (middot) = 1minus(middot)In this example we choose the mixture model 8N(01) +
p1N(minus31)+(2minusp1)N(41) This is different from the model
shown in Figure 1 which has alternative means minus3 and 3 there-fore different patterns are seen Both oracle procedures con-trol the mFDR at level 10 and we plot the mFNRs of thetwo procedures as a function of p1 in Figure 2(a) We cansee that again the p value oracle procedure is dominated bythe z value oracle procedure We set p = 15 and character-ize the rejection regions of the two oracle procedures in Fig-ure 2(b) Some calculations show that the rejection region ofthe p value oracle procedure is |zi | gt 227 with mFNR = 046whereas the rejection region of the z value oracle procedure iszi lt minus197 and zi gt 341 with mFNR = 038 It is interest-ing to note that the z value oracle procedure rejects observa-tion x = minus2 (p value = 046) but accepts observation x = 3(p value = 003) We provide more numerical comparisons inSection 5 and show that the z value oracle procedure yieldsa lower mFNR level than the p value oracle procedure on allpossible configurations of alternative hypotheses with the dif-ference significant in many cases
33 Connection to Work of Spjoslashtvoll (1972) andStorey (2007)
An ldquooptimalrdquo multiple-testing procedure was introduced bySpjoslashtvoll (1972) Let f01 f0N and f1 fN be integrablefunctions Let Sprime(γ ) denote the set of all tests (ψ1 ψN )that satisfy
sumNt=1
intψt(x)f0t (x) dμ(x) = γ where γ gt 0 is pre-
specified Spjoslashtvoll (1972) showed that the test (φ1 φN) =I [ft (x) gt cf0t (x)] t = 1 N maximizes
Nsumt=1
intφt (x)ft (x) dμ(x) (9)
among all tests (ψ1 ψN) isin Sprime(γ ) Storey (2007) proposedthe optimal discovery procedure (ODP) based on a shrinkagetest statistic that maximizes the expected number of true pos-itives (ETP) for a fixed level of the expected number of falsepositives (EFP) The result of Storey (2007) on the ODP fol-lows directly from the theorem of Spjoslashtvoll (1972) by choosingappropriate f0t and ft
In the setting of the present article both Spjoslashtvollrsquos optimalprocedure and Storeyrsquos ODP procedure depend on unknownfunctions that are not estimable from the data Moreover theoptimal cutoffs for a given test level are not specified in bothprocedures These limitations make the two methods inapplica-ble in terms of the goal of FDR control
The formulation of our testing procedure (6) has two ad-vantages over the procedures of Spjoslashtvoll (1972) and Storey(2007) First we can form good estimates (Jin and Cai 2007)from the data for the unknown functions in the oracle test sta-tistic that we propose Second we specify the optimal cutofffor each mFDR level which was not discussed by Spjoslashtvoll(1972) or Storey (2007) These advantages greatly facilitate thedevelopment of the adaptive procedure (10) where a consistentcutoff (relative to the oracle cutoff) is suggested for each testlevel based on a simple step-up procedure This task is impossi-ble based on the formulation of the work of Spjoslashtvoll or Storeybecause good estimates of the unknown quantities in their testldquostatisticsrdquo are not available and obtaining appropriate cutoffsis difficult
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
Sun and Cai Rules for False Discovery Rate Control 905
32 The Oracle Procedure Based on z Values
We write the Lfdr statistic as TOR(Zi) = (1 minus p)f0(Zi)
f (Zi) and call it the oracle test statistic We assume thatTOR(Zi) is distributed with marginal cdf GOR = (1 minus p)G0
OR +pG1
OR where G0OR and G1
OR are the cdfrsquos of TOR(Zi) under thenull and the alternative Set GOR(t) = 1 minus GOR(t) The mFDRof the oracle rule δ(TOR λ) is QOR(λ) = (1 minus p)G0
OR(λ)
GOR(λ) We assume that TOR isin T satisfies the SMLR assump-tion then (b) of Proposition 1 implies that QOR(λ) is increasingin the threshold λ Therefore the oracle procedure thresholdsTOR(z) at λOR = supt isin (01) QOR(t) le α = Qminus1
OR(α) Thusthe oracle testing rule based on z values is
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
(6)
The corresponding mFDR= QOR(λOR) is just α and the mFNRis
QOR(λOR) = pG1OR(λOR)GOR(λOR) (7)
Example 1 We consider a random sample z1 zm from anormal mixture model
f (z) = (1minusp1 minusp2)φ(z)+p1φ(zminusμ1)+p2φ(zminusμ2) (8)
along with corresponding m tests Hi0 μ = 0 versus Hi
1 μ = 0i = 1 m Let p = p1 + p2 be the total proportion of non-nulls We assume that μ1 lt 0 and μ2 gt 0 so that the rejectionregion will be two-sided
The p value distribution under the alternative can be de-rived as G1(t) = (p1p)(Zt2 +μ1)+(Zt2 minusμ1)+ (p2
p)(Zt2 + μ2) + (Zt2 minus μ2) Thus the optimal p valuecutoff ulowast is the solution to equation G1(u)u = (1p minus 1) times(1α minus 1) with corresponding mFNR of pG(ulowast)[pG1(u
lowast) +(1 minus p)(1 minus ulowast)]
It is easy to verify that in normal mixture model (8) theoracle testing rule (6) is equivalent to rejecting the null whenzi lt cl or zi gt cu The threshold λOR can be obtained using thebisection method For a given λ cl and cu can be solved numeri-cally from the equation λ[p1 exp(minusμ1zminus 1
2μ21)+p2 exp(μ2zminus
12μ2
2)] minus (1 minus λ)(1 minus p) = 0 The corresponding mFDR andmFNR can be calculated as
mFDR = (1 minus p)[(cl) + (cu)]((1 minus p)[(cl) + (cu)] + p1[(cl + μ1)
+ (cu + μ1)] + p2[(cl minus μ2) + (cu minus μ2)])
and
mFNR = (p1[(cu + μ1) minus (cl + μ1)]+ p2[(cu minus μ2) minus (cl minus μ2)]
)((1 minus p)[(cu) minus (cl)] + p1[(cu + μ1)
minus (cl + μ1)] + p2[(cu minus μ2) minus (cl minus μ2)])
where (middot) is the cdf of a standard normal and (middot) = 1minus(middot)In this example we choose the mixture model 8N(01) +
p1N(minus31)+(2minusp1)N(41) This is different from the model
shown in Figure 1 which has alternative means minus3 and 3 there-fore different patterns are seen Both oracle procedures con-trol the mFDR at level 10 and we plot the mFNRs of thetwo procedures as a function of p1 in Figure 2(a) We cansee that again the p value oracle procedure is dominated bythe z value oracle procedure We set p = 15 and character-ize the rejection regions of the two oracle procedures in Fig-ure 2(b) Some calculations show that the rejection region ofthe p value oracle procedure is |zi | gt 227 with mFNR = 046whereas the rejection region of the z value oracle procedure iszi lt minus197 and zi gt 341 with mFNR = 038 It is interest-ing to note that the z value oracle procedure rejects observa-tion x = minus2 (p value = 046) but accepts observation x = 3(p value = 003) We provide more numerical comparisons inSection 5 and show that the z value oracle procedure yieldsa lower mFNR level than the p value oracle procedure on allpossible configurations of alternative hypotheses with the dif-ference significant in many cases
33 Connection to Work of Spjoslashtvoll (1972) andStorey (2007)
An ldquooptimalrdquo multiple-testing procedure was introduced bySpjoslashtvoll (1972) Let f01 f0N and f1 fN be integrablefunctions Let Sprime(γ ) denote the set of all tests (ψ1 ψN )that satisfy
sumNt=1
intψt(x)f0t (x) dμ(x) = γ where γ gt 0 is pre-
specified Spjoslashtvoll (1972) showed that the test (φ1 φN) =I [ft (x) gt cf0t (x)] t = 1 N maximizes
Nsumt=1
intφt (x)ft (x) dμ(x) (9)
among all tests (ψ1 ψN) isin Sprime(γ ) Storey (2007) proposedthe optimal discovery procedure (ODP) based on a shrinkagetest statistic that maximizes the expected number of true pos-itives (ETP) for a fixed level of the expected number of falsepositives (EFP) The result of Storey (2007) on the ODP fol-lows directly from the theorem of Spjoslashtvoll (1972) by choosingappropriate f0t and ft
In the setting of the present article both Spjoslashtvollrsquos optimalprocedure and Storeyrsquos ODP procedure depend on unknownfunctions that are not estimable from the data Moreover theoptimal cutoffs for a given test level are not specified in bothprocedures These limitations make the two methods inapplica-ble in terms of the goal of FDR control
The formulation of our testing procedure (6) has two ad-vantages over the procedures of Spjoslashtvoll (1972) and Storey(2007) First we can form good estimates (Jin and Cai 2007)from the data for the unknown functions in the oracle test sta-tistic that we propose Second we specify the optimal cutofffor each mFDR level which was not discussed by Spjoslashtvoll(1972) or Storey (2007) These advantages greatly facilitate thedevelopment of the adaptive procedure (10) where a consistentcutoff (relative to the oracle cutoff) is suggested for each testlevel based on a simple step-up procedure This task is impossi-ble based on the formulation of the work of Spjoslashtvoll or Storeybecause good estimates of the unknown quantities in their testldquostatisticsrdquo are not available and obtaining appropriate cutoffsis difficult
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
906 Journal of the American Statistical Association September 2007
4 ADAPTIVE PROCEDURES FOR mFDR CONTROL
Genovese and Wasserman (2004) discussed a class of plug-inprocedures for the purpose of practical implementations of theoracle procedure based on the p values However the idea ofplug-in is difficult to apply to the z value oracle procedure be-cause it essentially involves estimating the distribution of oracletest statistic TOR(z) which is usually very difficult Instead wedevelop an adaptive procedure that requires only estimation ofthe distribution of the z values so that the difficulty of estimat-ing the distribution of TOR(z) is avoided The adaptive proce-dure can be easily implemented by noting that the z values areusually distributed as a normal mixture after appropriate trans-formations are applied and several methods for consistently es-timating the normal nulls have been developed in the literature(see eg Efron 2004b Jin and Cai 2007)
In this section we first introduce an adaptive p valuendashbasedprocedure proposed by Benjamini and Hochberg (2000) Wethen turn to the development of an adaptive procedure based onthe z values We show that the adaptive z valuendashbased proce-dure asymptotically attains the performance of the oracle proce-dure (6) We report simulation studies in Section 5 demonstrat-ing that the z valuendashbased adaptive procedure is more efficientthan the traditional p valuendashbased testing approaches
41 Adaptive Procedure Based on p Values
Suppose that P(1) P(m) are the ranked p values from m
independent tests and let k = maxi P(i) le αi[m(1 minus p)]Then the adaptive procedure of Benjamini and Hochberg(2000) designated as BH hereinafter rejects all H(i) i le kGenovese and Wasserman (2004) proposed the plug-in thresh-old t (p G) for p values where t (pG) = supt (1 minus p)t
G(t) le α is the threshold by the oracle p value procedure andp and G are estimates of p and G The next theorem shows thatthe GW plug-in procedure and the BH adaptive procedure areequivalent when the empirical distribution for p values Gm isused to estimate G
Theorem 3 Let p be an estimate of p and let Gm be theempirical cdf of the p values Then the GW plug-in procedureis equivalent to the BH adaptive procedure
It follows from our Theorem 3 and theorem 5 of Genoveseand Wasserman (2004) that the BH adaptive procedure controlsthe mFDR at level α asymptotically
42 Adaptive Procedure Based on z Values
Here we outline the steps for an intuitive derivation of theadaptive z valuendashbased procedure The derivation essentiallyinvolves mimicking the operation of the z value oracle proce-dure and evaluating the distribution of TOR(z) empirically
Let z1 zm be a random sample from the mixturemodel (1) with cdf F = (1 minus p)F0 + pF1 and pdf f = (1 minusp)f0 + pf1 Let p f0 and f be consistent estimates of pf0and f Such estimates have been provided by for example Jinand Cai (2007) Define TOR(zi) = [(1 minus p)f0(zi)f (zi)] and 1
The mFDR of decision rule δ(TOR λ) = I [TOR(zi) lt λ] i = 1 m is given by QOR(λ) = (1 minus p)G0
OR(λ)GOR(λ)where GOR(t) and G0
OR(t) are as defined in Section 3 LetSλ = z TOR(z) lt λ be the rejection region Then GOR(λ) =
intSλ
f (z) dz = int1TOR(z) lt λf (z) dz We estimate GOR(λ) by
GOR(λ) = 1m
summi=1 1TOR(zi) lt λ The numerator of QOR(λ)
can be written as (1 minus p)G0OR(λ) = (1 minus p)
intSλ
f0(z) dz =int1TOR(z) lt λTOR(z)f (z) dz and we estimate this quantity
by 1m
summi=1 1TOR(zi) lt λTOR(zi) Then QOR(λ) can be esti-
mated as
QOR(λ)
=[
msumi=1
1TOR(zi) lt λTOR(zi)
][msum
i=1
1TOR(zi) lt λ]
Set the estimated threshold as λOR = supt isin (01) QOR(t) le α and let R be the set of the ranked TOR(zi) R = ˆLfdr(1) ˆLfdr(m) We consider only the discrete cutoffs inset R where the estimated mFDR is reduced to QOR( ˆLfdr(k)) =1k
sumki=1
ˆLfdr(i) We propose the following adaptive step-up pro-cedure
Let k = maxi 1i
sumij=1
ˆLfdr(j) le α(10)
then reject all H(i) i = 1 k
Similar to the discussion in the proof of Theorem 3 it is suffi-cient to consider only the discrete cutoffs in R and the adaptiveprocedure (10) is equivalent to the plug-in procedure
δ(TOR λOR) = I [TOR(z1) lt λOR] I [TOR(zm) lt λOR]
which is very difficult to implement because obtaining λOR isdifficult
Remark 4 The procedure (10) is more adaptive than the BHadaptive procedure in the sense that it adapts to both the globalfeature p and local feature f0f In contrast the BH methodadapts only to the global feature p Suppose that we use the the-oretical null N(01) in the expression of TOR = (1 minus p)f0f The p value approaches treat points minusz and z equally whereasthe z value approaches evaluate the relative importance of minusz
and z according to their estimated densities For example ifthere is evidence in the data that there are more nonnullsaround minusz [ie f (minusz) is larger] then observation minusz will becorrespondingly ranked higher than observation z
Remark 5 In the FDR literature z valuendashbased methodssuch as the Lfdr procedure (Efron 2004ab) are used only tocalculate individual significance levels whereas the p valuendashbased procedures are used for global FDR control to identifynonnull cases It is also notable that the goals of global errorcontrol and individual case interpretation are naturally unifiedin the adaptive procedure (10)
The next two theorems show that the adaptive procedure (10)asymptotically attains the performance of the oracle procedurebased on the z values in the sense that both the mFDR andmFNR levels achieved by the oracle procedure (6) are alsoasymptotically achieved by the adaptive z valuendashbased proce-dure (10)
Theorem 4 (Asymptotic validity of the adaptive procedure)Let θi i = 1 m be iid Bernoulli(p) variables and letxi |θi i = 1 m be independent observations from the mix-ture model (1) with PDF f = (1 minus p)f0 + pf1 Suppose that
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
Sun and Cai Rules for False Discovery Rate Control 907
f is continuous and positive on the real line Assume thatTOR(zi) = (1 minus p)f0(zi)f (zi) is distributed with the marginalpdf g = (1 minus p)g0 + pg1 and that TOR isin T satisfies the SMLRassumption Let p f0 and f be estimates of p f0 and f
such that pprarr p Ef minus f 2 rarr 0 and Ef0 minus f02 rarr 0
Define the test statistic TOR(zi) = (1 minus p)f0(zi)f (zi) LetˆLfdr(1) ˆLfdr(m) be the ranked values of TOR(zi) then the
adaptive procedure (10) controls the mFDR at level α asymp-totically
Theorem 5 (Asymptotic attainment of adaptive procedure)Assume that random sample z1 zm and test statisticsTOR(zi) TOR(zi) are the same as in Theorem 4 Then the mFNRlevel of the adaptive procedure (10) is QOR(λOR)+o(1) whereQOR(λOR) given in (7) is the mFNR level achieved by thez value oracle procedure (6)
5 NUMERICAL RESULTS
We now turn to the numerical performance of our adap-tive z valuendashbased procedure (10) The procedure is easyto implement the R code for the procedure is available athttpstatwhartonupennedu˜tcaipaperhtmlFDRhtml Thegoal of this section is to provide numerical examples to showthat the conventional p valuendashbased procedures are inefficientWe also explore the situations where these p valuendashbased pro-cedures can be substantially improved by the z valuendashbasedprocedures We describe the application of our procedure to theanalysis of microarray data from an HIV study in Section 6
In our numerical studies we consider the following normalmixture model
Zi sim (1 minus p)N(μ0 σ20 ) + pN(μi σ
2i ) (11)
where (μiσ2i ) follows some bivariate distribution F(μσ 2)
This model can be used to approximate many mixture dis-tributions and is found in a wide range of applications (seeeg Magder and Zeger 1996 Efron 2004b) We compare thep valuendashbased and z valuendashbased procedures in two numericalexamples
Numerical Study 1 We compare the p value and z value or-acle procedures in the normal mixture model (8) a special caseof (11) The algorithm for calculating the oracle cutoffs and thecorresponding mFNRs is given in Example 1 at the end of Sec-tion 3 Figure 3 compares the performance of these two oracleprocedures Panel (a) plots the mFNR of the two oracle proce-dures as a function of p1 where the mFDR level is set at 10the alternative means are μ1 = minus3 and μ2 = 3 and the totalproportion of nonnulls is p = 2 Panel (b) plots the mFNR as afunction of p1 in the same setting except with alternative meansμ1 = minus3 and μ2 = 6 In panel (c) mFDR = 10 p1 = 18p2 = 02 and μ1 = minus3 and we plot the mFNR as a function ofμ2 Panel (d) plots the mFNR as a function of the mFDR levelwhile holding μ1 = minus3 μ2 = 1 p1 = 02 and p2 = 18 fixedWe discuss these results at the end of this section
Numerical Study 2 We compare the step-up procedure(Benjamini and Hochberg 1995) the adaptive p valuendashbasedprocedure (Benjamini and Hochberg 2000 Genovese andWasserman 2004) and the adaptive z valuendashbased proce-dure (10) designated by BH95 BHGW and AdaptZ here-inafter Note that the BH step-up procedure is easier to im-plement than either BHGW or AdaptZ The BHGW procedurerequires estimating the proportion of nonnulls and the AdaptZprocedure also requires an additional density estimation step
(a) (b)
(c) (d)
Figure 3 Comparison of the p value ( ) and z value ( ) oracle rules (a) mFDR = 10 μ = (minus33) (b) mFDR = 10 μ = (minus36)(c) mFDR = 10 p1 = 18 p2 = 02 μ1 = minus3 (d) μ = (minus31) p = (02 18)
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
908 Journal of the American Statistical Association September 2007
(a) (b)
(c) (d)
Figure 4 Comparison of procedures for FDR control (a) μ = (minus33) p = 20 (b) μ = (minus36) p = 20 (c) μ1 = minus3 p = (08 02)(d) μ1 = minus3 p = (18 02) ( the BH step-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values+ the adaptive procedure based on the z-values) The mFDR is controlled at level 10
The alternative means μ1i and μ2i are generated from uni-form distributions U(θ1 minus δ1 θ1 + δ1) and U(θ2 minus δ2 θ2 + δ2)after which zi is generated from the normal mixture model (8)based on μ1i and μ2i i = 1 m This hierarchical modelalso can be viewed as a special case of the mixture model (11)In estimating the Lfdr statistic Lfdr(zi) = (1 minus p)f0(zi)f (zi)f0 is chosen to be the theoretical null density N(01) p is esti-mated consistently using the approach of Jin and Cai and f isestimated using a kernel density estimator with bandwidth cho-sen by cross-validation The comparison results are displayedin Figure 4 where the top row gives plots of the mFNR as afunction of p1 and the bottom row gives plots of the mFNR as afunction of θ2 with other quantities held fixed All points in theplots are calculated based on a sample size of m = 5000 and200 replications
The following observations can be made based on the resultsfrom these two numerical studies
a When |μ1| = |μ2| the mFNR remains a constant for thep value oracle procedure In contrast the mFNR for the z valueoracle procedure increases first and then decreases [Fig 3(a)]The p value and z value oracle procedures yield the samemFNR levels only when the alternative is symmetric about thenull This reveals the fact that the z value procedure adapts tothe asymmetry in the alternative distribution but the p value
procedure does not Similar phenomena are shown in Figure 4for adaptive procedures
b The p value oracle procedure is dominated by the z valueoracle procedure The largest difference occurs when |μ1| lt μ2and p1 gt p2 where the alternative distribution is highly asym-metric about the null [Figs 3(b) and 3(c)] Similarly the BH95is dominated by BHGW which is again dominated by AdaptZ(Fig 4)
c The mFNRs for both p value and z value procedures de-crease as μ2 moves away from 0
d Within a reasonable range (mFDR lt 6) the mFNR isdecreasing in the mFDR [Fig 3(d)] which verifies part (d) ofProposition 1 one consequence of our SMLR assumption
Additional simulation results show that the difference in themFNRs of the p value and z value procedures is increas-ing in the proportion of nonnulls and that the adaptive proce-dures (BHGW and AdaptZ) control the mFDR at the nominallevel α approximately whereas the BH95 procedure controlsthe mFDR at a lower level
6 APPLICATION TO MICROARRAY DATA
We now illustrate our method in the analysis of the microar-ray data from an HIV study The goal of the HIV study (vanrsquotWout et al 2003) is to discover differentially expressed genes
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
Sun and Cai Rules for False Discovery Rate Control 909
(a) (b) (c)
Figure 5 Histograms of the HIV data (a) z values (b) p values transformed p values approximately distributed as uniform (01) for thenull cases
between HIV positive patients and HIV negative controls Geneexpression levels were measured for four HIV-positive patientsand four HIV-negative controls on the same N = 7680 genesA total of N two-sample t-tests were performed and N two-sided p values were obtained The z values were convertedfrom the t statistics using the transformation zi = minus1[G0(ti)]where and G0 are the cdfrsquos of a standard normal and a t vari-able with 6 degrees of freedom The histograms of the z valuesand p values were presented in Figure 5 An important featurein this data set is that the z value distribution is asymmetricabout the null The distribution is skewed to the right
When the null hypothesis is true the p values and z valuesshould follow their theoretical null distributions which are uni-form and standard normal But the theoretical nulls are usu-ally quite different from the empirical nulls for the data aris-ing from microarray experiments (see Efron 2004b for morediscussion on the choice of the null in multiple testing) Wetake the approach of Jin and Cai (2007) to estimate the nulldistribution as N(μ0 σ
20 ) The estimates μ0 and σ 2
0 are consis-tent We then proceed to estimate the proportion of the nonnullsp based on μ0 and σ 2
0 The marginal density f is estimated
by a kernel density estimate f with the bandwidth chosenby cross-validation The Lfdr statistics are then calculated as
ˆLfdr(zi) = (1 minus p)f0(zi)f (zi) The transformed p values areobtained as F0(zi) where F0 is the estimated null cdf (
xminusμ0σ0
)As shown in Figure 5(b) after transformation the distributionof the transformed p values is approximately uniform when thenull is true
We compare the BH95 BHGW and AdaptZ procedures us-ing both the theoretical nulls and estimated nulls We calculatethe number of rejections for each mFDR level the results areshown in Figure 6 For Figure 6(a) f0 is chosen to be the theo-retical null N(01) and the estimate for the proportion of nullsis 1 Therefore the BH and BHGW procedures yield the same
number of rejections For Figure 6(b) the estimated null dis-tribution is N(minus08 772) with estimated proportion of nullsp0 = 94 Transformed p values as well as the Lfdr statisticsare calculated according to the estimated null The followingobservations can be made from the results displayed
a The number of rejections is increasing as a function of themFDR
b For both the p valuendashbased and z valuendashbased ap-proaches more hypotheses are rejected by using the estimatednull
c Both comparisons show that AdaptZ is more powerfulthan BH95 and BHGW which are based on p values
7 DISCUSSION
We have developed a compound decision theory frameworkfor inference about the mixture model (1) and showed that theoracle procedure δ(TOR λOR) given in (6) is optimal in themultiple-testing problems for mFDR control We have proposedan adaptive procedure based on the z values that mimics the or-acle procedure (6) The adaptive z valuendashbased procedure at-tains the performance of the oracle procedure asymptoticallyand outperforms the traditional p valuendashbased approaches Thedecision-theoretic framework provides insight into the superi-ority of the adaptive z valuendashbased procedure Each multiple-testing procedure involves two steps ranking and threshold-ing The process of ranking the relative importance of the m
tests can be viewed as a compound decision problem P =(Pij 1 le i lt j le m) with m(mminus1)2 components Each com-ponent problem Pij is a pairwise comparison with its own datayij = (xi xj ) Then the solution to P based on p values is sim-ple (note that δij = I [p(xi) lt p(xj )] depends on yij alone)whereas the rule based on the estimated Lfdr is compound Thegain in efficiency of the adaptive z valuendashbased procedure isdue to the fact that the scope of attention is extended from the
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
910 Journal of the American Statistical Association September 2007
(a) (b)
Figure 6 Analysis of the HIV data Number of rejections versus FDR levels for the theoretical null (a) and the estimated null (b) the BHstep-up procedure based on the p values 13 the BHGW adaptive procedure based on the p values + the adaptive procedure based on thez values
class of simple rules to the class of compound rules in the rank-ing process
The compound decision-theoretic approach to multiple test-ing also suggests several future research directions First theoracle statistic (Lfdr) may no longer be optimal in ranking thesignificance of tests when the observations are correlated Inaddition different cutoffs should be chosen under dependencybecause the multiple-testing procedures developed for indepen-dent tests can be too liberal or too conservative for dependenttests (Efron 2006) Second the oracle procedure is only opti-mal in the class of symmetric decision rules We expect to de-velop more general models in the compound decision-theoreticframework and extend our attention to asymmetric rules to fur-ther increase the efficiency of the multiple-testing proceduresFinally Genovese Roeder and Wasserman (2005) Wassermanand Roeder (2006) Rubin Dudoit and van der Laan (2006)and Foster and Stine (2006) have discussed methods that usedifferent cutoffs for p values for different tests when prior in-formation is available or some correlation structure is known Itwould be interesting to construct z valuendashbased testing proce-dures that also exploit external information We leave this topicfor future research
APPENDIX PROOFS
Proof of Proposition 1
To prove part (a) note the MLR assumption (4) implies thatint c
0g0(t) dt
int c
0g1(t) dt =
int c
0g0(t) dt
int c
0
g1(t)
g0(t)g0(t) dt
lt
int c
0g0(t) dt
int c
0
g1(c)
g0(c)g0(t) dt
= g0(c)
g1(c)
which is equivalent to g1(c)G0(c) lt g0(c)G1(c) Thus the derivativeof P(θi = 1|T (Xi) le c) = pG1(c)G(c) is (1 minus p)pg1(c)G0(c) minusg0(c)G1(c)G(c)2 lt 0 Therefore P(θi = 1|T (Xi) le c) is de-creasing in c Parts (b) and (c) can be proved similarly to part (a)by noting that mFDR = p0G0(c)G(c) and mFNR = p1[1 minus G1(c)][1 minus G(c)] Part (d) follows from parts (b) and (c) For part (e)the classification risk of δ is Rλ = E[Lλ(θ δ)] = λ(1 minus p)G0(c) +p1 minus G1(c) The optimal cutoff clowast that minimizes Rλ satisfies[g0(clowast)g1(clowast)] = [pλ(1 minus p)] Equivalently λ = pg1(clowast)[(1 minusp)g0(clowast)] Therefore λ (clowast) is monotone decreasing in clowast (λ) if (4)holds Note that r = G(clowast) and G is increasing we conclude that r isdecreasing in λ
Proof of Theorem 1
According to part (b) of Proposition 1 for any T isin T and mFDRlevel α there exists a unique threshold c such that the mFDR is con-trolled at level α by δ(T c) = I [T (x1) lt c] I [T (xm) lt c] Letr be the expected number of rejections of δ(T c) Then again bypart (b) of Proposition 1 there exists a unique threshold clowast such thatthe expected number of rejections of δ( clowast) is also r Next accord-ing to part (e) of Proposition 1 there exists a unique λ(α) with respectto the choice of clowast such that the classification risk with λ(α) as theweight is minimized by δ( clowast) and the expected number of subjectsclassified to the nonnull population is r
Suppose that among the r rejected hypotheses there are vL fromthe null and kL from the nonnull when δ(L cL) is applied where(L cL) = (T c) or ( clowast) Then r = vT + kT = v + k Now weargue that vT ge v and kT le k If not then suppose that vT lt v
and kT gt k Note that the classification risk can be expressed asRλ(α) = p + 1
m λ(α)vL minus kL and it is easy to see that δ(T c) yieldsa lower classification risk than δ( clowast) This contradicts the factthat δ( clowast) minimizes the classification risk for the choice of λ(α)Therefore we must have that vT ge v and kT le k
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
Sun and Cai Rules for False Discovery Rate Control 911
Let mFDRL and mFNRL be the mFDR and mFNR of δ(L cL)where L = T Now we apply both δ = δ(T c) and δlowast = δ( clowast)
in the multiple-testing problem Then the mFDR using rule δlowast ismFDR = vr le vT r = mFDRT = α and the mFNR using δlowast ismFNR = [(m1 minus k)(m minus r)] le [(m1 minus kT )(m minus r)] = mFNRT Therefore δlowast is the testing rule in class DS that controls the mFDR atlevel α with the smallest mFNR
Proof of Theorem 2
The joint distribution of θ = (θ1 θm) is π(θ) = prodi (1 minus
p)1minusθi pθi The posterior distribution of θ given x can be calculated asPθ |X(θ |x) = prod
i Pθi |Xi(θi |xi) where
Pθi |Xi(θi |xi) = I (θi = 0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)
(1 minus p)f0(xi) + pf1(xi) (A1)
Let Ii = I (θi = 1) and I = (I1 Im) Ii = 0 or 1 be collectionsof all m-binary vectors then |I| = 2m The elements in I are desig-nated by superscripts I(j) j = 1 2m Note that
sum1θi=0[I (θi =
0)(1 minus p)f0(xi) + I (θi = 1)pf1(xi)][λI (θi = 0)δi + I (θi = 1)(1 minusδi)] = λ(1 minus p)f0(xi)δi + pf1(xi)(1 minus δi) then the posterior risk is
Eθ |XL(θ δ)
=2msumj=1
L(θ = I(j) δ
)prodi
Pθi=I
(j)i |Xi
(θi = I
(j)i
|xi
)
= 1
m
sumi
1sumθi=0
Pθi |Xi(θi |xi)λI (θi = 0)δi + I (θi = 1)(1 minus δi)
= 1
m
sumi
pf1(xi)
f (xi)+ 1
m
sumi
λ(1 minus p)f0(xi) minus pf1(xi)
f (xi)δi
The Bayes rule is the simple rule δπ (x) = [δπ1 (xi) δ
πm(xm)]
where δπi
= I [λ(1 minus p)f0(xi) lt pf1(xi)] The expected misclassifi-cation risk of δπ is
EEθ |XL(θ δπ ) = Epf1(X)
f (X)minus E
[λ(1 minus p)f0(X) minus pf1(X)
f (X)
times I λ(1 minus p)f0(X) lt pf1(X)]
= p +intλ(1minusp)f0ltpf1
[λ(1 minus p)f0(x) minus pf1(x)]dx
Proof of Theorem 3
Observe that for each t if p(i) le t lt p(i+1) then the number of re-jections is i In addition the ratio (1 minus p)tGm(t) increases in t overthe range on which the number of rejections is a constant which im-plies that if t = p(i) does not satisfy the constraint then neither doesany choice of t between p(i) and p(i+1) Thus it is sufficient to in-vestigate the thresholds that are equal to one of the p(i)rsquos Then the ra-tio (1 minus p)tGm(t) becomes (mi)(1minus p)p(i) Therefore the plug-inthreshold is given by supp(i) (mi)(1minus p)p(i) le α which is equiv-alent to choosing the largest i such that p(i) le iα[m(1 minus p)]Proof of Theorem 4
To prove this theorem we need to state the following lemmas
Lemma A1 Let p f and f0 be estimates such that pprarr Ef minus
f 2 rarr 0 Ef0 minus f02 rarr 0 and then ETOR minus TOR2 rarr 0
Proof Note that f is continuous and positive on the real line thenthere exists K1 = [minusMM] such that Pr(z isin Kc
1) rarr 0 as M rarr infin
Let infzisinK1 f (z) = l0 and Afε = z |f (z) minus f (z)| ge l02 Note that
Ef minus f 2 ge (l02)2P(Afε ) then Pr(Af
ε ) rarr 0 Thus f and f are
bounded below by a positive number for large m except for an eventthat has a low probability Similar arguments can be applied to the up-per bound of f and f as well as to the upper and lower bounds for f0and f0 Therefore we conclude that f0 f0 f and f are all boundedin the interval [la lb]0 lt la lt lb lt infin for large m except for an eventsay Aε that has algebraically low probability Therefore 0 lt la lt
infzisinAεminf0 f0 f f lt supzisinAc
εmaxf0 f0 f f lt lb lt infin
Noting that TOR minus TOR = [f0f (p minus p) + (1 minus p)f (f0 minus f0) +(1 minus p)f0(f minus f )](f f ) we conclude that (TOR minus TOR)2 le c1(p minusp)2 + c2(f0 minus f0)2 + c3(f minus f )2 in Ac
ε It is easy to see thatTOR minusTOR2 is bounded by L Then ETOR minusTOR2 le LPr(Aε)+c1E(p0 minus p)2 + c2Ef minus f 2 + c3Ef0 minus f02 Note that E(p0 minusp)2 rarr 0 by lemma 22 of van der Vaart (1998) and Ef minus f 2 rarr 0Ef0 minusf02 rarr 0 by assumption then we have that for a given ε gt 0there exists M isin Z
+ such that we can find Aε Pr(Aε) lt ε(4L)and at the same time E(p0 minus p)2 lt ε(4c1) Ef minus f 2 lt ε(4c2)and Ef0 minus f02 lt ε(4c3) for all m ge M Consequently ETOR minusTOR2 lt ε for m ge M and the result follows
Lemma A2 ETOR minus TOR2 rarr 0 implies that TOR(Zi)prarr
TOR(Zi)
Proof Let Aε = z |TOR(z) minus TOR(z)| ge ε Then ε2 Pr(Aε) leETOR minus TOR2 rarr 0 Consequently Pr(Aε) rarr 0 ThereforePr(|TOR(Zi)minusTOR(Zi)| ge ε) le Pr(Aε)+Pr(|TOR(Zi)minusTOR(Zi)| geε cap Ac
ε) = Pr(Aε) rarr 0 and the result follows
Lemma A3 For α lt t lt 1 E[1TOR(Zi) lt tTOR(Zi)] rarrE[1TOR(Zi) lt tTOR(Zi)]
Proof TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that TOR(Zi)
drarrTOR(Zi) Let h(x) = 1x lt tx then h(x) is bounded and continuousfor x lt t By lemma 22 of van der Vaart (1998) E[h(TOR(Zi))] rarrE[h(TOR(Zi))] and the result follows
Lemma A4 Construct the empirical distributions GOR(t) = 1m timessumm
i=1 1TOR(zi) le t and G0PI
(t) = 1m
summi=1 1TOR(zi) le t times
TOR(zi) Define QOR(t) = G0OR(t)GOR(t) the estimated mFDR
Then for α lt t lt 1 QOR(t)prarr QOR(t)
Proof Let ρm = cov[TOR(Zi) TOR(Zj )] where Zi and Zj aretwo independent random variables from the mixture distribution
F TOR(Zi)prarr TOR(Zi) (Lemma A2) implies that E[TOR(Zi) times
TOR(Zj )] rarr E[TOR(Zi)TOR(Zj )] Therefore ρm = cov[TOR(Zi)
TOR(Zj )] rarr cov[TOR(Zi) TOR(Zj )] = 0 Let σ 2m = var(TOR(Z1))
then σ 2m le E[TOR(Z1)]2 le 1
Let μm = PrTOR(Zi) lt t and Sm = summi=1 1TOR(zi) le t Note
that var(Sm)m2 = (1m)σ 2m + [(m minus 1)m]ρm le 1m + ρm rarr 0
According to the weak law of large numbers for triangular arrays
we have that (1m)summ
i=1 1TOR(zi) le t minus μmprarr 0 Also note
that μm = PrTOR(Zi) lt t rarr PrTOR(Zi) lt t = GOR(t) we con-
clude that GOR(t)prarr GOR(t) Next we let vm = E[1TOR(Zi) le
tTOR(Zi)] Similarly we can prove that G0OR(t) = 1
m timessummi=1 1TOR(zi) le tTOR(zi) minus vm
prarr 0 Note that by Lemma A2
we have E[1TOR(Zi) lt tTOR(Zi)] rarr E[1TOR(Zi) lt tTOR(Zi)]thus vm rarr E(1TOR(Zi) lt tTOR(Zi)) = (1 minus p)
int1TOR(Zi) lt
tf0dt = (1 minus p)G0OR(t) and G0
OR(t)prarr (1 minus p)G0
OR(t) Finallynote that for t gt α GOR(t) is bounded away from 0 and we obtain
QOR(t)prarr (1 minus p)G0
OR(t)GOR(t) = QOR(t)
Lemma A5 Define the estimate of the plug-in threshold λOR =supt isin (01) QOR(t) le α If QOR(t)
prarr QOR(t) then λORprarr λOR
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172
912 Journal of the American Statistical Association September 2007
Proof Note that QOR(t) is not continuous and the consistency ofQOR(t) does not necessarily imply the consistency of λOR Thus wefirst construct an envelope for QOR(t) using two continuous randomfunctions Qminus
OR(t) and Q+OR(t) such that for Lfdr(k) lt t lt Lfdr(k+1)
QminusOR(t) = QOR
(Lfdr(kminus1)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
and
Q+OR(t) = QOR
(Lfdr(k)
) t minus Lfdr(k)
Lfdr(k+1) minus Lfdr(k)
+ QOR(Lfdr(k+1)
) Lfdr(k+1) minus t
Lfdr(k+1) minus Lfdr(k)
Noting that QOR(Lfdr(k+1)) minus QOR(Lfdr(k)) = [kLfdr(k+1) minussumki=1 Lfdr(i)][k(k + 1)] gt 0 we have that Qminus
OR(t) le QOR(t) leQ+
OR(t) with both QminusOR(t) and Q+
OR(t) strictly increasing in t
Let λminusOR = supt isin (01) Qminus
OR(t) le α and λ+OR = supt isin (01)
Q+OR(t) le α then λ+
OR le λOR le λminusOR
We claim that λminusOR
prarr λOR If not then there must exist ε0 and η0
such that for all m isin Z+ Pr(|λminusOR minus λOR| gt ε0) gt 2η0 holds for some
m ge M Suppose that Pr(λminusOR gt λOR + ε0) gt 2η0 It is easy to see that
QminusOR
prarr QOR because QminusOR(t) minus Q+
OR(t)asrarr 0 Qminus
OR(t) le QOR(t) leQ+
OR(t) and QOR(t)prarr QOR(t) Thus there exists M isin Z
+ such
that Pr(|QminusOR(λOR + ε0) minus QOR(λOR + ε0)| lt δ0) gt 1 minus η0 for all
m ge M where we let 2δ0 = QOR(λOR + ε0) minus α Now for thechoice of m there exists event K1m such that Pr(K1m) ge 1 minus η0 andfor all outcomes ω isin K1m |Qminus
OR(λOR + ε0) minus QOR(λOR + ε0)| lt
δ0 At the same time there exists event K2m such that Pr(K2m) ge2η0 and for all outcomes ω isin K2m λminus
OR gt λOR + ε0 Let Km =K1m capK2m then Pr(Km) = Pr(K1m)+Pr(K2m)minusPr(K1m cupK2m) gePr(K1m)+Pr(K2m)minus1 ge η0 That is Km has positive measure Thenfor all outcomes in Km Qminus
OR(t) is continuous and strictly increas-
ing and we have α = QminusOR(λminus
OR) gt QminusOR(λOR + ε0) gt QOR(λOR +
ε0) minus δ0 gt α + δ0 This is a contradiction Therefore we must
have λminusOR
prarr λOR Similarly we can prove λ+OR
prarr λOR Note that
QminusOR(t) minus Q+
OR(t)asrarr 0 implies that λminus
OR minus λ+OR
asrarr 0 the result fol-
lows by noting that λ+OR le λOR le λminus
OR
Proof of Theorem 4 Implementing the adaptive procedure (10)is equivalent to choosing TOR(Zi) as the test statistic and λOR asthe threshold The mFDR of decision rule δ(TOR λOR) is mFDR =E(N10)E(R) = [(1 minus p)PH0(TOR lt λOR)][P(TOR lt λOR)] Not-
ing that TORprarr TOR by Lemma A2 and λOR
prarr λOR by Lemma A5it follows that (1 minus p)PH0(TOR lt λOR) rarr (1 minus p)PH0(TOR lt
λOR) = (1 minus p)G0OR(λOR) and P(TOR lt λOR) rarr P(TOR lt λOR) =
GOR(λOR) Noting that P(TOR lt λOR) is bounded away from 0 wehave that mFDR rarr (1 minus p)G0
OR(λOR)GOR(λOR) = QOR(λOR) =α
Proof of Theorem 5
Similar to the proof of Theorem 4 the mFNR of the adaptive pro-cedure (10) is mFNR = E(N01)E(S) = PH1 (TOR gt λOR)P (TOR gt
λOR) By Lemmas A2 and A5 we have that PH1 (TOR gt λOR) rarrPH1(TOR gt λOR) = G1
OR(λOR) and P(TOR gt λOR) rarr P(TOR gt
λOR) = GOR(λOR) Noting that GOR(λOR) is bounded away from 0we obtain mFNR rarr pG1
OR(λOR)GOR(λOR) = QOR(λOR)
[Received November 2006 Revised March 2007]
REFERENCES
Benjamini Y and Hochberg Y (1995) ldquoControlling the False Discovery RateA Practical and Powerful Approach to Multiple Testingrdquo Journal of the RoyalStatistical Society Ser B 57 289ndash300
(2000) ldquoOn the Adaptive Control of the False Discovery Rate in Mul-tiple Testing With Independent Statisticsrdquo Journal of Educational and Be-havioral Statistics 25 60ndash83
Copas J (1974) ldquoOn Symmetric Compound Decision Rules for DichotomiesrdquoThe Annals of Statistics 2 199ndash204
Efron B (2004a) ldquoLocal False Discovery Raterdquo technical report Stan-ford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersFalsepdf
(2004b) ldquoLarge-Scale Simultaneous Hypothesis Testing the Choiceof a Null Hypothesisrdquo Journal of the American Statistical Association 9996ndash104
(2006) ldquoCorrelation and Large-Scale Simultaneous Testingrdquo technicalreport Stanford University Dept of Statistics available at httpwww-statstanfordedu~bradpapersCorrelation-2006pdf
Efron B Tibshirani R Storey J and Tusher V (2001) ldquoEmpirical BayesAnalysis of a Microarray Experimentrdquo Journal of the American StatisticalAssociation 96 1151ndash1160
Foster D and Stine R (2006) ldquoAlpha Investing A New Multiple Hypothe-sis Testing Procedure That Controls mFDRrdquo technical report University ofPennsylvania Dept of Statistics
Genovese C and Wasserman L (2002) ldquoOperating Characteristic and Exten-sions of the False Discovery Rate Procedurerdquo Journal of the Royal StatisticalSociety Ser B 64 499ndash517
(2004) ldquoA Stochastic Process Approach to False Discovery ControlrdquoThe Annals of Statistics 32 1035ndash1061
Genovese C Roeder K and Wasserman L (2005) ldquoFalse Discovery ControlWith p-Value Weightingrdquo Biometrika 93 509ndash524
Jin J and Cai T (2007) ldquoEstimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisonsrdquo Journal of the AmericanStatistical Association 102 495ndash506
Magder L and Zeger S (1996) ldquoA Smooth Nonparametric Estimate of aMixing Distribution Using Mixtures of Gaussiansrdquo Journal of the AmericanStatistical Association 91 1141ndash1151
Robbins H (1951) ldquoAsymptotically Subminimax Solutions of Compound Sta-tistical Decision Problemsrdquo in Proceedings of Second Berkeley Symposiumon Mathematical Statistics and Probability Berkeley CA University of Cal-ifornia Press pp 131ndash148
Rubin D Dudoit S and van der Laan M (2006) ldquoA Method to Increase thePower of Multiple Testing Procedures Through Sample Splittingrdquo workingpaper University of California Berkeley Dept of Biostatistics available athttpwwwbepresscomucbbiostatpaper171
Spjoslashtvoll E (1972) ldquoOn the Optimality of Some Multiple Comparison Proce-duresrdquo The Annals of Mathematical Statistics 43 398ndash411
Storey J (2002) ldquoA Direct Approach to False Discovery Ratesrdquo Journal of theRoyal Statistical Society Ser B 64 479ndash498
(2003) ldquoThe Positive False Discovery Rate A Bayesian Interpretationand the Q-Valuerdquo The Annals of Statistics 31 2012ndash2035
(2007) ldquoThe Optimal Discovery Procedure A New Approach to Si-multaneous Significance Testingrdquo Journal of the Royal Statistical SocietySer B 69 347ndash368
van der Laan M Dudoit S and Pollard K (2004) ldquoMultiple Testing Part IIIAugmentation Procedures for Control of the Generalized Family-Wise ErrorRate and Tail Probabilities for the Proportion of False Positivesrdquo Universityof California Berkeley Dept of Biostatistics
van der Vaart A (1998) Asymptotic Statistics Cambridge UK CambridgeUniversity Press
vanrsquot Wout A Lehrman G Mikheeva S OrsquoKeeffe G Katze MBumgarner R Geiss G and Mullins J (2003) ldquoCellular Gene Expres-sion Upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-CellLinesrdquo Journal of Virology 77 1392ndash1402
Wasserman L and Roeder K (2006) ldquoWeighted Hypothesis Testingrdquo tech-nical report Carnegie Mellon University Dept of Statistics available athttpwwwarxivorgabsmathST0604172