Chapter 5 Nonparametric Post Hoc Test -...

Chapter 5Nonparametric Post

Hoc Test

121

When data consist of one nominal variable and one measurement variable,

usually one way ANOVA is used but when the measurement variable does not meet

the normality assumption of a one-way ANOVA then parametric method is not

applicable and when original data set actually consists of one nominal variable and

one ranked variable; we cannot apply ANOVA.The nonparametric techniques which

have been developed for k sample problem require no assumptions beyond continuous

populations and therefore it is applicable under any circumstances.

One of the assumptions of the parametric analysis is that the variability is

approximately the same across all groups. If this assumption does not hold then

researcher should first try to transform the response variable, perhaps using a log or

square root transformation. Hopefully this will be stabilizing the variance across the

groups. However, in certain situations none of the transformation resolves this

problem. In this situation the researcher should consider using a non parametric test

(Newman, 1995).

Non parametric tests are simple and easy to understand. No assumptions are

made regarding the parent population. If the normality assumption is violate or the

sample sizes from each of the k populations are too small to assess normality, Kruskal

Wallis (KW) test is used to compare the distribution of different populations.

Kruskal Wallis (KW) test is the non parametric equivalent to the omnibus F

test in a one way ANOVA (which is used with matrix dependent variable). KW test is

used when the dependent variable consist of ranks. It tests the null hypothesis that the

location of each group is the same in the population. If the null hypothesis is rejected,

then at least one of the locations is different from the others. When the KW test is

significant, perform follow up pair wise tests.

122

It is important to realize that the Kruskal-Wallis H test is an omnibus test

statistic that enables to test the general hypothesis that all population medians are

equal but cannot tell which specific groups of independent variable are statistically

significantly different from each other; it only tells that at least two groups are

different. But the researcher is not just interested in this general hypothesis but in

comparisons amongst the individual groups. Since we may have three, four, five or

more groups in our study design, determining which of these groups differ from each

other is important, then post hoc test is used.

There are two ways to apply nonparametric post hoc procedures, the first

being to use Mann–Whitney tests. However, if we use lots of Mann Whitney tests,

Type I error rate will inflate, therefore not preferable. However, if we want to use lots

of Mann–Whitney tests to follow up a Kruskal–Wallis test, we can if we make some

kind of adjustment to ensure that the type I errors don’t build up to more than .05. The

easiest method is to use a Bonferroni correction, which in its simplest form just means

that instead of using .05 as the critical value for significance for each test, you use a

critical value of .05 divided by the number of tests conducted.

For a long time investigators have been testing comparisons, after a KW test,

by applying Wilcoxon test or the Mann Whitney test. This procedure is equivalent to

performing simple t test among treatment means following an ANOVA. Like t test,

doing multiple Wilcoxon test probably errors on the side of leniency because no

protection is provided for the number of comparisons. This multiple comparison

situation has improved recently with the emergence of a number of procedures i.e.

Nemenyi, 1963; Dunn 1964; Dunn control 1964; Steel Dwass 1960; Steel control

1959. The other way to apply non parametric post hoc test with adjusted p value such

as Bonferroni, Holm, Hochberg, Hommel, Holland, Rom.

123

This chapter is divided in to 2 sections.

1. Nonparametric Post Hoc test. (Discussed in section 5.1)

2. Nonparametric Post Hoc test with adjusted P value. (Discussed in section 5.2)

Figure 4: Non Parametric Conservative Post Hoc Tests

Non Parametric Conservative Post HocTests

SS

Bonferroni

Nemenyi

Dunn Pairwise

Dunn control

Steel Dwass

Steel Control

123





Non Parametric Conservative Post HocTests

SU

Hommel

Hochberg

Rom

Holm

Holland & Copenhaver

123





SD

Holm


124

5.1: Nonparametric Post Hoc Test

5.1.1.: Introduction:

This chapter is devoted to multiple comparison procedures that are distribution

free (also referred to as non parametric) in the sense that their type I error rates don’t

depend on distributions that generate the sample data. In the context of testing of

hypotheses, this means that the marginal or also the joint null distributions of the test

statistics do not depend on the underlying distributions of the observations.

There are two broad groups of nonparametric UMCP (unplanned multiple

comparison procedure) for pair wise comparisons that use two quite different

approaches. One group uses joint ranking i.e. each pair wise comparison is based on

the rank for all k treatments in the study. The result of the comparison of each pair of

treatments depends on the data from the other k-2 treatments, a situation not found in

any of the commonly used parametric UMCP.

The other group uses pair wise ranking, i.e. re ranking the data for each pair

treatments being compared. The test for each pair of treatments does not depend on

the other treatments in the study, as it always the situation in parametric procedures.

This group usually calculates the maximum or minimum rank sum and uses the

wilcoxon or Mann Whitney U statistics.

Difference between pair wise ranking and joint ranking

Test statistics computed from joint rankings do not yield testing families,

while those computed from separate rankings do. The lack of the testing family

property also implies that single step test procedures based on joint rankings do not

have corresponding confidence analogs. Lehmann (1975) has noted that joint rankings

may provide more information than separate rankings in location problems. He has

also noted a lack of transitivity that can arise with separate rankings in which, say,

125

treatment 1 can be declared better than treatment 2, and treatment 2 can be declared

better than treatment 3, but treatment 1 cannot be declared better than treatment 3.

Such intransitivity cannot arise with joint rankings. Despite these limitations separate

rankings are still generally preferred in practice (Hochberg & Tamhane, 1987).

Both methods of assigning ranks pair wise and jointly have well known

drawbacks (Lehmann 2006, Miller 1981). When observations are ranked in a pair

wise fashion, an inconsistency known as cycling can arise where treatment j is

declared superior to treatment i and treatment k superior to j, but without k being

superior to i. When observations are ranked jointly or within blocks, the significance

of a comparison between a pair of treatments depends upon the observations from

treatments not involved in the comparison. Thus, results may change depending upon

the number of treatments being considered. This type of inconsistency is known as the

problem of irrelevant alternatives.

This section is divided into 2 sub sections. In the 1st sub section theoretical

aspects are explained and in the 2nd sub section simulations study is carried out.

5.1.1.1 Nemenyi Test (1963):

This method is analogous to Tukey test and is known as the Nemenyi joint

rank test. It is a nonparametric multiple median comparison test. Like the Wilcoxon

multiple comparison test, it is used to compare the sample groups when the data is

measured on at least an ordinal scale and when the sample size is the same in each of

the group. Nemenyi proposed a test that originally based on rank sums. This method

control inflation of the FWE (Israel, 2008).

Assumptions:

1. Measurement of variables should be at least an ordinal scale.

126

2. There should be an equal sample size for all groups.

Test statistics:

12

)1)((

nknkn

RRq ji

cal

(5.1.1.1.1)

Where,

Ri= Sum of the joint rank from the ith group.

Rj= Sum of the joint rank from the jth group.

n=number of observations in a group.

k= Total number of groups.

Critical Value:

(5.1.1.1.2).

The critical value in this test is known as studentized range, abbreviated q and is

depends upon α (significance level), and k (the total number of groups).

Decision procedure:

Reject the null hypothesis if qcal ≥ qcritical; do not reject H0 otherwise.

Advantages:

1. This test is protected.

2. There is no restriction in the number of groups to be compared with each other

(Israel, 2008).

Disadvantages:

1. This test becomes extremely conservative as the gap between groups increases,

because the joint ranking of a group of treatments with other very different

treatments reduces the relative differences between rank sums within a group.

2. It has very low power in sub groups.

kcritical qq ,,

127

5.1.1.2. Dunn Test (1964):

Dunn (1964) proposed a single step procedure that is based on joint rankings

of observations from all the treatments.When we have unequal sample sizes, we can

no longer use ranks sums. Rather, we must use rank means, since they are adjusted for

sample size. This procedure is called Dunn’s Test. If researcher is interested in

comparing the location parameters of k experimental groups simultaneously and

preserving the FWE, this method is used.

It tests whether pairs of median are equal using a rank test. The error rate is

adjusted on a comparison wise basis to give the family error rate, αFWE. Instead of

using means, it uses average ranks. It is used for all pair wise comparisons. In this test

we compare mean ranks, not sums of ranks that are arranged in order of magnitude

(Dmitrienko, A. et al 2007).

Family wise error which represents a conservative approach in making

multiple comparisons holds the probability of making only correct decisions at 1-α

when null hypothesis of no difference among populations is true. This approach

protects well against error when H0 is true, but it makes more difficult task of

detecting differences that are significant when the null hypothesis is false.

Assumptions:

1. It is protected test.

2. Sample sizes of at least five (but preferably larger) for each treatment are

recommended.

Test Statistics:

SE

RRQ

ji

cal

(5.1.1.2.1)

128

For that, first combine the data, rank it, find the groups mean ranks and then take the

standardized absolute differences of these average ranks.

For equal sample size For unequal sample size

6)1( NkSE

ji nn

NNSE

11

12

)1(

If tied ranks are present,

)1(6

)()1(1

32

NN

ttNNk

SE

m

ii


)1(12

11)()1(

1

32

N

nnttNN

SEji

m

iii

Where,

iR is the mean of the joint ranks for the ith group.

i

ii n

RR

i=1,2,….k j=1,2,…k i≠j

jRis the mean of the joint ranks for the jth group.

ni=the number of observations for the ith treatment.

nj=the number of observations for the jth treatment.

N is the total number of observations in all groups. N=∑ni

K is the total number of groups.

ti is the number of ties in the ith group of ties

m is the number of groups of tie ranks.

Critical Value:

)1(*

kk

Zz

(5.1.1.2.2)

129

The quantile α is called the FWE or the overall significance level, which is the

probability of at least one erroneous rejection among the k(k-1)/2 pair wise

comparisons.

When making multiple comparisons with a FWE, we usually select a value of α larger

than those customarily encountered in single comparison inference procedures. For

example, 0.15, 0.20 or perhaps 0.25, depending on the size of k. Choose a high

significance level, say, 10 per cent, 15 per cent, 20 per cent, or even 25 per cent

recommended by Dunn (Neave & Worthington, 1988).

Decision Procedure:

Reject the null hypothesis if Qcal ≥ z* ; do not reject H0 otherwise.

Advantages:

1. This test is very flexible as it takes into account ties (Israel, 2008).

2. This test is useful for comparing groups with very small sample size (Israel,

2008). Relatively small total sample sizes may be analyzed with this technique,

i.e. three groups with five experimental units or more than 3 groups with 4 units

(Lehman, 1975).

3. The symmetry assumption, which is often difficult to assess in drug discovery

settings with small sample sizes, may be relaxed or ignored (Dmitrienko, A et al

2007).

4. It is useful for evaluating a few priori comparisons from a large set of possible

comparisons (Edward, 1971).

5. Equal sample sizes are not required.

130

6. This procedure is more powerful for detecting differences between extreme

treatments when there are intermediate treatments present.

Disadvantages:

1. It uses a Bonferroni like correction to the FEW so it might be a too conservative.

2. This method is little conservative for pair wise testing (Edward, 1971). It is overly

conservative on Type I error, so it has very weak power.

3. This method employs joint ranking, thus the comparison of two groups is highly

influenced by the behavior of other groups in the experiment as the data are

initially ranked over the entire experiment.

4. It makes more difficult task of detecting differences that are significant when the

null hypothesis is false.

5. This test is meaningless if the main test of k Independent samples has not revealed

significant results.

6. The larger the k value, the more difficult it is to detect differences.

5.1.1.3 Dunn Control Test (1964):

This method is used if each group of data is to be tested against a control

group. Sometimes research situation is such that one of the k treatments is a control

condition. When this is the case, the investigator is frequently interested in comparing

each treatment with control condition without regard to whether the overall test for a

treatment effect is significant, and irrespective of any potential significant differences

between other pairs of treatments. When interest focuses on comparing all treatments

131

with a control condition, there will be k-1 comparisons to be made.Significant

Kruskal Wallis test not required for this test.

Test statistics:

SE

RRQ

ci

cal

(5.1.1.3.1)

For that, first combine the data, rank it, find the groups mean ranks and then take the

standardized absolute differences of these average ranks.

For equal sample size For unequal sample size

6)1( NkSE

ci nn

NNSE

11

12

)1(


)1(6

)()1(1

32

NN

ttNNk

SE

m

ii


)1(12

11)()1(

1

32

N

nnttNN

SE ci

m

iii

Where,

iR is the mean of the joint ranks for group i. i

ii n

RR

cR is the mean of the joint ranks for the control group c.

ni and nc are sample sizes for group i and the control group c respectively.

N is the total sample size N=∑n i

K is the total number of groups.

ti is the number of ties in the ith group of tie.

m is the number of groups of ties ranks.

132

Critical Value:

)1(2*

k

Zz

(5.1.1.3.2)

is the 100α/(2(k-1))th upper quantile from a standard Gaussian (normal) distribution.

Decision Procedure:

Reject the null hypothesis if Qcal ≥ z* ; do not reject H0 otherwise.

Advantages:

1. This test is very flexible as it takes into account ties (Israel, 2008).

2. This test is useful for comparing groups with very small sample size (Israel,

2008). Relatively small total sample sizes may be analyzed with this technique,

i.e. three groups with five experimental units or more than 3 groups with 4 units

(Lehman, 1975).

3. The symmetry assumption, which is often difficult to assess in drug discovery

settings with small sample sizes, may be relaxed or ignored (Dmitrienko, A et al

2007).

4. This procedure can also be used for unequal sample size.

5. This procedure is more powerful for detecting differences between extreme

treatments when there are intermediate treatments present.

Disadvantages:

1. It uses a Bonferroni like correction to the FWE and might be a too conservative.

2. This method is little conservative for pair wise testing (Edward, 1971). It is overly

conservative on Type I error, so it has very weak power.

133

3. This method employs joint ranking, thus the comparison of two groups is highly

influenced by the behavior of other groups in the experiment as the data are

initially ranked over the entire experiment.

4. It makes more difficult task of detecting differences that are significant when the

null hypothesis is false.

5. This test is meaningless if the main test of k independent samples has not revealed

significant results.

6. The larger the k value, the more difficult it is to detect differences.

5.1.1.4. Steel Dwass Test (1960):

This is multiple comparison test for the KW test in a manner analogous to the

Tukey or equivalent to Tukey. This is non parametric version for all pair wise

comparisons. This method use rank sums rather than sample means. It is used for

planned multiple comparison procedure. This method is used only for balanced case

n1=n2=…=nk = n. It is a simultaneous nonparametric inference for all pair wise

comparisons. This method is recommended for making pair wise comparisons after a

significant overall H has been obtained.

If one is interested in comparing the location parameters of the k experimental

groups simultaneously and preserving the FWE, approach suggested by Dwass

(Dwass, 1960) and Steele (Steele, 1960) is used. Steel (1960) and Dwass (1960)

independently proposed a single step procedure for this family that is based on

separate pair wise rankings of observations.

134

Assumptions:

1. All the random samples are independent.

2. Equal sample size required.

Procedure:

H0:Fi=Fi’ kii '1

Let RSii’+ be the rank sum of n observations from treatment i when the 2n

observations from treatments i and i’ are ranked together, and let

)1(,)12( '''' kiiRSnnRSRS iiiiii

(5.1.1.4.1)

Let

)},{max(max '''1

*

'

iiii

kiik RSRSRS

(5.1.1.4.2)

And let)*(

kRS be the upper α point of the distribution of*kRS under the overall null

hypothesis H0: F1=F2=…=Fk. (Note that notation for pair wise comparison is different

from that for comparisons with a control. Here the subscript on RS* denotes the

number of treatments and not the number of pair wise comparisons among them).

The steel Dwass procedure rejects Hii’ : Fi=Fi’ in favor of two sided alternative if

)*('' ),max(

kiiii RSRSRS kii '1

Note that

2

)12(

2

)12(),max( '''

nn

RSnn

RSRS iiiiii

Which shows that)*(

kRS can be determined from the (kc2) variate joint distribution of

the statistics

'iiRS .

Where,

135

24

)12(,

2

1

2

)12( 2)()(* nn

Qnn

RS kk

(5.1.1.4.3)

Where,

,kQ is the upper α point of the ,kQ random variable.

Advantages:

1. It is a simultaneous test procedure so confidence interval can be obtained by this

method.

2. Steel-Dwass procedure is not affected by the presence of other treatments and

hence higher power for detecting differences between adjacent treatments.

3. As this test uses the data in each pair of treatments separately, it should perform

best when the sample size is large compared to the number of treatments.

Disadvantages:

1. It cannot be used for unequal sample size.

2. It tends to be very conservative i.e. having a type I error much less than the stated

α (Zar, 1999).

3. It has very limited exact tables and the large sample approximation can be very

conservative when there are many treatments.

5.1.1.5. Steel control Test (1959):

It is non parametric test analogue to Dunnett procedure. It is a nonparametric

test that compares treatments with a control. It compares the medians of all groups

against a control using the Steel pair wise ranking nonparametric method.

136

It controls the error rate simultaneously for all the k-1 comparisons. It is

generalization of wilcoxon test. Steel present a rank sum multiple comparison test for

comparing treatments with a control.

This procedure is developed to meet the need of those researchers whose

experiments generally include recognized standard treatments for comparison with

each of k treatments; such inclusion is required for where environmental conditions

may change from experiment to experiment.

Assumption:

1. Although equal variance is a formal requirement for this test, they are believed to

be relatively robust to variance heterogeneity (Newman, 1965).

2. It assumes a continuous distribution for the measured variable.

3. This method is applicable when there are equal numbers of observations for all

treatments.

4. This method is recommend for making pair wise comparisons after a significant

over all H has been obtained i.e. It is protected test.

Procedure:

The formal null hypothesis is that all observations come from the same

population regardless of treatment.

Consider a control treatment labeled k and test treatments labeled 1,2,….,k-1

where k≥3.it also assumes that n1=n2=…..=nk-1=n(say), which may be different from

nk. Steel (1959) proposed a single step test procedure for the family of hypotheses

)11(: kiFFH kioi . In this procedure the n observations from Fi and the nk

observations from Fk are pooled and rank ordered from the smallest to the largest.

Because the observations from only the treatments being compared are ranked, this is

137

referred to as the method of separate rankings. Let Rij be the rank of Yij in this ranking

(1≤ j≤ n) and let RS+ik be the wilcoxon rank sum statistic:

n

jijik RRS

1 (1 ≤ i ≤ k-1) (5.1.1.5.1)

Suppose that the alternative to the Hoi are the one sided hypotheses H1i: Fi < Fk (1≤ i≤

k-1)

In this case steel’s procedure reject Hoi if

)(1

kik RSRS (1 ≤ i ≤ k-1) (5.1.1.5.2)

Where)(

1kRS is the upper α point of the distribution of

ikk RSRS max1 (1 ≤ i ≤ k-1)

Under the overall null hypothesis H0:F1=F2=….=Fk.

If the alternative to the H0i’ are the one sided hypothesis H1i-: Fi > Fk (1≤ i≤ k-1)

In this case steel’s procedure reject H0i if

)(1)1(

kikkik RSRSnnnRS(1 ≤ i ≤ k-1) (5.1.1.5.3)

(Note thatikRS is the rank sum for sample i if all n+nk observations from treatment i

and k are assigned ranks in the reverse order .The same critical point)(

1kRS is used

because the joint distribution of theikRS is the same as that of

ikRS under H0)

For the two sided alternative, Steel’s procedure reject H0i if

)(1),max(

kikikik RSRSRSRS (1 ≤ i ≤ k-1) (5.1.1.5.4)

Where)(

1kRS is the upper α point of the distribution of

ikki

k RSRS11

1 max

(1 ≤ i ≤ k-1) (5.1.1.5.5)

138

Steel (1959) computed exact upper tail probabilities of the null distribution of RSk-1

for k=3,4; n=3,4,5 where n=ni (1≤i≤k). Thus for n1=n2=….nk-1=n(say) ,a large sample

approximation to)(

1kRS

is given by

12

)1(

2

1

2

)1( )(

,1

)(* kkk

kk

nnnnZ

nnnRS

(5.1.1.5.6)

Where,

)(

,1

kZ

is the corresponding two sides upper α equicoordinate point.

)11()1)(1(

),( 1

kjinnnn

nnRSRScorr

kjki

jijkik

(5.1.1.5.7)

Steel’s procedure reject H0i if

)*(),max( kikikik RSRSRSRS

Advantage:

This method is relatively robust to variance heterogeneity.

Disadvantages:

1. These tests can only be used for one way designs, in contrast to the joint rank

tests.

2. Equal sample size required.

5.1.2 Comparisons of Tests:

The methods discussed above are compared with respect to different aspects like

Conservatism and Power and Simulation study.

139

Conservative:

A procedure developed by Steel and Dwass is somewhat more advantageous

than the test of Nemenyi and Dunn, but it is less convenient to use and it tends to be

very conservative (Miller, 1981). The Dunn method appears to be most conservative

in that it required larger critical value at every k value than Steel technique (Edward,

1971).

Power:

Dunn method is more powerful than Steel. Steel Dwass test is slightly more

robust than the Nemenyi joint rank test. However both the tests are less robust when

two or more variances are large and unequal variances are expected to have more

effect when sample sizes differ.

Skillings (1983) provided some useful guidelines based on a simulation study.

He found that neither procedure is uniformly superior in terms of power for all non

null configurations. The Dunn procedure is more powerful for detecting differences

between extreme treatments when there are intermediate treatments present. On the

other hand, the Steel-Dwass procedure is not affected by the presence of other

treatments and hence has higher power for detecting differences between adjacent

treatments.

Other:

The Dunn method is different in that it computes ranks on all the data, not just

the pair being compared. Dunn control is similar to Steel with Control option. For

both the method, (i.e. Dunn and Steel Dwass method) the reported p-Value reflects a

140

Bonferroni adjustment. It is the unadjusted p-value multiplied by the number of

comparisons. If the adjusted p-value exceeds 1, it is reported as 1.

Simulation Study 1:

The data was regarding the staff of a mental hospital is concerned with which

kind of treatment is most effective for a particular type of mental disorder (Table no.

41). A battery of tests administered to all patients delineated a group of 40 patients

who were similar as regards diagnosis and also personality, intelligence, projective

and physiological factors. These people were randomly divided into four different

groups of 10 each for treatments. For 6 months the respective groups received (1)

electroshock, (2) Psychotherapy, (3) electroshock plus Psychotherapy, and (4) no type

of treatment. At the end of this period the battery of tests were repeated on each

patient. The only type of measurement possible for these tests is a ranking of all 40

patients on the basis of their relative degree of improvement at the end of the

treatment period; rank 1 indicates the highest level of improvement, rank 2 the second

highest, and so forth. On the basis of these data, does there seem to be any difference

in effectiveness of the types of treatment?

Ho:θ1=θ2=…=θ4 i.e. Four groups have the same location parameter.

H1:θi≠θj i.e. At least one group has different location parameter.

Here we used Kruskal Wallis test (Table no. 43) to see whether the four groups have

the same location parameter. The probability is 0.000 so we reject the null hypothesis

of equal medians for the four groups.

When the null hypothesis is rejected, as in the normal theory case, one can compare

any two groups, say i and j (with 1≤i<j≤k), by a multiple comparison procedure. This

can be done by Nemenyi, Dunn and steel Dwass method.

141

Dunn method:

Table 18: Multiple comparisons through Dunn Test

Pair ji RR StandardError (SE)

Test StatisticsQcal=Differenc

e/SE

CriticalValue

)1(*

kk

Zz

Nullhypothesis

Result

(1,2) 13.8 5.228129 2.6395676 2.13 Reject Significant(1,3) 17 5.228129 3.2516412 2.13 Reject Significant

(1,4) 8.8 5.228129 1.6832025 2.13Do notReject

NotSignificant

(2,3) 3.2 5.228129 0.6120736 2.13Do notReject

NotSignificant

(2,4) 22.6 5.228129 4.3227701 2.13 Reject Significant(3,4) 25.8 5.228129 4.9348438 2.13 Reject Significant

The quantity α is called the family wise error rate or the overall significance level,

which is the probability of at least one erroneous rejection among the k(k-1)/2 pair

wise comparisons. Choose a high significance level. Therefore, ensure that you take

up a higher value of α for the larger number of k, instead of the usual 5 per cent level

of significance. In this case, since k = 4, let us use the α of 20 per cent level of

significance (i.e. 4 × 5 per cent level of significance) to find out the value of Z to an

appropriate upper probability of α/k (k – 1). This is the procedure which you have to

blindly follow. In this way, we calculate first α as 0.2/4 (4 – 1) = 0.2/12 = 0.01667 so

critical value is 2.13.

Having known the value of upper probability, standard normal cumulative

probabilities table is used and run through the values to find out where this 0.01667

lies. See from the table, it is found for Z = 2.13. Therefore, we can say that our null

hypothesis of no difference in effectiveness of the types of treatments will be rejected

if Qcal ≥ 2.13.

142

Compare Qcal and critical value of Z, and make a decision. By comparing the values

of Qcal for each pair with the critical value is 2.13. We find significant differences in

the treatments between the pair 1and 2(i.e. electroshock and psychotherapy), between

pair 1and 3(i.e. electroshock and electroshock plus psychotherapy), between pair 2

and 4 (i.e. psychotherapy and no type of treatment) and between pair 3 and 4(i.e.

electroshock plus psychotherapy and no type of treatment).

Nemenyi method:

Table 19: Multiple comparisons through Nemenyi Test

PairDifference

in Ranksums

StandardError (SE)

12

)1)((..

nknknES

Q=Difference

in RankSums/SE

Critical QValue at

0.05Level

Nullhypothesis

Result

(1,2) 138 23.3809 5.902253 3.633 Reject Significant(1,3) 170 23.3809 7.270891 3.633 Reject Significant(1,4) 88 23.3809 3.763755 3.633 Reject Significant

(2,3) 32 23.3809 1.368638 3.633 Do not RejectNot

Significant(2,4) 226 23.3809 9.666008 3.633 Reject Significant(3,4) 258 23.3809 11.03465 3.633 Reject Significant

The result column in the table shows accordingly the Nemenyi multiple comparison

results indicate that the treatment is same for treatment 2 and 3 but it is different for

treatment 1 and 2, treatment 1 and 3, treatment 1 and 4, treatment 2 and 4, treatment 3

and 4. Looking at the rank sum, we find that treatment 3 is more effective than 1, 2

and 4.

Steel Dwass method:

Table 20: Multiple comparisons through Steel Dwass Test

Pair ijRS

ijRS ),max( ijij RSRS )*(

kRS

Nullhypothesis(Comparecolumn 4with 5)

Result

(1,2) 154 56 154 139.4555 Reject Significant(1,3) 154 56 154 139.4555 Reject Significant

143

The result column in the table shows accordingly the Steel Dwass multiple

comparison results indicate that the treatment is same for treatment 2 and 3 but it is

different for treatment 1 and 2, treatment 1 and 3, treatment 1 and 4, treatment 2 and

4, treatment 3 and 4.

From the simulation study also, we can see that Steel Dwass and Nemenyi reject more

hypothesis than Dunn procedure. Steel Dwass procedure is more advantageous than

the Dunn test. We can also see that Dunn method appears to be most conservative in

that it required larger critical value.

Simulation Study 2:

Experimental group V/s Control group

A fertilizer manufacturer conducted an experiment to compare the effect of four types

of fertilizer on the yield of a certain grain(Table no. 44). Homogeneous equal size

experimental plots of soil were made available for the experiment. They were

randomly assigned to receive one of the five fertilizers, and plots receiving no

fertilizer served as controls. Nine plots were randomly selected from those previously

assigned to each of the fertilizers and control plots. The yields (in coded form) for

each plot are given in appendix.

Dunn control:

Table 21: Ranks, rank totals and mean ranks

Fertilizer1 2 3 4 5

None(0) A B C D10.5 16 28.5 33 45

1 15 23 37 42.52.5 17 23 23 38

(1,4) 62 148 148 139.4555 Reject Significant

(2,3) 121 89 121 139.4555Do notReject

Not Significant

(2,4) 55 155 155 139.4555 Reject Significant(3,4) 55 155 155 139.4555 Reject Significant

144

5 10.5 26 35.5 406 12.5 30 31.5 42.5

2.5 7 21 25 348.5 12.5 20 31.5 42.58.5 19 28.5 42.5 394 14 18 27 35.5

Total(R) 48.5 123.5 218 286 359

Mean ( R ) 5.39 13.72 24.22 31.78 39.89

Table 22: Comparison of yields of plots receiving fertilizer to yields of plots receivingno fertilizer by Dunn Control Test

Pair iRR 0SE

TestStatistics

Qcal=Difference/SE

CriticalValue

)1(*2*

k

Zz

Nullhypothesis

Result

A(2) 8.33 6.187108 1.34634796 1.96 Do not RejectNot

SignificantB(3) 18.83 6.187108 3.04342523 1.96 Reject SignificantC(4) 26.39 6.187108 4.26532086 1.96 Reject SignificantD(5) 34.50 6.187108 5.57611101 1.96 Reject Significant

Since 1.34634796 is less than 1.96, we cannot conclude that fertilizer A is better than

no fertilizer. Since 3.04342523, 4.26532086 and 5.57611101 are all greater than 1.96.

We conclude that fertilizers B, C and D will all result in higher yields than if no

fertilizer at all used.

Steel Control:

Table 23: Comparison of yields of plots receiving fertilizer to yields of plots receivingno fertilizer by Steel Control Method

Pair ikRS

ikRS

),max( ikik RSRS )*(

kRS

Nullhypothesis(Comparecolumn 4with 5)

Result

A(2) 122.5 48.5 122.5 110.4615 Reject SignificantB(3) 126 45 126 110.4615 Reject SignificantC(4) 126 45 126 110.4615 Reject SignificantD(5) 126 45 126 110.4615 Reject Significant

145

From the table, we can see that 154, 148 and 121 are all greater than 110.4615. We

conclude that fertilizers A, B, C and D will all result in higher yields than if no

fertilizer at all used.

From the simulation study of control group, we can see that Steel control procedure is

more advantageous than the Dunn control test. We can also see that Dunn method

appears to be most conservative in that it required larger critical value. Steel control

reject more hypothesis than Dunn control.

Table 24: Comparison of Multiple Comparison Procedure for Non Parametric Test

Test Use Test statisticsCriticalValue

EqualSample

size

JointRanking/Pair

wiseRanking

RankSum/Mean

RankCI

SteelDwass

Pair wise ),max( ''iiii RSRS )*(

kRS Yes Pair wiseRankSum Yes

NemenyiJointrank

Pair wise12

)1)((

nknkn

RR ji

Yes JointRankSum No

Dunn Pair wiseSE

RR ji )1( kk

Z No JointMeanRank No

Dunncontrol

Contrast ofcontrol

group witheach

experimentalgroup

SE

RR ci )1(2 k

Z No JointMeanRank No

Steel

Contrast ofcontrol

group witheach

experimentalgroup

),max( ikik RSRS )(

1kRS Yes Pair wise

RankSum Yes

kq ,,

146

5.2: Nonparametric Post Hoc Test with adjusted p value

An adjusted p-value is defined as the smallest significance level for which the

given hypothesis would be rejected, when the entire family of tests is considered. The

decision rule is to reject the null hypothesis when the adjusted p-value is less then α;

in most cases, this procedure controls the FWE at or below α level.

5.2.1. Introduction:

In this chapter we have discussed tests based on adjusted p values such that, if the

adjusted p value for an individual hypothesis is less than the chosen significance level

α, then the hypothesis is rejected with FWE not more than α. It includes Bonferroni

procedure and modification of that procedure by Holm, Holland & Copenhaver,

Hommel, Hochberg and Rom. From them some of the methods are Single step

procedure and others are step wise methods. Further Step wise methods can be

categorized in two ways i.e. Step up method and step down method.

147

Figure 5: Non parametric Adjusted P Value Methods

5.2.1.1 Bonferroni Test (1961):

Bonferroni based procedure is recommended when data is not continuous

because this procedures have no distributional assumptions (Ludbrook, 1991). The

Bonferroni method applies to both continuous and discrete data. This method is

flexible because it controls the FWE for tests of joint hypotheses about any subset of

m separate hypotheses (including individual contrasts). The procedure will reject a

joint hypothesis H0 if any p-value for the individual hypotheses included in H0 is less

than α/c. Bonferroni method, however, yields conservative bounds on Type I error

hence it has low power. This procedure controls the FWE at α without any further

assumption on the dependence structure of the p value.

The test is discussed in the section 3.1.5.

Single Step

Bonferroni

147













Adjusted P- Valuemethod

Single Step

Bonferroni

Step Wise

Step Up

Hommel

Hochberg

Rom

Step Down

Holm

Holland

147













Step Down

Holm

Holland

148

5.2.1.2 Holm Test (1979):

Holm proposed a modification of Bonferroni procedure(discussed in 5.2.1.1)

that yields a more powerful test. The goal of Holm method is to increase the power of

the statistical tests while keeping under control the FWE. It is a step down procedure.

It is also called a sequential rejection method because it examines each hypothesis in

an ordered sequence and the decision to accept or reject the null hypothesis depends

on the results of the previous hypothesis tests (Tamhane et al, 1998). Holm uniformly

improves the Bonferroni approach. He was the first to formally introduce a

sequentially rejective Bonferroni procedure. Bonferroni method does not account for

the correlations between the test statistics, the Holm procedure can be improved.

Holm method can be applied to almost any data because of its non-parametric

nature. This test can be applied in any pair-wise comparison where the classical

Bonferroni test is usually applied. It is applicable when pair wise comparisons of

median or linear combinations or non linear combinations of median are used. It is

used to perform priori comparison. For several a priori contrasts, not necessarily pair

wise, it controls FWE while at the same time maximizes the power (Howell, 2007).

Assumptions:

There are no restrictions on the type of test, the only requirement is that it

should be possible to calculate the obtained level for each separate test. Further, there

are no problems to include in the analysis only for the a priori interesting hypotheses,

while more special multiple tests usually include on all hypotheses of a certain kind.

Holm’s procedure may be used either as a protected test or as an unprotected

test but the protected version is preferred due to the additional power gains. But when

there exist logical implications among the hypotheses, problems arise which we have

to take in to consideration (Holm, 1979). So, Holm’s procedure makes no

149

distributional assumptions, logical assumptions about the hierarchy of the hypotheses

to be tested and does not assume independence of comparisons (Zweifel, 2014).

Procedure:

Order the p values, p( ) ≥……≥ p( ) , and denote the corresponding hypotheses,H( ),…..,H( ). Start with the smallest p value, p( ). If p( ) > α/c, then stop testing and

accept all the hypotheses; otherwise reject H( ) and go to the next step. In general, if

testing has continued to the ith step (1 ≤ i ≤ c) and if p( ) > α/(c − i + 1), then

stop testing and accept all the remaining hypotheses, H( ),….., H( ) ; otherwise

reject H( )and go to the next step.

In short, this procedure rejects the specific hypothesis H(i) for i = 1,2,…,c, provided

both P(i) ≤ α/(c-i+1) and H(1),…, H(i-1) have all been rejected.

Like Bonferroni procedure, Holm’s procedure can also modify p-values directly

multiplying the p-value by the adjusted C-i+1, where i is an index of the step

associated with the p value.

For Example see Cohen (2007) Explaining Psychological Statistics, p. p. 411.

For unequal sample size, the test statistics is same as Bonferroni given by…(3.1.1.1).

For equal sample size, the test statistics is given as

n

MS

xxt

error

ji

2'

(5.2.1.2.1)

Calculate t’ for all contrasts of interest and then arrange the t’ values in

increasing order without regard to sign. This ordering can be represented as ′ ≤′ ≤ ′ ≤ ⋯ ≤ | ′ |, where c is the total number of contrasts to be tasted.

The first significance test is carried out by evaluating against the critical value in

Dunn’s table corresponding to c contrasts. In other words, is evaluated at α’ = α/c.

150

If this largest t’ is significant, then we test the next largest t’ (i.e. ′ ) against the

critical value in Dunn’s table corresponding to c-1 contrasts. Thus, ′ is evaluated

at α’=α/(c-1). The same procedure continues for ′ , ′ , ′ ,…until the test

returns a non-significant result. At that point we stop testing. Holm has shown that

such a procedure continues to keep FWE ≤ α, while offering a more powerful test.

The logic behind the test is that when we reject the null for tc, we declare that

null hypothesis to be false. If it is false, that only leaves c-1 possibly true null

hypotheses, and so we only need to protect against c-1 contrasts. A similar logic

applies as we carry out additional tests. This logic makes particular sense when even

before the experiment is conducted we know that some of the null hypotheses are

almost certain to be false. If they are false, there is no point in protecting from

erroneously rejecting them.

Critical Value:

1 ic (5.2.1.2.2)

Decision procedure:

Reject H(1) to H(i-1) if

P(i) ≤ α(5.2.1.2.3)

α will change at all stages because of its step down nature.

The critical value of this method is based on the Bonferroni inequality.

Advantages:

1. This method is flexible and simple to implement.

2. It controls the FWE in the strong sense, i.e. it guarantees control of generalized

Type I error probability to be at most α (Hochberg, 1988; Schochet, 2008;

Ekenstierna, 2004; Hochberg & Benjamini, 1990; De Muth, 2006).

151

3. It does not require any assumptions regarding the population distribution (Holm,

1979; Qian et al, 2013). This method does not require strong assumptions such as

independence.

4. It is based on the Bonferroni inequality and valid regardless of the joint

distribution of the test statistics (Li, 2009).

5. This procedure maintains the Type I error rate below α for all combinations of

variance heterogeneity, non normality, sample size, effect size and pattern of

mean difference (Zweifel, 2014).

6. This archives lower type II error while keeping the type I error rate at level less

than α (Hochberg & Benjamini, 1990).

7. It can be used for equal as well as for unequal sample size.

Disadvantages:

1. Power of this method is small if all the hypotheses are almost true but it may be

considerable if a number of hypotheses are completely wrong (Holm, 1979).

2. It gives which comparisons are statistically significant but does not compute

confidence intervals.

3. It does not consider the logical interrelationships among the c hypothesis.

4. It becomes very conservative when the numbers of comparisons are large and

when tests are not independent (De Muth, 2006).

5. Holm’s procedure produces low power when conditions are not ideal, such as

when the sample or effects sizes are small (Zweifel, 2014).

152

5.2.1.3. Holland & Copenhaver Test (1987):

It uses the Sidak (1967) inequality to set the criterion for each hypothesis test. It is a

step down procedure. When there is need for further research in situations, where

there is no logical inter relationship among the hypotheses, this method is useful.

Assumptions:

Positive orthant dependence of the test statistics is required.

Procedure:

Let p(1),…,p(c) be the ordered p values (smallest to largest) and H(1),…,H(c) be the

corresponding hypotheses. Suppose i is the smallest integer from 1 to c such that p(i)

> 1 — (1 — α )1/(c-i+1); the Holland-Copenhaver procedure rejects H(1) to H(i-1) and

retains H(i) to H(c) (Olejnik et al,1997).

Test Statistics:

For unequal sample size, the test statistics is same as Bonferroni given by…(3.1.1.1)

For equal sample size, the test statistics is same as Holm given by…(5.2.1.2.1)

Critical Value:

1 — (1 — α)1/(c-i+1) (5.2.1.3.1)

Decision procedure:

Reject H(1) to H(i-1) if

p(i) < 1 — (1 — α )1/(c-i+1) (5.2.1.3.2)

Advantages:

This method is conservative under the condition that the test statistics are positive

orthant dependent.

Disadvantages:

Applicability of this method is slightly less than the Holm procedure because of the

requirement of positive orthant dependent condition for test statistics.

153

5.2.1.4 Hommel Test (1988):

Hommel (1988) employs the closure principle to extend Simes test and

developed a stepwise multiple testing procedure controlling FWE. It is based on the

Simes (l986) equality. This is a step up method and it is protected test. This procedure

is conservative only when the test statistics are independent, because it based on the

Simes equality for independent p values. It is not always necessary to test every

possible combination of hypothesis i.e. it can also be used for few comparisons.

The work of Hommel’s who generalized Simes procedure that it gives strong

control of FWE whenever Simes original procedure does achieve weak control (e.g.

with independent tests).

Assumptions:

Test statistics are independent.

Procedure:

Reject all hypothesis that have a p value ≤ α/j’ where j is defined as

',...1

':,...1max )'(

' ikfori

kpcij kic

If j is non empty, reject Hi whenever Pi ≤ α/j’ with j’=max j. If j is empty, reject all Hi

(i=1,2,…c)

This procedure includes two stages. The first stage uses the obtained p-values to

compute the number of members in J. The second stage obtains the significance level

of rejection using α'=α/j', where j’ is the largest number in J.

Test Statistics:

Test statistics is same as Holm given in…(5.2.1.2.1)

Critical Value

α/j’ (5.2.1.4.1)

154

Advantages:

1. It protects the FWE only when test statistics are independent (Dmitrienko et al,

2009; Olejnik et al, 1997).

2. The uniqueness of this procedure is that it not only considers the order of the tests

but also takes the obtained p values into the calculation while computing the α'.

Disadvantages:

1. This method is relatively complicated.

2. When correlations between variables are negative, the test can sometimes allow

slightly more Type I errors than the stated maximum family wise error.

3. It controls overall type I error rate only when test statistics are independent

(Olejnik et al,1997).

5.2.1.5 Hochberg Test :(1988)

It is a modification of Dunn procedure. This procedure uses critical values

identical to those used in Holm procedure but provides a potential for increased power

by conducting the tests in a step-up rather than step –down sequence. It is a step up

method and based on the Simes (1986) equality.

Assumptions:

Tests are independent of one another.

Procedure:

Hochberg derived an even sharper procedure which uses the ordered pis but in a

different way from Holm's procedure. This procedure starts by examining the largest

p-value p(c). If p(c) ≤ α, then H(c ) and all other hypotheses are rejected. If not, H(c) is

not rejected and one proceeds to compare p(c-1) with α/2. If the former is smaller, then

H(c-1) and all hypotheses with smaller p-values are rejected. Generally, one proceeds

155

from highest to lower p-values, retaining Ho, if its p-value satisfies p(i) > α/(c — i +

1). One stops the procedure at the first ordered hypothesis when that inequality is

reversed. This hypothesis is rejected and all hypotheses with lower or equal p-values.

This is always a sharper procedure than Holm's.

Critical Value:

1 ic (5.2.1.5.1)

Decision procedure:

Reject H(1) to H(i) for any i=c,c-1,….1 if

P(i) ≤ (5.2.1.5.2)

Advantages:

1. This procedure has strong control over the FWE α even if the free combination

condition is not satisfied (Holm, 1979; Holland & Copenhaver, 1987; Olejnik et

al, 1997).

2. It controls the FWE under the same conditions for which the Simes global test

control the Type I error rate.

3. This method always achieves the same type I FWE control and lower type II error

rates (Hochberg & Benjamini, 1990).

4. It has nice characteristic that no adjusted p value can be larger than the largest of

the unadjusted P values (Wright, 1992).

5. This method is able to reject at least one individual hypothesis when the global

null hypothesis is rejected. This property of consonance makes Hochberg

procedure easy to interpret (Rom, 1990).

Disadvantages:

1. It lacks the stability under certain conditions, for example, when the test statistics

are dependent or correlated (Schochet, 2008).

156

2. It only can be applied in the independent hypotheses tests (Olejnik, et al. 1997;

Schochet, 2008).

5.2.1.6 Rom Test (1990):

It is a modification of Hochberg procedure to increase the statistical power. It is a step

up procedure. Increased power is achieved by identifying the appropriate adjusted

significance levels that control the Type I error rate at exactly the nominal level when

test statistics are independent (Olejnik et al, 1997).

Assumptions:

Test statistics are independent.

Procedure:

The Rom procedure differs from the Hochberg procedure when the adjusted

significance level is obtained. Both procedures set α'(m) equal to α and α’(m - 1) equal to

α/2, but the remaining m - 2 adjusted significance levels differ. The adjusted

significance levels are determined recursively as

i

i

j

i

j

jijm

i

j

j

im

1

1

2

1

)('

1'

(5.2.1.6.1)

i=1,2,…m

where αm - 1 = α and αm - 2 = α/2.

It is step up procedure with different critical value of c1=α, c2=α/2, c3=α/3 +α2/12 etc.

First, we denote H(1) as the hypothesis with the largest p-value and H(m) as the

hypothesis with the smallest p-value.

The testing starts by comparing p(1) with α(1) and stops when p(i) < α(i).Then H(1) to

H(i+1) retained and H(i) to H(m) rejected. The computing equation for solving αi’s can be

divided into three parts. The first part is α1+α2+…αi-1 and the second part is

157

)()(....)()()()(2'

)1(2

2')3(

2

1')2(

1

i

i

iiiii

The third part is to solve for αi’, which

subtract the second part from the first part, and divide the difference by i.

Advantages:

1. It exactly control the FWE at α for independent test statistics (Schochet, 2008).

2. It gives motivation of lowering type II error.

3. The Rom procedure having the desired FWE only for independent test, for

complex comparison.

Disadvantages:

1. The calculation of this method is complicated and iterative.

2. It provides adjusted critical values for up to 10 tests when the overall alpha equals

0.05 and 0.01. The numbers of hypothesis test increases, the calculations become

impractical even when a computer is used.

5.2.2. Comparisons of Tests:

The methods discussed above are compared with respect to different aspects like

Conservatism, Power and Confidence Interval estimation and simulation study.

Conservatism:

Bonferroni method has the largest p values and thus most conservative

methods, followed by the Holm (1979), Hochberg (1988), and Hommel (1988)

methods. The Bonferroni and Holm (1979) methods shows the lowest Type I error,

whereas the Hochberg (1988) and Hommel (1988) methods allowed more error but

are still conservative when ρ (correlation) exceeded 0.5.

Holm procedure is a closed testing procedure in which each intersection

hypothesis is tested using a global test based on the Bonferroni procedure. Holm

procedure rejects the global hypothesis if and only if the Bonferroni procedure does

158

and therefore the conclusions regarding the conservative nature of the Bonferroni

procedure also apply to the Holm procedure (Dmitrienko et al, 2009).

Hochberg procedure uses the same criterion for each hypothesis as does the

Holm procedure but tests hypotheses with larger p values first. Consequently this

procedure will test and possibly reject hypotheses not examined by the Holm

procedure while rejecting the same hypotheses that are rejected by the Holm

procedure ((Dunnett & Tamhane, 1992; Hochberg, 1988; Olejnik et al, 1997). In most

real-life cases, the conclusions from the two methods i.e. Holm & Hochberg will

rarely differ.

Power:

Holm procedure is more powerful than Bonferroni method because the bound for this

method sequentially increases whereas the Bonferroni bound remains fixed. Holm

procedure is at least as powerful as Bonferroni because. Statistical power is gained by

sequentially increasing the criterion for statistical significance. Because any

hypothesis rejected by the original Bonferroni procedure will also be rejected by the

Holm procedure, the latter procedure cannot have lower power for an individual

hypothesis test. However, Holm claims that in actual practice the gain in power with

his procedure as compared to Bonferroni is non negligible because α/(c − i − 1) is

much larger than α/k for many values of i (Olejnik et al, 1997).

Any hypothesis rejected by Holm’s procedure will always be rejected by Hochberg’s

procedure (Dunnett & Tamhane, 1992; Hochberg, 1988). However, the power

differences tend to be negligible (Olejnik et al., 1997). Hochberg procedure is

uniformly more powerful than the Holm procedure (Hochberg, 1988) but, on the other

hand, it is uniformly less powerful than the Hommel procedure (Hommel, 1989).

However, due to the independence assumption required by Hochberg, the Holm

159

procedure may be the best choice if independence of tests is not certain. The criterion

used by the Holland and Copenhaver procedure is slightly larger than Holm procedure

thus leading to slightly greater power for an individual hypothesis test (Olejnik et al,

1997).

Hommel method is uniformly more powerful than Holm procedure because

the Simes test is uniformly more powerful than the global test based on the Bonferroni

procedure (Dmitrienko et al, 2009). For n>2, there are situation where Hommel reject

and Hochberg does not reject (Hommel, 1989). Hommel procedure rejects more

hypotheses than either the Rom or the Holland-Copenhaver procedure; however the

difference in the number of tests rejected is very small. Hochberg and Hommel

procedure are more powerful but they are known to have the desired FWE only for

independent test (Hommel, 1989). Rom gives slightly higher critical p-value that can

be used with Hochberg’s procedure, making it somewhat more powerful.

Holm’s procedure is least powerful method, because it is based on the

Bonferroni inequality. Rom procedure and Hommel procedure are more powerful

than Hochberg’s procedure due to the fact that sharp inequalities (or equalities) are

used in both (i.e. Rom & Hommel) procedures; however, the power improvement is

negligible compared to their complexities.

The increase in power for individual hypotheses tests provided by the Hommel

and Rom procedures over the Hochberg approach is at best marginal with the Rom

procedure having only a slight advantage over the Hommel (Dunnet and Tamhane,

1992; Olejnik, 1997).

Holland Copenhaver and Hochberg procedures provide power very close to

that obtained by the Hommel and Rom procedures, particularly when the total number

of hypotheses tested is not too large. If the numbers of false null hypotheses are large,

160

Hochberg procedure might provide a better chance of detecting all of them than the

Holland-Copenhaver procedure.

As the sample size increases, power of statistics increases but when the

number of variables in a matrix increases, the probability of rejecting all of the non-

null hypotheses decreases. All five of the enhancements are more sensitive than the

original Bonferroni procedure in detecting all true nonzero relationships. The

difference between the original Bonferroni procedure and the enhancements increased

as the number of true nonzero relationships increased. Very small differences in

statistical power are found among the five enhancements to the original Bonferroni

procedure. The Holm procedure is having the lowest sensitivity in detecting all true

nonzero relationships, whereas the Rom procedure has the greatest power. When all

the correlations are nonzero, the Hochberg, Hommel, and Rom procedures had the

same estimated power.

Because step-up sequential multiple comparisons are based on the Simes

equality, which assumes independence of comparisons, it is reasonable to suggest that

dependence or correlation between the means of groups should affect the Type I error

control and power (Zweifel, 2014).

In summary, the comparison of (Bonferroni, Holm, Holland, Hochberg,

Hommel, Rom), Bonferroni procedure has the lowest percentage of rejections and

Hommel procedure has the highest percentage of rejections whenever differences

exist among the procedures. Overall, the SU procedures are little more powerful than

the SD procedures. Within the SU procedures, whenever differences occurred, the

Hommel procedure has slightly higher percentage of rejections than the Hochberg

procedure. Within the SD procedure, whenever difference occur, Holland procedure

161

having a slightly higher percentage of rejection than Holm procedure (Olejnik et al,

1997).

Confidence Interval:

All the methods are step wise methods except Bonferroni so confidence interval

cannot be obtained by any of the method so comparison is not possible with respect to

Confidence Interval.

Simulation Study:This section discuss results regarding tests to be reported as adjusted p values such

that, if the adjusted p value for an individual hypothesis is less than the chosen

significance level α, then the hypothesis is rejected with FWE not more than α. It

includes Bonferroni procedure and modification of that procedure by Holm, Holland

& Copenhaver, Hommel, Hochberg and Rom.

As a concrete example, imagine that we have ten p values, and they are (in order

from smallest to largest) as follows: 0.002, 0.0054, 0.007, 0.008, 0.009, 0.0094, 0.012,

0.015, 0.028 and 0.067.

We will compare probability with critical value based on Bonferroni method and

modification of that procedure by Holm, Holland & Copenhaver, Hommel, Hochberg

and Rom.

Table 25: Rejection criteria according to different available Tests (for adjusted pvalue)

No Prob.Bonfer

roniHolm

Holland &Copenhaver

Hommel Hochberg Rom

1 0.002 0.005 0.005 0.005116197 0.025 0.005 0.0051152 0.0054 0.005 0.005556 0.005683045 0.025 0.005556 0.0056813 0.007 0.005 0.00625 0.006391151 0.025 0.00625 0.0063884 0.008 0.005 0.007143 0.007300832 0.025 0.007143 0.00735 0.009 0.005 0.008333 0.008512445 0.025 0.008333 0.0085056 0.0094 0.005 0.01 0.010206218 0.025 0.01 0.01027 0.012 0.005 0.0125 0.012741455 0.025 0.0125 0.01278 0.015 0.005 0.016667 0.016952428 0.025 0.016667 0.016875

162

9 0.028 0.005 0.025 0.025320566 0.025 0.025 0.02510 0.067 0.005 0.05 0.05 0.025 0.05 0.05

Table 26: Hypotheses Rejection by all these multiple comparison procedure withadjusted p value

No. Bonferroni HolmHolland &

CopenhaverHommel Hochberg Rom

1 Reject Reject Reject Reject Reject Reject2 Accept Reject Reject Reject Reject Reject3 Accept Accept Accept Reject Reject Reject4 Accept Accept Accept Reject Reject Reject5 Accept Accept Accept Reject Reject Reject6 Accept Accept Accept Reject Reject Reject7 Accept Accept Accept Reject Reject Reject8 Accept Accept Accept Reject Reject Reject9 Accept Accept Accept Accept Accept Accept10 Accept Accept Accept Accept Accept Accept

From simulation Study also, we can see that Holm procedure is more powerful

than Bonferroni method because the bound for this method sequentially increases

whereas the Bonferroni bound remains fixed. Any hypothesis rejected by the original

Bonferroni procedure will also be rejected by the Holm procedure; the latter procedure

cannot have lower power for an individual hypothesis test. Any hypothesis rejected by

Holm’s procedure will always be rejected by Hochberg’s procedure. Hochberg

procedure is uniformly more powerful than the Holm procedure but, on the other hand,

it is uniformly less powerful than the Hommel procedure. The criterion used by the

Holland and Copenhaver procedure is slightly larger than Holm procedure thus leading

to slightly greater power for an individual hypothesis test.

Hommel procedure rejects more hypotheses than either the Rom or the

Holland-Copenhaver procedure; however the difference in the number of tests rejected

is very small. Rom gives slightly larger critical p-value that can be used with

Hochberg’s procedure, making it somewhat more powerful.

163

Table 27: Comparison of Multiple Comparison procedure with adjusted p value

TestSS/SW Based on

Modificationof Critical Value Remarks

Bonferroni(1961)

SSBonferroniinequality

____ α/cPlanned contrasts,both simple and

complex.Holm (1979)

SDBonferroniinequality

Bonferroni 1 ic comparisons are

not independent

Holland(1987)

SDSidak

inequalityBonferroni

11)1(1 ic Positive orthantdependence

Hommel(1988)

SUSimes

inequality Holm α/j’When

comparisons areindependent

Hochberg(1988)

SUSimes

inequalityHolm 1 ic

When

comparisons areindependent

Rom(1990)

SUSimes

InequalityHochberg i

i

j

i

j

jijm

i

j

j

im

1

1

2

1

)('

1'

Whencomparisons are

independent

164

Figure 6: Non Parametric Post Hoc Tests

By Adjusting p value (forequal and unequal sample

size)

Bonferroni

Holm


Hommel

Hochberg

Rom

164


NON PARAMETRIC POSTHOC TEST

By Adjusting p value (forequal and unequal sample

size)

Bonferroni

Holm


Hommel

Hochberg

Rom

Equal Sample Size

Nemenyi

Dunn control

Steel Dwass

Steel Control

Unequal Sample Size

164


Unequal Sample Size

Dunn Pairwise

Dunn Control

Chapter 5 Nonparametric Post Hoc Test -...

Documents

Transcript of Chapter 5 Nonparametric Post Hoc Test -...