Master thesis - Statistiska Institutionen/menu/standard... · Master thesis Department of...

Master thesis Department of Statistics

Masteruppsats, Statistiska institutionen

Effects of unbalancedness and heteroscedasticity on two-way

MANOVA tests

Patrik Zetterberg

Masteruppsats 30 högskolepoäng, vt 2013

Supervisor: Tatjana von Rosen

Abstract

Multivariate analysis of variance is a widely used multivariate method that is gen-

erally robust to minor deviations from normality and homoscedasticity. When

data is balanced, standard multivariate tests for factor effects are exact. How-

ever, these tests can be biased when data is unbalanced and covariance matrices

are heteroscedastic which emphasizes the need for proper methods. This mas-

ter thesis aims to investigate how some newly proposed modified tests, which

takes unbalancedness and heteroscedaticity into account, perform in relation to

standard tests for two-way multivariate analysis of variance models with inter-

actions. Two numerical examples are set up in order to compare performances

of the modified and standard tests. The obtained results show that differences

between these tests are marginal when data is balanced. The modified tests are

overall less prone than standard tests to yield significant results when data is

unbalanced. Main implications from the results are that further studies of the

testing procedure are needed but that modified tests are useful as a statistical

tool in the presence of unbalancedness and heteroscedasticity.

Table of Contents

1 Introduction 4

2 Background 5

3 Univariate analysis of variance models 6

3.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Two-way ANOVA with interactions . . . . . . . . . . . . . . . . . . . . 8

3.3 Matrix formulation of the ANOVA model . . . . . . . . . . . . . . . . . 9

3.4 Estimation in the linear model . . . . . . . . . . . . . . . . . . . . . . . 10

3.4.1 Estimation in the two-way ANOVA with interactions . . . . . . 11

3.5 Hypothesis testing in the linear model . . . . . . . . . . . . . . . . . . 12

3.5.1 Hypothesis testing in the one-way ANOVA . . . . . . . . . . . . 12

3.5.2 Hypothesis testing in the two-way ANOVA with interactions . . 13

4 Multivariate analysis of variance models 14

4.1 Heteroscedasticity of covariance matrices . . . . . . . . . . . . . . . . . 15

4.1.1 Box’s M test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 One-way MANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Two-way MANOVA with interactions . . . . . . . . . . . . . . . . . . . 17

4.4 Estimation in the two-way MANOVA with interactions . . . . . . . . . 18

4.5 Hypothesis testing in the MANOVA model . . . . . . . . . . . . . . . . 19

4.5.1 Hypothesis testing in the two-way MANOVA with interactions . 19

4.5.2 Wilks’ Λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5.3 Hotelling-Lawley Trace . . . . . . . . . . . . . . . . . . . . . . . 21

4.5.4 Pillai’s Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5.5 Characteristics of the multivariate tests . . . . . . . . . . . . . . 21

5 Unbalanced data 22

5.1 Unbalanced two-way ANOVA with interactions . . . . . . . . . . . . . 22

5.1.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Unbalanced two-way MANOVA with interactions 24

6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Hypothesis testing using modified test statistics . . . . . . . . . . . . . 25

2

7 Numerical examples 28

7.1 Real-life data example . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.1.2 The two-way MANOVA model with interactions . . . . . . . . . 29

7.1.3 Structure of the numerical example . . . . . . . . . . . . . . . . 29

7.1.4 Testing model assumptions . . . . . . . . . . . . . . . . . . . . . 30

7.1.5 Results from the testing procedure . . . . . . . . . . . . . . . . 30

7.2 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.1 Construction of the simulated data . . . . . . . . . . . . . . . . 32

7.2.2 Testing model assumptions . . . . . . . . . . . . . . . . . . . . . 33

7.2.3 Results from the testing procedure . . . . . . . . . . . . . . . . 33

7.3 Summary of test results . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 Discussion 35

References 37

Appendices 39

A Matrix algebra 39

B Summary statistics for the real-life data 41

C Summary statistics for the simulated data 43

D Univariate models for the 2 real-life data 45

E Univariate models for the 2 simulated data 47

F Codes 49

F.1 SAS codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

F.2 MATLAB codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

F.3 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

1 Introduction

Analysis of variance models (ANOVA) have proven useful and applicable, especially as

a tool for experimental design, in a large variety of disciplines ranging from biostatis-

tics to economics. The models have several advantages; they are generally robust and

produce powerful tests (Littell et al., 2002; Hill & Lewicki, 2007). ANOVA-models be-

long to a class of linear models suitable when modeling a continuous response variable

against one or several qualitative explanatory variables, generally called factors, that

are measured either on a nominal or ordinal measurement scale. A main purpose of

fitting ANOVA-models is to determine how the value of the response variable is altered

by the manipulation of factors, but foremost to study differences in means between

factor levels (Sawyer, 2009).

In many situations, it is further of interest how the altering of the combination of factors

could explains variations in not only one, but several response variables simultaneously.

This multivariate set of ANOVA-models are generally referred to as MANOVA-models.

There are several advantages of using MANOVA-models instead of many univariate

ANOVA-models separately. With MANOVA, it is possible to test joint hypotheses of

differences for factor level means. MANOVA also takes into account the correlation

between response variables and thus make better use of the information in data (Littell

et al., 2002).

ANOVA and MANOVA-models rely on a set of assumptions which need to be fulfilled.

These models require data to be normally distributed, homoscedastic and balanced.

However, these assumptions are seldom fair representations of real-life data. In many

situations it is crucial for an experimenter, a company or a scientist to provide solu-

tions and possible corrections when standard assumptions are violated, for instance

when data is unbalanced.

In this master thesis it is shown how the MANOVA-model can be affected if important

model assumptions are violated. The aim is to study effects of unbalanced data and

covariance heteroscedasticity on the testing procedure in the two-way MANOVA-model

with interactions. In particular, the performance of newly proposed multivariate tests

in Zhang & Xiao (2012) will be evaluated and compared to the most commonly used

multivariate tests provided by statistical softwares.

4

The structure of this master thesis is as follows. In Section 2, a brief overview of

the literature related to ANOVA and MANOVA models will be presented. Section 3

provides a short introduction to univariate balanced ANOVA-models. Matrix nota-

tion is introduced for the model specification, estimation and hypothesis testing for the

linear model are reviewed. Section 4 extends the balanced ANOVA-model to a bal-

anced MANOVA-model by considering several response variables. This section further

presents the multivariate model, estimation and multivariate tests. Section 5 is devoted

to unbalanced data and its effects on model specification, estimation and hypothesis

testing in the ANOVA-model. Section 6 further investigates the consequences of unbal-

anced data for the MANOVA-model. Thorough numerical examples, which implement

the two-way MANOVA-model with interactions, will be presented and analyzed in Sec-

tion 7. Finally, the results from the numerical examples will be discussed in Section 8

together with some concluding remarks and suggestions of future studies.

2 Background

Sir Ronald A. Fisher first developed ANOVA in the 1920’s as a method for analyzing

agricultural and biological data. Since then, it has been extensively used in various

applications (Rutherford, 2012). ANOVA was initially applied on balanced data un-

til Frank Yates presented methods for unbalanced data analysIs in the 1930’s (Herr,

1986). Following in the footsteps of Yates, numerous authors have been addressing

unbalanced data in ANOVA-models, some of the more recent being Fujikoshi (1993),

Weber & Skillings (2000), Rencher (2000), Bao & Ananda (2001) and Langsrud (2003) .

After Theodore W. Anderson published his famous book ”An Introduction to Multi-

variate Analysis” in 1958, methods of multivariate statistics including MANOVA were

rapidly established (Sen, 1986). Searle (1987) was one of the first authors to present

methods of how to estimate parameters and test hypotheses in MANOVA-models with

unbalanced data. As he also mentions, ”It is preferable by far to think of the analysis

of unbalanced data as quite separate from that of balanced data”, thus emphasizing

the need for specific methods when data is unbalanced (Searle, 1987).

Many authors, such as Shaw & Mitchell-Olds (1993); Bao & Ananda (2001); Zhang &

5

Xiao (2012), point out that univariate and multivariate tests for main and interaction

effects in ANOVA and MANOVA models are exact when data is balanced and covari-

ance matrices are homoscedastic. When data is unbalanced and covariances are het-

eroscedastic, these tests for main and interaction effects are only approximate (Searle,

1987; Shaw & Mitchell-Olds, 1993; Littell et al., 2002). In addition, these tests become

too conservative and of low power which further highlights the importance of using

modified test statistics (Ananda & Weerahandi, 1997; Zhang & Xiao, 2012). Ways of

adjusting tests for heteroscedasticity in ANOVA-models have been presented by for

example Ananda & Weerahandi (1997) and Bao & Ananda (2001), who use generalized

p-values to obtain exact F-statistics. For MANOVA-models, modified multivariate tests

have been presented by Harrar & Bathke (2008) and Zhang & Xiao (2012).

In Harrar & Bathke (2008), non-parametric alternatives to Wilks’ λ, Hotelling-Lawley

Trace and Pillai’s Trace are proposed. Zhang & Xiao (2012) further propose two other

modifications of these tests using matching of covariance matrix variance components

and affine-invariant covariance matrix transformations. In a simulation study as well as

in real data example, Zhang & Xiao (2012) show that the two modified tests indeed are

less conservative and of higher power than standard multivariate tests and the modified

test proposed by Harrar & Bathke (2008).

Solutions to the problem of unbalanced data and heteroscedastic covariances in MANOVA-

models have not been well addressed despite the vast literature on multivariate methods.

As mentioned above, Harrar & Bathke (2008) and Zhang & Xiao (2012) have recently

presented solutions to these issues, but broader studies concerning their results are

missing and a thorough assessment of their methodology is needed. This master thesis

will therefore start to fill in this gap by investigating the performance of these newly

proposed methods.

3 Univariate analysis of variance models

ANOVA is a tool for estimating the effects of factors on a continuous response variable

with the goal of detecting differences in means for different factor categories, called

levels (Sawyer, 2009). To estimate the factor level means, it is necessary to observe

several outcomes, called replicates, given a certain combination of factor levels. If the

6

number of replicates for each factor combination is equal, the data is referred to as

balanced. However it is often the case that that the number of replicates varies over

factor levels. In this case, data is said to be unbalanced. As it will be shown more

in detail in Sections 5–6, the distinction of balanced and unbalanced is of major im-

portance for the model specification, estimation and hypothesis testing in ANOVA and

MANOVA-models.

Throughout this master thesis, the focus will be entirely on fixed effects ANOVA-

models. Fixed effects models are part of a larger set of general linear models including

random effects models and mixed models. Thus, since factors are assumed to be fixed,

levels of factors are not considered to be random samples from a larger populations

of levels. Hence, inference from fixed effects models is only valid within the specific

population and factors included in the model (Sawyer, 2009).

The ANOVA-model relies on several assumptions:

1. (Normality). The observed sample is assumed to be drawn from a normally

distributed population.

2. (Independence). Observations in the observed sample are independent of each

other.

3. (Homoscedasticity). The variance-covariance matrices are equal across levels of

factors.

A brief overview of one-way and two-way ANOVA models will be given in the following

subsections.

3.1 One-way ANOVA

The most simple form within the set of ANOVA-models is the one-way ANOVA means

model. In the means model, a single response variable is related to the level means of

a single factor so that

yik = µi + εik, (1)

where yik is the value of the response variable of the kth replicate for the ith level of

a factor A, µi is the mean of the ith level of factor A, and εik is a random error, i =

7

1, 2, ..., a, k = 1, 2, ..., n. Further, it is assumed that the error terms are independently

normally distributed with a zero mean and constant variance, εikiid∼ N(0, σ2

ε ). Thus,

E(yik) = µi and V (yik) = V (εik) = σ2ε .

In model (1), each µi is treated as an unknown fixed parameter. The random error

is assumed to vary over both replicates and factor levels, representing the difference

between the observations in each sample and the corresponding population means (?).

By expressing each factor level mean as the deviation from the overall population mean,

i.e. µi = µ + αi where αi = µi − µ, it is possible to formulate model (1) as the factor

effects model

yik = µ+ αi + εik, (2)

where µ is the overall mean and αi is the effect of the ith level of factor A, i = 1, 2, ..., a.

Due to the re-parametrization, E(yik) = µ + αi but the variance equals that in model

(1). In the factor effects model, αi represents the difference between the overall mean

µ and the mean of factor level i.

3.2 Two-way ANOVA with interactions

A natural extension of model (2) is to consider effects of two factors on the response

variable. Given two factors in the model, there are several types of effects to investi-

gate. The main effect of a factor is defined as the difference of one factor to the overall

population mean averaged over the levels of the second factor. It is often the case that

there is a combined effect on the response variable which depends on the level combina-

tion of the two factors. Hence, one may define the interaction effect as the effect of one

factor on the response variable across the levels of the second factor (Littell et al., 2002).

The presence of interaction effects can be discovered when plotting the means of the

response variable for the two factors. Figure 1 shows an example of interaction and no

interaction effects between two factors A and B which have 2 and 3 levels, respectively.

In the right plot in Figure 1 it can be seen that the level means of factor B are varying

with the levels of factor A, showing that factors A and B influence each other. Hence

there are interaction effects between factors A and B.

8

Figure 1: Examples of interaction and no interaction effects between factors A and B

02

46

810

No Interaction

Factor B levels

Mea

n va

lues

of y

1 2 3

Factor A levelslevel 1 level 2

02

46

810

Interaction

Factor B levelsM

ean

valu

es o

f y

1 2 3

Factor A levelslevel 1 level 2

If, on the other hand, level means of factor B are constant with the levels of factor A,

there are no interaction effects. This situation is exemplified in the left plot in Figure 1.

The two-way model with interactions is the following:

yijk = µ+ αi + βj + αβij + εijk, (3)

where yijk is response of the kth replicate on the ith level of A and j th level of B,

i = 1, 2, ..., a, j = 1, 2, ..., b, k = 1, 2, ..., n. Further, µ is the overall mean, αi and βj are

the main effects of the ith and j th levels of factors A and B, respectively, and αβij is

the interaction effect of the ith and j th levels of A and B. In (3), it is assumed that

εijkiid∼ N(0, σ2

ε ), so that E(yijk) = µ+ αi + βj + αβij and V ar(yijk) = σ2ε .

3.3 Matrix formulation of the ANOVA model

Generally, factor effects models (1)–(3) can be written as a linear model in matrix form:

y = Xβ + ε, (4)

where y : n× 1 represents the vector of responses, X : n× p is a known design matrix,

β : p× 1 is a vector of fixed effects to be estimated, and ε : n× 1 is a vector of random

errors such that ε ∼ Nn(0n, σ2εIn). Hence, E(y) = Xβ and V ar(y) = σ2

εIn. Here,

9

0n : n × 1 is a vector with all components equal to zero, and In denotes the identity

matrix of size n.

The utilization of matrix notation facilitates the analysis of factor effects models due

to its ability to express computations in a compact format. Matrix calculations in this

section mainly focus on inference for model (4). All necessary definitions are given in

appendix A.

The estimation in a linear model with fixed effects is briefly discussed in the next

section.

3.4 Estimation in the linear model

A two-way ANOVA with interactions, which is of interest in this master thesis, belongs

to a class of fixed effects models. This model can also be written as:

y = Xβ + ε,

where β = (µ, αi, βj, αβij)′. There are several approaches to estimate the parameter

vector β in the model y = Xβ + ε. A common approach is to use the method of least

squares, i.e. to find an estimator β so that the error sum of squares is minimized:

ε′ε = (y −Xβ)′(y −Xβ)⇒ min .

Observe that:

ε′ε = (y −Xβ)′(y −Xβ)

= y′y − 2βX ′y + β′X ′Xβ, (5)

since the transpose of a scalar is the scalar itself, i.e. y′Xβ = βX ′y. Using matrix

differentiation rules presented in e.g. Harville (2008), the minimization problem reduces

to solving the following equation:

dε′ε

dβ= −2X ′y + 2X ′Xβ = 0, (6)

10

which gives the system of normal equations :

X ′Xβ = X ′y. (7)

It can be shown that an unique solution to (7) is β = (X ′X)−1X ′y, where (X ′X)−1 is

the inverse of X ′X. This parameter value indeed gives the minimum of ε′ε in equation

(6) since the second derivative of (5) is positive (Rencher, 2000). The least square

estimator β has many useful properties. For instance, as shown in Weber & Skillings

(2000) and Rencher (2000) among others, β is a best linear unbiased estimator (BLUE).

3.4.1 Estimation in the two-way ANOVA with interactions

Since factor effects models (2) and (3) are expressed in terms of differences from an

overall mean µ, these models include more parameters than equations in the system of

normal equations. In other words, these models are less than full rank. As a result, by

expressing (2) and (3) in terms of (4), there are infinitely many solutions to the normal

equations in (7) implying that β is not estimable for factor effects model.

One way to solve the system of normal equations (7) uniquely is to impose restrictions

on the included parameters in β using linear constraints. The parameter restrictions

in the two-way ANOVA model with interactions (3) can be expressed as the following

independent linear constraints:

a∑i=1

αi = 0,b∑

j=1

βj = 0,

a∑i=1

αβij = 0, for each j,

b∑j=1

αβij = 0, for each i.

(8)

The least square estimators subject to the constraints (8) can then be derived as:

µ = y...,

αi = yi.. − µ = yi.. − y..., i = 1, . . . , a,

βj = y.j. − µ = y.j. − y..., j = 1, . . . , b,

αβij = yij. − yi.. − y.j. + y....

11

In the above expressions, the dot-notation of for instance yi.. represents averages of y

for each level i of factor A summed over all possible j = 1, 2, ..., b, k = 1, 2, ..., n, so

that:

yi.. =1

a

b∑j=1

n∑k=1

yijk.

3.5 Hypothesis testing in the linear model

Hypothesis tests concerning parameters in linear models written on the form y = βX+ε

could be expressed in terms of the general null hypothesis:

H0 : Cβ = 0, (9)

where C : m× p is a coefficent matrix, β : p× 1 is the vector of unknown parameters

and m is the number of linearly independent estimable functions of Cβ. The general

hypothesis is a convenient way of expressing possible subsets of hypothesis that one is

interested in for a given linear model. Assuming that y ∼ Nn(Xβ, σ2In), it could be

shown that under the null hypothesis (9), the test statistic

F =(Cβ)′[C(X ′X)−1C ′]Cβ/m

SSE/(n− k)

H0∼ F (m,n− k). (10)

In equation (10), (Cβ)′[C(X ′X)−1C ′]Cβ is the sum of squares corresponding to the

null hypothesis in (9) and SSE denotes the error sum of squares. The general hypothesis

can be used for testing null hypotheses about parameters of interest in specific models,

for instance ANOVA and MANOVA-models.

3.5.1 Hypothesis testing in the one-way ANOVA

The classic null hypotheses in the one-way ANOVA-model (2) could be stated as follows:

H0 : α1 = α2 = . . . = αa = 0, (11)

i.e. under the null hypothesis it is assumed that there is no effect of factor A. As

been mentioned by for instance Casella & Berger (2002) and Sawyer (2009), the idea of

ANOVA is to partition the total variance into components. Under the hypothesis (11),

12

it is possible to show that the total variation could be partitioned as:

SST = SSA + SSE,

where SST is the total sum of squares, SSA is the sum of squares of factor A and SSE

is the error sum of squares. The one-way ANOVA table can then be expressed as:

Table 1: One-way ANOVA table

Source df Sum of Squares MS F

A a− 1∑a

i=1

∑nk=1(yi. − y..)2 SSA

a−1MSAMSE

Error a(n− 1)∑a

i=1

∑nk=1(yik − yi.)2 SSE

a(n−1)

Total an− 1∑a

i=1

∑nk=1(yik − y..)2

The sums of squares presented in Table 1 can equivalently be written in matrix form:

Table 2: Matrix notation for the one-way ANOVA table


A a− 1 y′(H − 1nJn)y SSA

a−1MSAMSE

Error a(n− 1) y′(In −H)y SSEa(n−1)

Total an− 1 y′(In − 1nJn)y

Here, In is the identity matrix, Jn is a matrix of ones, and H = X(X ′X)−1X ′ denotes

the hat matrix, all matrices being of size n.

3.5.2 Hypothesis testing in the two-way ANOVA with interactions

In the two-way ANOVA model with interactions (3), the general hypothesis (9) under

the constraints (8) could be written as the the following three sets of null hypotheses:

H0A : α1 = α2 = . . . = αa = 0,

H0B : β1 = β2 = . . . = βb = 0, (12)

H0AB : αβ11 = . . . = αβ1b = . . . = αβa1 = . . . = αβab = 0.

13

The first two null hypotheses test the presence of main effects of factors A and B,

respectively, and the third hypothesis tests the presence of an interaction effect between

factors A and B. As with the test statistic for the general linear hypothesis stated in

(10), the test statistics under the three null hypotheses follow F-distributions. The total

sum of squares of all observations in the two-way ANOVA model could be partitioned

into independent sources or variation in the following way:

SST = SSA + SSB + SSAB + SSE. (13)

Since SSA, SSB and SSAB are independent, it is possible to test these 3 hypothesis

concerning factor effects separately. A summary of the decomposition of sums of squares

in the two-way ANOVA analysis is given in Table 3.

Table 3: Two-way ANOVA table


A a− 1∑a

i=1

∑bj=1

∑nk=1(yi.. − y...)2 SSA

a−1MSAMSE

B b− 1∑a

i=1

∑bj=1

∑nk=1(y.j. − y...)2 SSB

b−1MSBMSE

AB (a− 1)(b− 1)∑a

i=1

∑bj=1

∑nk=1(yij. − yi.. − y.j. + y...)

2 SSAB(a−1)(b−1)

MSABMSE

Error ab(n− 1)∑a

i=1

∑bj=1

∑nk=1(yijk − yij.)2 SSE

ab(n−1)

Total abn− 1∑a

i=1

∑bj=1

∑nk=1(yijk − y..)2

4 Multivariate analysis of variance models

The multivariate analysis of variance (MANOVA) is an extension of ANOVA in which

the effects of factors are assessed on a linear combination of several response variables.

A multivariate generalization of the ANOVA-model was first addressed by Wilks (1932),

nowadays the MANOVA methodology is well established and widely used in many re-

search areas, ranging from biology to psychology (Casella & Berger, 2002; Zhang &

Xiao, 2012).

The MANOVA-model has many advantages over simultaneous estimation of several

ANOVA-models:

14

• MANOVA tests whether there are significant differences among combinations of

factor levels on several response variables. Thus using MANOVA, one is able to

test joint hypotheses of all univariate ANOVA models and more likely to observe

differences between factor levels. For instance, two factors may have no main or

interaction effects on two different response variables separately but only jointly.

• Fitting one MANOVA-mode instead of several ANOVA-models decreases the the

experimentwise Type I error probability. As a simple example, suppose that

α = 5% for F-tests 6 in separate ANOVA-models. Then, the experimentwise

type I error would equal 30% whereas an overall F-test for included models in the

MANOVA-model would imply a 5% Type I error probability (Littell et al., 2002).

• Several ANOVA-models estimated separately does not take into account the co-

variance pattern among response variables. On the other hand, the MANOVA-

model is sensitive not only to mean differences of factor levels but also to the

covariation between response variables. When response variables are studied to-

gether, they are likely to be correlated to at least some extend and by conducting

several ANOVA analyses this correlation would be lost (Littell et al., 2002).

As for the univariate ANOVA-model, the complexity of the MANOVA-model is rapidly

increasing with the number of factors included in the model. The model specification is

in many ways similar to its univariate analogues presented in Section 3. The assump-

tions for the MANOVA-model are the same as for the ANOVA-model, but extended to

comprise multivariate normality. Still equality of covariance matrices for factor combi-

nations are assumed so that:

Σ11 = Σ12 = . . . = Σ1b = Σa1 = . . . = Σab = Σ,

where Σ : p× p is an unknown covariance matrix.

4.1 Heteroscedasticity of covariance matrices

For the standard ANOVA and MANOVA-model, it is assumed that investigated samples

are independent, follow a normal distribution, and have constant covariance matrices

over factor level combinations. Balanced data does not in itself imply that covariances

are equal and for a given sample, covariances may not in fact be equal for each fac-

tor combination. It has been proven that the estimation in balanced ANOVA and

15

MANOVA-models is robust even with minor deviations from the assumption of equal

covariance matrices (Timm, 2002; Rencher, 2003). This is however not the case when

data is unbalanced, which shall be discussed in Section 6

4.1.1 Box’s M test

One multivariate test of equality of covariance matrices is Box’s M test, named after

Box (1949) who first developed the test. In a similar manner as Lavenes test for the

univariate model, Box’s M tests the equality of covariance matrices across factor levels

in the MANOVA-model.

Thus, with Box’s M, one is interested in testing the null hypothesis:

H0 : Σ1 = Σ2 = . . . = Σi = . . . = Σk = Σ, (14)

where Σi : p×p is the covariance matrix of the ith combination of factors, i = 1, 2, . . . , k,

in the MANOVA-model with p response variables. Setting n =∑k

i=1 ni and vi = ni−1,

under the null hypothesis (14), the pooled estimator of the total covariance matrix is:

S =k∑i=1

viSin− k

,

where ni is the number of replicates on the ith factor combination, and Si is an unbiased

estimator of Σi. A generalized likelihood ratio test statistic can then be calculated as:

M = (n− k) log |S| −k∑i=1

vi logSi.

Using scale factors, Box’s M could be approximated to either a χ2or a F -distribution.

For both approximations, the null hypothesis of homoscedasticity is rejected for large

values of the scaled test statistics (Box, 1949). As Timm (2002) notes, the χ2approxima-

tion is preferred when ni < 20, p < 6 and k < 6. Otherwise, a F approximation is

recommended.

16

4.2 One-way MANOVA

In the one-way MANOVA-model, a single factor explains the variation in a set of

response variables. The factor effects one-way MANOVA-model is the following:

yik = µ+αi + εik, (15)

where yik : p× 1 is vector of p response variables for the kth replicate on the ith level

of factor A, i = 1, 2, ..., a, and αi : p × 1 is the vector of effects for level i of factor A.

More specifically, αi = µi − µ, showing that the vector of effects could be interpreted

as the deviation from the vector of overall means. Further, it is assumed that errors are

independently normally distributed with a zero mean and constant covariance matrix,

εik ∼ Np(0,Σ). Thus,

E(yik) = µ+αi and V (yik) = V (εik) = Σ.

Using matrix notation, one could express the one-way MANOVA-model (15) as:

y = 1ak ⊗ µ+ Ia ⊗ 1k ⊗α+ Iak ⊗ ε,

where ⊗ denotes the Kronecker product of two matrices (see Appendix A).

4.3 Two-way MANOVA with interactions

Similarly to the univariate model (3), the two-way MANOVA model with interactions

is expressed as:

yijk = µ+αi + βj +αβij + εijk, (16)

where yijk : p×1 is a vector of p response variables for the kth replicate on the ith level

of factor A, and the j th level of factor B, i = 1, 2, ..., a, j = 1, 2, ..., b, k = 1, 2, ..., n. In

the two-way MANOVA-model, vectors αi,βj and αβij represent main and interaction

effects, respectively. Also, it is assumed that εijkiid∼ Np(0,Σ) so that:

E(yijk) = µ+αi + βj +αβij and V (yijk) = V (εijk) = Σ.

17

The matrix notation for model (16) is the following:

y = 1abk ⊗ µ+ Ia ⊗ 1bk ⊗α+ 1a ⊗ Ib ⊗ 1k ⊗ β + Iab ⊗ 1k ⊗αβ + Iabk ⊗ ε.

4.4 Estimation in the two-way MANOVA with interactions

As for the univariate models described in Section 3, the effects in model (16) are not

estimable due to over-parametrization of the model. As a result, constraints must be

imposed on the parameters αi,βj,αβij. Based on the notation in e.g. Zhang & Xiao

(2012), the constraints could be expressed as:

a∑i=1

αi = 0,b∑

j=1

βj = 0,

a∑i=1

αβij = 0, for each j,

b∑j=1

αβij = 0, for each i.

(17)

Given the above constraints, the estimators in the two-way MANOVA model with

interactions are obtained as the solutions to the system of normal equations as given

in Section 3.2:

µ = y...,

αi = yi.. − y...,

βj = y.j. − y...,

αβij = yij. − yi.. − y.j. + y...,

where the dot-notation of e.g. y... represents the average of y summed over all possible

(i, j, k), so that:

y... =a∑i=1

b∑j=1

n∑k=1

yijk.

18

4.5 Hypothesis testing in the MANOVA model

For the MANOVA-model, the testing of hypotheses based on the partitioning of sums

of squares becomes more complex because of the interrelationships between the p in-

cluded ANOVA-models. Unlike the univariate models, one must now consider sums of

squares but also cross products for the factors in the MANOVA-model. In the resulting

matrices, called sums of squares and cross products (SSCP), diagonal elements corre-

sponds to the usual sums of squares for each of the p response variables whereas the

off-diagonal elements correspond to the cross products for each response variable pair.

When data is balanced, the partitioning of SSCP matrices is independent in anal-

ogy with the ANOVA-models described in Section 3.5. For example, in the one-way

MANOVA-model:

T = H +E,

where T : p× p is the total SSCP matrix, H : p× p is the hypothesis SSCP matrix and

E : p× p is the error SSCP matrix.

4.5.1 Hypothesis testing in the two-way MANOVA with interactions

In this master thesis, the main focus will be on the two-way MANOVA-model with

interactions. As for the univariate model, it is possible to set up hypotheses about

vectors of parameters in the MANOVA-model using the general hypothesis in equation

(9). Partitioning the associated SSCP matrices makes it possible to conduct multivari-

ate tests of both main effects of A and B as well as the interaction effects between the

two factors A and B. Using the notation proposed by Zhang & Xiao (2012), the general

hypothesis under the two-way MANOVA with interaction (16) can be written as the

following set of hypotheses:

H0A : α1 = α2 = . . . = αa = 0,

H0B : β1 = β2 = . . . = βb = 0, (18)

H0AB : αβ11 = . . . = αβ1b = . . . = αβa1 = . . . = αβab = 0,

19

where H0A tests the main effects of factor A, H0B tests the main effects of factor B and

H0AB tests the interaction effects of A and B. The independent partitioning of SSCP

matrices associated with the hypotheses in (18) could be written as:

T = HA +HB +HAB +E.

In order to test hypotheses about several response variables simultaneously in MANOVA-

models, the standard F-tests for main and interaction effects in the ANOVA-models

have to be generalized. The multivariate tests concerning the effects in the linear

model are in many ways similar to univariate F-tests except that the sums of squares

for effects are replaced, due to the covariance between responses, by SSCP matrices

(Littell et al., 2002).

The SSCP matrices HA, HB, HAB, T and E are expressed explicitly in Table 4:

Table 4: Multivariate analysis of variance table

Source df Sums of squares and cross products matrices

A a− 1 HA = nb∑a

i=1(yi.. − y...)(yi.. − y...)′

B b− 1 HB = na∑b

j=1(y.j. − y...)(y.j. − y...)′

AB (a− 1)(b− 1) HAB = n∑a

i=1

∑bj=1(yij. − yi.. − y.j. + y...)

×(yij. − yi.. − y.j. + y...)′

Error ab(n− 1) E =∑a

i=1

∑bj=1

∑nk=1(yijk − yij.)(yijk − yij.)′

Total abn− 1 T =∑a

i=1

∑bj=1

∑nk=1(yijk − y...)(yijk − y...)′

The multivariate tests in MANOVA are based on the relation between the hypothesis

SSCP matrix H and the error SSCP matrix E. A basis for these tests is the matrix

E−1H , showing that H corresponds to the numerator of the test and E to the denom-

inator of the test. Three commonly used multivariate tests, all being functions of the

matrix E−1H , are Wilks’ Λ, Hotelling-Lawley Trace and Pillai’s Trace.

It should be noted that in Sections 4.5.2–4.5.4, H symbolizes the hypothesis tested

in the MANOVA-model. For instance is H = HAB when testing for interaction effects.

20

4.5.2 Wilks’ Λ

Under the null hypothesis of no factor γ effects:

H0 : γ = 0, (19)

the likelihood ratio test statistic,

Λ =|E|

|E +H|,

is generally known as Wilks’ Λ after Wilks (1932). The null hypothesis is rejected for

small values of Λ, showing that E is small compared to the total SSCP matrix E+H .

4.5.3 Hotelling-Lawley Trace

The test statistic:

U = tr(E−1H),

is often referred to as Hotelling-Layley Trace after Lawley (1938) and Hotelling (1947)

who took part in developing the statistic. Naturally, a large H relative to E would

indicate a larger support for H and a larger trace. Hence is the null hypothesis (19) of

no effects rejected for large values of U .

4.5.4 Pillai’s Trace

Pillai (1955) developed the following statistic:

V = tr((E +H)−1H),

which is commonly known as Pillai’s Trace. As with Hotelling-Lawleys Trace, the null

hypothesis (19) is rejected for large values of V , indicating a large H relative to E.

4.5.5 Characteristics of the multivariate tests

Wilks’ Lambda, Hotelling-Lawley Trace and Pillai’s Trace are all exact tests, meaning

that the probability of rejecting H0 in (19) when H0 is true exactly equals α (Rencher,

2003). However, these tests have different probabilities of rejection when H0 is false,

21

thus implying that the tests have different power for a given sample. In general, none

of these multivariate tests is uniformly better than the other two, although there might

be situations where one test is preferred (Littell et al., 2002; Harrar & Bathke, 2008).

All three tests are also robust when data is balanced (Timm, 2002; Rencher, 2003).

Wilks’ Lambda, Hotelling-Lawley Trace and Pillai’s Trace are usually approximated

with the F-distribution (see e.g. Rencher (2003) for more details on these approxima-

tions).

5 Unbalanced data

So far in this master thesis, the analysis has only been considering the case when

the data is balanced, i.e. when there are equally many observations for each factor

level combination. Nevertheless, it is not always the case that the observed data is

balanced due to either a designed unbalance or missing observations (Searle, 1987; Shaw

& Mitchell-Olds, 1993). In these cases, data is said to be unbalanced. The difference

between balanced and unbalanced data might seem to be trivial, but the similarities

between analyses of balanced data and unbalanced data are few. Instead, as Searle

(1987) points out, one should rather consider unbalanced data as a separate setting

than a special case for balanced data.

5.1 Unbalanced two-way ANOVA with interactions

Since the first papers on analysis of unbalanced data design were published in the mid

1930’s, there has been a long and fruitful debate on how one should express unbalanced

linear models and make inference in the best possible way. Following the notation that

was introduced for balanced ANOVA models in Section 3, the two-way unbalanced

ANOVA-model with interactions could be expressed as a factor effects model:

yijk = µ+ αi + βj + αβij + εijk, (20)

where model assumptions and notation are equal to those for model (3) except that

k = 1, 2, . . . , nij, where nij is the number of replicates at each factor level combination

(i, j), i = 1, 2, . . . , a, j = 1, 2, . . . , b.

22

5.1.1 Estimation

Similar to balanced models one must impose constraints, defined in equation (8), on

the over-parameterized model (20) in order to obtain solutions to the system of normal

equations and unique values of the estimators µ, αi, βj, αβij. These estimators in model

(20) can be written as follows:

µ = y..., αi = yi.. − y..., i = 1, . . . , a,

βj = y.j. − y..., αβij = yij. − yi.. − y.j. + y..., j = 1, . . . , b,(21)

where

y... =a∑i=1

b∑j=1

nij∑k=1

(abnij)−1yijk, yi.. =

b∑j=1

nij∑k=1

(bnij)−1yijk,

y.j. =a∑i=1

nij∑k=1

(anij)−1yijk, yij. =

nij∑k=1

n−1ij yijk.

(22)

As can be seen in expressions (21) and (22), means are weighted by nij which leads to

that produced estimates are different than for the balanced model. For instance, y... no

longer equals the overall mean of the sample as when data is balanced.

5.1.2 Hypothesis testing

Testing the hypotheses of main and interaction effects (12) in the ANOVA-model when

data is unbalanced is not as straight forward as when data is balanced. Searle (1987)

notes that it is quite easy to realize that the usual partitioning of sum of squares in (13)

is not possible when data is unbalanced because main and interaction sum of squares

are no longer independent. As a consequence, the obtained test statistics under the

null hypotheses of no main and interaction effects (12) will not be exactly F-distributed

(Shaw & Mitchell-Olds, 1993). Modifications of methods for computing effect sum of

squares are therefore required when data is unbalanced (Langsrud, 2003).

To adjust for the fact that F-tests for effects in the model are not exact when data

is unbalanced, three methods to partition sums of squares for factors in the ANOVA-

model have been implemented; Type I, Type II and Type III. For details on Type I,

Type II and Type III sum of squares, see e.g. Littell et al. (2002) or Langsrud (2003).

23

In this master thesis, Type III sum of squares will be used for calculations. Type

III sums of squares for an effect is calculated so that it is adjusted for all other effects

in the ANOVA-model, regardless of the order which they are included. For instance,

in the two-way ANOVA with interaction, the Type III sum of squares for factor A is

calculated conditionally on that factor B and the interaction between factors A and

B are already included in the model (Littell et al., 2002). Expressed symbolically, the

partitioning of Type III sums of squares in the unbalanced two-way ANOVA could be

seen in Table 5.

Table 5: Partitioning of Type III sums of squares in the two-way ANOVA table

Source df Type III Sum of Squares

A a− 1 SS(α|µ, β, αβ)

B b− 1 SS(β|µ, α, αβ)

AB (a− 1)(b− 1) SS(αβ|µ, α, β)

6 Unbalanced two-way MANOVA with interactions

The methodology in this section is mainly based on the article by Zhang & Xiao (2012).

Hence, a similar notation and structure will be used throughout this Section. The unbal-

anced and heteroscedastic two-way MANOVA-model with interactions could formally

be expressed as follows:

yijk = µ+αi + βj +αβij + εijk, εijk ∼ N(0,Σij) (23)

where Σij : p×p is the covariance matrix of the (i, j)th combination of levels for factors

A and B, k = 1, 2, . . . , nij, where nij is the number of replicates at each factor level

combination (i, j), i = 1, 2, . . . , a, j = 1, 2, . . . , b, and other notation is the same as for

the balanced two-way MANOVA-model with interactions in Section 4.3.

As been mentioned in Section 4 will a balanced data setting result in robust esti-

mation of the MANOVA-model even with minor deviations from the assumption of

covariance homoscedasticity. However, when covariance heteroscedasticity is severe, for

24

instance due to an unbalanced data setting, the standard multivariate tests become

biased (Zhang & Xiao, 2012). In this case, it is necessary to use modifications of the

standard multivariate tests that protects against the bias. Zhang & Xiao (2012) pro-

pose ways to modify Wilks’ Λ, Hotelling-Lawley Trace and Pillai’s Trace in order to

obtain reliable tests.

6.1 Estimation

In analogy to unbalanced ANOVA-models, the estimation for unbalanced MANOVA-

models is affected by the fact that weights for different factor combinations are no

longer equal. Under the constraints proposed in equation (17), the vector of estimators

of effects could be uniquely derived as:

µ = y..., αi = yi.. − y..., i = 1, . . . , a,

βj = y.j. − y..., αβij = yij. − yi.. − y.j. + y..., j = 1, . . . , b,

where the dot-notation is the same as proposed by (Zhang & Xiao, 2012):

y... =a∑i=1

b∑j=1

nij∑k=1

(abnij)−1yijk, yi.. =

b∑j=1

nij∑k=1

(bnij)−1yijk,

y.j. =a∑i=1

nij∑k=1

(anij)−1yijk, yij. =

nij∑k=1

n−1ij yijk.

.

6.2 Hypothesis testing using modified test statistics

Zhang & Xiao (2012) propose three types of modifications to the standard test statistics

which adjust for unbalanced data and heteroscedastic covaraince matrices. Under the

null hypotheses of no main or interaction effects stated in equation (18), one may define

SSCP matrices for the unbalanced two-way MANOVA-model with interactions in the

following way:

25

HA =1

a− 1

a∑i=1

b∑j=1

(yi.. − y...)(yi.. − y...)′,

HB =1

b− 1

a∑i=1

b∑j=1

(y.j. − y...)(y.j. − y...)′,

HAB =1

(b− 1)(a− 1)

a∑i=1

b∑j=1

(yij. − yi.. − y.j. + y...)(yij. − yi.. − y.j. + y...)′,

where HA,HB and HAB are the SSCP matrices associated with the hypothesis of no

main effects for factor A, factor B and the interaction effect between factors A and B,

respectively. It should be noted that throughout the rest of this section, H = HA,HB

or HAB depending on the hypothesis tested.

When deriving the modified test statistics, Zhang & Xiao (2012) depart from the rela-

tionship between H and a natural unbiased estimator of the covariance matrix Σ:

G = (ab)−1

a∑i=1

b∑j=1

n−1ij Σij,

where Σij = (nij−1)−1∑a

i=1

∑bj=1(yijk− yijk)(yijk− yijk)′ is the unbiased estimator of

Σij. It can be shown that under model (23), H and G will be approximately Wishart-

distributed:

H ∼ W (fH ,Σ/fH), G ∼ W (fG,Σ/fG),

where fH and fG are unknown approximate degrees of freedom belonging to the dis-

tributions of H and G, respectively. Then, by defining W1 = fHH and W2 = fGG,

modifications of Wilks’ Λ (WL), Hotelling-Lawley Trace (HLT) and Pillai’s trace (PT)

could be derived as:

TWL = − log

(|W1|

|W1 +W2|

),

THLT = tr(W1W

−12

),

TPT = tr(W1(W1W2)−1

),

(24)

26

due to the fact that W1 and W2 are independent (Harrar & Bathke, 2008; Zhang &

Xiao, 2012).

By looking at the modified MANOVA test statistics in equation (24), one can clearly

see the resemblance to the standard MANOVA test statistics in sections 4.5.2–4.5.4.

However, unlike the standard multivariate test statistics, the modified test statistics

depend on the unknown quantities Σ, fH and fG. Zhang & Xiao (2012) derive three

sets of expressions for fH and fG using the information contained in the matrices H

and G:

1. fH and fG are obtained as proposed by Harrar & Bathke (2008).

2. fH and fG are obtained by matching total variances.

3. fH and fG are obtained under an affine invariant transformation of the MANOVA

model.

It can be shown that the three sets of approximate degrees of freedom are the following:

fH =tr (Σ2)∑

i,j

∑α,β

c2ij,αβnijnαβ

tr (ΣijΣαβ), fG =

tr (Σ2)

(ab)−2∑i,j

(nij − 1)−1n−2ij tr (Σ2

ij), (25)

fH =[tr (Σ2) + tr2 (Σ)]∑

i,j

∑α,β

c2ij,αβnijnαβ

[tr (ΣijΣαβ) + tr (Σij) tr (Σαβ)],

fG =[tr (Σ2) + tr2 (Σ)]

(ab)−2∑i,j

(nij − 1)−1n−2ij [tr (Σ2

ij) + tr2 (Σij)],

(26)

fH =p(p+ 1)∑

i,j

∑α,β

c2ij,αβnijnαβ

{tr (ΣijΣ−1ΣαβΣ−1) + tr (ΣijΣ−1) tr (ΣαβΣ−1)},

fG =p(p+ 1)

(ab)−2∑i,j

(nij − 1)−1n−2ij {tr ([ΣijΣ−1]2)] + tr2 (ΣijΣ−1)}

,

(27)

where∑i,j

symbolizes summation over all i’s and j’s, c2ij,αβ are design weights related to

the hypothesis tested, and α = 1, 2 . . . , a, β = 1, 2, . . . , b are additional indices used for

27

convenience of calculations. To estimate fG and fH in equations (25)–(27) in real data

analysis, the unknown quantities Σ and Σij are replaced by their estimates.

7 Numerical examples

The section will present two numerical examples, one that applies the methodology on

two real-life data sets and one that applies the method on two simulated data sets. The

main purpose of the numerical examples is to examine the performance of the modified

MANOVA tests, Wilks’ Λ, Hotelling-Lawley Trace and Pillai’s Trace, when the assump-

tion of covariance homoscedasticity is unlikely to hold and the data is unbalanced. In

both examples, the performance of the modified MANOVA tests will be compared to

test results of standard MANOVA tests so to open up for a broader discussion.

The two numerical examples are presented in Sections 7.1–7.2, and main results of

standard and modified MANOVA tests are presented in Section 7.3.

7.1 Real-life data example

In this section, the methodology presented in Sections 3–6 will be illustrated using a

synthetic real-life data set. The data used for this numerical example will be discussed

briefly in Section 7.1.1. As part of the numerical example, assumptions regarding

distributional properties and covariance matrices of the response variables will be in-

vestigated. The results from these diagnostic tests will be presented in Section 7.1.4.

Ultimately, the results from the multivariate testing procedure will be presented in

Section 7.1.5.

7.1.1 Data

The data for the numerical example is collected from the database Integrated Public Use

Microdata Series (IPUMS). IPUMS comprises data for various samples of the American

population, drawn from federal censuses and the American Community Surveys (ASC)

between the years 2000-2011 (Ruggles et al., 2010). In this master thesis, the ACS

2011 sample data is used. The ACS 2011 sample is collected using mixed-modes,

including e-mail, phone, mail and personal interviews (Ruggles et al., 2010). From the

original ACS sample, a subset density of 1%, thereby containing 251215 individuals is

28

extracted. Ultimately, using a second round of random sampling, two final samples of

120 observations are obtained. Three variables from ACS are chosen as a continuous

multivariate response in the MANOVA-model:

• Total personal income – the total pre-tax income or losses for each individual

during the previous calendar year, measured in dollars.

• Hauser and Warren Socioeconomic Index – an index score assigned to each in-

dividual based on occupation. The index measures occupational status based on

earnings and educational attainment for each category.

• Occupational Income Score – an income score assigned to each individual based on

occupation. The score is measured as the weighted average occupational income,

thereby reflecting relative economic standing of occupations for each individual.

where variable definitions are found in (Ruggles et al., 2010). In the numerical exam-

ple below are all three variables log-transformed to avoid severe departure from the

assumption of normality of responses.

7.1.2 The two-way MANOVA model with interactions

The two studied factors for each individual is Sex and Census region, each having 2

and 4 levels, respectively: males and females coming from Northeast, Midwest, South

and West census regions in the U.S. In the MANOVA-model, it is assumed that Sex

and Census region have main effects as well as interaction effects. Further, Sex is

representing the first factor A and Census region the second factor B, so that there are

8 combinations of the two factors, as shown in Table 6.

7.1.3 Structure of the numerical example

As mentioned earlier is the performance of the standard MANOVA tests likely to be

affected when data is unbalanced and covariance matrices are heteroscedastic. This

numerical example will thus examine two situations, one sample where data is bal-

anced and one sample where nij vary over factor combinations. In both samples is

n =∑nij

k=1 = 120. It should be noted that one faces endless combinations of how nij

can be altered when studying effects of unbalancedness and that future studies might

consider other alternatives.

29

Table 6: Data for the unbalanced two-wayMANOVA with k replicates, k = 1, 2, . . . , nij.

Sex

Census region Males Females Totals

Northeast y11k y21k y.1k

Midwest y12k y22k y.2k

South y13k y23k y.3k

West y14k y24k y.4k

Totals y1.k y2.k y..k

7.1.4 Testing model assumptions

This Section summarizes the results from goodness-of-fit tests of univariate and multi-

variate normality of the response variables in the MANOVA model as well as the results

from Box’s M test of covariance matrix homoscedasticiy.

The Shapiro-Wilk goodness-of-fit test is used for testing univariate normality whereas

the Mardia test is used for testing multivariate normality. Additionally, QQ-plots of

multivariate fit and histograms of univariate residuals of the estimated MANOVA model

are constructed. Low p-values of obtained goodness-of-fit test statistics suggest devia-

tion from both univariate and multivariate normality in both samples. However, neither

residual histograms nor QQ-plots suggest that this deviation is severe (Tables and Fig-

ures are presented in Appendix B).

Table 7 summarizes the results from Box’s M test for the samples. The null hypothesis

of equality of covariance matrices is rejected at a 5% significance level for the unbalanced

samples, but not when data is balanced.

7.1.5 Results from the testing procedure

Tables 8–9 below show the test results of Wilks’ Λ (WL), Hotelling-Lawley Trace (HLT)

and Pillai’s Trace (PT) as produced in SAS as well as the three modifications proposed

by Zhang & Xiao (2012). The modified tests are denoted HLTi, PTi and WLi where

30

Table 7: Results from Box’s M test of H0 : Σ11 = Σ12 = . . . = Σ24

Sample χ2 df p

Balanced 54.7 42 0.0910Unbalanced 72.7 42 0.0023

i = 1 stands for the modification proposed initially by Harrar & Bathke (2008), i = 2

for the modification based on matching of variance components, and i = 3 for the

modification based on affine invariant transformation of the two-way MANOVA-model

with interactions. When data is unbalanced, standard tests are based on Type III par-

titioning of SSCP matrices described in Section 5.1.2.

Tables 8 shows the test results for the balanced sample. The null hypotheses of main

and interaction effects for Sex and Census Region are not rejected at a 5% significance

level. Overall, it can be noted that test results are nearly identical for all 12 test statis-

tics. There is a slight tendency that modified MANOVA tests are less significant than

the standard MANOVA tests produced in SAS, but this difference is so small that it

can be neglected.

Table 8: Test results for balanced sample

Statistic FA df1,A df2,A pA FB df1,B df2,B pB FAB df1,AB df2,AB pAB

HLT 2.44 3 110 0.0678 0.86 9 169.7 0.5647 0.85 9 169.7 0.5685PT 2.44 3 110 0.0678 0.86 9 336 0.5623 0.86 9 336 0.5590WL 2.44 3 110 0.0678 0.86 9 267.86 0.5652 0.86 9 267.86 0.5655

HLT1 2.44 3 93.235 0.0695 0.85 8.4399 221.6 0.5622 0.85 8.4399 221.6 0.5622PT1 2.44 3 93.235 0.0695 0.85 8.4399 258.67 0.5662 0.84 8.4399 258.67 0.5698WL1 2.44 3 93.235 0.0695 0.86 8.4399 267.4 0.5585 0.86 8.4399 267.4 0.5547

HLT2 2.44 3 95.517 0.0692 0.85 8.5088 227.75 0.5625 0.85 8.5088 227.75 0.5625PT2 2.44 3 95.517 0.0692 0.85 8.5088 267.24 0.5663 0.84 8.5088 267.24 0.5700WL2 2.44 3 95.517 0.0692 0.86 8.5088 276.12 0.5589 0.86 8.5088 276.12 0.5552

HLT3 2.44 3 98.208 0.0689 0.85 8.742 236.61 0.5647 0.85 8.742 236.61 0.5647PT3 2.44 3 98.208 0.0689 0.85 8.742 282.35 0.5683 0.85 8.742 282.35 0.5720WL3 2.44 3 98.208 0.0689 0.86 8.742 291.75 0.5612 0.86 8.742 291.75 0.5575

Notation: In the above table, subscripts declare the tested hypothesis. For instance FA represents the F-statisticfor the null hypothesis of no effects for Sex.

Looking at Table 9 one can see that standard MANOVA tests reject the null hypothesis

of main effects for Sex at a 5% significance level. Both the main effect for Census region

31

and interaction effect between Sex and Census region are clearly not rejected.

Table 9: Test results for the unbalanced sample.


HLT 3.49 3 110 0.0181 0.85 9 169.7 0.5680 1.43 9 169.7 0.1769PT 3.49 3 110 0.0181 0.86 9 336 0.5636 1.41 9 336 0.1803WL 3.49 3 110 0.0181 0.85 9 267.86 0.5676 1.42 9 267.86 0.1778

HLT1 2.34 3 39.629 0.0881 0.93 7.6064 90.136 0.4951 1.26 7.6064 90.136 0.2740PT1 2.34 3 39.629 0.0881 0.92 7.6064 97.407 0.5037 1.27 7.6064 97.407 0.2707WL1 2.34 3 39.629 0.0881 0.94 7.6064 104.37 0.4874 1.25 7.6064 104.37 0.2785

HLT2 2.34 3 41.464 0.0868 0.93 7.6777 94.71 0.4935 1.27 7.6777 94.71 0.2710PT2 2.34 3 41.464 0.0868 0.92 7.6777 103 0.5016 1.27 7.6777 103 0.2674WL2 2.34 3 41.464 0.0868 0.94 7.6777 110.11 0.4863 1.26 7.6777 110.11 0.2757

HLT3 2.37 3 52.568 0.0813 0.94 7.9645 122.02 0.4854 1.29 7.9645 122.02 0.2567PT3 2.37 3 52.568 0.0813 0.93 7.9645 136.25 0.4912 1.29 7.9645 136.25 0.2523WL3 2.37 3 52.568 0.0813 0.95 7.9645 143.95 0.4801 1.27 7.9645 143.95 0.2619

Notation: In the above table, subscripts declare the tested hypothesis. For instance FA represents the F-statisticfor the null hypothesis of no effects for Sex.

These results differ from the test results obtained by the modified tests which do not

reject the null hypothesis of no main effect for Sex at a 5% level of significance. Further,

main effects for Census region and interaction effects between Sex and Census region

are not statistically significant.

7.2 Simulation study

A simulation study is conducted in order to validate the results obtained from studying

the real-life data. Two data sets, one balanced and one unbalanced are simulated and

investigated in the same way as the real-life data. The aim is to get further indica-

tions about the performance of the modified MANOVA tests in relation to the standard

MANOVA tests. A short description of the simulation study is given in Section 7.2.1.

Assumptions of univariate and multivariate normality as well as covariance homoscedas-

ticity are tested in Section 7.2.2. Results from the testing procedure are then shown in

Section 7.2.3.

7.2.1 Construction of the simulated data

The simulation study is based on the algorithms presented in Zhang & Xiao (2012).

Two data sets of size n = 120 are simulated from a multivariate normal distribution

32

with 3 response variables y1, y2 and y3, and 2 factors A and B having 2 and 4 levels,

respectively. The layout of the two simulated data sets can therefore be represented as

shown in Table 6. The simulated data sets are generated in the following way:

yijk = µij + Σ1/2ij εijk,

where k = 1, 2, . . . , nij. The mean vectors are defined as µij = µ11 + ijδh/(ab) where

µ11 = 0 and h = 2.4. It is further assumed that εijk ∼ Np(0, Ip). The covariance

structure for the unbalanced simulated data is assumed to vary over the two levels of

factor A. Explicitly, Σ1j = I3 and Σ2j = diag (1.0, 5.0, 0.1) , j = 1, 2, 3, 4.

7.2.2 Testing model assumptions

Goodness-of-fit tests of univariate and multivariate normality for the two simulated data

are presented in Appendix C together with QQ-plots of multivariate fit and histograms

of univariate residuals from the estimated MANOVA model. While figures indicate a

better fit to univariate and multivariate normality than for the real-life data samples,

the goodness-of-fit tests suggest a slight departure from normality in some cases.

Table 10 summarizes the results from Box’s M test for the two simulated data sets. The

null hypothesis of equality of covariance matrices is clearly rejected at a 5% significance

level for both the balanced and the unbalanced case which implies heteroscedasticity of

covariance matrices.

Table 10: Results from Box’s M test of H0 : Σ11 = Σ12 = . . . = Σ24

Sample χ2 df p

1 306.3 42 <.00012 293.9 42 <.0001

7.2.3 Results from the testing procedure

Tables 11–12 show the test results for the two simulated data sets. As can be seen in

Table 11, all tests show statistically significant main effects of factors A and B when

data is balanced. The interaction effect between factors A and B is however not sta-

tistically significant. Overall, p-values are marginally higher for modified tests than for

33

standard tests.

Table 11: Test results for balanced simulated data.


HLT 9.30 3 110 <.0001 6.29 9 169.7 <.0001 1.14 9 169.7 0.3397PT 9.30 3 110 <.0001 4.96 9 336 <.0001 1.12 9 336 0.3463WL 9.30 3 110 <.0001 5.66 9 267.86 <.0001 1.13 9 267.86 0.3434

HLT1 9.14 3 54.507 <.0001 5.59 5.6492 104.82 <.0001 1.11 5.6492 104.82 0.3614PT1 9.14 3 54.507 <.0001 6.11 5.6492 100.88 <.0001 1.10 5.6492 100.88 0.3647WL1 9.14 3 54.507 <.0001 5.09 5.6492 104.3 0.0002 1.11 5.6492 104.3 0.3584

HLT2 9.17 3 59.467 <.0001 5.59 6.0228 119.22 <.0001 1.11 6.0228 119.22 0.3596PT2 9.17 3 59.467 <.0001 6.13 6.0228 117.37 <.0001 1.11 6.0228 117.37 0.3621WL2 9.17 3 59.467 <.0001 5.04 6.0228 121.41 0.0001 1.12 6.0228 121.41 0.3574

HLT3 9.26 3 84.829 <.0001 5.58 8.0237 197.47 <.0001 1.12 8.0237 197.47 0.3512PT3 9.26 3 84.829 <.0001 6.20 8.0237 223.53 <.0001 1.12 8.0237 223.53 0.3499WL3 9.26 3 84.829 <.0001 4.87 8.0237 231.36 <.0001 1.12 8.0237 231.36 0.3528

Notation: In the above table, subscripts declare the tested hypothesis. For instance FA represents the F-statisticfor the null hypothesis of no effects for factor A.

For the unbalanced simulated data, all tests show statistically significant main effects

for factors A and B as can be seen in figure 12.

Table 12: Test results for unbalanced simulated data.


HLT 7.20 3 110 0.0002 5.37 9 169.7 <.0001 1.72 9 169.7 0.0880PT 7.20 3 110 0.0002 4.69 9 336 <.0001 1.62 9 336 0.1071WL 7.20 3 110 0.0002 5.07 9 267.86 <.0001 1.67 9 267.86 0.0959

HLT1 7.52 3 35.478 0.0005 5.09 5.792 69.365 0.0003 1.68 5.792 69.365 0.1416PT1 7.52 3 35.478 0.0005 5.38 5.792 66.634 0.0002 1.73 5.792 66.634 0.1305WL1 7.52 3 35.478 0.0005 4.82 5.792 70.292 0.0004 1.63 5.792 70.292 0.1551

HLT2 7.55 3 38.145 0.0004 5.09 6.093 77.024 0.0002 1.68 6.093 77.024 0.1353PT2 7.55 3 38.145 0.0004 5.40 6.093 75.41 0.0001 1.74 6.093 75.41 0.1232WL2 7.55 3 38.145 0.0004 4.76 6.093 79.566 0.0003 1.63 6.093 79.566 0.1498

HLT3 7.69 3 60.757 0.0002 5.13 8.0793 141.88 <.0001 1.72 8.0793 141.88 0.0984PT3 7.69 3 60.757 0.0002 5.54 8.0793 160.24 <.0001 1.78 8.0793 160.24 0.0840WL3 7.69 3 60.757 0.0002 4.60 8.0793 168.18 <.0001 1.64 8.0793 168.18 0.1154

Notation: In the above table, subscripts declare the tested hypothesis. For instance FA represents the F-statisticfor the null hypothesis of no effects for factor A.

The interaction effect is not significant at a 5% significance level but 2 out of 3 standard

test suggest the presence of a significant effect at a 10% level. This last result is

34

not supported by the modified tests which show substantially higher p-values for the

interaction effect. In line with previous results, the p-values from the modified tests are

higher compared to those from the standard tests in most cases.

7.3 Summary of test results

Combining the results from the two studies it is evident that standard MANOVA tests

overall have lower p-values than the three modified MANOVA tests. These findings

are similar to those presented by Zhang & Xiao (2012). In both numerical examples,

differences between tests are small when data is balanced but substantially larger when

data is unbalanced. Results for the real-life data samples are generally not as clear-

cut as for the simulated data but the overall tendency is that p-values of the standard

MANOVA tests are higher than for the modified MANOVA tests. In one case (the test

of main effects for Sex in the unbalanced real-life sample) are standard tests statistically

significant at a 5% level while modified tests are not.

8 Discussion

Based on the obtained results from the empirical study, it can be debated whether

modified MANOVA tests are a better choice than standard MANOVA tests. It is de-

sirable to adjust tests when heteroscedasticity is severe, since one wants to make valid

inference when the variability in data is large. This property of the modified MANOVA

tests is highlighted by Zhang & Xiao (2012) as one of the main reasons why one should

use these tests.

Looking at the results, it is obvious that the modified test seem to be less prone to

reject the null hypotheses when data is unbalanced which is in line with results ob-

tained by Zhang & Xiao (2012). Test results from modified tests and standard tests

are almost identical when data is balanced, even when covariances are heteroscedastic.

Once again, this is highlighting the problems with unbalanced data and supporting the

fact that balanced data is preferable in empirical studies.

The results obtained from the conducted studies on MANOVA tests raise interesting

questions on what could be learned and improved. Even though results are pointing in

the same direction, that modified MANOVA tests have higher p-values than standard

35

MANOVA tests, one must be careful to exaggerate these results. Further, a deeper

analysis of the underlying covariance structures in the data must be made in order to

generalize these results. For example, the covariance structure for the simulated data

is quite simple, so it is questionable whether complicated covariance structures would

yield equivalent results.

This master thesis aimed to contribute on studies of performance of newly proposed

modified two-way MANOVA tests. The obtained results indicate that further studies

with a special emphasis on unbalancedness are needed. Moreover, different types of

factors (with respect to number of factor levels) and different covariance structures

(with respect to covariance complexity) must be implemented in these studies. The

heteroscedasticity of covariance matrices for factor level combinations might affect the

testing procedure even for balanced data. Further, one could recommend to investigate

the effect of different covariance structures on MANOVA-tests in the presence of het-

eroscedasticity.

Despite all difficulties when analyzing unbalanced and heteroscedastic data, the ob-

tained results in this master thesis suggest modified MANOVA tests as a useful statis-

tical tool. As been further mentioned by Zhang & Xiao (2012), these tests are relatively

powerful and unbiased which further supports their wide application.

36

References

Ananda, M. M. A. & Weerahandi, S. (1997). Two-way anova with unequal cell frequen-

cies and unequal variances. Statistica Sinica, 7, 631–646.

Bao, P. & Ananda, M. M. A. (2001). Performance of two-way anova procedures when

cell frequencies and variances are unequal. Communications in Statistics - Simulation

and Computation, 30, 805–829.

Box, G. E. P. (1949). A general distribution theory for a class of likelihood criteria.

Biometrika, 36, 317–346.

Casella, G. & Berger, R. (2002). Statistical inference. Cengage Learning, Stamford.

Fujikoshi, Y. (1993). Two-way anova models with unbalanced data. Discrete Mathe-

matics, 116, 315–334.

Harrar, S. W. & Bathke, A. C. (2008). Nonparametric methods for unbalanced multi-

variate data and many factor levels. Journal of Multivariate Analysis, 99, 1635–1664.

Harville, D. (2008). Matrix algebra from a statistician’s perspective. Springer, New

York.

Herr, D. G. (1986). On the history of anova in unbalanced, factorial designs: The first

30 years. The American Statistician, 40, 265–270.

Hill, T. & Lewicki, P. (2007). Statistics: Methods and applications. StatSoft.

Hotelling, H. (1947). Multivariate quality control: illustrated by the air testing of sample

bombsights. McGraw-Hill, New York.

Langsrud, O. (2003). Anova for unbalanced data: Use Type II instead of Type III sums

of squares. Statistics and Computing, 13, 163–167.

Lawley, D. N. (1938). A generalization of Fisher’s z-test. Biometrika, 30, 180–187.

Littell, R., Stroup, W. & Freund, R. (2002). Sas for linear models. SAS Institute.

Pillai, K. C. S. (1955). Some new test criteria in multivariate analysis. The Annals of

Mathematical Statistics, 26, 117–121.

Rencher, A. (2000). Linear models in statistics. Wiley, New York.

37

Rencher, A. (2003). Methods of multivariate analysis. Wiley, New York.

Ruggles, S., Sobek, M., Genadek, K., Alexander, J., Schroeder, M. & Goeken, R. (2010).

Integrated public use microdata series: Version 5.0.

Rutherford, A. (2012). Anova and ancova: A glm approach. Wiley, New York.

Sawyer, S. (2009). Analysis of variance: The fundamental concepts. The Journal of

Manual and Manipulative Therapy, 17, E27–E38.

Schott, J. (2005). Matrix analysis for statistics. Wiley Series in Probability and Statis-

tics. Wiley, New York.

Searle, S. (1987). Linear models for unbalanced data. Wiley, New York.

Sen, P. K. (1986). Contemporary textbooks on multivariate statistical analysis: A

panoramic appraisal and critique. Journal of the American Statistical Association,

81, 560–564.

Shaw, R. G. & Mitchell-Olds, T. (1993). Anova for unbalanced data: An overview.

Ecology, 74, 1638–1645.

Timm, N. (2002). Applied multivariate analysis: methods and case studies. Springer,

New York.

Weber, D. & Skillings, J. (2000). A first course in the design of experiments: A linear

model approch. CRC Press, New York.

Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 24,

471–494.

Zhang, J.-T. & Xiao, S. (2012). A note on the modified two-way manova tests. Statistics

and Probability Letters, 82, 519–527.

38

Appendices

A Matrix algebra

Definition 1 (Transpose). Let A be a n × m matrix. Then the transpose of A is a

m × n matrix A′ such that the ith row, jth column element of A′ is the jth row, ith

column element of A.

Definition 2 (Determinant). The following definition of determinants is found in

Schott (2005). Let A be a p× p matrix. Then its determinant |A| is given by:

|A| =∑

(−1)f(i1,...,im)a1i1a2i2 · · · amim=∑

(−1)f(i1,...,im)ai11ai22 · · · aimm,

where the summation is taken over all permutations (i1, . . . , im) of the set of integers

(1, . . . ,m), and the function f(i1, . . . , im) equals the number of transpositions necessary

to change (i1, . . . , im) to (1, . . . ,m).

Definition 3 (Trace). The following definition of the trace is found in Schott (2005).

Let A be a p × p matrix. Then its trace, tr (A), is defined as the sum of the diagonal

element in A:

tr (A) =

p∑i=1

aii.

where the summation is taken over all permutations (i1, . . . , im) of the set of integers

(1, . . . ,m), and the function f(i1, . . . , im) equals the number of transpositions necessary

to change (i1, . . . , im) to (1, . . . ,m).

Definition 4 (Diagonal). Let A = (aij) be a square p× p matrix. Then the diagonal

of A is a p× 1 vector containing the elements a11, a22, . . . , app.

Definition 5 (Invertibility). A square matrixA is said to be invertible (or non-singular)

if there exist a matrix A−1 such that AA−1 = I, where I the identity matrix and A−1

is the inverse of A.

Definition 6 (Rank). The rank of a matrix A is the number of linearly independent

columns or rows in A.

39

Definition 7 (Full rank). A matrix A is said to be of full rank if all columns and/or

all rows in A are linearly independent. Thus, A : n × n is of full rank if and only if

all rows and columns of A are linearly independent. If A : n × p is of full rank, then

rankA = min(n, p).

Definition 8 (Kronecker product). Let A be a m×n matrix and B be a p× q matrix.

Then the Kronecker product, A⊗B, is the mp× nq block matrix:

A⊗B =

a11B · · · a1nB

.... . .

...

am1B · · · amnB

.Definition 9 (Linear independence). Vectors a1,a2, ...,ap are said to be linearly in-

dependent if there exist no scalars c1, c2, ..., cp (at least one ci 6= 0, i = 1, 2, . . . , p) such

that

c1a1 + c2a2 + ...+ cpap = 0.

Definition 10 (Estimable functions). Let y = Xβ+ε with E(ε) = 0, and let λ : p×1

be a vector of constants. Then a function λ′β is an estimable function if and only if λ′

is a linear combination of the rows in X so that a′X = λ′.

40

B Summary statistics for the real-life data

Table 13: Univariate and multivariate tests of normality for balanced data.

Variable Test Statistic Value p

Income Shapiro−Wilk W 0.96 0.0196H-W Score Shapiro−Wilk W 0.96 0.0151Occ. Score Shapiro−Wilk W 0.96 0.0097System Mardia Skewness 27.40 0.0022

Mardia Kurtosis 0.61 0.5419

Table 14: Univariate and multivariate tests of normality for unbalanced data.


Income Shapiro−Wilk W 0.93 <.0001H-W Score Shapiro−Wilk W 0.94 <.0001Occ. Score Shapiro−Wilk W 0.95 0.0003System Mardia Skewness 43.86 <.0001


Figure 2: QQ-plots of Squared Mahalanobis distances for balanced & unbalanced data.

41

Figure 3: Residuals (Income, H-W score and Occ. score) for balanced data

Figure 4: Residuals (Income, H-W score and Occ. score) for unbalanced data

42

C Summary statistics for the simulated data

Table 15: Univariate and multivariate tests of normality for balanced data.


y1 Shapiro−Wilk W 0.99 0.8368y2 Shapiro−Wilk W 0.92 <.0001y3 Shapiro−Wilk W 0.97 0.1069System Mardia Skewness 31.18 0.0005


Table 16: Univariate and multivariate tests of normality for unbalanced data.


y1 Shapiro−Wilk W 0.98 0.6153y2 Shapiro−Wilk W 0.95 0.0013y3 Shapiro−Wilk W 0.97 0.1565System Mardia Skewness 11.13 0.3471


Figure 5: QQ-plots of Squared Mahalanobis distances for balanced & unbalanced data.

43

Figure 6: Residuals (y1, y2 and y3) for balanced data

Figure 7: Residuals (y1, y2 and y3) for unbalanced data

44

D Univariate models for the 2 real-life data

Table 17: Test results for balanced data models. The first table uses Income as responsevariable, the second uses the H-W score and the third uses Occ. score.


HLT 0.41 1 112 0.5216 0.19 3 112 0.9040 1.18 3 112 0.3219PT 0.41 1 112 0.5216 0.19 3 112 0.9040 1.18 3 112 0.3219WL 0.41 1 112 0.5216 0.19 3 112 0.9040 1.18 3 112 0.3219

HLT1 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216PT1 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216WL1 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216

HLT2 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216PT2 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216WL2 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216

HLT3 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216PT3 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216WL3 0.41 1 95.829 0.5219 0.19 2.8243 95.829 0.8940 1.18 2.8243 95.829 0.3216


HLT 0.01 1 112 0.9331 0.79 3 112 0.5024 0.69 3 112 0.5577PT 0.01 1 112 0.9331 0.79 3 112 0.5024 0.69 3 112 0.5577WL 0.01 1 112 0.9331 0.79 3 112 0.5024 0.69 3 112 0.5577

HLT1 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557PT1 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557WL1 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557

HLT2 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557PT2 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557WL2 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557

HLT3 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557PT3 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557WL3 0.01 1 106.95 0.9331 0.79 2.9536 106.95 0.5008 0.69 2.9536 106.95 0.5557


HLT 3.71 1 112 0.0567 0.80 3 112 0.4987 0.55 3 112 0.6508PT 3.71 1 112 0.0567 0.80 3 112 0.4987 0.55 3 112 0.6508WL 3.71 1 112 0.0567 0.80 3 112 0.4987 0.55 3 112 0.6508

HLT1 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464PT1 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464WL1 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464

HLT2 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464PT2 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464WL2 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464

HLT3 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464PT3 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464WL3 3.71 1 100.15 0.0570 0.80 2.9246 100.15 0.4962 0.55 2.9246 100.15 0.6464

45

Table 18: Test results for unbalanced model. The first table uses Income as responsevariable, the second uses the H-W score and the third uses Occ. score


HLT 7.49 1 112 0.0072 0.13 3 112 0.9427 3.02 3 112 0.0328PT 7.49 1 112 0.0072 0.13 3 112 0.9427 3.02 3 112 0.0328WL 7.49 1 112 0.0072 0.13 3 112 0.9427 3.02 3 112 0.0328

HLT1 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457PT1 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457WL1 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457

HLT2 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457PT2 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457WL2 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457

HLT3 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457PT3 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457WL3 5.72 1 41.974 0.0214 0.11 2.5323 41.974 0.9307 1.94 2.5323 41.974 0.1457


HLT 0.07 1 112 0.7980 1.30 3 112 0.2780 2.91 3 112 0.0375PT 0.07 1 112 0.7980 1.30 3 112 0.2780 2.91 3 112 0.0375WL 0.07 1 112 0.7980 1.30 3 112 0.2780 2.91 3 112 0.0375

HLT1 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285PT1 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285WL1 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285

HLT2 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285PT2 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285WL2 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285

HLT3 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285PT3 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285WL3 0.07 1 59.726 0.7897 1.49 2.6923 59.726 0.2279 3.37 2.6923 59.726 0.0285


HLT 0.67 1 112 0.4153 1.17 3 112 0.3253 1.97 3 112 0.1229PT 0.67 1 112 0.4153 1.17 3 112 0.3253 1.97 3 112 0.1229WL 0.67 1 112 0.4153 1.17 3 112 0.3253 1.97 3 112 0.1229

HLT1 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998PT1 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998WL1 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998

HLT2 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998PT2 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998WL2 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998

HLT3 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998PT3 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998WL3 0.82 1 53.615 0.3687 1.70 2.6206 53.615 0.1846 2.26 2.6206 53.615 0.0998

46

E Univariate models for the 2 simulated data

Table 19: Test results for balanced data models. The first table uses y1 as responsevariable, the second uses the y2 and the third uses y3.


HLT 0.36 1 112 0.5482 0.81 3 112 0.4901 2.64 3 112 0.0530PT 0.36 1 112 0.5482 0.81 3 112 0.4901 2.64 3 112 0.0530WL 0.36 1 112 0.5482 0.81 3 112 0.4901 2.64 3 112 0.0530

HLT1 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591PT1 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591WL1 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591

HLT2 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591PT2 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591WL2 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591

HLT3 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591PT3 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591WL3 0.36 1 79.603 0.5486 0.81 2.795 79.603 0.4837 2.64 2.795 79.603 0.0591


HLT 0.29 1 112 0.5928 0.05 3 112 0.9861 0.27 3 112 0.8467PT 0.29 1 112 0.5928 0.05 3 112 0.9861 0.27 3 112 0.8467WL 0.29 1 112 0.5928 0.05 3 112 0.9861 0.27 3 112 0.8467

HLT1 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506PT1 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506WL1 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506

HLT2 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506PT2 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506WL2 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506

HLT3 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506PT3 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506WL3 0.29 1 56.517 0.5939 0.05 1.8803 56.517 0.9458 0.27 1.8803 56.517 0.7506


HLT 27.70 1 112 <.0001 18.53 3 112 <.0001 0.59 3 112 0.6233PT 27.70 1 112 <.0001 18.53 3 112 <.0001 0.59 3 112 0.6233WL 27.70 1 112 <.0001 18.53 3 112 <.0001 0.59 3 112 0.6233

HLT1 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431PT1 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431WL1 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431

HLT2 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431PT2 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431WL2 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431

HLT3 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431PT3 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431WL3 27.70 1 53.264 <.0001 18.53 1.8224 53.264 <.0001 0.59 1.8224 53.264 0.5431

47

Table 20: Test results for unbalanced model. The first table uses y1 as response variable,the second uses the y2 and the third uses y3.


HLT 11.97 1 112 0.0008 3.31 3 112 0.0228 0.59 3 112 0.6218PT 11.97 1 112 0.0008 3.31 3 112 0.0228 0.59 3 112 0.6218WL 11.97 1 112 0.0008 3.31 3 112 0.0228 0.59 3 112 0.6218

HLT1 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920PT1 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920WL1 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920

HLT2 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920PT2 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920WL2 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920

HLT3 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920PT3 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920WL3 12.71 1 74.414 0.0006 3.51 2.869 74.414 0.0208 0.63 2.869 74.414 0.5920


HLT 0.74 1 112 0.3917 0.56 3 112 0.6450 0.17 3 112 0.9163PT 0.74 1 112 0.3917 0.56 3 112 0.6450 0.17 3 112 0.9163WL 0.74 1 112 0.3917 0.56 3 112 0.6450 0.17 3 112 0.9163

HLT1 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390PT1 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390WL1 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390

HLT2 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390PT2 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390WL2 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390

HLT3 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390PT3 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390WL3 0.73 1 37.441 0.3989 0.55 1.9299 37.441 0.5766 0.17 1.9299 37.441 0.8390


HLT 4.55 1 112 0.0350 10.89 3 112 <.0001 3.90 3 112 0.0107PT 4.55 1 112 0.0350 10.89 3 112 <.0001 3.90 3 112 0.0107WL 4.55 1 112 0.0350 10.89 3 112 <.0001 3.90 3 112 0.0107

HLT1 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331PT1 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331WL1 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331

HLT2 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331PT2 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331WL2 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331

HLT3 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331PT3 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331WL3 4.58 1 35.54 0.0392 10.96 1.773 35.54 0.0003 3.93 1.773 35.54 0.0331

48

F Codes

F.1 SAS codes

This SAS program calculates the modified and standard MANOVA test statistics ana-lyzed in this master thesis.

/*--------------------------------------------------------------------

libname IPUMS is created where all output is saved. The data from

http://usa.ipums.org/ is extracted as a .csv file

and formatted in excel. Imprting the file "uppsatsdata.xls" to SAS.

------------------------------------------------------------------*/

%put ?????---;

libname IPUMS "F:\";

ods listing;

proc import out=ipums.data

datafile="F:\uppsatsdata.xlsx" dbms=excel replace;

getnames=yes;

run;

/*---------------------------------------------------------------

Creating a makro for importing simulated datasets from Matlab.

Renaming variables.

-----------------------------------------------------------------*/

%macro matlab(in,out=,dbms=);

proc import out=&out datafile=&in dbms=&dbms replace;

getnames=no;

data &out;

set &out;

sex=VAR1;

area=VAR2;

y1=VAR3;

y2=VAR4;

y3=VAR5;

drop VAR1-VAR5;

if VAR1=1 and VAR2=1 then group=1;

else if VAR1=1 and VAR2=2 then group=2;






49

else group=8;

run;

%mend matlab;

%matlab("F:\Matlab\myfile1.txt",out=ipums.simdata1,dbms=csv)



/*----------------------------------------------------------

Checking the content/variables of the Ipums data.

Variables that are possibly important/relevant are kept.

------------------------------------------------------------*/

proc contents data=ipums.data;

run;

Proc freq data=ipums.data nlevels;

tables sex region sex*region;

proc means data=ipums.data n nmiss min max mean std kurt skew;

var inctot incwage ftotinc occscore sei hwsei;

where inctot<999999 & incwage<999999 & ftotinc<9999999 &

0<occscore & 0<sei & 0<hwsei;

proc univariate data=ipums.data noprint;

var inctot incwage ftotinc occscore sei hwsei;

where inctot<999999 & incwage<999999 & ftotinc<9999999 &

0<occscore & 0<sei & 0<hwsei;

histogram;

run;

/*-----------------------------------------------------

Put constraints on the imported data.

--------------------------------------------------------*/

data ipums.data11;

set ipums.data (keep=age sex region occscore sei hwsei inctot);

where 0<inctot<999999 & 0<occscore & 0<sei & 0<hwsei & age>=16;

*if inctot=0 and inwage=0 and ftotinc=0 then delete;

if region in (11,12,13) then regions="northeast" ;

else if region in (21,22,23) then regions="midwest";

else if region in (31,32,33) then regions="south" ;

else if region in (41,42,43) then regions="west" ;

else regions="Missing" and area=0;

if region in (11,12,13) then area=1;

else if region in (21,22,23) then area=2;

50



else area=0;

if SEX=1 then Gender="Male ";

else Gender="Female";

if sex=1 and region in (21,22,23) then group=1;

else if sex=1 and region in (11,12,13) then group=2;






else group=8;

drop region;

run;

proc sort data=ipums.data11 out=ipums.data11;

by sex area;

run;

/*------------------------------------------------------------------

Splitting the data based on the 8 factor combinations to create

samples for each combination from which final samples are drawn.

------------------------------------------------------------------*/

%macro datasets(sex=,area=,out=);

data &out;

set ipums.data11;

if sex=&sex and area=&area then do;

output &out;

end;

run;

%mend datasets;

%datasets(sex=1,area=1,out=ipums.data01);








51

/*----------------------------------------------------------------

Macro for sampling observations from each of the 8 factor

combinations. These obs. are later combined to final samples.

----------------------------------------------------------------*/

%macro c(out1,out2,out3,out4,out5,out6,out7,out8,

a1=,a2=,a3=,a4=,a5=,a6=,a7=,a8=);

%SAMPLE(ipums.data01,&out1,0.5,MRSS=&a1,OVERSAM=0);








%mend c;

/*-------------------------------------------------------------------

The macro below is written by Chang Jiang and is collected

from: http://www.nesug.org/proceedings/nesug00/ps/ps7012.pdf.

It samples observations from a larger dataset using SRS.

-------------------------------------------------------------------*/

%MACRO SAMPLE(EMDS,SAMPLE,RAND,MRSS=,OVERSAM=0.05);

DATA _NULL_;

FSS=CEIL(&MRSS*(1+&OVERSAM));

CALL SYMPUT(’FSS’,LEFT(PUT(FSS,8.)));

RUN;

/* get the number of FSS and store it in &FSS */

DATA _NULL_;

IF 0 THEN SET &EMDS NOBS=EM;

CALL SYMPUT(’EM’, LEFT(PUT(EM,8.)));

STOP;

RUN;

/* get the number of EM and store it in &EM at compile time */

DATA &EMDS; SET &EMDS;

OBSNUM=_N_;

/*use OBSNUM to track chosen members */

RUN;

DATA _NULL_;

N=FLOOR(&EM/&FSS);

START=MAX(ROUND(&RAND*N),1);

/* round START using .5 rule */

CALL SYMPUT(’N’, LEFT(PUT(N,8.)));

52

CALL SYMPUT(’START’,LEFT(PUT(START,8.)));

RUN;

DATA &SAMPLE(DROP=I);

LENGTH LIST $7;

DO I=1 TO &FSS;

OBSIN=&START+FLOOR((I-1)*(&EM/&FSS));

SET &EMDS POINT=OBSIN;

/*draw members by their observation #*/

IF I <= &MRSS THEN LIST=’PRIMARY’;

ELSE LIST=’AUXILIA’;

OUTPUT;

END;

STOP;

RUN;

%PUT EM=&EM MRSS=&MRSS FSS=&FSS N=&N START=&START;

/* output the values of these macro variables to SAS LOG */

%MEND SAMPLE;

/*------------------------------------------------------------------

This macro appends datasets.

-----------------------------------------------------------------*/

%macro append(in,out,a=);

data &out;

set

%do i = 1 %to &a;

&in&i

%end;

;

run;

%mend append;

/*-------------------------------------------------------------------------------

calculating log of the variables to give a better fit to a Normal distribution.

----------------------------------------------------------------------------------*/

%macro log(in=,out=);

data &out;

set &in;

loginc=log(inctot);

loghw=log(hwsei);

logocc=log(occscore);

keep group sex area loginc loghw logocc;

run;

53

%mend log;

/*-----------------------------------------------------------------

Run macros c, append and log to obtain 3 samples, n=120. 1 sample is

balanced and 2 are unbalanced. All response variables are logged.

-------------------------------------------------------------------*/

%c(ipums.nytest1,ipums.nytest2,ipums.nytest3,ipums.nytest4,

ipums.nytest5,ipums.nytest6,ipums.nytest7,ipums.nytest8,

a1=15,a2=15,a3=15,a4=15,a5=15,a6=15,a7=15,a8=15)

%append(ipums.nytest,ipums.urval10,a=8)

%log(in=ipums.urval10,out=ipums.urval1)



a1=9,a2=7,a3=14,a4=10,a5=18,a6=15,a7=29,a8=18)





a1=10,a2=10,a3=10,a4=30,a5=10,a6=10,a7=10,a8=30)



/*-----------------------------------------------------------------------------

A macro with the program for obtaining modified test statistics as proposed by

Zhang and Xiao (2012) using proc iml. All results are collected and outputted.

-----------------------------------------------------------------------------*/

%macro teststatistics(var1,var2,var3,in=,out=);

ods listing;

/* Loading data into proc iml and naming variables */

proc iml;

use &in;

read all var {sex} into x2;

read all var {area} into x3;

read all var {&var1 &var2 &var3} into y;

close &in;

rows=nrow(y);

p=ncol(y); /*dimensions of y*/

a=2; /*number of levels of factor a*/

b=4; /*number of levels of factor b*/

ab=a*b; /*number of combinations*/

54

ny1=J(rows,1,0);

ny2=J(rows,1,0);

ny3=J(rows,1,0);

ny4=J(rows,1,0);

ny5=J(rows,1,0);

ny6=J(rows,1,0);

ny7=J(rows,1,0);

ny8=J(rows,1,0);

do i=1 to rows;

if x2[i]=1 & x3[i]=1 then ny1[i]=1;

else if x2[i]=1 & x3[i]=2 then ny2[i]=1;






else ny8[i]=1;

end;

n=sum(ny1)//sum(ny2)//sum(ny3)//sum(ny4)//

sum(ny5)//sum(ny6)//sum(ny7)//sum(ny8);

print n; /*number of observations for each ij*/

/* Creating means and covariances for each ij combination. */

covij=j(ab*p,p,0);

meanij=j(ab,p,0);

do i=1 to ab;

jn=J(n[i],1);

jnn=J(n[i]);

in=I(n[i]);

s=y[sum(n[1:i])-(n[i]-1):sum(n[1:i]),];

meanij[i,]=((1/n[i])*(jn‘*s)‘)‘;

covij[(i*p)-(p-1):i*p,]=(1/(n[i]-1))*(s‘*(in-(1/n[i])*jnn)*s);

end;

/*calculating covariance estimator G.*/

G1=J(p,p,0);

print G1;

do i=1 to ab;

G1=G1+(1/n[i])*covij[(p*i)-(p-1):p*i,];

end;

G=(1/ab)*G1;

55

print meanij covij G;

/*calculating degrees of freedom fH, fG for 3 modifications*/

traceG2=trace(G*G);

trace2G=trace(G)**2;

sumij1=0;

sumij2=0;

sumij3=0;

fG=J(3,1,0);

do i=1 to ab;

Sij=covij[(p*i)-(p-1):p*i,];

/* Harrar and Bathkes method*/

trace1=(1/(n[i]-1))*(n[i]**-2)*trace(Sij**2);

sumij1=sumij1+trace1;

/* Zhangs and Xiaos method 1*/

trace2=(1/(n[i]-1))*(n[i]**-2)*(trace(Sij**2)+trace(Sij)**2);


/* Zhangs and Xiaos method 2*/

trace3=(1/(n[i]-1))*(n[i]**-2)*(trace((Sij*inv(G))**2)+

trace(Sij*inv(G))**2);


end;

fG[1]=(ab**2)*traceG2/sumij1;

fG[2]=(ab**2)*(traceG2+trace2G)/sumij2;

fG[3]=(ab**2)*p*(p+1)/sumij3;

print sumij1 sumij2 sumij3 fG;

/* calculating design weights C needed for calculation of fH.

C are depending on the hypothesis tested.*/

C_A=(1/(a-1))*((I(a)-(J(a)/a))@(J(b)/b));

C_B=(1/(b-1))*((J(a)/a)@(I(b)-(J(b)/b)));

C_AB=(1/((b-1)*(a-1)))*((I(a)-(J(a)/a))@(I(b)-(J(b)/b)));

print C_A C_B C_AB;

/*Defining a covariance matris where I already divide by n.*/

covar=J(p*ab,p,0);

do i=1 to ab;

covar[(p*i)-(p-1):p*i,]=covij[(p*i)-(p-1):p*i,]/n[i];

end;

print covar;

56

/* calculating fG and fH for 3 hypotheses and 3 methods.*/

q=3; /* number of hypotheses*/

cov_ab=covar;

fH=J(3,q,0);

do m=1 to q;

summa1=0;

summa2=0;

summa3=0;

if m=1 then C=C_A;

else if m=2 then C=C_B;

else if m=3 then C=C_AB;

do i=1 to ab;

do j=1 to ab;

Sj=cov_ab[(p*j)-(p-1):p*j,];

Si=covar[(p*i)-(p-1):p*i,];

temp1=(C[i,j]**2)*trace(Si*Sj);

temp2=(C[i,j]**2)*(trace(Si*Sj)+

trace(Si)*trace(Sj));

temp3=(C[i,j]**2)*(trace(Si*inv(G)*Sj*inv(G))

+trace(Si*inv(G))*trace(Sj*inv(G)));

summa1=summa1+temp1;



end;

end;

/*rows=method. 1=Harrar&Bathke, 2=Zhang&Xiao1, 3=Zhang&Xiao2.

columns=hypotesis. 1=A, 2=B, 3=AB*/

fH[1,m]=traceG2/summa1;

fH[2,m]=(traceG2+trace2G)/summa2;

fH[3,m]=p*(p+1)/summa3;

end;

print fH;

/*calculating modified test statistics. Defining H=vmu‘*C*vmu.*/

T_wlr=J(3,q,0); /*modified Wilks Lambda*/

T_lht=J(3,q,0); /*modified Hotelling-Lawley trace*/

T_bnp=J(3,q,0); /*modified Pillais trace*/

W1=J(3*p,q*p,0);

H=J(3,3,0);

W2=fG[1]*G//fG[2]*G//fG[3]*G;

do k=1 to 3; /*method*/

57

do m=1 to 3; /*hypotesis*/

if m=1 then C=C_A;

else if m=2 then C=C_B;

else if m=3 then C=C_AB;

W1[(p*k)-(p-1):p*k,p*(m-1)+1:p*m]=fH[k,m]*(meanij‘*C*meanij);

T_wlr[k,m]=-log(det(W2[(p*k)-(p-1):p*k,])/det(W1[(p*k)

-(p-1):p*k,(p*m)-(p-1):p*m]+W2[(p*k)-(p-1):p*k,]));

T_lht[k,m]=trace(W1[(p*k)-(p-1):p*k,(p*m)-(p-1):p*m]/

W2[(p*k)-(p-1):p*k,]);

T_bnp[k,m]=trace(W1[(p*k)-(p-1):p*k,(p*m)-(p-1):p*m]/(W1[(p*k)

-(p-1):p*k,(p*m)-(p-1):p*m]+W2[(p*k)-(p-1):p*k,]));

end;

end;

print W1 W2;

/*calculating F-approximations of T_wlr, T_lht and T_bnp*/

tabell=J(9,12,0);

do k=1 to 3; /*method*/

do m=1 to q; /*hypotesis. q=3*/

B=W1[(p*k)-(p-1):p*k,p*(m-1)+1:p*m];

W=W2[(p*k)-(p-1):p*k,];

f_H=fH[k,m];

f_G=fG[k];

teststat=J(3,1,0);

ps=J(3,1,0);

df=J(3,2,0);

do t=1 to 3; /*Test statistic*/

if t=1 then do; /* Wilks Lambda*/

stat=det(W*inv(B+W));

mo=f_H;

no=f_G;

ko=no-0.5*(p-mo+1);

ro=p*mo/2-1;

so=sqrt(((p*mo)**2-4)/(p**2+mo**2-5));

df_1=p*mo;

df_2=ko*so-ro;

f_stat=(stat**(-1/so)-1)*(df_2/df_1);

pval=1-probf(f_stat,df_1,df_2);

end;

if t=2 then do; /*Hotelling-Lawley trace*/

stat1=trace(B*inv(W));

58

d1=f_H;

d2=f_G;

s1=min(p,d1);

m1=(abs(p-d1)-1)/2;

n1=(d2-p-1)/2;

df_1=s1*(2*m1+s1+1);

df_2=2*(n1*s1+1);

f_stat=(stat1*df_2)/(s1*df_1);


end;

if t=3 then do; /*Pillai trace*/

stat2=trace(inv(B+W)*B);

d1=f_H;

d2=f_G;

s1=min(p,d1);

m1=(abs(p-d1)-1)/2;

n1=(d2-p-1)/2;

df_1=s1*((2*m1)+s1+1);

df_2=s1*((2*n1)+s1+1);

f_stat=(df_2/df_1)*(stat2/(s1-stat2));


end;

teststat[t]=f_stat;

ps[t]=pval;

df[t,1]=df_1;

df[t,2]=df_2;

end;

tab=teststat||df||ps;

tabell[(3*k)-2:3*k,(4*m)-3:4*m]=tab;

end;

end;

print tabell;

/*Creating a table with all results, exporting to a sas data set.*/

b={"FValue_A" "NumDf_A" "DenDf_A" "ProbF_A" "FValue_B" "NumDf_B" "DenDf_B"

"ProbF_B" "FValue_AB" "NumDf_AB" "DenDf_AB" "ProbF_AB"};

create &out from tabell [ colname=b ];

append from tabell;

Statistic={"Hotelling-Lawley Trace", "Pillai’s Trace" ,"Wilks’ Lambda",

"Hotelling-Lawley Trace", "Pillai’s Trace" ,"Wilks’ Lambda",

"Hotelling-Lawley Trace", "Pillai’s Trace" ,"Wilks’ Lambda"};

name={"Statistic"};

59

create ipums.name from Statistic [ colname=name ];

append from Statistic;

quit; /*end of proc iml program*/

%mend teststatistics;

/*--------------------------------------------------------------------

The below macro runs the 2-way MANOVA model on the samples.

Extracting results from standard test statistics. This is done

for comparison of the modified MANOVA statistics.

--------------------------------------------------------------------*/

%macro manova(var1,var2,var3,in=,in2=,out=);

data ipums.name;

set ipums.name;

n=_n_;

data &in2;

set &in2;

n=_n_;

data &in2;

merge ipums.name &in2;

by n;

drop n;

run;

ods listing close;

ods trace on;

ods output "Multivariate Tests"=ipums.manova;

proc glm data=&in plots=none;

class sex area;

model &var1 &var2 &var3 = sex area sex*area /solution P;

manova h=_all_ / printh;

run;

quit;

ods output close;

ods listing;

data ipums.manova;

set ipums.manova;

if _n_ in (4,8,12) then delete;

drop hypothesis error pvalue Value;

data ipums.manova1;

set ipums.manova;

if _n_ in (1,2,3);

data ipums.manova2;

60

set ipums.manova;

if _n_ in (4,5,6);

data ipums.manova3;

set ipums.manova;

if _n_ in (7,8,9);

proc sort data=ipums.manova1 out=ipums.manova1;

by statistic;


by statistic;


by statistic;

data ipums.manova;

merge ipums.manova1 (rename=(Fvalue=FValue_A NumDF=NumDF_A

DenDF=DenDF_A ProbF=ProbF_A))

ipums.manova2 (rename=(Fvalue=FValue_B NumDF=NumDF_B

DenDF=DenDF_B ProbF=ProbF_B))

ipums.manova3 (rename=(Fvalue=FValue_AB NumDF=NumDF_AB

DenDF=DenDF_AB ProbF=ProbF_AB));

by statistic;

proc append base=ipums.manova data=&in2;

data &out;

set ipums.manova(rename=(FValue_A=F_a NumDF_A=Df_a1 DenDF_A=Df_a2

ProbF_A=Pa FValue_B=F_b NumDF_B=Df_b1 DenDF_B=Df_b2 ProbF_B=Pb

FValue_AB=F_ab NumDF_AB=Df_ab1 DenDF_AB=Df_ab2 ProbF_AB=Pab));

run;

%mend manova;

/*-------------------------------------------------------------------

Macro with results from Box’s M test. Exporting results as latex-files.

---------------------------------------------------------------------*/

%macro boxmtest(var1,var2,var3,out,in=);

ods listing close;

ods output "Homogeneity Test"=&out;

proc discrim data=&in pool=test wcov;

class group;

var &var1 &var2 &var3;

run;

ods output close;

ods listing;

%mend boxmtest;

/*-------------------------------------------------------------------

61

macro for generating univariate residuals which are later analyzed.

--------------------------------------------------------------------*/

%macro univnorm(var1,var2,var3,in=,out=);

proc glm data=&in PLOTS=none noprint;

class sex area;

model &var1 &var2 &var3=sex area sex*area /solution P;

manova h=_all_ / printh;

output out=&out residual=res1-res3;

run;

quit;

proc univariate data=&out noprint;

var res1 res2 res3;

histogram res1 res2 res3 /normal ;

run;

%mend univnorm;

/*--------------------------------------------------------------------------------

Importing macro %multnorm which was collected from http://www.srce.unizg.hr/

fileadmin/Srce/proizvodi_usluge/referalni_centri/SAS/stat-sasprog/MultNormMacro.sas

---------------------------------------------------------------------------------*/

%inc "F:\multnorm.sas";

/*--------------------------------------------------------------

macro for getting all results from the iml program, univariate

and multivariate data analysis, manova models and box’s M test.

---------------------------------------------------------------*/

%macro resultat(var1,var2,var3,outbox,outuni,in1=,in2=,in3=,

out1=,out2=,out3=,file1=,file2=,file3=);

/*modified and standard MANOVA test results for the 3 samples*/

%teststatistics(&var1,&var2,&var3,in=&in1,out=ipums.teststats)

%manova(&var1,&var2,&var3,in=&in1,in2=ipums.teststats,out=&out1)





/*Box M test*/

%boxmtest(&var1,&var2,&var3,ipums.Box1,in=&in1)



data &outbox;

62

set ipums.Box1 ipums.Box2 ipums.Box3;

Chi=round(ChiSq,0.1);

n=_n_;

data &outbox;

merge &outbox(drop=DF ProbChiSq ChiSq)

&outbox(drop=Chi ChiSq);

by n;

drop n;

run;

/*modified and standard ANOVA test results for the 3 samples*/

%teststatistics(&var1,in=&in1,out=ipums.teststats20)

%manova(&var1,in=&in1,in2=ipums.teststats20,out=ipums.sum1)

















%append(ipums.sum,&outuni,a=9)

/*generating residual plots and exporting results*/

ods tagsets.simplelatex file=&file2 stylesheet="sas.sty"(url="sas");

%univnorm(&var1,&var2,&var3,in=&in1,out=ipums.residlog1);



ods tagsets.simplelatex close;

/*generating info om multivariate normal distr. and exporting results*/


%multnorm(data=&in1, var=&var1 &var2 &var3, plot=mult)


63



*Exporting other results obtained above;


proc print data=&out1;



proc print data=&outbox;

proc print data=&outuni;

run;


%mend resultat; /*End of all macros*/

%put *****;

/*------------------------------------------------------------------

Obtaining results by running above macros. Do this for 2 situations:

1. 3 real life data samples obtained above

2. 3 simulated data sets imported from MATLAB

--------------------------------------------------------------------*/

/*simulated data*/

%resultat(y1,y2,y3,ipums.box01,ipums.sum10,in1=ipums.simdata1,

in2=ipums.simdata2,in3=ipums.simdata3,out1=ipums.manovatab11,

out2=ipums.manovatab12,out3=ipums.manovatab13,

file1="F:\Latex\simulations01.tex",file2="F:\Latex\univar01.tex",

file3="F:\Latex\multivar01.tex")

/*real life data*/

%resultat(loginc,loghw,logocc,ipums.box02,ipums.sum20,in1=ipums.urval1,

in2=ipums.urval2,in3=ipums.urval3,out1=ipums.manovatab21,

out2=ipums.manovatab22,out3=ipums.manovatab23,

file1="F:\Latex\urval01.tex",file2="F:\Latex\univar02.tex",

file3="F:\Latex\multivar02.tex")

/*----------------------------------------------------------------

END OF PROGRAM!

----------------------------------------------------------------*/

F.2 MATLAB codes

This MATLAB program was written to obtain the simulated data sets.

64

1 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−2 % Simulat ion o f a balanced data with 120 obs e rva t i on s .

3 % The code i s based on s imu la t i on methods presented

4 % in Zhang & Xiao (2012) .

5 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−6 nsim=2;

7 A = repmat (0 , [ 120 5 nsim ] )

8 kkde l ta =2.4 ; %determining d i f f e r e n c e s in means ;

9 hSigma2 = [ 1 , 0 , 0 ; 0 , 5 , 0 ; 0 , 0 , 0 . 1 ] ;

10 hSigma = [ ] ;

11 hSigma=[hSigma ; eye (3 , 3 ) ] ; %d e f i n i n g 2 covar iance s t r u c t u r e s .

12 hSigma=[hSigma ; hSigma2 ] ;

13 hSigma=kron ( ones (4 , 1 ) , hSigma ) ;

14 Gsize = [15 , 15 ] ;

15 g s i z e=kron ( ones (4 , 1 ) , Gsize ) ;

16 p=3; %dimensions o f y ;

17 u =[1 : 3 ] / s q r t (sum ( [ 1 : 3 ] . ˆ 2 ) ) ;

18 data = [ ] ;

19 i j =0;

20 a=2; %number o f l e v e l s f o r f a c t o r A;

21 b=4; %number o f l e v e l s f o r f a c t o r B;

22

23 % gene ra t e s the data ;

24 f o r i a =1:2 ,

25 f o r ib =1:4 ,

26 i j=i j +1;

27 i j f l a g =(( i j −1)∗p+1) : ( i j ∗p) ;

28 n i j=g s i z e ( i j ) ;

29 i f ( i a==1)&&(ib==1) ,

30 y i j=randn ( n i j , p ) ∗hSigma ( i j f l a g , : ) ;

31 e l s e

32 y i j=ones ( n i j , 1 ) ∗ kkde l ta ∗u∗ i a ∗ ib /b/a+randn ( n i j , p ) ∗hSigma (

i j f l a g , : ) ;

33 end

34 i j d a t a =[ i a ∗ ones ( n i j , 1 ) , ib ∗ ones ( n i j , 1 ) , y i j ] ;

35 data =[ data ; i j d a t a ] ;

36 end

37 end

38 A( : , : , 1 )=data ;

39

40 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−41 % Simulat ion o f an unbalanced data s e t

65

42 % with 120 obs e rva t i on s . Covariance s t r u c t u r e s

43 % and mean d i f f e r e n c e s are the same as above .

44 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−45 kkde l ta =2.4 ;

46 hSigma2 = [ 1 , 0 , 0 ; 0 , 5 , 0 ; 0 , 0 , 0 . 1 ] ;

47 hSigma = [ ] ;

48 hSigma=[hSigma ; eye (3 , 3 ) ] ;

49 hSigma=[hSigma ; hSigma2 ] ;

50 hSigma=kron ( ones (4 , 1 ) , hSigma ) ;

51 Gsize = [10 , 20 ] ;

52 g s i z e=kron ( ones (4 , 1 ) , Gsize ) ;

53 p=3;

54 u =[1 : 3 ] / s q r t (sum ( [ 1 : 3 ] . ˆ 2 ) ) ;

55 data = [ ] ;

56 i j =0;

57 a=2;

58 b=4;

59 f o r i a =1:2 ,

60 f o r ib =1:4 ,

61 i j=i j +1;

62 i j f l a g =(( i j −1)∗p+1) : ( i j ∗p) ;

63 n i j=g s i z e ( i j ) ;

64 i f ( i a==1)&&(ib==1) ,

65 y i j=randn ( n i j , p ) ∗hSigma ( i j f l a g , : ) ;

66 e l s e

67 y i j=ones ( n i j , 1 ) ∗ kkde l ta ∗u∗ i a ∗ ib /b/a+randn ( n i j , p ) ∗hSigma (

i j f l a g , : ) ;

68 end

69 i j d a t a =[ i a ∗ ones ( n i j , 1 ) , ib ∗ ones ( n i j , 1 ) , y i j ] ;

70 data =[ data ; i j d a t a ] ;

71 end

72 end

73 A( : , : , 2 )=data ;

74 data1=A( : , : , 1 ) ;

75 data2=A( : , : , 2 ) ;

76

77 %export ing the data to text f i l e s ;

78 dlmwrite ( ’ myf i l e1 . txt ’ , data1 )

79 dlmwrite ( ’ myf i l e2 . txt ’ , data2 )

80

81 %end o f program ;

66

F.3 R code

R program for constructing interaction plots in section 3.2.

# Showing one example of interaction effects and

# one of no interaction effects

Y <- c(4,5,9,2,3,7)

A <- c(1,2,3,1,2,3)

B <- c(1,1,1,2,2,2)

hej = aov(Y~A+B+A*B) #do the analysis of variance

par(mfrow=c(1,2)) # Two-way Interaction Plot

A1 <- factor(A)

B1 <- factor(B)

interaction.plot(A1, B1, Y,type="b", col=c("red","blue"), legend=F,lty=c(1,2),

lwd=2, pch=c(1,24),xlab="Factor B levels",ylim=c(0,10),

ylab="Mean values of y",main="No Interaction")

par(family = "")

legend("topleft",c("level 1 ","level 2"),border="black", bty="o",

bg="beige", lty=c(1,2),lwd=2,pch=c(1,24),

col=c("red","blue"), title="Factor A levels",inset = .02)

Y1 <- c(4,3,9,2,5,7)

interaction.plot(A1, B1, Y1,type="b", col=c("red","blue"), legend=F,lty=c(1,2),

lwd=2, pch=c(1,24),xlab="Factor B levels",ylim=c(0,10),

ylab="Mean values of y",main="Interaction")

par(family = "")

legend("topleft",c("level 1 ","level 2"),border="black", bty="o",

bg="beige", lty=c(1,2),lwd=2,pch=c(1,24),

col=c("red","blue"), title="Factor A levels",inset = .02)

#end of program

67

Master thesis - Statistiska Institutionen/menu/standard... · Master thesis Department of...

Documents

Transcript of Master thesis - Statistiska Institutionen/menu/standard... · Master thesis Department of...