A DISSERTATION IN BUSINESS ADMINISTRATION the …

BAYES FACTORS FOR VARIANCE COMPONENTS

IN THE MIXED LINEAR MODEL

by

MITHAT GONEN, B.S., M.S., M.S.

A DISSERTATION

IN

BUSINESS ADMINISTRATION

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

the Degree of

DOCTOR OF PHILOSOPHY

Approved

Accepted

December, 1996

ACKNOWLEDGMENTS

There are many people who have made the completion of this dissertation

easier and more pleasant. Among them are Banu Altunba§ and Demet Nalbant

who showed great companionship and hospitality during my visits to Lubbock,

and Patsy and the staflF of the Graduate School who handled the intricacies of my

dual-degree program.

My committee members have also been very helpful. Paul Randolph is the one

who started it all by recruiting and encouraging me to attend graduate school.

Ronald Bremer and William Conover made suggestions that led to considerable

improvements in the dissertation and Benjamin Duran provided his usual friendship

and understanding.

However, none has influenced my education more than Peter West fall, m}-

advisor. He has been a teacher and a role model for me; being an example of

how hard work, insightfulness and poise can lead to professional success. I am sure

that without him. I would have emerged from my years of study as a less capable

teacher and researcher.

My graduate study is funded by a scholarship from Middle East Technical

University. Turkey. I take this as a generous support from the people of Turkey to

whom I am indebted.

Finally, my heart is with Elza who makes everything meaningful and worth

while.

u

CONTENTS

ACKNOWLEDGMENTS ii

ABSTRACT v

LIST OF TABLES vi

I. INTRODUCTION 1 1.1 Overview 1 1.2 Purpose and Importance of the Research 3

II. LITERATURE REVIEW 5 2.1 Linear Models 5 2.2 Bayesian Analysis 8 2.3 Bayesian Analysis of Linear Models 12

III. THEORETICAL DEVELOPMENT 14 3.1 Preliminaries 14 3.2 Prior Selection 17 3.3 Deriving the Bayes Factor 22 3.4 Hierarchical Approach 27 3.5 Missing Observations 34

IV. NUMERICAL METHODS 39 4.1 Monte Carlo Estimation of the Bayes Factor 39 4.2 Data and Model 42 4.3 Simple Random Sampling 45 4.4 Latin Hypercube Sampling 51 4.5 Gibbs Sampling 56

V. CONCLUSIONS AND FUTURE RESEARCH 63

REFERENCES 66

APPENDIX A: SAS IML CODE FOR SRS 77

APPENDIX B: SAS IML CODE FOR LHS 82

ni

APPENDIX C: RATIO-OF-UNIFORMS 87

APPENDIX D: SAS IML CODE FOR THE GIBBS SAMPLER 90

IV

ABSTRACT

The Bayes Factor is a widely-used summary measure that can be used to test

hypotheses in a Bayesian setting. It also performs well in problems of model se

lection. In this study, Bayes Factors for variance components in the mixed linear

model are derived. The formulation used avoids the assumption of a priori indepen

dence between the variance components by using a Dirichlet prior on the intraclass

correlations. A reference prior, which results in a Bayes Factor that is flexible and

easy to use, is suggested. Hypothesis tests using the Bayes Factor avoid difficulties

of the classical tests, such as non-uniqueness and invalid asymptotics.

The priors on the nuisance parameters are chosen to be non-informative and the

corresponding integrals are carried out analytically. For the parameters of interest,

however, numerical methods have to be used. For this purpose, Monte Carlo

methods have been investigated. Simple random sampling and Latin hypercube

sampling are employed for simulating the prior and a Gibbs sampling scheme

has been implemented for simulating the posterior. The resulting estimators are

compared on a small data set.

LIST OF TABLES

4.1 Data 43

4.2 ANOVA Table for the Model 43

4.3 ANOVA Table for Main Effects 43

4.4 Simple Random Sampling 50

4.5 SRS for 5 50

4.6 SRS for po 50

4.7 LHS for B 53

4.8 LHS for po 54

4.9 Estimates from the Gibbs Sampler 62

VI

CHAPTER I

INTRODUCTION

1.1 Overview

This study is concerned with a Bayesian approach to testing hypotheses about

variance components in the mixed linear model. The following sections will define

what we mean by the mixed linear model precisely, but for the purposes of this

introduction we assume that it is roughly understood what we mean by it.

Linear models have been around since the early nineteenth centur}-. Legendre

(1806) and Gauss (1809) seemed to be the first people to consider comparison of

means and estimation of fixed effects. Some fifty years later, we see the first studies

involving variances and random effects in Airy (1861) and Chauvenet (1863). Of

course, those authors did not use either of the terms linear model, fixed effect and

random effect. All of those studies were motivated by observations arising from

astronomical studies.

Fisher (1918) is attributed as the first person to study problems involving linear

models from a statistical perspective. His famous work "Statistical Methods for

Research Workers" has helped the methodology to spread to practitioners quickly,

especially in the areas of biology and agriculture. He also originated the somewhat

misleading term "analysis of variance," whose shorthand ANO\'A may be the

single most well-known acronym of applied statistics. But it was Eisenhart (1947)

who introduced the terminology that we use here, involving the terms "fixed,"

"random," "mixed," "Model F' and "Model II." Searle, Casella and McCulloch

(1992) has an introductory chapter including the historical developments in this

area. Also. Khuri and Sahai (1985) has a very good literature survey that touches

upon history as well.

Bayesian analysis seems to be even older than linear models. Appropriately

named after Bayes (1783). this has been a strong school of thought in the devel

opment of statistics. It has found itself at the center of severe disagreements with

the ''classical'' or "frequentist" school. Most, if not all. classical statistical meth

ods rely on the work of Neyman and Pearson (1933. 1967) which has led to the

now well-known theory of hypothesis testing, named after them. This approach

is called "classical'' for obvious reasons, or "frequentist" reflecting their definition

1

of probability as a limiting relative frequency. Lehmann (1986) is an excellent

source for these results. This development coincides with the period in which the

use of statistical models was becoming more and more popular among researchers.

Hence, most of the earlier work on linear models was frequentist in flavor includ

ing some of the now-famous methods (such as F-Test and ANO\'A estimation).

Scheffe (1956). Graybill (1986) and Searle (1971) are the classical references that

contain accounts of those developments. Hocking (1985) has a general approach

based on a different parameterization of the problem.

The same statistical problems were being attacked by Bayesians as well. Jef

freys (1939) provided us with several Bayesian methodologies. Savage (1954) is

another pioneer in this area. As we will see, Bayesian answers to most problems

involve integrals that are not analytically tractable. This was a detriment for the

practitioners of that time who lacked the computing power of today and explains

why it was not until the 1960s that Bayesian methods emerged as an alternative

methodology to analyze linear models. The earlier Bayesian results (including

linear models) are surveyed in'Box and Tiao (1973) extensively and carefully. An

other good account for that period is Zellner (1971). Broemeling (1985) reflects

later developments.

A preliminary treatment of Bayesian statistics in most books emphasizes the

models about which Bayesian results and classical results agree. This gives the

misleading impression that Bayesian analysis is another route to arrive at the same

destination. In fact, in many cases Bayesian and classical results do not agree (see.

for example, the rejoinder of Berger and Sellke, 1987). We will demonstrate this

in the case of mixed models as well; however we do not intend to debate the

fundamental issues of probability and statistical inference, especially with regard

to the philosophical differences between the two schools of statistics.

The organization of this dissertation is as follows: In the next section we will

discuss the purpose and importance of the research. Chapter II provides a liter

ature review on Linear Models and Bayesian Analysis. Chapter III is the main

body of the research and derives the Bayes Factors for the variance components

in the mixed linear model. Chapter IV is devoted to developing and investigating

numerical methods to evaluate the Bayes Factors. We present our conclusions and

directions for future research in Chapter V.

1.2 Purpose and Importance of the Research

In this study we will attempt to provide a unified methodology of testing hy

potheses about variance components in the mixed model. Our emphasis will be

on unbalanced data, mainly because frequentist methods fail to provide a general

answer in this case. Our main tool of analysis will be the Bayes Factor introduced

by Jeffreys (1935). Specifically, we will derive the Bayes Factors for the variance

components in the mixed linear model. Our approach will be a multivariate gener

alization of Westfall and Gonen (1996) who have developed Bayes Factors for the

one-way random model. This generalization, however, is not straightforward. Not

only is the algebra more complicated, but selection of a family of suitable prior

distributions requires further work.

Bayes Factors are commonly used for model selection and hypothesis testing.

Previous studies have shown that they may overcome the difficulties that their

frequentist counterparts suffer from. In model selection, it is widely-accepted that

the ^-test for individual components is too unwilling to discard the individual

effects, thus suggesting unnecessarily complicated models. In the context of re

gression models, the suggested frequentist remedies are generally either sequences

of hypothesis tests (like stepwise regression) or some t}'pe of mean square error cri

terion (like Mallow's Cp. see Mallows. 1973). Two other measures that have been

successfully used in a number of different statistical models are Akaike's Informa

tion Criterion (AIC) and a modification known as Schwarz's Bayesian Information

Criterion (BIC). A good review of model selection methods and their applications

to regression analysis is Miller (1990). The Bayes Factor has close relationships

with AIC and BIC. as mentioned by Kass and Raftery (1995). and as such is a

promising tool for model selection. Some recent works in Bayesian model selection

are Mitchell and Beauchamp (1988), George and McCulloch (1993) and Carlin and

Chib (1995). It has been argued by Smith and Spiegelhalter (1980) that the Bayes

Factor acts as a fully automatic Occam's razor, cutting back to the simpler model

at once when the additional parameters are not needed, hence resulting in simpler

models with good predictive power.

In this study our main focus will be hypothesis testing. Of course, model

selection can be viewed as an appUcation of hypothesis testing, and the Bayes

factor we will derive should work for that purpose as well. In the context of

hypothesis testing, the Bayes Factor not only emerges as a pragmatic solution to

many difficult problems, but also challenges the results of existing methods. We

have already mentioned that there are several instances in which classical and

Bayesian results do not agree. This has been demonstrated by Edwards. Lindman

and Savage (1963) and their work has been considerably extended by Berger and

Sellke (1987). This theme will come up over and over in this study as we discuss

the so-called irreconcilability of p-values and posterior probabilities in Chapter II

and demonstrate it on a small data set in Chapter IV.

CHAPTER II

LITERATURE REMEW

2.1 Linear Models

Throughout this study, we will use uppercase letters to denote matrices, low

ercase letters to denote scalars in mathematical formulas and lowercase boldface

letters to denote vectors, unless otherwise stated. Also, / will denote an identity

matrix, 0 will denote a vector whose elements are all 0 and 1 will denote a vector

whose elements are all 1. We will assume that / . 0 and 1 have the appropriate

dimensions.

There are several (equivalent) formulations of the mixed model. We start with

the one introduced by Hartley and Rao (1967) :

y = Xf3-\-Zu + €, (2.1)

where y is a n x 1 vector of observable quantities, /3 is a p x 1 vector of fixed

effects and X is a fixed n x p matrix corresponding to the occurrence in the data

of the elements in /3. It may contain the observed values of the covariates. when

covariables are part of the model; otherwise it is an incidence (or design) matrix.

The elements of an incidence matrix are either 0 or 1, denoting the presence (or

absence) of the corresponding parameter in that particular combination of levels.

Also, u is a /: X 1 vector of random, unobservable effects and Z is a fixed n x A;

incidence matrix. Finally, e is a n x 1 vector of error terms. Which effects should

be fixed and which should be random is a critical issue and changes the whole flow

(and possibly the conclusions) of the analysis. Bremer (1994) is a good source on

this important topic.

We will focus on the random effects, so it is useful to partition u. as

U = [Ui , . . . ,Ur]^ .

where Uj contains qi elements, qi being the number of levels of the z* random

effect. So we have a total of r random effects. Corresponding to the partition of

u, we also partition Z as

Z ^^ [Zi... •, Zf] •

6

We will make the following assumptions, which are realistic and at the same

time lead to a tractable mathematical analysis. The notation .V„ refers to a n-

dimensional normal distribution.

. e\(3,a\{anU-Afn{0,a'l)

• Ui\l3,a'^,{af}l^-^ ~ Mq^O^afl) for afi i. where qi is the number of levels

within the i^^ factor.

• Ui|/3,(T^, {o- lJ" ! and Wj|/3, a^, {af}J" i are statistically independent for all

• Wi|/3,cr^, {(y\Yi^\ ctnd €|/3,cr^, {(y\Yi^\ ^re statistically independent for all i.

Some authors find it easier to write e as another random effect within the

vector u. Although it is notationally convenient, we prefer to stick with (2.1)

mainly because we will not treat e in the same way we will treat UjS from a

Bayesian standpoint.

Now we can rewrite (2.1) as

T

y = X/3-^X^Z,Ui + e. (2.2)

This is the version of the model we will use throughout this study. Using (2.2) and

the assumptions, one can derive that

y - A / ; ( X / 3 , F ) , (2.3)

where

V = £<T?Z.Zf + (T^/, (2.4)

where G\'S are called "variance components" and u^ is called "error variance."

In this setup we will investigate the following hypothesis about the variance

components: E^ : cr] = 0 against Ex : a] > 0 for j e J = { 1 , . . . . r } . This is

known as the "main-effects" test, measuring the significance of the f^ random

effect and commonly used for model selection purposes.

In the context of balanced models, there is an extensively researched and fairly-

satisfactory frequentist theory. Most of the ideas are based on sums of squares

as developed in an analysis of variance table. The main point is that the sums

of squares are distributed as scalar multiples of central or non-central y^ variates.

under the usual assumptions of normality, homogeneit}' of variances and indepen

dence of the model's random effects. So, in many cases an exact F-test is available.

But still there may be cases where an exact test is not present. Then one has to

use a pseudo-F-test, which is simple to construct, but its exact size and power

is unknown. A similar situation occurs with unbalanced models involving fixed

effects only.

In unbalanced models with random effects, the state-of-the-art is considerably

worse. Even in the one-way random model (;? = 1, X = 1 and r = 1 in (2.2)).

there is confusion as to which test to use. These shortcomings are illustrated by

Westfall (1989), who shows that there is no most powerful invariant test in a one

way random model for testing E^ : o\ = 0 against E\ : G\ > 0. Self and Liang

(1992) show that Likelihood Ratio Test performs poorly, since the null value is on

the boundary of the parameter space. Those results are enough of an indication to

convince us that present situation of hypothesis testing for variance components

leaves a lot of room for improvement. In fact, even before the mentioned studies

appeared, Khuri and Sahai (1985) offered the following perspective:

The main difficulty stems from the fact that in unbalanced data situations, the partitioning of the total sum of squares can be done in a variety of ways; hence there is no unique way to write the analysis of variance table as is the case with balanced data. Furthermore the sums of squares in an analysis of variance table for an unbalanced model, with the exception of residual sum of squares, do not in general, have known distributional properties and are not independently distributed, even under the usual assumptions of independence, homogeneity of variances and normality of the model's random effects. Consequently, there are no known exact procedures that can be used for tests of hypotheses of variance components. It is therefore, not surprising that most research in the area of variance components for unbalanced models has centered on point estimation, [p. 256]

This perspective is apparently shared by Searle, Casella h McCulloch (1992),

whose book is devoted entirely to the study of variance components. It has a total

8

of 12 chapters that are concerned with estimation only: and mentions hypothesis

testing in small subsections in special models.

So, not only is it the case that frequentist methods fail to give us a unified

approach now. there is not much hope (and ongoing research) to provide the prac

titioners with better methods in the future. As we will see in the next section, a

Bayesian approach is more promising, but not so well-investigated—a gap which

we aspire to fill.

2.2 Bayesian Analysis

Bayesian analysis of statistical models exhibits some important fundamental

differences from classical methods. The differences are sometimes simple issues,

resolved using asymptotic arguments. More often than not, however, those differ

ences involve several basic issues regarding the definition of probability, philoso

phy of statistical inference, nature of probabilistic modeling and technical details

of measure-theoretic probability. We will not consider those issues in this study.

There are several published works involving those differences. Some examples are

Jeffreys (1939), de Finetti (1964, 1972, 1974, 1975) and Savage (1954, 1962) . Also,

several studies have addressed the practical aspects of the long-standing Bayesian-

frequentist argument. Berger and Deely (1987), Berger and Sellke (1987) are recent

views from the Bayesian side, whereas Casella and Berger (1987) and Efron (1986)

provide us with frequentist arguments. Our main reason for presenting a Bayesian

anah'sis here is mostly pragmatic, because it works well for an important class of

models where classical methods are not satisfactory.

Mainly, the frequentist school builds a statistical model by allowing the data to

be random variables and the parameters of the model to be deterministic variables

(i.e., unknown, but non-random quantities). The main tool of inference is the

sampling distribution of an appropriately chosen statistic on which estimators and

tests can be based. On the other hand, Bayesian school models both data and

parameters as random variables. The distribution to be assumed for the data may

be placed through frequentist arguments, but the distribution of the parameters

generally requires subjective arguments. Then the inferences are based on the

posterior distributions of the parameters that are found by Bayes" theorem (hence

the name Bayesian), conditional upon the values of the observed data.

In the sequel we will use P{.) for what is actually a probability density function.

As unusual as it may seem, this is common notation among Bayesian studies, such

as Hill (1965) and Chaloner (1987).

Assume we have a statistical model, where Y denotes the observables and

6 denotes the parameters. Naturally, the distribution of 1" depends on 6. In

frequentist context this distribution is termed as P(V). that is the (marginal)

distribution of i ' (also called the likelihood function), since 9 is not random. In

a Bayesian model, the very same distribution is viewed as P(V|^). since 9 is

random, but considered known (given) for the purposes of that distribution. We

can furthermore specify a marginal distribution for 9, that is P{9). We will call this

as the prior distribution of 9, since it does not depend on V, and hence is specified

before observing Y. This is where subjectivity enters into the picture, since before

observing Y. our information on 9 is very likely to be subjective. Then, by Bayes"

theorem, we can derive the conditional distribution of 9:

_ P{Y\ 9)P(9)

This distribution is called the posterior distribution of 9 and reflects how the

information contained in Y changed our prior beliefs about 9. In Bayes' theorem,

it is usually necessary to calculate the marginal distribution of Y through the

following identity:

P{Y)= f P{Y\9)P(9)d9.

So. the denominator is nothing but a normalization constant (constant with respect

to 9) that guarantees the resulting function to be a probability distribution func

tion. It is for this reason that, in some Bayesian contexts, the following notation

is very common: P{9\Y)(xP{Y\9)P{9),

meaning that the posterior distribution is proportional to the product of likelihood

and prior. Later, we will use the notation oc for other probability distribution

functions. It should be understood that, when such notation is used, the right-

hand side contains every term related to the random variable in question (the

so-called kernel of the distribution), but may or may not contain the constants

involved.

10

The posterior distribution contains all the necessary information to make in

ferences. Point estimates, confidence intervals (in Bayesian terminology called

credible sets) and hypothesis tests are readily available. Moreover, we can directl}'

talk about the probability of an interval or a hypothesis, unlike the frequentist

case. For example we can make statements fike P{9 G A) = p. or P(EQ) = p for

some p e [0,1], because ^ is a random variable. As opposed to this, the frequentist

results like a confidence interval or a ]>value are much harder to interpret. In

fact. DeGroot (1973) argues that practitioners tend to interpret the frequentist

measures as if they are Bayesian measures. For example, many people think that

a 95% confidence interval has 0.95 probability of including the parameter, or a

p-value of 0.04 means that the probability that the null hypothesis is true is 0.04.

There are two major difficulties with a Bayesian approach. First, one has to

assign prior probabilities and in most cases the choice of an entire distribution is

necessary. Second, the calculation of the posterior distribution usually involves

integrations that are very difficult or impossible to perform analytically. There are

some computer intensive methods suggested in the current literature to go around

the problem of integration. We provide a review of those in Chapter I\'.

The problem of prior selection is a difficult one. Samaniengo and Reneau (1994)

argue that, most statisticians agree to perform a Bayesian analysis in the presence

of substantial prior information. However when prior information is vague or not

available at all, it is not very clear what one can do. In some cases there are families

of distributions such that, when used as a prior, they lead to a posterior from the

same family. These are called conjugate families. When they are available, it may

be wise to make a choice among them, even at the expense of slightly distorting

the representation of prior befiefs. DeGroot (1970) examines several models for

which conjugate families exist. However, such families are not available for most

problems. Then one has to use the so-called non-informative priors. In some cases,

those priors will be improper probability distributions in the sense that they do

not integrate to 1. We will have more to say about non-informative priors in the

next chapter.

11

As we mentioned, the only thing that is needed by an analyst is the posterior

distribution. In fact, some Bayesians argue that it is sufficient (and necessary) to

report the entire posterior distribution. However, this may be an inefficient way

to communicate the results as argued by Berger (1985. section 4.10). So. some

summarizing measures have been proposed such as the Bayes Factor. Kass and

Raftery (1995) have an extensive review on Bayes Factors that contains several

results that we will frequently use.

We will now define the Bayes Factor and explore its connections with the pos

terior probabilities. Since we are interested in testing hypothesis using Bayesian

methods, we can talk about the prior and posterior probabilities of hypotheses.

Let TTo and po denote the prior and posterior probabilities of EQ. We adopt the

same convention in TTI and pi for the alternative hypothesis. Then BF is defined

by (see Berger, 1985, p. 149) ^ ^ Po/Pi

TTO/TTI •

(We will abbreviate Bayes Factor as BF in text and B in mathematical expres

sions.) Assume that the two hypotheses are mutually exclusive and collectively

exhaustive. Then TTI = 1 - TTQ, pi = 1 - Po and

^ ^ Po/{l-Po) 7ro/(l -TTo) '

The quantities of the type p / ( l — p) are usually called "odds ratio" (or "odds" in

short) when p represents the probability of an event. Therefore we can interpret

the Bayes Factor as the ratio of posterior odds to prior odds.

We can now express po in terms of B using the following argument:

Po _ BTTQ

1 - Po 1 - TTo

i - Po 1 - TTo

Po B% 0

l_l = iz^ Po BTTO

which leads to

Po = (2,5)

12

This shows us that, in order to calculate the posterior probability of EQ, we need

to calculate BF and assign a prior probability to EQ. Therefore the contribution

of the data to the posterior probability is through BF only. If, in addition to the

null hypothesis, the alternative hypothesis is also simple, then BF does not involve

any prior information and is sometimes interpreted as "the evidence against Ei. as

provided by data." However, when the alternative hypothesis is composite the data

will be weighted by the prior probability distribution specified by the alternative

model.

We will mainly be testing point-null hypotheses. A point-null hypothesis spec

ifies a single value (rather than a region) for the entire parameter vector or a

subvector of it. In this study, we will focus on point-null hypothesis which are

concerned with a subvector of size 1, that is a single parameter. One interest

ing problem of testing a point-null hypothesis is that there are drastic differences

between frequentist measures (such as p-values) and Bayesian measures (such as

Po). This issue has been examined by several researchers, and the seminal works

on this controversy are Lindley (1957), Edwards, Lindman and Savage (1963) and

more recently, Berger and Sellke (1987). Berger and Delampady (1987) review the

earlier contributions and present new evidence on this issue. Also, there have been

several arguments (Hodges and Lehmann, 1954) as to whether we should discard

point null hypotheses entirely and replace them with the so-called "interval null

hypotheses," where the analyst chooses a small enough interval of "indifference""

around the point value to use in the null hypothesis. With all due respect to those

arguments, we maintain our pragmatic point of view and develop methodologies

for testing a point-null hypothesis.

2.3 Bayesian Analysis of Linear Models

In this section, we will briefly mention key Bayesian achievements in the anal

ysis of linear models. Most of the work involving problems of fixed effects models

seem to have been part of the statistical folklore, so it is difficult to trace their roots.

The books by DeGroot (1970) and Box and Tiao (1973) provide detailed accounts

of the earlier results about analysis of fixed effects from a Bayesian perspective.

Lempers (1971) brings together several Bayesian techniques appHcable to linear

models. Broemeling (1985) derives the posterior distributions of several different

13

Hnear models (including the mixed model), mostly using a hierarchical approach

and non-informative priors. However, he chooses to use confidence intervals to test

hypotheses and barely mentions the Bayes Factor. Among the recent books plac

ing special emphasis on models and data analysis are Press (1989) and Gelman.

Carlin, Stern and Rubin (1995). The works of Berger (1985) and Bernardo and

Smith (1995) investigate the mathematical and philosophical aspects of Bayesian

analysis in general, but use linear models as examples. OHagan (1994) presents

a careful mix of mathematical and philosophical aspects along with a review of

models and issues of computation.

The pioneering work in the area of Bayesian analysis of linear models with

random effects has been done by Hill (1965) and Tiao and Tan (1965), who have

worked independent of each other. Dickey (1974) is the first to derive Bayes Factors

for the linear model. His work is very detailed and carefully presented; however

his priors are fully informative and his BF requires many prior inputs. Smith

and Spiegelhalter (1980) and Spiegelhalter and Smith (1982) also derive Bayes

Factors, with fixed effects in mind, for linear and log-linear models and introduce

a device with which indeterminacies resulting from the use of improper priors can

be handled. Zellner and Slow (1980) and Zellner (1986b) consider Bayes Factors

for regression models. Also, Berger and Deely (1988) consider the problem of

analyzing and ranking several means and derive Bayes Factors for them. Their

framework includes several classical models as special cases. One drawback of

their study is the assumption of known error variance, which is hardly ever the

case in practice. Finally, Westfall and Gonen (1996) derive a Bayes Factor for the

one-way random model. They also suggest a framework in which the asymptotic

performances of the mentioned Bayes Factors can be evaluated and compared and

using this framework point out some difficulties that arise from the device of Smith

and Spiegelhalter (1980).

CHAPTER III

THEORETICAL DE\TLOPMENT

In this chapter we will derive the Bayes Factor for variance components in the

setting of the mixed linear model. Our main goals are to prove Theorems 4 and

5. Theorem 4 presents a Bayes Factor resulting from a "standard" model and

Theorem 5 presents one resulting from a hierarchical model.

3.1 Preliminaries

We will follow the notation developed in the previous chapters, unless we explic

itly state otherwise. We will continue to use P(.) for what is actually a probability

density as we have done in Chapter II. Whenever convenient, we will use dP{.) to

denote the Lebesgue-Stieltjes measure induced by P(.).

We first provide a summary of the problem. We assume y ~ 7Vn(X/3, T'). where

V = E •=! ^fZiZj^ + ^^^- Let J = { 1 , . . . , r} and rf = a]/a^ for all j e J. Also,

let K = { 1 , . . . , j — 1, j + 1 , . . . , r} . To keep the notation as simple as possible,

we introduce two vectors r j and r ^ , where r j = {Tj}jeJ and r ^ = {rDkeK-, so

that Tj contains all of the variance ratios, and r ^ consists of those variance ratios

that are not specified to be 0 under Eo. We keep in mind (2.5), which gives us the

relationship between po and the BF. The reasons to work with rf's instead of erf's

will become clear later.

Since the main question is centered around erf's (or, equivalently, rf's), we call

/3 and a^ nuisance parameters. This was the main reason for us to adopt (2.2).

rather than treating a^ as a variance component, because as we will see later we will

use different priors for nuisance parameters than the ones we will use for variance

components.

We will call this model "standard" for the lack of a better label. The need

for a label will become clear when the "hierarchical" model, whose label is widely

accepted and natural, is introduced in Section 3.4.

In this section, we will develop some tools that will be of use to prove our main

results in Sections 3.3 and 3.4.

14

15

Theorem 1 Tlie Bayes Factor for testing EQ : TJ = 0 against Ei : rj > 0 is given

by

B = I •. • / P ( y 1/3, a\ rl)P{(3. G\ T^) d(3 da' dr 2

K

/ . • • / P ( y 1/3. CT\ T])P{(3. CJ\ T]) d(3 da' dr (3.1)

Proof: The proof will follow the fines of Berger (1985, p. 149) with sfight

changes in notation. Let m(y) denote the marginal distribution of y. Then one

can write

where

m(y) = mo(y)7ro + mi(y)(l - TTQ).

mo(y) = / • • • / P(y 1/3, a \ TI)P{^, a \ r^) d(3 da' d: K (3.2)

and

mi(y) = / • • • / P(y 1/3, a\ TJ)P( /3 , a\ rj) dH da' dr]. (3,3)

Any integration without limits should be understood as being evaluated over

the entire parameter space. So, in (3.1) and what follows, the integrals correspond

ing to /3, a', T] and T^ are over R^, R+, (M+)'" and (E+)^-^ respectively. Then,

by Bayes' theorem

po = P(r, =01 y) = - ^ ^ ^ ^ .

Substituting m(y) from above we arrive at

mo(y)7ro ^° ~ 7no(y)7ro + mi(y)(l-7ro)"

Dividing both the numerator and the denominator by mo(y)7ro. we have the fol

lowing identity:

Po = 1 + 1 -7romi(y)

TTQ mo(y)

T - 1

(3,4)

16

Comparing (2.5) and (3.4), we see that

_ / • • • / P ( y 1/3, a\ TI)P((3. a\ r^) d(3 da' dr^

/ • • • / ^ ( y 1/3, cj\ T])P{(3, a', r j ) df3 da' dr] '

for Eo as defined above and this completes the proof. •

In most finear model contexts, the terms "full model" and "reduced model" are

very common. In this study, we will use the term "full model" to refer to the model

specified by the alternative hypothesis and "reduced model" to refer to the model

specified by the null hypothesis. One popular method of deriving a test statistic by

using frequentist methods or the fikelihood principle is to form a ratio where the

reduced model is represented in the numerator and the full model is represented in

the denominator. Mostly, likelihood functions corresponding to the reduced and

full models are used and the resulting test statistic is known as a Likelihood Ratio

Test Statistic. In that sense (3.1) resembles such a statistic, but there are two

main differences. First, while deriving a likelihood ratio test statistic, nuisance

parameters are handled by maximizing over the corresponding parameter space;

here they are integrated out. Second, in the Bayesian framework, likelihoods are

multiplied by the prior probabilities, which act as weights. For this reason. BF is

sometimes called "Weighted Likelihood Ratio," especially by non-Bayesians.

At this point, we would like to introduce the idea of "marginal likelihood,"

abbreviated as ML. The integrand in (3.2) can be recognized as the joint density

of data and parameters, namely P(y,/3, cr^,T^). Then we integrate out the pa

rameters and this leaves us with a function of data only, which is called "marginal

likelihood" (sometimes called "integrated likefihood"). Thus mo(y) is the marginal

likelihood for the reduced model. Similar comments can be made for (3.3) as well,

so mi (y) is the marginal likelihood for the full model and the Bayes Factor can be

summarily written as 5 = mo(y)/mi(y).

We would also like to mention that Kass and Raftery (1995) provide a "gran

ular" scale of interpretation (mostly based on experience) to use the BF directly

for inferences without calculating the posterior probabilities. Although it may be

easier for some people to avoid specifying TTQ, we feel it is a coherent Bayesian

approach to specify TTQ as well and to make decisions based on po, rather than the

BF itself

Our next step is to provide the analytic expressions for the likelihood and prior

to substitute in (3.1). The next theorem gives us the fikelihood.

Theorem 2 Let

yo'=Y:''iZkZi+i keK

and

jeJ

Then the likelihood for the reduced model is given by

P{y\l3,'7\Tl) = (2T<T^)-''/2|Vo|-'/^exp { - ^ ( y - XI3fVi\y - A"^)}

(3.5)

and the likelihood for the full model is given by

P{y\l3,a\r^) = {2nar""\Vir"exp[-^^(y - XI3)^\\-\y - XP)j . (3,6)

Proof: Using (2.4), we can see that

V = ± dlZ^Zf + a'l = a' if: rlZ,Zj + I

The rest of the proof follows from (2.3). We note that the effect of the reduced

model on the likelihood is via the covariance matrix only, i.e., it only '"reduces"

the covariance matrix. •

3.2 Prior Selection

We now turn our attention to ways of representing our prior information on

the parameters. There are several established methods of doing this. One can use

the so-called ^-priors (Zellner, 1986a). Another approach is to use non-informative

priors. These priors, as the name impfies, represent absence of prior information on

the parameters. Also, they have nice mathematical properties, such as connections

with Haar measure and invariance principle as mentioned by Berger (1985). In

most cases they arise as a limiting form of some regular distribution, and as such

they can carry some unusual properties. For example, one non-informative prior.

18

that is commonly used for location parameters is P(//) ex c. c e R J dP{fi) / 1.

so it is not a probability distribution in the usual sense. Such priors are called

improper. The example we have given arises from a standard normal distribution,

whose variance tends to oc. Although they are improper in this sense, their use

may result in proper posterior distributions. A common choice of non-informative

prior for scale parameters is P{9) oc 9'"", where u eR^. Note that the oc notation

is slightly abused here, since improper distributions do not have normalization

constants and hence reporting the kernel does not assure uniqueness. However,

improper priors are used when the multiplying constant of the kernel is irrelevant.

In the example for location parameter, any real number c would serve the same

purpose, and the posterior distribution does not depend on the choice of c. Box

and Tiao (1973) provide several mathematical and philosophical motivations for

the use of these priors.

The use of non-informative priors have their drawbacks as well. It is difficult for

most practitioners to understand them. Also, they generally lead to indeterminate

or infinite Bayes Factors (see, for example, Kass and Raftery. 1995). so they must

be used with caution in hypothesis testing situations.

With regard to all these discussions we find it suitable to choose non-informative

priors for the nuisance parameters and proper priors for the parameters of interest.

This is similar to Berger and Deely (1988) and Westfall and Gonen (1996).

Many researchers working on similar situations have assumed a priori inde

pendence of location and scale parameters. Unless there is contrary evidence, this

seems reasonable in the context of linear models as well. A handful of published

studies that treat mixed models from a Bayesian standpoint have used this assump

tion without any problems. Some examples are Hill (1965), Tiao and Tan (1965).

Dickey (1974), and Berger and Deely (1988). So, we will make this assumption

without further questioning it, i.e., we will assume

P ( ^ , a ^ r j ) = P(/3)P(<T^rj), (3.7)

and we choose P(y9), such that it is constant over

It is not as easy, however, to assume a priori independence of the individual

variance components and error variance. Dickey (1974) has done this mainly for

mathematical convenience, but we criticize him for this choice. Both the variance

19

components and the error variance are related to the scale of the problem, and in

most cases information about one of them usually leads to some information about

the other. This is the main reason why we use the variance ratios {T''S) instead of

the variance components themselves ((T|'S). Variance ratios are unitless quantities

and it is reasonable to argue that they are independent of a' a priori, in many

cases. So we choose

P{<7\T]) = Pia')P{TJ). (3.8)

Furthermore, this setup allows us to incorporate a non-informative prior for a'

very easily; we simply choose

P{a') ex {a')-'. (3.9)

With regard to the discussion and notation on page 18, and reafizing that a

(not a') is the scale parameter, this corresponds to taking v = 4. The more

commonly used choice in the literature seems to be z* = 2 (justifications for this

choice is given by Datta and Ghosh (1995)), although Ye (1994) suggests taking

1/ = 3, in the context of the one-way model, based on a derivation involving the use

of Fisher information. Our particular choice of z- = 4 is motivated by invariance

considerations. Its use in the one-way ANOVA has led to a BF that arises after

reducing the data to maximal invariant (Westfall and Gonen, 1996). We could

have easily used a somewhat richer family of priors suggested by Chaloner (1987).

Her priors are of the form

P(a') (X {a')-^-\

Our numerical experience in one-way ANOVA (Gonen, 1995) has shown that BF

is somewhat insensitive to the choice of A, so we prefer not to introduce another

prior parameter. The use of this family, however, does not bring any additional

analytical difficulties to our approach. One can easily incorporate it into the fol

lowing derivation. In Chaloner's family of priors, our choice in (3.9) corresponds

to A = 2.

The choice of P(T]) is not a trivial matter. Past studies suggest that the use

of non-informative priors for the parameters that are being tested usually result in

infinite Bayes Factors (see Smith and Spiegelhalter, 1980 ; Spiegelhalter and Smith,

1982; and Berger and Deely, 1988). So, we have to be informative. However, there

20

are no informative choices that wifi make the integrations in (3.1) analytically pos

sible. Since this will represent the prior information about parameters of interest,

we want a family of prior distributions that possess some nice properties. We

would fike to afiow a rich enough structure that will permit possible a priori de

pendence of the variance ratios, as this may very well be the case. .A.lso, we would

like to keep the parameters of the prior distribution as understandable as possible,

with the potential users in mind, who may not have the time or background to

understand the intricacies of probability theory.

We will suggest an approach to prior selection based on intraclass correlation, which is defined as

This is a correlation coefficient between the subjects within the same group.

By definition, it takes values on the unit interval. So, if we define a vector pj =

{pjjjeJi tbe Dirichlet family presents itself naturally to represent prior information

on PJ. Dirichlet is a well-known and well-studied (Ferguson, 1967) multivariate

generalization of the beta family. A vector p = (pi , . . . ,Pr)^ is said to have a

Dirichlet distribution with parameter vector a = ( a i , . . . ,Q;r-+i)^, if it has the

following distribution function:

nr(a,)W •=' i=l

where ao = E S I ai, 0 < pi < I ioi all i = 1..., r; and aj > 0 for all j = 1 . . . , 7"+1.

We now suggest using a member of Dirichlet family to represent prior information

about the intraclass correlation coefficients.

Using a Dirichlet family has the advantage of incorporating a reference prior,

in the terminology of Box and Tiao (1973). We suggest that, in the absence of

prior information, one should choose a = 1. We have already mentioned that

we have to use proper priors for the parameters of interest to avoid indeterminate

Bayes Factors. However, choosing a = 1 gives us a proper prior which does not

favor any of the parameters a priori. This is a generalization of the choice of a

uniform distribution as a prior for the intraclass correlation in one-way random

models, see Westfafi and Gonen (1996) for details. Another consideration here is

21

the concern for developing a proper reference prior. O'Hagan (1995) argues that

the Bayes Factors can be sensitive to the choice of prior inputs for the parameters

of interest and a prior family should include reasonable choices of reference priors,

which is what the Dirichlet family does for us.

Having represented our prior information in terms of pj. we now face the task

of finding the corresponding distribution on r J . We remind ourselves that r J is a

one-to-one transformation of pj. so the inverse transformation exists and is well-

defined. It is, then, conceptually simple to find the distribution of r j , by using

the Jacobian of the inverse transformation, a method that is very well-known and

widely used (see, for example, Hogg and Craig, 1978, p. 134). However, in this

case, computing the Jacobian is not trivial. The following theorem establishes this

connection for the case a = 1.

Theorem 3 Let TJ, pj and a be as defined above. If pj has a Dirichlet distri

bution with parameter vector a = 1, then P{TJ) a (1 + Y.]=i T')~^'^'^^\

Proof: Let (p i , . . . , pj.+i)^ has a Dirichlet distribution with all parameters

being equal to 1. We want to find the distribution of {r' , r'). where

Pi = hi{T',...,T') = — r

2 1 + E rj

Let A — { ^ \ . and /(.) be the denisty of the Dirichlet distribution. Then

P{Tl....T^)=\A\f{h,{Tl),...,hr{T',)).

Since aU the parameters are equal to 1, this reduces to (recalfing that r ( l ) = 1)

P(Tl,...y,) = \A\Y{r-rl).

So. we need to find |^|. We let T = E L i '^i and note that

To find the determinant of the matrix A, defined above, we convert it into an

upper triangular matrix in the foUowing way. For each z = 1 , . . . , r - 1 . we perform

the following elementary row operations:

dh ^ I ^-^f$- lii=j dr' ^ -'

9 0

1. Replace the {i -\-1)^^ row by the sum of the i^^ row and the (/ -r If^ row.

2. Replace the i^^ column by the difference of the i^^ column and the (? -^ iY^ column.

Since these are elementary row operations, the determinant is left unaffected at each step. After doing this, we have

flfjf — 1/(1+ T) i f z ^ j

0 if zV j

Then \A\ = (1 + r)-(^+i). This leads to

- ( r ^ l )

P{T', r,2) = r ( r + l) 1 + E^/ t = i

which is a multivariate extension of the Pearson Type \'I family. •

By combining (3.7). (3.8) and (3.9) we arrive at our prior specification for the

full model

P(/3,a^rj)oc((T^)-^P(Tj), (3,10)

and for the reduced model

P ( / 3 , a ^ T ^ ) o c ( a V ^ ( T ^ ) , (3.11)

where P(r^) and P{TJ) are given by Theorem 3.

3.3 Deriving the Bayes Factor

The main purpose of this section is to prove the following theorem, which gives

us an explicit expression for the Bayes Factor for a single variance component.

Theorem 4 BF for a single variance component a' in the standard model is given

by

B = ^ (3.12)

23

where

mo (y) = / lUol-i-lA'^ro-^Yr^'-^[(y - A'^orUo-^y - A' /3o)]"^ '^ ' '^^P(r^) . (3.13)

777 (y) = / \\\\-' ^|.V^\r'-V|-'/^[(y - A A ) ^ r r ' ( y - .Y/3i) -1.^-1)

dP{TJ). (3.14)

and

$0 = (X^Vo-'X)-'X^\l-'y.

$^ = iX^\\-KX)-KX^V-'y.

The proof of the theorem will be greatly facifitated by the use of the following

lemmas.

Lemma 1 Let U 6e a symmetric, positive-definite matrix. The generalized least

squares estimator corresponding to \' is /3 = {X-^\'~^X)~^X^\ ~^y. Then

r ^ - T T — 1 (y - X^yv-\y - X(3) = {y - X(3y V-\y - Xf3) + (yS - /3)^ A" 1 -^V(/3 - /3). (3.15)

get:

Proof: We start by adding and subtracting A'/3 and expand the expression we

[(y - A-/3) + {X0 - .V/3)f V-'[(y - .V/3) + (A'^ - A/S)] (3,16)

= (y - X$yv-\y - A/3) + (/3 - 0)^X^-^X0 - /3) +

(y - X$fV-'X{$ - /3) - (/3 - l3fX^V-\y - A/3),

By comparing (3.15) and (3.16). we see that the only thing we need to prove is

(y - X(3)^V-'X(0 - /3) -h (/3 - ^yx'^iy - X(3) = 0

To prove (3.17). it is sufficient to prove that

(3.17)

{y-X(3yV-KX{(3-0) = O. (3.18)

24

and

{^-(3fX^V-\y-X^) = 0. (3.19)

We first tackle (3.18):

(y - X/3)Û-^X(/3 - /3 ) = y^V-'X^-y^V-'X(3 (3.20)

- ^^X'^V^^X^ -h ^^X^\ '-KX(3.

Keeping in mind that U~^ and {X'^V~'^X)~^ are symmetric matrices and sub

stituting /3 = {X'^V-^X)-'^X'^V-'^y in every term of (3.20), we obtain the follow

ing equations:

y'^y-'^XP = y'^V-^XiX'^V-'^Xy^X'^y-'^y (3.21)

$'^x^y-^x$ = y'^y-^x{x^y-^x)-\x'^y-'^x){x^y-''x)-Kx^y-^y = y^y-'^xix'^y-'^xy^x'^y-'^y (3.22)

^^x^'y-'x^ = y^y-'x{x''y-'x)-\x^y-Kx)f3 = yÛ-^X/3. (3.23)

Substituting (3.21), (3.22) and (3.23) back in (3.20), we see that (3.18) is proved.

Now we go back to (3.19):

{$-(3)^X^y-\y-X0) = $^X^y-'y-$'^X''y-'X$ (3.24)

- /3^xû-V + f3'^x^y-Kx$.

The terms in (3.24) can be expanded in the same way as in (3.20) to get:

0^X^y-'y = y^y-'X{X^y-'X)-'X^y-'y. (3.25)

$^x^y-'x0 = y^y-'x{x^y-'x)-'x^y-'xix^y-'x)-Kx'~yy = y'^y-'xix^y-'xy'x^y-'y (3.26)

^Ty^Ty-ij^^ = y^y-'x{x'^y-'x)-\x''y-Kx)(x''y-'x)-Kx^y-'y = 0^X^y-^y. (3.27)

25

Now, (3.25), (3.26) and (3.27) substituted back in (3.24) imply (3.19). Then

(3.18) and (3.19) together imply (3.17) which in turn implies (3.15). which is what

we wanted to prove. •

Lemma 2

y exp{(y - Xl3n-'{y - X0)}dy = {2-)'"'\V\'/\

where p is the dimension of the vector y.

Proof: Follows from the definition of a p-dimensional multivariate normal distribution. •

Lemma 3

^0

Proof: Follows from the definition of the inverted-gamma distribution . •

We are now ready to prove Theorem 4.

Proof (Theorem 4): We will work out the details of the proof for 77ii(.)

only, since exactly the same steps can be traced for 777o(.), with a slight change in

notation.

We first substitute (3.6), (3.10) and (3.11) in (3.1) to get

m,{y) = JJJ{2na')--/'\y,\-y'

e x p { - ^ ( y - X/3)^Uf ^(y - Xf3)}{a')-'d0da'dPir]).

It is important to keep in mind that VQ and Tl are functions of the variance

ratios, but this dependence is not explicit to keep the notation to a minimum.

After grouping like terms, we arrive at

e x p { - 2 ^ ( y - Xl3f \r'(y - A/3)} d0da'dP{T]). (3,28)

We will integrate over /3 first, in (3,28), Using Lemma 1, we get

J/J{<T'ri-'\\\r"exp{~(y - Xl3y\\-'{y - X/3)}d0dc'-dP(T^)

= j\Vi\-'"J{<rr''-'exp{-^{y - X$rf\\-\y - A/3x)}

Jexp{-^^{0,-0VX^\r'X0i-0)}dl3da'dP{TJ).

26

Notice how the lemma has enabled us to separate the term in the exponent into two parts one involving /3 only and the other involving rJ only. By virtue of this, we are in a position to integrate over /3 analytically, using Lemma 2:

/ e x p { - ^ ( / 3 x - lifX^V,-'X0, - H)] dH = {2^rl\ayl'\X^\\-'Xr".

Now, we turn our attention to the integral over a':

/ / /(<7^)-*-^|r ,r i /^exp{-^(y - XI3fVr\y - Xl3)}{a')-'dfida'dPirj)

/ ( a ' ) - ( ^ « ' e x p { - ^ ( y - X$^f\\-\y - X$,)}da' dPirj).

By using Lemma 3,

/ ( . ^ ) - ( ^ - ^ ) e x p { - ^ 2(T2

\{y-X0^fVr\y-X0^)

]^{y - X^ifVf'iy - X0,)}]da'-

n ^ i ^ + i

This development leads to the following expression for 7ni(y):

™i(y) = r i ^ ^ + i

/ \V,r"\X^Vf'X\-'"[\{y - X^)^V,-'{y - Aft)]"*^ '* dPir'j).

Carrying out similar steps, we arrive at the following expression for the numer

ator:

mo(y) = r l ^ ^ + l

f \Vo\-'"\X^Vi'X\-'" ^(y-X0ofV,-\y-X0o) - ( ^ + 1 )

dP(rj,).

The only difference between the numerator and the denominator that is not explicit in those two expressions is the domain of integration. For the numerator.

27

the integration is over (R+)'' ^ whereas for the denominator, it is (R^)'". After

grouping like terms and cancelations, we arrive at

„ / \Vo\-'"\X^Vo-'X\-'''[iy - X0ofV,-\y - A/3o)l" ' '^^"rfP(T^) jy =

/ | r i | - i /2 |A^yr 'A|-V2[(y _ X0,)nr\y - .V/3 i ) ] " ' ' ^^"dP( r j )

which completes the proof. •

A special case worthy of interest is the one-way random model, for which BF

is derived by Westfafi and Gonen (1996). It can easily be seen that, this Bayes

Factor reduces to the one reported in that study.

3.4 Hierarchical Approach

In this section we will provide an alternative expression for the Bayes Factor

using a hierarchical approach. As noted by Searle et al. (1992). hierarchical models

have a distinct Bayesian flavor, but they have also been used successfully beyond

Bayesian analysis. As far as algebraic simplicity is concerned, the standard model

we have worked with in the previous section is preferred to a hierarchical model.

But it will turn out that, the BF based on a hierarchical model is much more

efficient computationally than the one we derived in Theorem 4. Also, in general,

hierarchical models are easier to conceptualize.

The mixed model hierarchy is well-investigated and the seminal work in this

area is Lindley and Smith (1972) , followed by the works of Smith (1973a, 1973b).

The starting point is the mixed linear model as specified in (2.2):

y = Xf3-\-Zu^e.

In what follows, we will treat u as a parameter as well. This is not a problem

from the Bayesian perspective since u is a vector of unobservable quantities and

hence can be treated as parameters. Then, we can specify the fikefihood function

for the model as

y|u, /3. a', T] - K{X(3 + Zu. a'I). (3.29)

Now, we have to specify priors for the parameters u,/3, a'.rj. We will do this

in two stages. In the first stage we will specify a prior for u. conditional on /3. a'. TJ

28

and then in the second stage we wifi specify priors for /3, a' and r j . Because of this

strategy of specifying priors, this approach is termed as "hierarchical."" .Actually,

it is a Bayesian example of the more general topic of "hierarchical modeling." see

Casefia and Berger (1990) for an introductory treatment of hierarchical models.

Berger (1985) provides a treatment of such models from a Bayesian perspecti^•e.

FoUowing the model assumptions for the mixed linear model, as stated in Section 2.1, we specify

Ui|/3,a2,TJ~A/;,(0,c7V/) (3.30)

with the understanding that Ui|/3, o• r j and Uj|/3, a^ r j are independent for z J-

We also keep in mind that u = (u i , . . . , Ur). This is cafied the "first-stage prior".

and the parameters that are conditioned on in a first-stage prior are known as

"hyperparameters". As one might expect, the second-stage priors are concerned

with hyperparameters:

P(/3,^^Tj)cx(a2)-2p(Tj) (3,31)

which is the same as (3.10). When some of the random factors are not present,

such as in the reduced model, one can easily make the necessary adjustments to

arrive at an appropriate hierarchical specification.

In this context, we will prove the following theorem about the Bayes Factor

expressed under this hierarchical setup. We assume that the variance components

are arranged in such a way that the one to be tested is labeled as a\. This

assumption is needed only for simplicity in notation.

Theorem 5 The BF for a single variance component al in the hierarchical model

is given by

B = ^ (3.32) ^i(y)

where

(3.33)

mi(y) = J\X^BrX\-'/'Q:^"-'^'y'[f[ \A i=l

1-1/2 dP{TJ).

29

(3.34)

and

Bo

A,

B,

$ho

$hi

Qo

Qi

= I

= I-^TjZfB,.,Zj

= Bj^,-TfBj_,ZjA-'ZjB,_,

= {X'^Br-iXy'X^Br-.y

= {X^BrXy^X'^Bry

= {y-X$nofBr-i{y-X$no)

= {y-X$ni)'^Br{y-X$ni)

(3.35)

(3.36)

(3.37)

(3.38)

(3.39)

(3.40)

(3.41)

forj = 1,2,... , r .

The proof of Theorem 5 will be greatly simplified by the use of the following

lemmas:

Lemma 4

UjÛj + rfiî - ZjUifBj.,{ij - ZjUi) =

(uj - .4j'dj)^.4,(uj - ylj 'dj) + Tj^/B,ij (3,42)

where

i=j+l

(3,43)

and

.IrjT d j = ' ^ j ^ j ^ j - i ^ J (3.44)

Proof: Our proof starts with a straightforward expansion of the left hand side:

UjÛj + r]{ii - Z,Uj)^B,-i(€,- - Z,Uj)

= u/(I + r/ZjB,_iZ,)uj - 2uj^r;zjB,.iC,- + T]^/BJ.,^J.

30

Using the definitions of Aj and dj from (3.36) and (3.44) we have

u / u j + rji^j - Z,Uj)^P,_i($,- - Z,Uj) (3.45)

= Uj .Uj - 2uj dj + rf^/B,_,^j

= (uj - A-Mj)^^,(uj - ^-Mj) - dj^.4-Mj + r /^ /P ,_ i^ , - . (3.46)

Since dj^^-Mj = (rfy^/Bj_,ZjA-'ZjBj_,^j. we conclude that

- d / ^ - M j + r /^ /P ,_ i = rj\^/{B,.,-rjBj_,Z,Aj^ZjB,.,)^j]

= rJ^j'^Bj^j. (3.47)

Substituting (3.47) back in (3.46) gives us the desired result. D

Lemma 5

/

I

"" l 2 ? V L"J "J + ^!iii - ZjUifBj.,i^j - Z,Uj)]}dUj =

(2^aV;) '^ /V, | - ' /^ e x p { - ^ € / B , € i } . (3,48)

Proof: The proof of this lemma is greatly facilitated by the use of Lemma 4

which enable us to rewrite the exponent in the following manner:

r 1

J ^"^^i" 2 ^ ^ E""' ""' " ^^^^' ~ ^^'"J^^^^-i^^J' ~ - ' j)] } ^^^ 1 /* 1

= e x p { - ^ € / B ^ € j } J exp{-^^3;^(uj - Aj'difAjiui - Aj'di)} duj

= e x p { - ^ ^ / B , « , } ( 2 x < T V ; ) ' ^ / V , | - ' / ^

where the integral is evaluated by using Lemma 2. D

Armed with these lemmas, we turn out attention to Theorem 5.

Proof (Theorem 5): As we have done before with Theorem 4, we will derive

the marginal fikelihood for the fuH model only. The derivation for the reduced

model will turn out to be very similar to that of the full model.

m^(y) = J•••Jpiy\u,/3,a\T^)

[^P(ui|/3,<7^rj)]P(,9,<7^Tj) [ n d u , ] dtSdaUr 2 J-

i = l 1=1

31

We substitute the likelihood as specified in (3.29). the first-stage prior as in

(3.30) and the second-stage prior as in (3.31) to get

mi(y) = / - - - / ( 2 ^ < r ^ ) - " / ^ e x p { - ^ }

n ( 2 x a V / ) - ' - / 2 e x p { - ^ g ^ } ] ( a ^ ) - ^ [f[du,] dfida'dP{TJ). (3,49) .1—1 I J i = l

Arranging the terms in (3.49) we arrive at

777i(y) = //|(27r)-("+^)/2(a2)-("+^+4)/2[n(r2) -9. /2 I^df3da'dP(T]). i=l (3.50)

where

Iu = J ••• /exp{ 2a2 - E

r T r Ui Ui

27r a^r' i=i ^ " ^ 'i i=i }[nH (3.51)

It is helpful to keep in mind that ô . as defined by 3.43. is a function of Ui"s.

We will evaluate /„ first. Define, for i = 1, . . . , r — 1

U_i = ( U i + i , . . . U r ) . (3.52)

Then

/u = / - - - / e x p { - t ^ } ^ u _ , / . „

where

/u. = / e x p { - ^ [ ^ + ( 4 i - Z i U i ) ^ ( 6 - 2 i U i ) ] } r f u , r?

iJB,î = ( 2 w V f ) ' - / V , i - / ^ e x p { - i i ^ } ,

where, in the last step, we have used Lemma 5.

We substitute (3.54) in (3.53) and repeat the same process for U2.

/u = UiÛi ^ / P i ^ i

(3.53)

(3.54)

}du_] (2. .Vf)V^|^.r/7. . . /exp{-i:^-il^

= ( 2 . a V f ) ' ' / ^ | . 4 : | - ' / 7 - - - / e x p { - | : ^ } / „ , d u - , (3,55)

32

where

In, = / e x p { - — [ H l ^ + (^2-^2U2)^Pl(6-^2U2)]}rfU2 2 a n r |

= {2'Ka'Tiy^l'\A2\-"'ey.v[ ^2 B2^2

2^2 } (3.56)

where, in the last step we have used Lemma 5 again. Substituting (3.56) back in (3.55), we get

/u = {27ra'y'^^'^^/'\A,\-'/'{T',y/'\A2\-'/'{r^y'/'

/••7^M-gS?}^"p — } d u . 2 - (3.o7) 2a^

As one might realize easily, these are exactly the same steps we went through

in evaluating the integral over Ui. Repeating this procedure for the remaining

random effects U3, . . . , Ur, we find that

/„ = [f[(2â'Tfr^"\A.r"] e x p { - ^ ( y - A^)^B.(y - X/3)}

(3.58)

Using (3,58) in (3,50), we get

mi(y) = ///(2^)-("+'>/â2)-("+'+^'/2rjJ(^2)-,,/ •^•^•^ i = l

[f[{2na\^)'"''\Ai\-'/'] e x p { - 2 ^ ( y - XfifB^y - A/3)} d/ida' dP{TJ) 1=1

which can be rewritten as

mi(y) = (2^)-"/^//(^^)-("^^'/^[lI \A.\-'/%daUP{T]) i=l (3.60)

where

/^ = | e x p { - ^ ( y - A ^ ) ^ B . ( y - X / 3 ) } d / 3

= /exp{-2^[(y-A/3hi )^B,(y-A;3„) +

i0-0îfX''BrX{l3-0Hr)]}d0

= exp{-2^(y - X0HifB,{y - A/3M)}(27r(7^)''/^|A^B,X|-'/^(3,61)

33

where we have used the techniques we have developed in Section 3.4 to prove Theorem 4 and we have also defined

Phi = (X^BrXr'X^Bry. (3.62)

Substituting (3.61) back in (3.60), we get

7771 (y) = (27r)-^--Py'J\X^BrX\-'/'[fl \Ai\-'/']l,2 dPir]) 1=1 (3.63)

where

4 . = l{a')-^"-'^'y'exp{-^^{y-X0^,fBAy-X0n^)} da'

n-p + 2 = r(^^^f^) \(y-X0HifBr{y-X0^i) -{n-p+2)/2

which, when substituted in (3.63), leads to

777i(y) = (2^)-("-P) /2r( I^Z|±^)

j\X'^BrX\-"' \{y-X$ni)'^Br{y-X^hi) -1 - (n-p+2) /2 r

[n \A,\-"'] dP{r]). (3,64) 2 = 1

A similar line of argument gives

777o(y) = ( 2 . ) - ( " - ) / 2 r ( ^ ^ f ± ^ )

j \X^Br-iX\-"' ^{y-X$ho)'^Br-i{y-X$no) -]-(n-p+2)/2 r-1

[n \A^r^'] dP{ri). (3.65) i=l

where

-IvT /3^o = {X' Br-iX)-'X' Br-iy

Combining (3.64) and (3.65), we arrive at (3.32) and this completes the proof.

D

To realize the computational advantage of the hierarchical model, one needs

to compare Theorem 4 and Theorem 5. The expression for the Bayes Factor in

34

Theorem 4 requires the inversion of 77 x ri matrices. As we know from the numerical

analysis literature (see, for example, Golub and Van Loan. 1989). matrix inversion

requires n^ operations and as such is expensive (in the sense of computation).

However, the Bayes Factor in Theorem 5 only requires inversion for matrices (Aj)

where the dimensions are in the orders of qi. Typically. 77 (number of observations)

is much larger than qi (number of levels of the i^^ random effect) and this gi •es

the expression in Theorem 5 a great advantage over that in Theorem 4. We will

make more remarks about this in the next chapter, where we tackle the issues of

computation. However, in fairness to the standard model, we note that the algebra

involved was much less tedious than the hierarchical model.

3.5 Missing Observations

The subject of missing observations has been a concern for practitioners for a

long time, because of the regularity with which it happens. Even the most carefully

designed experiment under strict controlled conditions can lose experimental sub

jects during the course of the experiment. In studies dealing with living subjects,

which are common in fields like psychology, biology, medicine and agriculture, the

phenomenon of missing observations is much more common. This partly explains

the existence of a lengthy literature on this topic and we do not intend to give a

complete treatment here. There have been a variety of proposed solutions rang

ing from the naive (discarding the data associated with missing observations), to

computationally intensive (multiple imputations), to analytically challenging (per

forming an exact mathematical analysis). The discussion of missing observations

is further complicated by the apparent lack of taxonomy. We will loosely follow

the terminology used by Searle et al. (1992), but several of the literature reviews

that we refer here introduce their own taxonomies. Another confusing factor is the

approach which considers all unbalanced data as missing, as adopted by Dodge

(1985), for example.

One of the earlier reviews of missing observations in the literature is given b}-

Afiffi and Elashoff (1966). They review the work initiated by the ideas of Allen and

Wishart (1930) and Yates (1933). Later Hartley and Hocking (1971) provided a

taxonomy for incomplete data problems and develop estimation techniques based

on fikelihood principle for normal models, including the finear model, with missing

35

observations. Anderson. Basilevsky and Hum (1983) provide their own taxonomy

and consider the problem of missing observations in planned experiments, multi

variate parameter estimation and least squares regression procedures. The most

recent fiterature survey is by Little (1992). His focus is on regression models with

missing values of covariates, however he reviews contemporary approaches to the

topic as well. He, too, provides his own taxonomy of methods.

The concept of identifiabifity plays an important role in the discussion of miss

ing observations. A statistical model with parameters 0 and observables y is said

to be identifiable if distinct values of 0 correspond to distinct fikelihood functions

P(y |^) . That is, if 0 / e\ then f{y\e) is not the same function as f{y\9').

As an example, consider a 2 x 2 experiment with 1 observation per cell where we want to model both the main-effects and the interaction. Using (2.4), we see that

y = alZiZ^ -h alZ^Zl + al^ZnZ^^ + c^ /

where al and al are main effect variances, al^ is the interaction variance and a'

is the error variance. Also, the Z's are the corresponding incidence matrices. .\

quick calculation assures us that Z12 = / , so

y = alZiZl + (T^ZzZj + {al^ + a')I. (3.66)

Letting 0 = {af^a^, 0-^2,(r'), we see that two different values of 0 such that al2-¥a'

is constant will yield the same U and hence the same likefihood function, so this

model is not identifiable. It is sometimes said that the interaction is confounded

with error variance addressing the fact that we cannot distinguish the two param

eters. The reasoning behind this terminology should be clear from (3.66). At this

point, let us observe that if two effects (fixed or random) have the same incidence

matrix, the mixed linear model will not be identifiable.

In the context of designed experiments with a balanced structure, a few miss

ing observations will result in an unbalanced design which may be undesirable

for several purposes. In this case, the most common practice is to "estimate"

these missing observations and use these estimated values as if they are the actual

observed ones. Adjustments have to be made regarding the degrees of freedom.

36

subtracting one for each estimated observation. This method should be used only

when there are few missing observations indeed, since the degrees of freedom ad

justment may result in a substantial loss of power in the hypothesis tests: or create

a problem with identifiabifity. One should also realize that the validity of infer

ences is now conditioned on that the specified model is true, since the missing

observations are generated according to the model and then used as if the\' were

actually observed. One also makes the additional (and somewhat unreafistic in

some cases) assumption that the distribution of the observed data does not change

in the presence of the missing ones. As a result of these considerations, one should

treat the results of such analyses as only approximate.

In this section, we use the term missing observation to refer to a datum that

we have planned to observe, but were unable to do so, because of the uncontrol

lable factors of the experimental environment. So, as long as the model remains

identifiable, an experiment where we were already planning for unbalanced data

does not create any further problems for the Bayes Factor and all of the discus

sions in the previous sections apply without exception. It is also not a problem

if we our planned-to-be balanced experiment turns out to be unbalanced because

of the missing values (of course, assuming the model is still identifiable), since

the BF works equally well in both balanced and unbalanced situations. However

sometimes an entire cell is missing and this leads to difficulties. In such cases,

estimation (in the sense of previous paragraph) of the missing cell means becomes

necessary and the analysis is carried out using these estimated values. Most of

the literature on handling missing cells in the mixed model is concerned with pa

rameter estimation and mot much has been done about hypothesis testing. The

distinction between "all-cells-filled" and "some-cells-empty" data is most helpful

for linear models as noted by Searle (1987), but does not facilitate the analysis in

mixed models.

If, for one of the reasons explained in the previous discussion, estimation of in

dividual observations or cell means becomes necessary, there are three main routes

which one may take: Least Squares, Maximum Likelihood and Bayesian. This is

also the order they have appeared in the literature historically. The process of

using the estimated values in place of the observed values is called "imputation."

Least Squares techniques have been known and appfied for a long time, but as con-

eluded by Little (1992), they are usually outperformed by fikefihood and Bayesian

methods. The method of multiple imputations (Rubin, 1987) uses the predictive

density of missing observations based on the EM algorithm for maximum likeli

hood estimation for generating a set of possible values for the missing observations

and perform the analysis using these generated values. It is inherently a computer

intensive approach, but seems to have dominated the area, at least in survey re

search. Little and Rubin (1987) extends the method of multiple imputations to

several other statistical models.

A Bayesian approach to imputation is advocated by Tanner and Wong (1987)

and models the missing observations as random variables with a prior distribution.

Since the missing values are unobservable to the experimenter, this approach is \er}'

sensible from a Bayesian perspective. Their device, called data augmentation, has

a distinctive Markov Chain Monte Carlo flavor. Later, Tanner (1993) established

data augmentation as a special case of Gibbs sampling. In the data augmentation,

one first generates values from the conditional distribution of parameters given the

observed and missing data. Then a sample of imputed values are generated from

the sampling distribution of the missing data, that is the conditional probability

distribution of the missing data given the parameters and observed data. This

iterative scheme is continued until some convergence is reached.

There is another way of modeling variance components advocated by Hocking,

Green and Bremer (1989). The main theme of their approach is interpreting the

variance components as covariances. Although this idea is motivated by the pursuit

of a model which would avoid negative variance estimates, it gives us a tool in the

case of missing data, in both all-cells-filled and some-cells-empty cases. Their

focus, however, has been mostly on estimation.

We conclude this section noting that, the BF we have proposed suffers no diffi

culty in the case of missing observations as long as the model remains identifiable.

If the model is no longer identifiable or if the analyst prefers to keep the balance

and symmetry in the experiment, then one can use any of the imputation methods

to find estimates of the missing observations and/or cefi means. The imputed data

set. then, can be used to calculate the Bayes Factor. In the case of imputations,

the caveats about the results of the analysis being only approximately true, as

discussed above, should be kept in mind. Of course, a novel Bayesian approach

38

to the problem would be to devise an algorithm that will integrate the calculation

of the BF and the data augmentation process that will automatically take care of

the missing observations, but this task is beyond the purposes of this study.

CHAPTER IV

NUMERICAL METHODS

The Bayes Factor, as given by either Theorem 4 or Theorem 5, involves analyt-

icaUy intractable multi-dimensional integrals both in the numerator and denomi

nator. Hence, for calculations, one must divert to numerical methods. Classical

techniques of numerical integration are based on quadrature methods and their sta

tistical appfications are considered in a recent review by Kahaner (1991). In higher

dimensions, it has been established that quadrature methods are outperformed by

Monte Carlo techniques (Niederreiter, 1992). We wifi, therefore, focus on Monte

Carlo estimation of the integrals in this chapter. When quadrature methods are

used for numerical evaluation of an integral, it is generafiy said that the integral

is approximated: however in the language of Monte Carlo, estimation of integrals

seem to be the more common terminology.

In Section 4.1, we will provide a general framework and notation in which

the Monte Carlo estimators we will use can be defined. In Section 4.2. we will

introduce a small data set and a model to work with. In subsequent sections, we

will implement the estimators we suggest in Section 4.1. In particular. Section 4.3

will be concerned with simple random sampling. Section 4.4 will consider Latin

hypercube sampling and Section 4.5 is going to deal with Gibbs sampling.

4.1 Monte Carlo Estimation of the Bayes Factor

We will first introduce a framework in which all of the Monte Carlo estimators

considered in this chapter can be examined easily. Our approach and notation is

similar to that of Raftery (1996).

The key quantity in the estimation of the BF is the marginal likelihood. In

what follows, we will work with 777(y), a generic marginal fikefihood in the form of

777 (y) = JL(T^)P[T^)dT\ (4,1)

L(.) is either a likelihood function, or what we will call a partially integrated

likelihood. If we let 0 = (0, -0) to be the vector of parameters with 0 denot

ing the nuisance parameters, then we define the partially integrated likelihood as

39

40

/ /((/), \l))P[(f), 0 ) d0, where /(</), 0 ) denotes the likelihood function for the model

and P(</>, 0 ) is the prior.

Let ^(r^) be a positive integrable function so that cg{T^) is a probability den

sity function where c~^ = J g{T^)dT^, a possibly unknown integration constant.

Then (4.1) can be rewritten as

777(y) = fL{T')[^^]cg{T')dT\ (4.2)

If we have a sample of size T from cg(T^), which we will denote by T^{i). for

7 = 1 , . . . , T, then we can use the following estimator for 777(y):

1 £ L{T%)P{T%^)

= \\LP/cg\U, (4,4)

where the g in the subscript refers to the density (or the kernel of the density)

where the samples are obtained from. We will refer to T as the simulation size (as

opposed to sample size, which we will use for the number of observations in a data

set). This estimator is simulation-consistent, that is m(y) converges in probability

to m(y) as T —» oo.

In some cases (including ours), c cannot be found analytically. We can, then,

estimate it by using the following discussion,

P{T^)dT^ = 1

My) = fi: ' /_.\ ' (4.3)

/

/

/

^cg{r^)dr^ = 1

cg(T )dT = c.

So, a simulation-consistent estimator of c is c = \\P/9\\g. Substituting this back in

(4.4), we get

\\LP/g\U . , , ,

We wifi call this estimator the general importance sampling estimator, following

Newton and Raftery (1994), with importance sampfing function g.

41

An obvious choice of g is P, the prior. This gives us

777p(y) = | | L | | p . (4.6)

as the importance sampling estimator. An attentive look at (4.6) assures us that

this is the usual Monte Carlo method for evaluating integrals using the sample

average, see Hammersley and Handscomb (1964). Historically, this has been done

using a simple random sample (SRS) from P, but we will also consider an alter

native, called "Latin hypercube sampfing" (LHS) suggested by McKay. Conover

and Beckman (1979). SRS and LHS will be investigated in Sections 4.3 and 4.4.

respectively.

Recent advances in Markov Chain Monte Carlo Methods (MCMC) has enabled

us to produce samples from the posterior distribution with relative ease, so another

reasonable suggestion is to use the posterior as the importance sampling function,

that is g = LP. This gives us

mip(y) = ||l/L||lJ,. (4.7)

This estimator is known as the harmonic mean estimator. Note that, the impor

tance of incorporating an unknown c frequently comes into the picture at this

stage, since in several situations the user, having specified the prior, knows the

integrating constant for P, but not for the posterior, which is given as propor

tional to LP. The use of the harmonic mean estimator in the context of posterior

simulation has been suggested by Newton and Raftery (1994). We investigate this

estimator in Section 4.5.

Let

Lo = [ n \Ar/']{X^Br-iX)-'/'[{y - XfLni)^Br-i{y - Xpni)]'^''^'^.^) i=\ r - (n+ l ) /2

L: = [^\.\i\-"%X^B,X)-"%y-Xiiu,fB,(y-XM\ • (4.9) i=l

These correspond to the partially integrated likelihoods of the numerator and

the denominator of the BF, given in Theorem 5. By replacing the generic L with

Lo and P with PQ (prior for the reduced model) in the discussion above, one can

get the importance sampfing estimators of the marginal fikelihood for the reduced

42

model. The same can also be done for the fuU model. It wifi. then, follow that the

BF can be estimated by the ratio of the two marginal likefihoods. that is

B = ? 2 M , (4.10) mi(y)

ll^oPo/g|| \\Po/g\\,

\\LiPi/g\\

where

rho(y) =

^ i ( y ) = I IP / II •

We wifi specifically spefi out what these estimates will be in the following

sections whenever we will evaluate them.

4.2 Data and Model

In this section, we introduce a smafi data set made-up for the purposes of this

study and we will suggest a model. In subsequent sections, we will employ the

Monte Carlo techniques discussed above to evaluate the Bayes Factor for the data

and model below.

The data set in Table 4.1 has two factors, both of which we assume to be

random effects. We will call those effects a and 7, a being the one with two levels

and 7 being the one with three levels. We work with the following random model

for this data set:

yij = fi-\-ai-\-jj-\-eij, (4.11)

The GLM procedure on SAS is used to produce the ANOVA table for this data

set, along with tests for main effects. These results are provided in Tables 4.2 and

4.3.

An investigative analysis suggests the presence of strong interaction, but to

highlight the computational strategies, while keeping the details to a minimum,

we ignore this fact and work with the no-interaction model. The tables above

summarize the results to perform the frequentist hypothesis tests. Roughly speak

ing, a is a highly significant effect, but 7 is not. Now, we wiU look at the same

hypothesis from a Bayesian perspective.

43

Table 4.1: Data

a

7

4.9

5.0

5.1

4.4

4.2

4.5

4.0

4.1

4.9

5.1

Table 4.2: ANOVA Table for the Model

Source

Model

Error

Corr. Total

DF

3

6

9

Sum of Squares

1.2620

0.3941

1.6560

Mean Square

0.4207

0.0657

F Value

6.41

Pr > F

0.0267

Table 4.3: ANOVA Table for Main Effects

Source

a

7 Source

a

7

DF

1

2

DF

1

2

Type I SS

0.9000

0.3620

Type III SS

0.6001

0.3620

Mean Square

0.9000

0.1810

Mean Square

0.6001

0.1810

F Value

13.70

2.76

F Value

9.14

2.76

Pr > F

0.0101

0.1416

Pr > F

0.0233

0.1416

44

We will test Eo '. a^ = 0 versus ifi : cr;; > 0, or equivalently EQ : T.^ = 0 \'ersus 2 _ 7

El : T' > 0. The choice of testing 7 main effect instead of a is. to some extent.

arbitrary. We are hoping that we will be able to demonstrate the "irreconcilabil

ity" of posterior probabilities and p-values mentioned earlier. Previous experience

suggests the difference is more dramatic especially when the p-values are of mod

erate size, instead of being very small, so at the outset, testing '^ may serve that

purpose. In the sequel, r^ and T' will denote the same thing as will r' and r | .

The model matrices can easily be found to be

X' =

zl =

zi =

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1

1 0 0 0 0 1 1 0 0 0

0 1 1 0 0 0 0 1 0 0

0 0 0 1 1 0 0 0 1 1

The Bayes Factor is given by mo(y)/777i(y) where

777o(y) = JLo{rl)P{Tl)dT^

7771 (y) = JL,{Tl)P{T])dTl

2 K

and

- (n+ l ) /2 (4.12) Lo = \A,\-"'{X'^B,X)-"'[{y-Xiino)^B,{y-XM]

Li = \A,\-"'\A2\-'l'{X^B2X)-"'[[y - XM^B^y - Xfi,i)]~^''^%.lS)

The A and B matrices are given by (see Theorem 5)

Al

Bi

A2

B2

= I-hT^Z^BiZi

= I-r'iZiAi'Zl

= I -\- T2Z2 B2Z2

= I-TIBIZ2A2^ZIBI.

45

and. following (3.62),

fiho = [X^ BiX)-'X'^ Biy

Phi = (X''B2X)-'X^B2y.

Finally, P(TI) and P[TI, T^) are the one-to-one transformations of the intraclass

correlations whose prior distributions are selected from the Dirichlet family.

4.3 Simple Random Sampling

In order to use the Monte Carlo method as described in the Section 4.1, we

need to be able to generate from the prior distributions. In the fiterature this has

been done mostly using simple random sampfing, or SRS (a term we borrow from

statistical sampling theory to emphasize the difference between different kinds of

sampling, otherwise SRS is known as random sampling): however, we will also

investigate the possibifity of using Latin hypercube sampling for this purpose in

the next section.

In several computer packages, a routine for generating a simple random sample

from a U{0,1) is readily available. By the definition of a simple random sample, this

method produces samples that are independent of each other. Strictly speaking,

there is no way to generate independent random numbers by using a computer,

however most of the available random number generators generate deviates that

behave as if they are independent. Still, some authors use the term "pseudo

random numbers" to remind us the inherent deterministic nature of the computer-

generated random numbers. There is a well-established literature, which is given an

encyclopedic treatment by Devroye (1986), on how to transform uniform deviates

to obtain a simple random sample from a given distribution. For some applications

of using simple random sampling in system simulations, see Law and Kelton (1991)

or Ripley (1987). This is the first of two methods we will use to generate random

numbers and we will label the resulting Monte Carlo estimator as BSRS-

In order to draw samples from P{T^), we use the fact that the variance ratios

are functions of intraclass correlations, the latter being modeled by a Dirichlet

distribution, a priori. To generate samples from a /c-dimensional Dirichlet distri

bution with parameter vector ot = (ai , a^+i)^, we use the following algorithm

reported by Devroye (1986):

46

1. Generate an independent sequence of random numbers {xjjfjî. where J , is drawn from a gamma distribution with parameters a and 1.

fc+i 2. Let X = J2 î and yi — Xi/x ioi i = I,... ,k. Then ?/i, yk is a random

1=1

sample from a /c-dimensional Dirichlet distribution with parameter vector a . There are several methods to generate an SRS from the Gamma family, most of

which are summarized by Devroye (1986). But, for convenience, we prefer to use

the built-in generator RANGAM of SAS, the system that we chose to implement

SRS.

As we mentioned before, it is quite easy to get an idea of the accuracy of the

Monte Carlo method. One possibility is finding the joint asymptotic distribution

of 777o and 7771 as we will now do.

Theorem 6 Let rho and rhi be as defined above. Then, as T ^^ oc,

j l / 2 7770

777i

Po

Ml ^M 0

0 OQ paoai

paooi a'l

Proof Follows from the multivariate central limit theorem. •

We will now state two results and defer their proofs for a brief moment, until

we mention Theorem 9, which is a vital tool in the proofs of Theorems 7 and 8.

The next theorem estabfishes the asymptotic distribution of BSRS-

Theorem 7 Let BSRS = rho/rhi. Then

BSRS — ' JîPsRS, CTSRS)^

where

and

PSRS — /fo /î

_ j ^ ^ / 2 2paoaipo /7o<J?\ -Tp'X' pi " p'l ) • ^SRS ~ rr .,2\ " 0

Finally, we suggest the fofiowing estimator for po:

PSRS = 1 + l - T T o 1

^0 BSRS .

The asymptotic distribution of psRs is given by the foUowing theorem:

Theorem 8

where

PSRS^^f{îpsRs^^L^,)•

i^PSRS ~ \ ^ ~^

l - T T Q 1

TTo / i s ,

- 1

0776?

^PSRS ~

l - T T Q

. \ TTo ^BSRS J

2 BSRS

Our main tool for proving Theorems 6 and 7, that estabfish the asymptotic

distributions of the Monte Carlo estimates for BF and po, will be the next theorem,

called (5-method. A proof of Theorem 9 can be found in Rao (1973).

Theorem 9 Let ST be a k-dimensional statistic {SIT, • • • • SkT) whose asymptotic

distribution is k-variate normal with mean {9i ,9k) and covariance matrix ^.

Let hi,... ,hq be q functions ofk variables with each of them being totally differen-

tiable. Then the asymptotic distribution of T^^'[hi{Sim', • • • • Skm) — hi{9i 9k)\

for all i = 1 , . . . , ^ is q-variate normal with mean zero and covariance EHE^.

where the elements of the q x k matrix E is given by Eij = {dhi/d9j}.

Proof (Theorem 7): Let h(mo,mi) = fho/fhi. k = 2 and q = I. Then, by

using theorems 6 and 9, we state that

rpl/2 7770

7T7i

Mo

M l -Âf{0,EEE'^).

where

E = a^ paoai

paoai al

and

E = d h d h dt\ t2 dt2 t2 (ti,t2)=(M0,Mi)

Performing the partial differentiation, we have

E = L /î f^ i

48

Then

and

Mo MB = —.

M l

T 4 = ET.E'^,

so we have 1 1

^B = ^ ^0-2paoaipo +

2 2

Mo^i

M? T p'i\ " pi

which completes the proof. D

Proof (Theorem 8): The proof wiU follow the central theme (and notation) of the previous proof. We apply theorems 6 and 9 again, this time with k = q = l and

l - T T o 1 \ - 1 h{B) = 1 -h

TTo B.

Differentiating h, we get dh^

~d§

l - T T Q

TTQ

ih^+B) 2-

This leads to

MpQ = KI^B) = U + l - T T Q 1

TTo M B

-1

and

^lo = l - T T Q

TTQ

2^B

and the proof is complete. •

The method of simple random sampling is implemented by using the SAS/IML

code provided in Appendix A. We run this program for several different values

of T and report the results in Tables 4.4, 4.5 and 4.6. To estimate the unknown

parameters al, al and p, we use the corresponding sample variances and the sample

correlation coefficient.

We see that the estimates are sufficiently stable, especially for sample sizes 10^

or larger. We have a very high positive correlation between the numerator and

the denominator. A glance at Theorem 7 assures us that this high positive corre

lation will play an important role in reducing the standard error of our estimates

considerably.

49

Table 4.4, by itself, provides only point estimates for B. To go a step fur

ther we implement the results of the previous section and form 95% confidence

intervals for B and psRs- Those confidence intervals are based on the asymptotic

distributions established in Theorems 7 and 8, so their coverage probability is only

asymptotically correct. However, since our simulation sizes are large enough, we

accept them as satisfactory approximations. The confidence intervals are reported

in Tables 4.5 and 4.6. In calculating psRs's, we assume TTQ = 0.5, where TTQ is the

prior probability of EQ, which can be thought of as a priori indifference between

two hypotheses.

Tables 4.5 and 4.6 give us a reasonable picture to draw conclusions. For the

Bayes Factor, when T = 10^ or T = 10^, the standard error of the Monte Carlo

estimate is high and the corresponding confidence intervals are too wide to work

with. However, for T > lO' , the standard errors are satisfactory and the confidence

intervals shrink to acceptable values. For this example, we estimate the BF to be

0.59. A similar situation occurs for po- For T < 10^, we have high standard errors

and wide confidence intervals, but for T > 10" , the results become acceptable. We

estimate po to be 0.37 in this example. This is the "irreconcilabifity" of p-values

and posterior probabilities mentioned previously. For the same hypothesis, the

p-value of the P-test was 0.1416 (Table 4.3). We see that there is a huge difference

between those two measures.

Monte Carlo integration using simple random sampling has been studied and

applied extensively in the literature. We have already mentioned that it outper

forms the quadrature methods in the context of multi-dimensional integration,

however objections have been raised as to the efficiency of the process. The most

comprehensive treatment of these objections is given by McCulloch and Rossi

(1991) . They carefully study the situations where the posterior is concentrated

around a single mode (not so uncommon with the use of reference priors and a

moderate amount of data) and demonstrate that this wifi lead to the domination of

the Monte Carlo integration estimates by a few large values of the fikelihood. This

leads to a high variance for the estimation process, resulting in considerable ineffi

ciency. In the next sections we wifi examine two alternative estimators attempting

to increase the efficiency.

50

Table 4.4: Simple Random Sampling

T

102

10^

10^

10^

10^

7770

0.1695480

0.1692819

0.1681263

0.1679399

0.1675505

7771

0.2974386

0.2757402

0.2842017

0.2844701

0.2835661

B

0.5700270

0.6139182

0.5915737

0.5902148

0.5908692

^l 0.0341101

0.0340637

0.0337364

0.0334786

0.0335596

^l 0.1182261

0.1087973

0.1117144

0.1118142

0.1114368

P

0.8499019

0.8071927

0.8078687

0.8225059

0.8172925

Table 4.5: SRS for B

T

102

10^

10^

10^

10^

BSRS

0.5701

0.6139

0.5916

0.5902

0.5909

^^Bsn,

0.0346

0.0137

0.0041

0.0012

0.0004

9 5 % Confidence Interval

(0.5009, 0.6393)

(0.5865, 0.6413)

(0.5834, 0.5998)

(0.5878, 0.5926)

(0.5901, 0.5917)

Length of Interval

0.1372

0.0548

0.0164

0.0048

0.0016

Table 4.6: SRS for po

10

10-

10

10

10

PSRS

0.3631

0.3804

0.3717

0.3711

0.3714

SE, PSRS

0.0145

0.0046

0.0016

0.0005

0.0001

95% Confidence Interval

(0.3341, 0.3921)

(0.3712, 0.3896)

(0.3685, 0.3749)

(0.3701, 0.3721)

(0.3712, 0.3716)

Length of Interval

0.0580

0.0184

0.0064

0.0020

0.0004

51

4.4 Latin Hypercube Sampfing

In this section we wifi consider "Latin hypercube sampling" (LHS), which is

introduced by McKay, Conover and Beckman (1979) and extended substantial!}' in

a subsequent study by Iman and Conover (1980). This method has been originated

as an extension of "stratified sampling" which is widely used in survey research.

Later studies in number-theoretic methods to generate random variates have estab-

fished Latin hypercube sampfing as a special case of "quasi-Monte Carlo" methods.

For a treatment of quasi-Monte Carlo methods, see Niederreiter (1992).

The idea behind LHS is to divide the range of the random variate into equi-

probable strata and draw a simple random sample from each of them. This ap

proach ensures that the range space of the random variate is covered adequately

and the decrease in variance can be understood by means of an analogy with the

intuition behind stratified sampling.

If we are working with a random vector, each component of the vector is treated

by the same method. Then components are randomly matched. This random

matching ensures that the entire range of the vector is covered appropriately.

As noted by McKay, Conover and Beckman (1979) "One advantage of the

Latin hypercube sampling appears when the output is dominated by only a few

components of the input. This method ensures that each of those components is

represented in a fully stratified manner, no matter which components might turn

out to be important" [p. 239].

This advantage is especially important for statistical integration problems,

where the integrands are typically highly-peaked around a maximum point, thus

leading to a situation where numerical approximations are dominated by a few

values of the variable of integration. So, we would expect LHS to improve over

SRS, considering the issues raised by McCulloch and Rossi (1991), mentioned at

the end of the previous section.

There are several ways of obtaining a Latin hypercube sample from the Dirich

let distribution. We choose to generate a Latin hypercube sample over the unit

hypercube (in this case, having two variance components, the unit hypercube is

simply the unit square) and then convert them into a sample from the Dirichlet

family. In order to obtain a Latin hypercube sample of size T on the unit square,

we devise the following algorithm:

52

1. Divide the unit interval [0,1) into T equaUy spaced subintervals.

2. Randomly sample two points rj and v within each interval. Call them 7/ and

Vi when they are sampled from the i* interval, that is from [ ^ , ^ ) .

3. At this point we have T/I, . . . , 7 ^ and z/i,..., z/ - Let z = 1 and consider 7/ .

Pick an integer j at random such that 1 < j < T and let xî) = {r/i. Vj) be

the first point in the Latin hypercube sample. Now, refrain j from further

consideration, increase 7 by 1 and repeat this step until all the r}i are consid

ered. This gives us a Latin hypercube sample over the unit square that we

label as a;(i),... ,X{T)-

Having obtained a Latin hypercube sample over the unit square, we convert it into

a sample from the Dirichlet family as follows. First let y = (yi, 7/2) be defined by

yi,{i) = 1 - v^îxiy ^^^ 2/2,(i) = ^2,(i)- It follows that 7/i,(i) is Beta(l,2) and 7/2,(i) is

Beta(l,l). Now consider 2;i,(i) = 7/i,(i) and 2:2,(1) = 2/2,(i)/(l -2/2,(i))- Using a result of

Aitchison (1963), we state that 2;(i) = (zi^î), Z2,{i)) has a Dirichlet distribution with

all parameters being equal to 1. This approach is necessary since we are working

with a Latin hypercube sample which does not satisfy the iid assumption.

The algorithm as described above works for models with two variance com

ponents only. To describe the corresponding algorithm with r components pre

cisely would require further definitions involving random permutations of integers,

although the generalization is conceptually straightforward. Loosely speaking,

one has to obtain a Latin hypercube sample, Xi,(i),..., Xr,{i), where i = 1 , . . . , T,

over the r-dimensional unit hypercube first. Then, via a suitable transforma

tion, yi(i),- • -^Vrii) are obtained such that 7/j,(i) is Beta(l,77 - j + 1). Finally

Zj = 7/,(i)/(l — Sfc=i 2/A:,(o)' ^ ~ l-^--^T; constitute the desired Latin hypercube

sample from the Dirichlet distribution.

In order to use Theorems 7 and 8, we need to find the standard errors of Tho and

777i. Since SRS provided us an iid sample, a good estimate was the corresponding

sample standard deviations. However a Latin hypercube sample lacks the inherent

randomness mechanism, so the only way we can obtain these estimates is to use

repfications. We first simulate Ti values and calculate the corresponding estimate,

then repficate this process T2 times. For purposes of efficiency comparison Ti x T2

should be chosen equal to one of the values of T implemented for SRS. We had

53

Table 4.7: LHS for B

Ti

50000

20000

10000

2000

1000

400

250

100

1

T2

2

5

10

50

100

250

400

1000

100000

BLHS

0.5912

0.5909

0.5913

0.5906

0.5907

0.5901

0.5919

0.5915

0.5902

SE ' BLHS

0.0004

0.0003

0.0002

0.0003

0.0004

0.0006

0.0009

0.0013

0.0012


(0.5862, 0.5962)

(0.5901, 0.5917)

(0.5908, 0.5918)

(0.5900, 0.5912)

(0.5899, 0.5915)

(0.5889, 0.5913)

(0.5901, 0.5937)

(0.5889, 0.5941)

(0.5878, 0.5926)

Length of Interval

0.0100

0.0016

0.0010

0.0012

0.0016

0.0024

0.0036

0.0052

0.0048

Table 4.8: LHS for po

Ti

1000

400

250

100

1

T2

100

250

400

1000

100000

BLHS

0.3713

0.3707

0.3710

0.3715

0.3711

^^Br.MS

0.0001

0.0003

0.0004

0.0006

0.0005


(0.3711, 0.3715)

(0.3701, 0.3714)

(0.3702, 0.3718)

(0.3703, 0.3727)

(0.3701, 0.3721)

Length of Interval

0.0004

0.0012

0.0016

0.0024

0.0020

observed in Section 4.3 that Monte Carlo estimates stabifized for T > 10" , so we

choose T = 10^ here. Then we face the problem of choosing Ti and T2 such that

T = Ti X T2. There are no clear guidelines as to how to do this in an optimal

manner, so we try different values of Ti and T2 such that T = Ti xT2 = lO^.

The results of Latin hypercube sampling with different values of Pi and P2 are

reported in Tables 4.7 and 4.8. The SAS/IML code that we have used to obtain

these results is given in Appendix B.

Some caveats are in order before we interpret the results of Table 4.7. First of

54

all, we only report the results regarding the standard errors, since LHS does not

introduce any bias. All the point estimates for BF and po display a small random

fluctuation and their exclusion should not injure the validity of our conclusions

regarding the efficiency of computational strategies. Also, we should keep in mind

that when Ti = 1, we subdivide the range of our random variable into one part

only, hence we are effectively performing an SRS. So the last fine of Table 4.7 can

be used for comparing the efficiencies of SRS and LHS. Finally, there is another

issue that needs to be kept in mind regarding the last column which represents the

half-length of the corresponding confidence interval. Recall that we are estimating

the ratio of two simulated statistics (that is P = rho/rhi) and the standard error

of the ratio is given by Theorem 7 whose proof relies heavily on the assumption

that both 777o and 777i are normally distributed. For small T2 in Table 4.8 this

assumption is violated and the best we can say is that they have ^-distributions

with T2 — 1 degrees of freedom. In this situation the probability theory is beyond

our reach to calculate the distribution of P . So the results of the last column for

small T2 should be considered ad hoc, since we have used the critical values of

% - i -

Looking at Tables 4.7 and 4.8, one observes that the choice of Ti and T2 is

critical for reducing the standard error of the LHS estimator. Especially, it seems

that smaller values of T2 improves the efficiency. However, the first three cases

where T2 is 2, 5 and 10 must be used with caution because of the remark made in

the previous paragraph. We feel that normality asumptions required for Theorem

7 are satisifed for T2 > 50, specifically we suggest using T2 = 100. After making

this observation we report the results of LHS for po for fewer combinations of Ti

and T2 (see Table 4.8), since we face a simfiar situation regarding normafity in

Theorem 8. By comparing the results of T2 = 100 with the results of SRS (last fine in Table

4.7) we see that LHS is far more efficient than SRS. The standard error is reduced

by a factor of 3 for B and a factor of 5 for po- Thus the resulting confidence intervals

are much narrower. Another interesting feature is that, the standard error of LHS

for T = 10^ is about the same as that of SRS for T = 10^ (for P , also see Table

4.5), hence we can interpret the gain in efficiency as an order of magnitude in terms

of the computing time. We also observe that one should choose T2 to be small as

0 0

long as convergence to normafity is satisfied. Considering that implementation of

LHS does not need a significant amount of increase in human effort (programming

etc.), we advocate its use to calculate the Bayes Factor, when the prior is our

choice of importance sampling function.

Latin hypercube sampling, by and large, has been ignored in the Bayesian lit

erature; so the degree of inefficiency suffered by SRS in the examples of McCufioch

and Rossi (1991) is not investigated in the case of LHS. The way it is constructed,

sampling randomly within equi-probable strata, it attempts to cover the range of

the prior more efficiently, but how does this transform to the case of a concentrated

posterior is unknown, as of yet. Stifi, our fimited numerical experience suggests it

will not suffer as much as SRS, if at all.

4.5 Gibbs Sampling

As we have noted before, the widespread use of Bayesian methods among prac

titioners is impeded by the analytical difficulties that are experienced in the deriva

tion of posterior distributions. A recent solution emerges in the form of obtaining

a sample from the posterior distribution in a rather easy and computationally in

expensive way. One can use this sample to calculate functionals of the posterior

distribution, such as quantiles, modes or marginal likelihoods; that will be of use

in inference. Currently, these methods are collectively known as Markov Chain

Monte Carlo (MCMC). The essence of these methods is to create a Markov chain

whose stationary distribution is the desired distribution to sample from, that is the

posterior. Then, after the convergence is reached, the sample path of the Markov

chain is a sample from the posterior. Of course the remarkable part of this ap

proach is devising ways of constructing a Markov chain with a given stationary

distribution, however it turns out that this is easy to do. Although the literature

has seen an explosion of MCMC studies recently, the essential ideas date back to

Metropolis, Rosenbluth, Rosenbluth, Tefier and Teller (1953) and Hastings (1970).

Gibbs sampling, our method of choice for this study, has been first used by Geman

and Geman (1984) in the context of image processing. Its use in statistical models

has been advocated by Gelfand and Smith (1990) .

We will first explain the basic mechanism behind the Gibbs sampler. V 'e want

to generate a sample from the distribution of T^|y, where r^ can be either r J or

56

T^.. Consider, for z = 1, . . . r the set of conditional posterior distributions

P^(r^) = P{rmrjh^.yl (4.14)

which are commonly called "fufi conditionals". It is not always the case that the

fufi conditionals determine the joint, but the conditions for them to do so are

rather mild and given by Besag (1974). As indicated above, we will use P^{.) to

refer to the conditional posterior distribution of rf suppressing the dependence on

the rest of the variance ratios and the data, mainly for notational simplification.

Consider the iterative scheme in which we start from a set of values, sa\',

{'^l(o)Vi=i ^^^ then generate the next set of values by

^ U l ) ~ ^('^1^2,(0).- •• ,'^r,(0).y)

2,(1) ~ Pir^Kn),... ,r2(o),y)

.2 7-r,(l) ^ Bi^rKil)^--- ,7-r-l,(l),y)

and continue this updating scheme. The set of values {T^^^^J^^I is called the sample

generated at the t^^ iteration. This set constitutes a sample, collectively, from

the joint posterior distribution and furthermore the individual values Tf,^s. is a

realization from the marginal posterior density of T'. Hence by continuing this

iterative scheme T times, one can get a realization of the posterior distribution.

This iterative scheme is known as Gibbs sampling. Note that it is not necessary

to consider all the univariate conditionals one by one, instead we can partition

the parameter vector into subvectors and work with the conditional distributions

of these subvectors. If we choose all the subvectors to be univariate, we get the

version of the Gibbs sampler described above.

There are two major issues surrounding the implementation of and inference

from Gibbs sampling (and more generally MCMC methods). The first one is

assessing convergence and the second one is the dependence among the samples

from the posterior distribution.

5/

The conditions under which a Markov chain has a unique stationary distribu

tion are well-known theoretically (it has to be aperiodic, irreducible and positive

recurrent, see Roberts (1996)) and the way we construct our chains in the Gibbs

sampler (and in other MCMC methods for that matter) guarantees the existence

of a stationary distribution. However, in practice, convergence can be painfully

slow and the major issue is determining when reasonable convergence has been

reached (called "burn-in"). Then, the samples obtained upto the burn-in point

are discarded. Although there is considerable theoretical work in the literature in

the forms of establishing bounds on convergence rates (examples being Rosenthal

(1993) and Poison (1995)), they generally turn out to be too loose to be of any

practical use. It seems that, the rather ad hoc methods suggested by Gelman and

Rubin (1992) and Raftery and Lewis (1992) are enjoying widespread practice.

An issue related to the convergence is the rate of mixing. Informally, mixing

is the rate with which the Markov chain moves throughout the support of the

stationary distribution. So, if a chain is slow mixing, it may stay at a certain

portion of the state-space for a long time and, unless the chain length is adjusted

accordingly, the inferences will be unduly affected. Previous work suggests that

reparametrization is the best remedy to slow mixing, some examples are given in

Wilks and Roberts (1996) .

The second issue is concerned with the fact that the observed values, being the

sample path of a Markov chain, are not independent of each other. Assuming that

convergence is reached, the observed values wifi form a dependent sample from the

posterior distribution. This is an uncomfortable setting for many statisticians, but

is not necessarily bad in MCMC. In most problems, the typical estimate wifi be ob

tained by averaging over the samples to get some empirical quantity. Although the

samples are dependent, the ergodic theorem assures us that these sample averages

(also called ergodic averages) converge to the true expectations (Gilks, Richard

son and Spiegelhalter, 1996). That the Markov chain is ergodic follows from the

aperiodicity and positive recurrence which have already been required for the ex

istence of a stationary distribution. So the most common approach to dependence

is to ignore it. Nevertheless, if for any reason, one needs an independent (at least

approximately) sample, there are two ways of getting one. One is to run several

chains with independent starting points and use the endpoint of each chain in the

58

final sample. The second one, called thinning, is to use every k^^ value from the

chain, which will be approximately independent due to the aperiodicity. Obviously,

one has to foresake some computational efficiency to obtain independent samples

and it is generally deemed not worthwile.

There is also a heated debate as to whether running several chains has any other

merit. Supporters of the several-chain approach maintain that comparison of the

sample paths of the chains might reveal essential information regarding convergence

diagnostics. Those who subscribe to the school of single-chain submit that one long

chain has better chances of finding new modes of the posterior distribution, thus

providing faster convergence in the ergodic sense. To this date, in practice, the

choice between single and several chains seem to be a matter of computational

availability and taste, rather than being based on rational justifications.

This lengthy discussion about the implementation of the Gibbs sampler is ne

cessitated by the lack of agreement and established guidelines in the literature. As

far as our implementation in this study is concerned, we choose to go with a single

chain. We feel the ergodic theorem is secure enough for our purposes and the

computational and algorithmic simplicity of it is also welcome. For assessing con

vergence, the techniques mentioned above do not come to the rescue. One of them

(Gelman and Rubin) is exclusively for multiple-chains and the other (Raftery and

Lewis) is most suitable when the target posterior quantities are quantiles. Hence

we will rely on simple graphical methods to monitor convergence.

One major input to our implementation is the full conditionals as described

by (4.14). For the data and model described in Section 4.2 the full conditionals

can be derived by the key observation that the conditional posterior distribution

of both T' and T' is proportional to their joint posterior, that is.

Pi(rf) = P{T',\Ti) P{rlrl)

P{rl) oc P(r^r | )

a L,(TITI)P(TITI)

59

and

P,{T',) = P(r||Tf) Pjrlri)

P(r?) « P(Tf,r |)

oc I , (T? , r | )P( r f , r | )

SO tha t

P,{T^) a \A,r/'\A,\-'/'{X^B,xr/'[{y - Xil,,fB^{y - . Y M . ^ ) ] " ' " ^ " ' '

(l + rf + r l ) -^ (4.15)

and

P,(T',) OC \A,r"\A,\-'''{X^B,X)-y'[(y - X(i,,fBAy - XM]'^"^'^"

(1 + TI + TI)-\ (4.16)

What seems to be a mysterious situation can be better understood by recalling

the discussion about the use of the symbol oc in Section 3.2. For Pi(Tf), the random

variable in question is rf, hence anything that does not involve rf is considered a

constant, including Pir'). A similar argument holds for P2{TI). SO, in order to get

the full conditionals, we only need to write the expression for the joint posterior

and then pick up the terms involving rf for Pi{r') and T2 for P2{T'). Because of

the implicit dependence of Li, as given in (4.13), on T' and r', it is difficult (and

not very helpful) to write the full conditionals explicitly; instead we will work with

what is given in (4.15) and (4.16).

An immediate difficulty with the full conditionals is the absence of the integrat

ing constants in their expressions. Fortunately, there are a wide variety of methods

to generate samples from distributions with unknown integrating constants. Gilks

(1996) reviews some of them in the context of MCMC and Bennett, Racine-Poon

and Wakefield (1996) provides a comparative study within the framework of a

nonlinear model.

Among the ones reviewed by Gilks (1996) are rejection, ratio-of-uniforms and

Metropofis-Hastings (which is itself a MCMC technique). They are generic in the

60

100 200 300 400 500 600 700 800 900 t

Figure 4.1: Bayes Factor

sense that they are applicable equally well to arbitrary distributions regardless

of whether we are implementing a MCMC method. Another recent suggestion is

"Griddy Gibbs Sampler," by Ritter and Tanner (1992) which is applicable only

when one is running several chains with each subvector being univariate. We choose

to work with the ratio-of-uniforms, which has also been suggested by Gelfand.

Hills, Racine-Poon and Smith (1990). An explanation of how this technique works

is given in Appendix C.

Our implementation of the Gibbs sampler, using the code given in Appendix

D in SAS IML, starts with an arbitrarily chosen value of rf = 1, then generates

a sample from P2 using ratio-of-uniforms. The next step is to generate a sample

from Pi, conditioning on the generated value of T', again using ratio-of-uniforms.

We run the chain for T = 1000 values and the corresponding estimator of the BF,

as given in (4.7), is plotted against the number of iterations in Figure 4.1. The

figure fluctuates until the number of iterations is around 350 and by using this

graphical judgment, we set the burn-in to be 400. So, the reported estimates of

marginal likelihoods and the BF in Table 4.9 are based on an effective simulation

size of 600.

The estimates reported in Table 4.9 are reasonably close to the ones we have

got by SRS and LHS. However, as Newton and Raftery (1994) notes, the harmonic

61

Table 4.9: Estimates from the Gibbs Sampler

7770

0.1679

777i

0.2849

B

0.5893 P

0.3708

estimator has infinite variance and does not satisfy a central fimit theorem. So.

we cannot get standard errors for the harmonic mean estimator for the BF and a

comparison of efficiency by SRS and LHS on formal grounds does not seem possible.

To get around the problem of infinite variance, Newton and Raftery (1994) suggest

some modifications.

Raftery (1996) reviews several methods of estimating marginal likefihoods,

mostly based on posterior simulation. Based on the findings of Rosenkranz (1992)

and Lewis (1994), he encourages the use of the harmonic mean estimator in spite

of the infinite variance problem.

In our numerical experience with the small data set, posterior-based estima

tors have taken longer computer time to evaluate. However, they are worthwile to

investigate, because the BF is usually not the sole aim of data analysis. In most

cases, analysts will support their findings by using posterior distributions. Consid

ering the widespread use of MCMC methods in exploring posterior distributions,

any Bayesian data analysis would have already set up a scheme to sample from

the posterior and the harmonic mean estimator (or its modifications) can be easily

obtained as by-products of this process.

Estimating marginal likelihoods from posterior simulation is an active research

area. Recently, Chib (1995) suggested another method which requires known in

tegrating constants in the set of full conditionals, hence not applicable to our

formulation. DiCiccio, Kass, Raftery and W^asserman (1996) has a recent re

view of posterior-based marginal likelihood estimators and suggests improvements.

Raftery (1996) reviews analytic approximations as well as the mentioned posterior ft

simulation methods, however the performance of all of these methods remain to

be tested.

CHAPTER \ '

CONCLUSIONS AND FUTURE RESEARCH

In this study we have derived Bayes Factors to test for a null variance ratio

(or a nufi variance component) in the context of a normal mixed finear model.

We have taken a fully operational Bayesian approach, specifying non-informative

improper prior distributions for nuisance parameters and informative priors that

require fittle or no prior input from the analyst for the parameters of interest.

The Bayesian approach avoids the difficulties of frequentist and likelihood so

lutions to the problem, such as the non-uniqueness of the most powerful invari

ant tests or intractable asymptotic distributions of test statistics. Our particular

formulation also overcomes the difficulties of similar Bayesian approaches to the

problem by using a proper family of priors for the variance components. An inves

tigation, by Westfall and Gonen (1996), of some previous studies using improper

priors for parameters of interest and employing a device of "imaginary training

sample" has established asymptotic difficulties with that approach. However, we

also provide a reasonable choice of reference prior which does not need any prior in

put from the analyst and can be used if prior information on variance components

is not available. We have illustrated our technique on both a standard model and a

hierarchical model and found that each has its own merits. The minimally param

eterized model was easier to work with from an analytic standpoint but presented

computational challenges which were solved by the hierarchical approach. We have

also noted that missing observations or cells can be handled within this framework

just as they are handled in other approaches, i.e., imputation or remodeling to

solve issues of identifiabifity.

It turned out that the calculation of the Bayes Factor necessitates numerical in

tegration. We have devoted Chapter IV to investigate the Monte Carlo approaches

to the problem, employing prior and posterior as importance sampling functions.

We have considered Latin hypercube sampling as an alternative to simple random

sampling when the prior is used as the importance sampling function. We have

also used Gibbs sampling to simulate the posterior distribution when the poste

rior is used as the importance sampling function. Our numerical experience favors

Latin hypercube sampling over simple random sampling. Also, considering the

62

63

popularity and necessity of posterior simulation in contemporar}' Bayesian data

analysis, we encourage the use of the harmonic mean estimator as well.

Bayesian analysis of variance components is an active research area and there

are many directions which future research may take. Our approach may be gen

eralized to include non-normal models, that is linear models using a link function

other than normal. Such models, known as generalized linear models, have been

successfully employed to explain the variations of 0-1 or count data, but to this

date no one has attempted to find Bayes Factors for random effects in generalized

mixed linear models. Preliminary studies of Bayesian analysis concerning estima

tion and using improper priors have been done by Zeger and Karim (1991) and

Clayton (1996).

Another direction is to look into the possibifities of developing more efficient

numerical strategies. As mentioned in Section 4.5, estimating marginal likelihoods

via posterior simulation receives a great deal of attention in the current literature,

as well as efficient reparametrizations to improve mixing. A study reviewing and

comparing the estimators suggested in the literature would be appropriate.

Another possibility is to investigate the performance of the suggested Bayes

Factors in model selection and compare with other popular measures, such as the

ones mentioned in Section 1.2. Such studies give the practitioners a chance to pick

up their choice of statistical summary measures according to their particular needs

without going into lengthy numerical studies themselves.

We have already noted at the end of Section 3.5 that a better way to approach

the problem of missing observations in a Bayesian setting is to treat the missing

values as parameters and use strategies like data augmentation to simulate the

values of missing observations from their posterior distributions and then impute

those values to perform the analysis. This may further be integrated into a Gibbs

sampling environment that will automaticafiy handle missing values and calculate

the Bayes Factor. Finally, it should be noted that the development of Bayesian approaches, and

especially Bayes Factors, for popular statistical models has been substantially de

layed due to the computational concerns which has led the practicing statisticians

to regard Bayesian methods as elegant but inapplicable. The explosion of the

availability of inexpensive computing power in the last decade along with clever

64

numerical methods seem to overcome this problem. X'irtually any statistician will

benefit from the addition of easily applicable Bayesian methods to his or her arse

nal.

REFERENCES

Afiffi, A. & Elashoff, R. (1966). Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61: 595-605.

Airy, G. (1861). On the Algebraical and Numerical Theory of Errors of Observations, MacMillan, London.

Aitchison, J. (1963). Inverse distributions and independent gamma-distributed products of random variables, Biometrika 50: 505-508.

Allen, F. & Wishart, J. (1930). A method of estimating the yields of a missing plot in field experimental work. Journal of Agricultural Society 30: 399-406.

Anderson, A., Basilevsky, A. & Hum, D. (1983). Missing data, in P. Rossi, J. Wright & A. Anderson (eds). Handbook of Survey Research, Academic Press, New York, pp. 415-494.

Bayes, T. (1783). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53: 370-418.

Bennett, J., Racine-Poon, A. k Wakefield, J. (1996). MCMC for nonlinear hierarchical models, in W. Gilks, S. Richardson & D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 339-358.

Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis, second ed. Springer-Verlag, New York.

Berger, J. & Deely, J. (1988). A Bayesian approach to ranking and selection of related means with alternatives to analysis-of-variance methodology. Journal of the American Statistical Association 83: 364-373.

Berger, J. & Delampady, M. (1987). Testing precise hypotheses. Statistical Science 2: 317-352.

Berger, J. & Sellke, T. (1987). Testing of a point nufi hypothesis. Journal of the American Statistical Association 82: 112-139.

Bernardo, J. & Smith, A. (1994). Bayesian Theory, Wiley, New York.

Besag. J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B 36: 192-326.

65

66

Box, G. & Tiao, G. (1973). Bayesian Inference in Statistical Analysis, Addison-Wesley, Reading, Massachusetts.

Bremer, R. (1994). Choosing and modefing your mixed linear model. Communications in Statistics: Theory and Methods 22: 3491-3522.

Broemeling, L. (1985). Bayesian Analysis of Linear Models, Marcel-Dekker. New York.

Carlin, B. k Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series B 57: 473-484.

Casella, G. k Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association 82: 106-111.

Casella, G. k Berger, R. (1990). Statistical Inference, Duxbury Press, Belmont. California.

r

Chaloner, K. (1987). A Bayesian approach to the estimation of variance components for the unbalanced one-way random model, Technometrics 29: 323-337.

Chauvenet, W. (1863). A Manual of Spherical and Practical Astronomy. Lippin-cott. Philadelphia.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90: 1313-1321.

Clayton, D. (1996). Generalized mixed linear models, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 275-302.

Datta, G. k Ghosh, M. (1995). Some remarks on noninformative priors. Journal of the American Statistical Association 90: 1357-1363.

de Finetti, B. (1964). Foresight: Its logical laws, its subjective sources, in H. Ky-burg k H. Smokier (eds), Studies in Subjective Probability, Wiley, New York, pp. 93-158. Translated and reprinted, originally appeared in 1937.

de Finetti, B. (1972). Probability, Induction and Statistics, Wiley, New York.

de Finetti, B. (1974). Theory of Probability, Vol. 1, Wiley, New York.

de Finetti, B. (1975). Theory of Probability, Vol. 2. Wiley. New York.

67

DeGroot, M. (1970). Optimal Statistical Decisions. McGraw HiU. New York.

DeGroot, M. (1973). Doing what comes naturally: Interpreting a tail-area as a posterior probabifity or as fikelihood ratio. Journal of the American Statistical Association 68: 966-969.

Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer-\erlag, New York.

DiCiccio, T., Kass, R., Raftery, A. k Wasserman, L. (1996). Computing Bayes factors by combining simulation and analytic approximations. Technical Report 630, Carnegie Mefion University, Department of Statistics.

Dickey, J. (1974). Bayesian alternatives to the F-test and least squares estimates in the normal linear model, in S. Fienberg k A. Zellner (eds). Studies in Bayesian Econometrics and Statistics, North Holland, Amsterdam, pp. 515-554.

Dodge. Y. (1985). Analysis of Experiments with Missing Data, Wiley. New York.

Edwards, W., Lindman. H. k Savage, L. (1963). Bayesian statistical inference for psychological research, Psychological Review 70: 193-242.

Efron, B. (1986). Why isn't everyone a Bayesian?, American Statistician 40: 1-11.

Eisenhart, C. (1947). The assumptions underlying the analysis of variance. Biometrics 3: 1-21.

Ferguson, T. (1967). Mathematical Statistics: A Decision-Theoretic Approach, Academic Press, New York.

Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Edinburgh Royal Society 52: 399-433.

Gauss, K. (1809). Theoria Motus Corporum Celestrium in Sectonibus Conies Solem Ambientium, Pertles and Besser, Paris.

Gelfand. A., Hills, S., Racine-Poon. A. k Smith, A. (1990). Illustration of Bayesian inference in normal data models using Gibbs sampfing, Journal of the American Statistical Association 85: 972-985.

68

Gelfand, A. k Smith, A. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85: 398-409.

Gelman, A., Carlin, J., Stern, H. k Rubin, D. (1995). Bayesian Data Analysis. Chapman and Hall, London.

Gelman, A. k Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7: 457-511.

Geman, S. k Geman, D. (1984). Stochastic relaxation, Gibbs distributions and Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741.

George, E. k McCulloch, R. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881-889.

Gilks, W. (1996). Full conditional distributions, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 75-88.

Gilks, W., Richardson, S. k Spiegelhalter, D. (1996). Introducing Markov chain Monte Carlo, in W. Gilks, S. Richardson k D. Spiegelhalter (eds). Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 1-20.

Golub, G. k Van Loan, C. (1989). Matrix Computations, The Johns Hopkins University Press, Baltimore, Maryland.

Gonen, M. (1995). Bayes factors for the ANOVA hypothesis. Presented in IMS/WNAR Joint Conference, Stanford, California.

Graybill, F. (1986). Theory and Application of the Linear Model, Duxbury, North Scituate, Massachusetts.

Hammersley, J. k Handscomb, D. (1964). Monte Carlo Methods, Methuen. London.

Hartley, H. k Hocking, R. (1971). The analysis of incomplete data. Biometrics 27: 783-823.

Hartley, H. k Rao, J. (1967). Maximum likelihood estimation for the mixed analysis of variance model, Biometrika 54: 93-108.

69

Hastings, W. (1970). Monte Carlo methods using Markov chains and their applications, Biometrika 57: 97-109.

Hill, B. (1965). Inference about variance components in the one-way model, Journal of the American Statistical Association 60: 806-825.

Hocking, R. (1985). The Analysis of Linear Models, Brooks/Cole. Monterey. California.

Hocking, R., Green, J. k Bremer, R. (1989). Variance component estimation with model-based diagnostics, Technometrics 31: 227-240.

Hodges, J. k Lehmann, E. (1954). Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical Society. Series B 16: 261-268.

Hogg, R. k Craig, A. (1978). Introduction to Mathematical Statistics, fourth ed. McMifian, New York.

Iman, R. k Conover, W. (1980). Small sample sensitivity analysis techniques for computer models with an application to risk assessment, Comrtiunications in Statistics: Theory and Methods 17: 1749-1842.

Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Proceedings of the Cambridge Philosophy Society 31: 203-222.

Jeffreys, H. (1939). Theory of Probability, Oxford University Press, London.

Kahaner, D. (1991). A survey of existing multidimensional quadrature routines, in N. Flournov k R. Tsutakawa (eds). Statistical Multiple Integration, American Mathematical Society, Providence, Rhode Island, pp. 9-22.

Kass, R. k Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association 90: 773-795.

Khuri, A. k Sahai, H. (1985). Variance components analysis: A selective literature survey. International Statistical Review 53: 279-300.

Law, A. k Kelton, D. (1991). Simulation Modeling and Analysis, second ed, McGraw-Hifi, New York.

Legendre, A. (1806). Nouvelles Methodes pour la Determination des Orbites des Cometes; avec un Supplement Contenat Divers Perfectionnements de ces Methodes et leur Application aux deux Cometes de 1805, Courcier. Paris.

70

Lehmann, E. (1986). Testing Statistical Hypotheses, Wiley. New ^brk.

Lempers, F. (1971). Posterior Probabilities of Alternative Linear Models, University Press, Rotterdam.

Lewis, S. (1994). Multilevel Modeling of Discrete Event History Data Using Markov Chain Monte Carlo Methods, PhD thesis. University of V\ ashington, Department of Statistics.

Lindley, D. (1957). A statistical paradox, Biometrika 44: 187-192.

Lindley, D. k Smith, A. (1972). Bayes estimates for the finear model. Journal of the Royal Statistical Society, Series B 34: 1-41.

Little, R. (1992). Regression with missing X's: A review. Journal of the American Statistical Association 87: 1227-1237.

Little, R. k Rubin, D. (1987). Statistical Analysis with Missing Data, Wiley, New York.

Mallows, C. (1973). Some comments on Cp, Technometrics 15: 661-675.

McCulloch, R. k Rossi, P. (1991). A Bayesian approach to testing the arbitrage pricing theory, Journal of Econometrics 49: 141-168.

McKay, M., Conover, W. k Beckman, R. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code, Technometrics 21: 239-245.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. k Teller. E. (1953). Equations of state calculations by fast computing machine. Journal of Chemical Physics 21: 1087-1091.

Miller, A. (1990). Subset Selection in Regression, Chapman and Hall, New York.

Mitchefi, T. k Beauchamp, J. (1988). Bayesian variable selection in finear regression. Journal of the American Statistical Association 83: 1023-1032.

Newton, M. k Raftery, A. (1994). Approximate Bayesian inference by the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B 56: 3 -48.

'1

Neyman, J. k Pearson, E. (1933). On the problem of most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A 231: 289-337.

Neyman, J. k Pearson, E. (1967). Joint Statistical Papers of J. Neyman and E.S. Pearson, University of California Press, Berkeley.

Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Appfied Mathematics, Philadelphia. Pennsylvania.

O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference, Edward Arnold, London.

O'Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society, Series B 57: 99-138.

Poison, N. (1995). Convergence of Markov chain Monte Carlo algorithms, in J. Bernardo, J. Berger, A. Dawid k A. Smith (eds). Bayesian Statistics 5, Oxford University Press, Oxford.

Press, J. (1989). Bayesian Statistics: Principles, Models and Applications, Wiley, New York.

Raftery, A. (1996). Hypothesis testing and model selection via posterior simulation, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 163-188.

Raftery, A. k Lewis, S. (1992). How many iterations in the Gibbs sampler?, in J. Bernardo, J. Berger, A. Dawid k A. Smith (eds), Bayesian Statistics 4-Oxford University Press, Oxford, pp. 765-776.

Rao, C. (1973). Linear Statistical Inference and Its Applications, second ed, Wiley. New York.

Ripley, B. (1987). Stochastic Simulation, Wiley, New York.

Ritter, C. k Tanner, M. (1992). The Gibbs stopper and the griddy Gibbs sampler. Journal of the American Statistical Association 87: 861-868.

Roberts, G. (1996). Markov chain concepts related to sampfing algorithms, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hafi, London, pp. 45-58.

72

Rosenkranz, S. (1992). The Bayes Factor for Model Evaluation in a Hierarchical Poisson Model for Area Counts, PhD thesis, University of Washington. Department of Biostatistics.

Rosenthal, J. (1993). Rates of convergence for data augmentation on finite sample spaces. Annals of Applied Probability 3: 819-839.

Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. Wilev. New York.

Samaniengo, F. k Reneau, D. (1994). Toward a reconciliation of the Bayesian and frequentist approaches to point estimation. Journal of the American Statistical Association 89: 947-957.

Savage, L. (1954). The Foundations of Statistics. Wiley, New York.

Savage, L. (1962). The Foundations of Statistical Inference, Methuen, London.

Scheffe, H. (1956). The Analysis of Variance, Wiley, New York.

Searle, S. (1971). Linear Models, Wiley, New York.

Searle, S. (1987). Linear Models for Unbalanced Data, Wiley, New York.

Searle, S., Casella, G. k McCulloch, C. (1992). Variance Components, Wiley, New York.

Self, S. k Liang, K. (1987). Asymptotic properties of maximum likelihood estimators and the likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82: 605-610.

Smith, A. (1973a). Bayes estimates in one-way and two-way models, Biometrika 60: 319-330.

Smith, A. (1973b). A general Bayesian linear model. Journal of the Royal Statistical Society, Series B 35: 67-75.

Smith, A. k Spiegelhalter, D. (1980). Bayes factors and choice criteria for finear models. Journal of the Royal Statistical Society, Series B 42: 213-220.

Spiegelhalter, D. k Smith, A. (1982). Bayes factors for linear and log-linear models with vague prior information. Journal of the Royal Statistical Society, Series B 44: 377-387.

73

Tanner, M. (1993). Tools for Statistical Inference: Methods for Exploring Posterior Distributions and Likelihood Functions, second ed, Springer-X'erlag, New "^brk.

Tanner, M. k Wong, W. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82: 528-550.

Tiao, G. k Tan, W. (1965). Bayesian analysis of random-effects models in the analysis of variance, Biometrika 52: 37-53.

Westfafi, P. (1989). Power comparisons for invariant ratio tests in mixed .\NO\'A models. Annals of Statistics 17: 318-326.

Westfafi, P. k Gonen. M. (1996). Asymptotic properties of ANO\A Bayes factors. Communications in Statistics: Theory and Methods 25(12). To appear.

Wilks, W. k Roberts, G. (1996). Strategies for improving MCMC. in W. Gilks. S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 89-114.

Yates, F. (1933). The analysis of replicated experiments when the field results are incomplete. Empire Journal of Experimental Agriculture 1: 129-142.

Ye, K. (1994). Bayesian reference prior analysis on the ratio of variances for the balanced one-way random effect model. Journal of Statistical Planning and Inference 41: 267-280.

Zeger, S. k Karim, M. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association 86: 79-86.

Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics, Wiley, New York.

Zellner, A. (1986a). On assessing prior distributions and Bayesian regression analysis with g-prior distributions, in P. Goel k A. Zellner (eds), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, North-Hofiand, Amsterdam, pp. 233-243.

Zellner, A. (1986b). Posterior odds ratios for regression hypotheses: General considerations and specific results, in A. Zellner (ed.), Basic Issues in Econometrics, University Press, Chicago, pp. 275-305.

Zellner, A. k Slow, A. (1980). Posterior odds ratios for selected regression hypotheses, in J. Bernardo, J. Berger, A. Dawid k .A.. Smith (eds). Bayesian Statistics, University Press, Valencia.

APPENDIX A

SAS IML CODE FOR SRS

75

1 /* Same as Ihs.sas, but uses simple random sampling */

2 /* instead of latin hypercube sampling */

3

4 /* calculate estimates based on rep=100 repetitions */

5 /* repeat m=500 times to get variance of the estimator*/

6

7

8 proc iml;

9

10 rep=10;

11 m=50;

12 sumb=0;sumsqb=0;

13

14 do k=l to m;

15

16 e l=j ( l , rep ,0) ;

17 e2=el;

18 e3=el;u3=el;

19

20

21 do i=l to rep;

22 el[i]=ranexp(0)

23 e2[i]=ranexp(0)

24 e3[i]=ranexp(0)

25 u3[i]=uniform(0);

26 end;

27

28 e6=el+e2+e3;

29 zl=el/e6;

30 z2=e2/e6;

31

/ (

32 / * ( z l , z 2 ) i s D i r i c h l e t ( 1 , 1 , 1 ) * / 33

34 t l = z l / ( l - z l - z 2 ) ;

35 t 2 = z 2 / ( l - z l - z 2 ) ;

36

37 /*tl and t2 are the variance ratios for the denominator*/

38

39

40 t 3 = u 3 / ( l - u 3 ) ;

41

42 / * t 3 i s the var iance r a t i o for the numerator*/

43

44 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;

45

46 n = 1 0 ;

47 p = l ;

48

49 x = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /

50

51 wl={l 0 ,1 0 ,1 0,1 0,1 0,0 1,0 1,0 1,0 1,0 1};

52 w2={ l 0 0 , 0 1 0 , 0 1 0 , 0 0 1 ,0 0 1 ,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1 ,0 0 1 } ;

53 110=1(10);

54

55 pay=j(1 , rep ,0) ;payda=pay;

56

57 numbar=j(1,5,0);denombar=numbar;/*b=numbar;*/

58

59 sumnum=0;sumdenom=0;

60

61 do j = l t o r e p ;

62 covl=I10+t1 [ j ]*wl*wl '+t2[j]*w2*w2';

63 covO=I10+t3 [ j ]*wl*wl ' ;

64

'8

65 i n v l = i n v ( c o v l ) ;

66 invO=inv(covO); 67

68 b e t a h a t l = i n v ( x ' * i n v l * x ) * x ' * i n v l * y ;

69 betahatO=inv(x'*invO*x)*x'*invO*y; 70

71 d e l = d e t ( c o v l ) * * ( - 0 . 5 ) ;

72 d e 2 = d e t ( x ' * i n v l * x ) * * ( - 0 . 5 ) ;

73 ycd=y-x*betahat l ;

74 de3=(ycd '* inv l*ycd)**( - (n -p+2) /2 ) ;

75 payda[j]=del*de2*de3; 76

77

78 n l = d e t ( c o v 0 ) * * ( - 0 . 5 ) ;

79 n2=de t (x '* inv0*x)**( -0 .5 ) ;

80 ycn=y-x*betahatO;

81 n3=(ycn '* inv0*ycn)**(- (n-p+2) /2) ;

82 pay[ j ]=nl*n2*n3;

83

84 sumnum=sumnum+pay[ j ] ;

85 sumdenom=sumdenom+payda[ j ] ;

86

87 end;

88

89 numbar=sumnum/rep;

90 denombar=sumdenom/rep;

91

92 sumb=sumb+(numbar/denombar);

93 sumsqb=sumsqb+((numbar/denombar)**2);

94

95 end;

96

97 blhs=sumb/m;

"9

98 varblhs=(sumsqb-(m*(blhs**2)))/m; 99

100 print blhs varblhs;

101

APPENDIX B

SAS IML CODE FOR LHS

80

81

1 /* We generate a sample from multivariate independent imiform */

2 /* distribution on the unit hypercube and then transform */

3 /* them into Beta distributed variates and then transform */

4 /* into a vector with Dirichlet distribution */

5

6 proc iml;

7

8 rep=100;

9 m=500;

10 sumb=0;sumsqb=0;

11

12 do k=l t o m;

13

14 ul=j(l,rep,0);u2=ul;u3=u2;

15

16 do i=l to rep;

17 ul [i]=uniform(0);

18 u2[i]=uniform(0);

19 u3[i]=uniform(O);

20 end;

21

22 vl=rank(ul);

23 v2=rank(u2);

24

25 xl=(vl/rep)-(l/(2*rep));

26 x2=(v2/rep)-(l/(2*rep));

27

28 /*xl and x2 is a LHS sample from bivariate independent uniform(0,l)*/

29

30 yl=l-(l-xl)##(0.5);

31 y2=x2;

82

32

33 / * y l i s B e t a ( l , 2 ) and y2 i s uni form(0 ,1)* /

34

35 z l = y l ;

36 z 2 = y 2 # ( l - y l ) ;

37

38 / * ( z l , z 2 ) i s D i r i c h l e t ( 1 , 1 , 1 ) * /

39

40 tl=zl/(l-zl-z2);

41 t 2 = z 2 / ( l - z l - z 2 ) ;

42

43 /*tl and t2 are the variance ratios for the denominator*/

44

45 t3=u3/(l-u3);

46

47 / * t 3 i s the var iance r a t i o for the numerator*/

48

49 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;

50

51 n=10;

52 p = l ;

53

54 x = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /

55

56 wl={l 0 , 1 0 , 1 0 , 1 0 , 1 0 ,0 1,0 1,0 1,0 1,0 1 } ;

57 w 2 = { l 0 0 , 0 1 0 , 0 1 0 , 0 0 1 ,0 0 1 ,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1 ,0 0 1 } ;

58 110=1(10);

59

60 pay=j (1 , rep ,0) ;payda=pay;

61 62 numbar=j(1,5,0);denombar=numbar;b=numbar;

63

64 sumnum=0;sumdenom=0;

83

65

66 do j = l t o r ep ;

67 covl=I10+t l [ j ]*wl*wl '+t2[j]*w2*w2';

68 covO=I10+t3 [ j ]*wl*wl ' ; 69

70 i n v l = i n v ( c o v l ) ;

71 inv0=inv(cov0); 72

73 b e t a h a t l = i n v ( x ' * i n v l * x ) * x ' * i n v l * y ;

74 betahatO=inv(x'*inv0*x)*x'*invO*y; 75

76 d e l = d e t ( c o v l ) * * ( - 0 . 5 ) ;

77 de2=de t (x ' * inv l*x )** ( -0 .5 ) ;

78 ycd=y-x*betahat l ;

79 de3=(ycd '* inv l*ycd)**( - (n -p+2) /2 ) ;

80 payda[j]=del*de2*de3;

81

82 n l = d e t ( c o v 0 ) * * ( - 0 . 5 ) ;

83 n2=de t (x '* inv0*x)**( -0 .5 ) ;

84 ycn=y-x*betahatO;

85 n3=(yen '*inv0*ycn)**(- (n-p+2) /2) ;

86 pay[ j ]=nl*n2*n3;

87

88 sumnum=sumnum+pay[ j ] ;

89 sumdenom=sumdenom+payda[j] ;

90

91 e n d ;

92

93 numbar=sumnum/rep;

94 denombar=sumdenom/rep;

95

96 sumb=sumb+(numbar/denombar);

97 sumsqb=sumsqb+((numbar/denombar)**2);

84

98

99 e n d ;

100

101 blhs=sumb/m;

102 v a r b l h s = ( s u m s q b - ( m * ( b l h s * * 2 ) ) ) / m ; 103

104 p r i n t b l h s v a r b l h s ;

105

APPENDIX C

RATIO-OF-UNIFORMS

85

86

In this appendix, we will explain the mechanics of the ratio-of-uniforms method.

We use this to generate from the fufi conditionals in Section 4.5. Our approach

follows that of Gilks (1996), with a sfight change in notation.

Let Cg be a subset of R defined by

Cg = {{u,v):0<u< ^g(v/u)} (C.l)

and let {u,v) be a bivariate random variable that is uniformly distributed on Cg.

that is f{u, v) = k ii {u, v) e Cg and 0 otherwise. Let 9 = v/u. Then, the joint

density of 9 and u is f(9, u) = ku ior 0 < u < ^g{9) and 0 otherwise. Then the

marginal density of 9 is

r\/9{&) k m = f{e,n)du = -g(9) (C.2)

JO I

which implies that f{9i) a g{9i) and the normalizing constant i^ k = 2/ f g(9)d9.

So we can generate from a non-normalized density g{.) without evaluating its

normalizing constant by using this method, as long as we can identify the region

Cg and generate bivariate uniform variables over it.

One way to generate uniformly on Cg is to use rejection. If we can find a

rectangle R that contains Cg, say [0, a] x \b,c\, then we will generate a candidate

from R uniformly and accept it if it lies in C^. A proposed R is

a = sup \lg{x) X

b = mixJg{x)

c = sup xyjg{x)

This is the minimal rectangle, but any rectangle containing it should do. The

probability of acceptance is i/ka{c - b). It is difficult to derive the infimums and

supremums analytically. One approach to is to undertake a decrease in efficiency

and use somewhat looser bounds that can be calculated easily. We take the sec

ond approach in our implementation and computed the supremums and infimums

s:

numerically. This might have resulted in some increase in the required computing

time, but it provided a generic algorithm, where, otherwise, we had to adapt the

bound every time we change our model. Hence, the implementation of the ratio-

of-uniforms methods requires three maximizations (counting the minimization as

a maximization for purposes of computer time) per generation and a further re

jection scheme. In our experience we found the rejection part to be quite efficient,

on the average 10 out of 13 candidates are accepted.

APPENDIX D

SAS IML CODE FOR THE GIBBS SAMPLER

88

89

1 options ps=32 ls=76 nodate;

2

3 proc iml;

4

5 file "harden-a.out";

6

7 /* epsilon */

8 eps=10e-6;

9

10 /* Setup regarding the observations and model matrices */

11

12 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;

13

14 n = 1 0 ;

15 p = l ;

16 q l = 2 ;

17 q 2 = 3 ;

18

19 o n e = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /

20

21 Z1={1 0 ,1 0 ,1 0 ,1 0,1 0,0 1,0 1,0 1,0 1,0 1};

22 Z2={1 0 0 , 0 1 0 , 0 1 0 , 0 0 1,0 0 1,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1,0 0 1 } ;

23

24

25 / * Module t o genera te from the var iance r a t i o s us ing

26 r a t i o -o f - im i fo rms method */

27

28

29 s t a r t denl (x2) g l o b a l ( q l , q 2 , n , Z l , Z 2 , o n e , y , t l ) ;

30 A2=I(q2)+x2*Z2'*Z2;

31 B2=I(n)-x2*Z2*(inv(A2))*Z2' ;

9(

38

39

32 Al=I(ql)+tl*Zl'*B2*Zl;

33 Bl=B2-tl*B2'*Zl*(inv(Al))*zr*B2;

34 gl=(one'*Bl*one)**(-0.5);

35 g2=(det(Al))**(-0.5);

36 g3=(det(A2))**(-0.5);

37 muhat2=(gl**2)*one'*Bl*y;

g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);

vl=gl*g2*g3*g4*((1+tl+x2)**(-3));

40 v2=sqrt(vl);

41 retum(v2);

42 finish denl;

43

44 start den2(x2) global(ql,q2,n,Zl,Z2,one,y,t1);

45 A2=I(q2)+x2*Z2'*Z2;

46 B2=I(n)-x2*Z2*(inv(A2))*Z2';

47 Al=I(ql)+tl*Zl'*B2*Zl;

48 Bl=B2-tl*B2'*Zl*(inv(Al))*Zl'*B2;

49 gl=(one'*Bl*one)**(-0.5);

50 g2=(det(Al))**(-0.5);

51 g3=(det(A2))**(-0.5);


53 g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);

54 vl=gl*g2*g3*g4*((l+tl+x2)**(-3));

55 v2=x2*sqrt(vl);

56 return(v2) ;

finish den2; O I

58

59 Start den3(xl) global(ql,q2,n,Zl,Z2,one,y,t2);

60 A2=I(q2)+t2*Z2'*Z2;

61 B2=I(n)-t2*Z2*(inv(A2))*Z2';

62 Al=I(ql)+xl*Zl'*B2*Zl;

63 Bl=B2-xl*B2'*Zl*(inv(Al))*Zl'*B2;

gl=(one'*Bl*one)**(-0.5); 64

91

65 g2=(det(Al))**(-0.5);

66 g3=(det(A2))**(-0.5);



69 vl=gl*g2*g3*g4*((l+xl+t2)**(-3));

70 v2=sqrt(vl);

71 return(v2);

72 finish den3;

73

74 Start den4(xl) global(ql,q2,n,Zl,Z2,one,y,t2);

75 A2=I(q2)+t2*Z2'*Z2;

76 B2=I(n)-t2*Z2*(inv(A2))*Z2';

77 A 1 = I ( q l ) + x l * Z l ' * B 2 * Z 1 ;

78 Bl=B2-xl*B2'*Z1*(inv(Al))*Z1'*B2;

79 gl=(one'*Bl*one)**(-0.5);

80 g2=(det(Al))**(-0.5);

81 g3=(det(A2))**(-0.5);



84 vl=gl*g2*g3*g4*((l+xl+t2)**(-3));

85 v2=xl*sqrt(vl);

86 return(v2);

87 finish den4;

88

89 start likden(xl,x2) global(ql,q2,n,Zl,Z2,one,y);

90 A2=I(q2)+x2*Z2'*Z2;

91 B2=I(n)-x2*Z2*(inv(A2))*Z2';

92 A1=I(ql)+xl*Zl'*B2*Z1;

93 Bl=B2-xl*B2'*Zl*(inv(Al))*Zl'*B2;

94 gl=(one'*Bl*one)**(-0.5);

95 g2=(det(Al))**(-0.5);

96 g3=(det(A2))**(-0.5);


92

98 g4=((y-muhat2*one) '*B1*(y-muhat2*one))**(-(n+1)/2);

99 vl=gl*g2*g3*g4;

100 return(vl);

101 f i n i s h l i kden ;

102

103 s t a r t modgen2;

104

105 infl=0;

106 inf2=0;

107

108 opt=l;

109 bc={0,.};

110 xO=l;

111

112 call nlpqn(rl,argl,"denl",xO,bc,opt);

113 call nlpqn(r2,arg2,"den2",x0,bc,opt);

114 s u p l = d e n l ( a r g l ) ;

115 sup2=den2(arg2);

116

117 accept=0;

118

119 do u n t i l ( a c c e p t = l ) ;

120 vl=supl*uniform(O);

121 v2=inf2+(sup2-inf2)*uniform(O);

122 v=v2/vl;

123

124 upper=denl(v);

125

126 if vl<(upper**2) then accept=l;

127

128 end;

129

130 t2=v;

93

131 f i n i s h modgen2;

132

133 s t a r t modgenl;

134

135 i n f l = 0 ;

136 in f 2=0;

137

138 o p t = l ;

139 b c = { 0 , . } ;

140 x O = l ;

141

142 c a l l n l p q n ( r l , a r g l , " d e n 3 " , x 0 , b c , o p t ) ;

143 c a l l n lpqn(r2 ,arg2 ,"den4" ,xO,bc ,opt ) ;

144 s u p l = d e n 3 ( a r g l ) ;

145 sup2=den4(arg2);

146

147 accept=0;

148

149 do u n t i l ( a c c e p t = l ) ;

150 v l=supl*uniform(O);

151 v2=inf2+(sup2- inf2)*uni form(O);

152 v = v 2 / v l ;

153

154 upper=den3(v);

155

156 i f v l<(upper**2) then accept= l ;

157

158 e n d ;

159

160 t l = v ;

161 f i n i s h modgenl;

162

163 n i t e r = 1 0 0 0 ;

94

164 burnin=100;

165 sum=0;

166 t l = l ;

167 do j= l to n i t e r ;

168 c a l l m o d g e n 2 ;

169 c a l l m o d g e n l ;

170 i f j>burnin then do;

171 sum=sum+(l / l ikden(t l , t2));

172 mltemp=(j-burnin)/sum;

173 put t l +1 t 2 +1 mltemp;

174 end;

175 end;

176

177 ml=(niter-burnin)/sum;

178 print ml;

179

180

181

182

183

184

185

186

A DISSERTATION IN BUSINESS ADMINISTRATION the …

Documents

Transcript of A DISSERTATION IN BUSINESS ADMINISTRATION the …