A DISSERTATION IN BUSINESS ADMINISTRATION the …
Transcript of A DISSERTATION IN BUSINESS ADMINISTRATION the …
BAYES FACTORS FOR VARIANCE COMPONENTS
IN THE MIXED LINEAR MODEL
by
MITHAT GONEN, B.S., M.S., M.S.
A DISSERTATION
IN
BUSINESS ADMINISTRATION
Submitted to the Graduate Faculty of Texas Tech University in
Partial Fulfillment of the Requirements for
the Degree of
DOCTOR OF PHILOSOPHY
Approved
Accepted
December, 1996
ACKNOWLEDGMENTS
There are many people who have made the completion of this dissertation
easier and more pleasant. Among them are Banu Altunba§ and Demet Nalbant
who showed great companionship and hospitality during my visits to Lubbock,
and Patsy and the staflF of the Graduate School who handled the intricacies of my
dual-degree program.
My committee members have also been very helpful. Paul Randolph is the one
who started it all by recruiting and encouraging me to attend graduate school.
Ronald Bremer and William Conover made suggestions that led to considerable
improvements in the dissertation and Benjamin Duran provided his usual friendship
and understanding.
However, none has influenced my education more than Peter West fall, m}-
advisor. He has been a teacher and a role model for me; being an example of
how hard work, insightfulness and poise can lead to professional success. I am sure
that without him. I would have emerged from my years of study as a less capable
teacher and researcher.
My graduate study is funded by a scholarship from Middle East Technical
University. Turkey. I take this as a generous support from the people of Turkey to
whom I am indebted.
Finally, my heart is with Elza who makes everything meaningful and worth
while.
u
CONTENTS
ACKNOWLEDGMENTS ii
ABSTRACT v
LIST OF TABLES vi
I. INTRODUCTION 1 1.1 Overview 1 1.2 Purpose and Importance of the Research 3
II. LITERATURE REVIEW 5 2.1 Linear Models 5 2.2 Bayesian Analysis 8 2.3 Bayesian Analysis of Linear Models 12
III. THEORETICAL DEVELOPMENT 14 3.1 Preliminaries 14 3.2 Prior Selection 17 3.3 Deriving the Bayes Factor 22 3.4 Hierarchical Approach 27 3.5 Missing Observations 34
IV. NUMERICAL METHODS 39 4.1 Monte Carlo Estimation of the Bayes Factor 39 4.2 Data and Model 42 4.3 Simple Random Sampling 45 4.4 Latin Hypercube Sampling 51 4.5 Gibbs Sampling 56
V. CONCLUSIONS AND FUTURE RESEARCH 63
REFERENCES 66
APPENDIX A: SAS IML CODE FOR SRS 77
APPENDIX B: SAS IML CODE FOR LHS 82
ni
APPENDIX C: RATIO-OF-UNIFORMS 87
APPENDIX D: SAS IML CODE FOR THE GIBBS SAMPLER 90
IV
ABSTRACT
The Bayes Factor is a widely-used summary measure that can be used to test
hypotheses in a Bayesian setting. It also performs well in problems of model se
lection. In this study, Bayes Factors for variance components in the mixed linear
model are derived. The formulation used avoids the assumption of a priori indepen
dence between the variance components by using a Dirichlet prior on the intraclass
correlations. A reference prior, which results in a Bayes Factor that is flexible and
easy to use, is suggested. Hypothesis tests using the Bayes Factor avoid difficulties
of the classical tests, such as non-uniqueness and invalid asymptotics.
The priors on the nuisance parameters are chosen to be non-informative and the
corresponding integrals are carried out analytically. For the parameters of interest,
however, numerical methods have to be used. For this purpose, Monte Carlo
methods have been investigated. Simple random sampling and Latin hypercube
sampling are employed for simulating the prior and a Gibbs sampling scheme
has been implemented for simulating the posterior. The resulting estimators are
compared on a small data set.
LIST OF TABLES
4.1 Data 43
4.2 ANOVA Table for the Model 43
4.3 ANOVA Table for Main Effects 43
4.4 Simple Random Sampling 50
4.5 SRS for 5 50
4.6 SRS for po 50
4.7 LHS for B 53
4.8 LHS for po 54
4.9 Estimates from the Gibbs Sampler 62
VI
CHAPTER I
INTRODUCTION
1.1 Overview
This study is concerned with a Bayesian approach to testing hypotheses about
variance components in the mixed linear model. The following sections will define
what we mean by the mixed linear model precisely, but for the purposes of this
introduction we assume that it is roughly understood what we mean by it.
Linear models have been around since the early nineteenth centur}-. Legendre
(1806) and Gauss (1809) seemed to be the first people to consider comparison of
means and estimation of fixed effects. Some fifty years later, we see the first studies
involving variances and random effects in Airy (1861) and Chauvenet (1863). Of
course, those authors did not use either of the terms linear model, fixed effect and
random effect. All of those studies were motivated by observations arising from
astronomical studies.
Fisher (1918) is attributed as the first person to study problems involving linear
models from a statistical perspective. His famous work "Statistical Methods for
Research Workers" has helped the methodology to spread to practitioners quickly,
especially in the areas of biology and agriculture. He also originated the somewhat
misleading term "analysis of variance," whose shorthand ANO\'A may be the
single most well-known acronym of applied statistics. But it was Eisenhart (1947)
who introduced the terminology that we use here, involving the terms "fixed,"
"random," "mixed," "Model F' and "Model II." Searle, Casella and McCulloch
(1992) has an introductory chapter including the historical developments in this
area. Also. Khuri and Sahai (1985) has a very good literature survey that touches
upon history as well.
Bayesian analysis seems to be even older than linear models. Appropriately
named after Bayes (1783). this has been a strong school of thought in the devel
opment of statistics. It has found itself at the center of severe disagreements with
the ''classical'' or "frequentist" school. Most, if not all. classical statistical meth
ods rely on the work of Neyman and Pearson (1933. 1967) which has led to the
now well-known theory of hypothesis testing, named after them. This approach
is called "classical'' for obvious reasons, or "frequentist" reflecting their definition
1
of probability as a limiting relative frequency. Lehmann (1986) is an excellent
source for these results. This development coincides with the period in which the
use of statistical models was becoming more and more popular among researchers.
Hence, most of the earlier work on linear models was frequentist in flavor includ
ing some of the now-famous methods (such as F-Test and ANO\'A estimation).
Scheffe (1956). Graybill (1986) and Searle (1971) are the classical references that
contain accounts of those developments. Hocking (1985) has a general approach
based on a different parameterization of the problem.
The same statistical problems were being attacked by Bayesians as well. Jef
freys (1939) provided us with several Bayesian methodologies. Savage (1954) is
another pioneer in this area. As we will see, Bayesian answers to most problems
involve integrals that are not analytically tractable. This was a detriment for the
practitioners of that time who lacked the computing power of today and explains
why it was not until the 1960s that Bayesian methods emerged as an alternative
methodology to analyze linear models. The earlier Bayesian results (including
linear models) are surveyed in'Box and Tiao (1973) extensively and carefully. An
other good account for that period is Zellner (1971). Broemeling (1985) reflects
later developments.
A preliminary treatment of Bayesian statistics in most books emphasizes the
models about which Bayesian results and classical results agree. This gives the
misleading impression that Bayesian analysis is another route to arrive at the same
destination. In fact, in many cases Bayesian and classical results do not agree (see.
for example, the rejoinder of Berger and Sellke, 1987). We will demonstrate this
in the case of mixed models as well; however we do not intend to debate the
fundamental issues of probability and statistical inference, especially with regard
to the philosophical differences between the two schools of statistics.
The organization of this dissertation is as follows: In the next section we will
discuss the purpose and importance of the research. Chapter II provides a liter
ature review on Linear Models and Bayesian Analysis. Chapter III is the main
body of the research and derives the Bayes Factors for the variance components
in the mixed linear model. Chapter IV is devoted to developing and investigating
numerical methods to evaluate the Bayes Factors. We present our conclusions and
directions for future research in Chapter V.
1.2 Purpose and Importance of the Research
In this study we will attempt to provide a unified methodology of testing hy
potheses about variance components in the mixed model. Our emphasis will be
on unbalanced data, mainly because frequentist methods fail to provide a general
answer in this case. Our main tool of analysis will be the Bayes Factor introduced
by Jeffreys (1935). Specifically, we will derive the Bayes Factors for the variance
components in the mixed linear model. Our approach will be a multivariate gener
alization of Westfall and Gonen (1996) who have developed Bayes Factors for the
one-way random model. This generalization, however, is not straightforward. Not
only is the algebra more complicated, but selection of a family of suitable prior
distributions requires further work.
Bayes Factors are commonly used for model selection and hypothesis testing.
Previous studies have shown that they may overcome the difficulties that their
frequentist counterparts suffer from. In model selection, it is widely-accepted that
the ^-test for individual components is too unwilling to discard the individual
effects, thus suggesting unnecessarily complicated models. In the context of re
gression models, the suggested frequentist remedies are generally either sequences
of hypothesis tests (like stepwise regression) or some t}'pe of mean square error cri
terion (like Mallow's Cp. see Mallows. 1973). Two other measures that have been
successfully used in a number of different statistical models are Akaike's Informa
tion Criterion (AIC) and a modification known as Schwarz's Bayesian Information
Criterion (BIC). A good review of model selection methods and their applications
to regression analysis is Miller (1990). The Bayes Factor has close relationships
with AIC and BIC. as mentioned by Kass and Raftery (1995). and as such is a
promising tool for model selection. Some recent works in Bayesian model selection
are Mitchell and Beauchamp (1988), George and McCulloch (1993) and Carlin and
Chib (1995). It has been argued by Smith and Spiegelhalter (1980) that the Bayes
Factor acts as a fully automatic Occam's razor, cutting back to the simpler model
at once when the additional parameters are not needed, hence resulting in simpler
models with good predictive power.
In this study our main focus will be hypothesis testing. Of course, model
selection can be viewed as an appUcation of hypothesis testing, and the Bayes
factor we will derive should work for that purpose as well. In the context of
hypothesis testing, the Bayes Factor not only emerges as a pragmatic solution to
many difficult problems, but also challenges the results of existing methods. We
have already mentioned that there are several instances in which classical and
Bayesian results do not agree. This has been demonstrated by Edwards. Lindman
and Savage (1963) and their work has been considerably extended by Berger and
Sellke (1987). This theme will come up over and over in this study as we discuss
the so-called irreconcilability of p-values and posterior probabilities in Chapter II
and demonstrate it on a small data set in Chapter IV.
CHAPTER II
LITERATURE REMEW
2.1 Linear Models
Throughout this study, we will use uppercase letters to denote matrices, low
ercase letters to denote scalars in mathematical formulas and lowercase boldface
letters to denote vectors, unless otherwise stated. Also, / will denote an identity
matrix, 0 will denote a vector whose elements are all 0 and 1 will denote a vector
whose elements are all 1. We will assume that / . 0 and 1 have the appropriate
dimensions.
There are several (equivalent) formulations of the mixed model. We start with
the one introduced by Hartley and Rao (1967) :
y = Xf3-\-Zu + €, (2.1)
where y is a n x 1 vector of observable quantities, /3 is a p x 1 vector of fixed
effects and X is a fixed n x p matrix corresponding to the occurrence in the data
of the elements in /3. It may contain the observed values of the covariates. when
covariables are part of the model; otherwise it is an incidence (or design) matrix.
The elements of an incidence matrix are either 0 or 1, denoting the presence (or
absence) of the corresponding parameter in that particular combination of levels.
Also, u is a /: X 1 vector of random, unobservable effects and Z is a fixed n x A;
incidence matrix. Finally, e is a n x 1 vector of error terms. Which effects should
be fixed and which should be random is a critical issue and changes the whole flow
(and possibly the conclusions) of the analysis. Bremer (1994) is a good source on
this important topic.
We will focus on the random effects, so it is useful to partition u. as
U = [Ui , . . . ,Ur]^ .
where Uj contains qi elements, qi being the number of levels of the z* random
effect. So we have a total of r random effects. Corresponding to the partition of
u, we also partition Z as
Z ^^ [Zi... •, Zf] •
6
We will make the following assumptions, which are realistic and at the same
time lead to a tractable mathematical analysis. The notation .V„ refers to a n-
dimensional normal distribution.
. e\(3,a\{anU-Afn{0,a'l)
• Ui\l3,a'^,{af}l^-^ ~ Mq^O^afl) for afi i. where qi is the number of levels
within the i^^ factor.
• Ui|/3,(T^, {o- lJ" ! and Wj|/3, a^, {af}J" i are statistically independent for all
• Wi|/3,cr^, {(y\Yi^\ ctnd €|/3,cr^, {(y\Yi^\ ^re statistically independent for all i.
Some authors find it easier to write e as another random effect within the
vector u. Although it is notationally convenient, we prefer to stick with (2.1)
mainly because we will not treat e in the same way we will treat UjS from a
Bayesian standpoint.
Now we can rewrite (2.1) as
T
y = X/3-^X^Z,Ui + e. (2.2)
This is the version of the model we will use throughout this study. Using (2.2) and
the assumptions, one can derive that
y - A / ; ( X / 3 , F ) , (2.3)
where
V = £<T?Z.Zf + (T^/, (2.4)
where G\'S are called "variance components" and u^ is called "error variance."
In this setup we will investigate the following hypothesis about the variance
components: E^ : cr] = 0 against Ex : a] > 0 for j e J = { 1 , . . . . r } . This is
known as the "main-effects" test, measuring the significance of the f^ random
effect and commonly used for model selection purposes.
In the context of balanced models, there is an extensively researched and fairly-
satisfactory frequentist theory. Most of the ideas are based on sums of squares
as developed in an analysis of variance table. The main point is that the sums
of squares are distributed as scalar multiples of central or non-central y^ variates.
under the usual assumptions of normality, homogeneit}' of variances and indepen
dence of the model's random effects. So, in many cases an exact F-test is available.
But still there may be cases where an exact test is not present. Then one has to
use a pseudo-F-test, which is simple to construct, but its exact size and power
is unknown. A similar situation occurs with unbalanced models involving fixed
effects only.
In unbalanced models with random effects, the state-of-the-art is considerably
worse. Even in the one-way random model (;? = 1, X = 1 and r = 1 in (2.2)).
there is confusion as to which test to use. These shortcomings are illustrated by
Westfall (1989), who shows that there is no most powerful invariant test in a one
way random model for testing E^ : o\ = 0 against E\ : G\ > 0. Self and Liang
(1992) show that Likelihood Ratio Test performs poorly, since the null value is on
the boundary of the parameter space. Those results are enough of an indication to
convince us that present situation of hypothesis testing for variance components
leaves a lot of room for improvement. In fact, even before the mentioned studies
appeared, Khuri and Sahai (1985) offered the following perspective:
The main difficulty stems from the fact that in unbalanced data situations, the partitioning of the total sum of squares can be done in a variety of ways; hence there is no unique way to write the analysis of variance table as is the case with balanced data. Furthermore the sums of squares in an analysis of variance table for an unbalanced model, with the exception of residual sum of squares, do not in general, have known distributional properties and are not independently distributed, even under the usual assumptions of independence, homogeneity of variances and normality of the model's random effects. Consequently, there are no known exact procedures that can be used for tests of hypotheses of variance components. It is therefore, not surprising that most research in the area of variance components for unbalanced models has centered on point estimation, [p. 256]
This perspective is apparently shared by Searle, Casella h McCulloch (1992),
whose book is devoted entirely to the study of variance components. It has a total
8
of 12 chapters that are concerned with estimation only: and mentions hypothesis
testing in small subsections in special models.
So, not only is it the case that frequentist methods fail to give us a unified
approach now. there is not much hope (and ongoing research) to provide the prac
titioners with better methods in the future. As we will see in the next section, a
Bayesian approach is more promising, but not so well-investigated—a gap which
we aspire to fill.
2.2 Bayesian Analysis
Bayesian analysis of statistical models exhibits some important fundamental
differences from classical methods. The differences are sometimes simple issues,
resolved using asymptotic arguments. More often than not, however, those differ
ences involve several basic issues regarding the definition of probability, philoso
phy of statistical inference, nature of probabilistic modeling and technical details
of measure-theoretic probability. We will not consider those issues in this study.
There are several published works involving those differences. Some examples are
Jeffreys (1939), de Finetti (1964, 1972, 1974, 1975) and Savage (1954, 1962) . Also,
several studies have addressed the practical aspects of the long-standing Bayesian-
frequentist argument. Berger and Deely (1987), Berger and Sellke (1987) are recent
views from the Bayesian side, whereas Casella and Berger (1987) and Efron (1986)
provide us with frequentist arguments. Our main reason for presenting a Bayesian
anah'sis here is mostly pragmatic, because it works well for an important class of
models where classical methods are not satisfactory.
Mainly, the frequentist school builds a statistical model by allowing the data to
be random variables and the parameters of the model to be deterministic variables
(i.e., unknown, but non-random quantities). The main tool of inference is the
sampling distribution of an appropriately chosen statistic on which estimators and
tests can be based. On the other hand, Bayesian school models both data and
parameters as random variables. The distribution to be assumed for the data may
be placed through frequentist arguments, but the distribution of the parameters
generally requires subjective arguments. Then the inferences are based on the
posterior distributions of the parameters that are found by Bayes" theorem (hence
the name Bayesian), conditional upon the values of the observed data.
In the sequel we will use P{.) for what is actually a probability density function.
As unusual as it may seem, this is common notation among Bayesian studies, such
as Hill (1965) and Chaloner (1987).
Assume we have a statistical model, where Y denotes the observables and
6 denotes the parameters. Naturally, the distribution of 1" depends on 6. In
frequentist context this distribution is termed as P(V). that is the (marginal)
distribution of i ' (also called the likelihood function), since 9 is not random. In
a Bayesian model, the very same distribution is viewed as P(V|^). since 9 is
random, but considered known (given) for the purposes of that distribution. We
can furthermore specify a marginal distribution for 9, that is P{9). We will call this
as the prior distribution of 9, since it does not depend on V, and hence is specified
before observing Y. This is where subjectivity enters into the picture, since before
observing Y. our information on 9 is very likely to be subjective. Then, by Bayes"
theorem, we can derive the conditional distribution of 9:
_ P{Y\ 9)P(9)
This distribution is called the posterior distribution of 9 and reflects how the
information contained in Y changed our prior beliefs about 9. In Bayes' theorem,
it is usually necessary to calculate the marginal distribution of Y through the
following identity:
P{Y)= f P{Y\9)P(9)d9.
So. the denominator is nothing but a normalization constant (constant with respect
to 9) that guarantees the resulting function to be a probability distribution func
tion. It is for this reason that, in some Bayesian contexts, the following notation
is very common: P{9\Y)(xP{Y\9)P{9),
meaning that the posterior distribution is proportional to the product of likelihood
and prior. Later, we will use the notation oc for other probability distribution
functions. It should be understood that, when such notation is used, the right-
hand side contains every term related to the random variable in question (the
so-called kernel of the distribution), but may or may not contain the constants
involved.
10
The posterior distribution contains all the necessary information to make in
ferences. Point estimates, confidence intervals (in Bayesian terminology called
credible sets) and hypothesis tests are readily available. Moreover, we can directl}'
talk about the probability of an interval or a hypothesis, unlike the frequentist
case. For example we can make statements fike P{9 G A) = p. or P(EQ) = p for
some p e [0,1], because ^ is a random variable. As opposed to this, the frequentist
results like a confidence interval or a ]>value are much harder to interpret. In
fact. DeGroot (1973) argues that practitioners tend to interpret the frequentist
measures as if they are Bayesian measures. For example, many people think that
a 95% confidence interval has 0.95 probability of including the parameter, or a
p-value of 0.04 means that the probability that the null hypothesis is true is 0.04.
There are two major difficulties with a Bayesian approach. First, one has to
assign prior probabilities and in most cases the choice of an entire distribution is
necessary. Second, the calculation of the posterior distribution usually involves
integrations that are very difficult or impossible to perform analytically. There are
some computer intensive methods suggested in the current literature to go around
the problem of integration. We provide a review of those in Chapter I\'.
The problem of prior selection is a difficult one. Samaniengo and Reneau (1994)
argue that, most statisticians agree to perform a Bayesian analysis in the presence
of substantial prior information. However when prior information is vague or not
available at all, it is not very clear what one can do. In some cases there are families
of distributions such that, when used as a prior, they lead to a posterior from the
same family. These are called conjugate families. When they are available, it may
be wise to make a choice among them, even at the expense of slightly distorting
the representation of prior befiefs. DeGroot (1970) examines several models for
which conjugate families exist. However, such families are not available for most
problems. Then one has to use the so-called non-informative priors. In some cases,
those priors will be improper probability distributions in the sense that they do
not integrate to 1. We will have more to say about non-informative priors in the
next chapter.
11
As we mentioned, the only thing that is needed by an analyst is the posterior
distribution. In fact, some Bayesians argue that it is sufficient (and necessary) to
report the entire posterior distribution. However, this may be an inefficient way
to communicate the results as argued by Berger (1985. section 4.10). So. some
summarizing measures have been proposed such as the Bayes Factor. Kass and
Raftery (1995) have an extensive review on Bayes Factors that contains several
results that we will frequently use.
We will now define the Bayes Factor and explore its connections with the pos
terior probabilities. Since we are interested in testing hypothesis using Bayesian
methods, we can talk about the prior and posterior probabilities of hypotheses.
Let TTo and po denote the prior and posterior probabilities of EQ. We adopt the
same convention in TTI and pi for the alternative hypothesis. Then BF is defined
by (see Berger, 1985, p. 149) ^ ^ Po/Pi
TTO/TTI •
(We will abbreviate Bayes Factor as BF in text and B in mathematical expres
sions.) Assume that the two hypotheses are mutually exclusive and collectively
exhaustive. Then TTI = 1 - TTQ, pi = 1 - Po and
^ ^ Po/{l-Po) 7ro/(l -TTo) '
The quantities of the type p / ( l — p) are usually called "odds ratio" (or "odds" in
short) when p represents the probability of an event. Therefore we can interpret
the Bayes Factor as the ratio of posterior odds to prior odds.
We can now express po in terms of B using the following argument:
Po _ BTTQ
1 - Po 1 - TTo
i - Po 1 - TTo
Po B% 0
l_l = iz^ Po BTTO
which leads to
Po = (2,5)
12
This shows us that, in order to calculate the posterior probability of EQ, we need
to calculate BF and assign a prior probability to EQ. Therefore the contribution
of the data to the posterior probability is through BF only. If, in addition to the
null hypothesis, the alternative hypothesis is also simple, then BF does not involve
any prior information and is sometimes interpreted as "the evidence against Ei. as
provided by data." However, when the alternative hypothesis is composite the data
will be weighted by the prior probability distribution specified by the alternative
model.
We will mainly be testing point-null hypotheses. A point-null hypothesis spec
ifies a single value (rather than a region) for the entire parameter vector or a
subvector of it. In this study, we will focus on point-null hypothesis which are
concerned with a subvector of size 1, that is a single parameter. One interest
ing problem of testing a point-null hypothesis is that there are drastic differences
between frequentist measures (such as p-values) and Bayesian measures (such as
Po). This issue has been examined by several researchers, and the seminal works
on this controversy are Lindley (1957), Edwards, Lindman and Savage (1963) and
more recently, Berger and Sellke (1987). Berger and Delampady (1987) review the
earlier contributions and present new evidence on this issue. Also, there have been
several arguments (Hodges and Lehmann, 1954) as to whether we should discard
point null hypotheses entirely and replace them with the so-called "interval null
hypotheses," where the analyst chooses a small enough interval of "indifference""
around the point value to use in the null hypothesis. With all due respect to those
arguments, we maintain our pragmatic point of view and develop methodologies
for testing a point-null hypothesis.
2.3 Bayesian Analysis of Linear Models
In this section, we will briefly mention key Bayesian achievements in the anal
ysis of linear models. Most of the work involving problems of fixed effects models
seem to have been part of the statistical folklore, so it is difficult to trace their roots.
The books by DeGroot (1970) and Box and Tiao (1973) provide detailed accounts
of the earlier results about analysis of fixed effects from a Bayesian perspective.
Lempers (1971) brings together several Bayesian techniques appHcable to linear
models. Broemeling (1985) derives the posterior distributions of several different
13
Hnear models (including the mixed model), mostly using a hierarchical approach
and non-informative priors. However, he chooses to use confidence intervals to test
hypotheses and barely mentions the Bayes Factor. Among the recent books plac
ing special emphasis on models and data analysis are Press (1989) and Gelman.
Carlin, Stern and Rubin (1995). The works of Berger (1985) and Bernardo and
Smith (1995) investigate the mathematical and philosophical aspects of Bayesian
analysis in general, but use linear models as examples. OHagan (1994) presents
a careful mix of mathematical and philosophical aspects along with a review of
models and issues of computation.
The pioneering work in the area of Bayesian analysis of linear models with
random effects has been done by Hill (1965) and Tiao and Tan (1965), who have
worked independent of each other. Dickey (1974) is the first to derive Bayes Factors
for the linear model. His work is very detailed and carefully presented; however
his priors are fully informative and his BF requires many prior inputs. Smith
and Spiegelhalter (1980) and Spiegelhalter and Smith (1982) also derive Bayes
Factors, with fixed effects in mind, for linear and log-linear models and introduce
a device with which indeterminacies resulting from the use of improper priors can
be handled. Zellner and Slow (1980) and Zellner (1986b) consider Bayes Factors
for regression models. Also, Berger and Deely (1988) consider the problem of
analyzing and ranking several means and derive Bayes Factors for them. Their
framework includes several classical models as special cases. One drawback of
their study is the assumption of known error variance, which is hardly ever the
case in practice. Finally, Westfall and Gonen (1996) derive a Bayes Factor for the
one-way random model. They also suggest a framework in which the asymptotic
performances of the mentioned Bayes Factors can be evaluated and compared and
using this framework point out some difficulties that arise from the device of Smith
and Spiegelhalter (1980).
CHAPTER III
THEORETICAL DE\TLOPMENT
In this chapter we will derive the Bayes Factor for variance components in the
setting of the mixed linear model. Our main goals are to prove Theorems 4 and
5. Theorem 4 presents a Bayes Factor resulting from a "standard" model and
Theorem 5 presents one resulting from a hierarchical model.
3.1 Preliminaries
We will follow the notation developed in the previous chapters, unless we explic
itly state otherwise. We will continue to use P(.) for what is actually a probability
density as we have done in Chapter II. Whenever convenient, we will use dP{.) to
denote the Lebesgue-Stieltjes measure induced by P(.).
We first provide a summary of the problem. We assume y ~ 7Vn(X/3, T'). where
V = E •=! ^fZiZj^ + ^^^- Let J = { 1 , . . . , r} and rf = a]/a^ for all j e J. Also,
let K = { 1 , . . . , j — 1, j + 1 , . . . , r} . To keep the notation as simple as possible,
we introduce two vectors r j and r ^ , where r j = {Tj}jeJ and r ^ = {rDkeK-, so
that Tj contains all of the variance ratios, and r ^ consists of those variance ratios
that are not specified to be 0 under Eo. We keep in mind (2.5), which gives us the
relationship between po and the BF. The reasons to work with rf's instead of erf's
will become clear later.
Since the main question is centered around erf's (or, equivalently, rf's), we call
/3 and a^ nuisance parameters. This was the main reason for us to adopt (2.2).
rather than treating a^ as a variance component, because as we will see later we will
use different priors for nuisance parameters than the ones we will use for variance
components.
We will call this model "standard" for the lack of a better label. The need
for a label will become clear when the "hierarchical" model, whose label is widely
accepted and natural, is introduced in Section 3.4.
In this section, we will develop some tools that will be of use to prove our main
results in Sections 3.3 and 3.4.
14
15
Theorem 1 Tlie Bayes Factor for testing EQ : TJ = 0 against Ei : rj > 0 is given
by
B = I •. • / P ( y 1/3, a\ rl)P{(3. G\ T^) d(3 da' dr 2
K
/ . • • / P ( y 1/3. CT\ T])P{(3. CJ\ T]) d(3 da' dr (3.1)
Proof: The proof will follow the fines of Berger (1985, p. 149) with sfight
changes in notation. Let m(y) denote the marginal distribution of y. Then one
can write
where
m(y) = mo(y)7ro + mi(y)(l - TTQ).
mo(y) = / • • • / P(y 1/3, a \ TI)P{^, a \ r^) d(3 da' d: K (3.2)
and
mi(y) = / • • • / P(y 1/3, a\ TJ)P( /3 , a\ rj) dH da' dr]. (3,3)
Any integration without limits should be understood as being evaluated over
the entire parameter space. So, in (3.1) and what follows, the integrals correspond
ing to /3, a', T] and T^ are over R^, R+, (M+)'" and (E+)^-^ respectively. Then,
by Bayes' theorem
po = P(r, =01 y) = - ^ ^ ^ ^ .
Substituting m(y) from above we arrive at
mo(y)7ro ^° ~ 7no(y)7ro + mi(y)(l-7ro)"
Dividing both the numerator and the denominator by mo(y)7ro. we have the fol
lowing identity:
Po = 1 + 1 -7romi(y)
TTQ mo(y)
T - 1
(3,4)
16
Comparing (2.5) and (3.4), we see that
_ / • • • / P ( y 1/3, a\ TI)P((3. a\ r^) d(3 da' dr^
/ • • • / ^ ( y 1/3, cj\ T])P{(3, a', r j ) df3 da' dr] '
for Eo as defined above and this completes the proof. •
In most finear model contexts, the terms "full model" and "reduced model" are
very common. In this study, we will use the term "full model" to refer to the model
specified by the alternative hypothesis and "reduced model" to refer to the model
specified by the null hypothesis. One popular method of deriving a test statistic by
using frequentist methods or the fikelihood principle is to form a ratio where the
reduced model is represented in the numerator and the full model is represented in
the denominator. Mostly, likelihood functions corresponding to the reduced and
full models are used and the resulting test statistic is known as a Likelihood Ratio
Test Statistic. In that sense (3.1) resembles such a statistic, but there are two
main differences. First, while deriving a likelihood ratio test statistic, nuisance
parameters are handled by maximizing over the corresponding parameter space;
here they are integrated out. Second, in the Bayesian framework, likelihoods are
multiplied by the prior probabilities, which act as weights. For this reason. BF is
sometimes called "Weighted Likelihood Ratio," especially by non-Bayesians.
At this point, we would like to introduce the idea of "marginal likelihood,"
abbreviated as ML. The integrand in (3.2) can be recognized as the joint density
of data and parameters, namely P(y,/3, cr^,T^). Then we integrate out the pa
rameters and this leaves us with a function of data only, which is called "marginal
likelihood" (sometimes called "integrated likefihood"). Thus mo(y) is the marginal
likelihood for the reduced model. Similar comments can be made for (3.3) as well,
so mi (y) is the marginal likelihood for the full model and the Bayes Factor can be
summarily written as 5 = mo(y)/mi(y).
We would also like to mention that Kass and Raftery (1995) provide a "gran
ular" scale of interpretation (mostly based on experience) to use the BF directly
for inferences without calculating the posterior probabilities. Although it may be
easier for some people to avoid specifying TTQ, we feel it is a coherent Bayesian
approach to specify TTQ as well and to make decisions based on po, rather than the
BF itself
Our next step is to provide the analytic expressions for the likelihood and prior
to substitute in (3.1). The next theorem gives us the fikelihood.
Theorem 2 Let
yo'=Y:''iZkZi+i keK
and
jeJ
Then the likelihood for the reduced model is given by
P{y\l3,'7\Tl) = (2T<T^)-''/2|Vo|-'/^exp { - ^ ( y - XI3fVi\y - A"^)}
(3.5)
and the likelihood for the full model is given by
P{y\l3,a\r^) = {2nar""\Vir"exp[-^^(y - XI3)^\\-\y - XP)j . (3,6)
Proof: Using (2.4), we can see that
V = ± dlZ^Zf + a'l = a' if: rlZ,Zj + I
The rest of the proof follows from (2.3). We note that the effect of the reduced
model on the likelihood is via the covariance matrix only, i.e., it only '"reduces"
the covariance matrix. •
3.2 Prior Selection
We now turn our attention to ways of representing our prior information on
the parameters. There are several established methods of doing this. One can use
the so-called ^-priors (Zellner, 1986a). Another approach is to use non-informative
priors. These priors, as the name impfies, represent absence of prior information on
the parameters. Also, they have nice mathematical properties, such as connections
with Haar measure and invariance principle as mentioned by Berger (1985). In
most cases they arise as a limiting form of some regular distribution, and as such
they can carry some unusual properties. For example, one non-informative prior.
18
that is commonly used for location parameters is P(//) ex c. c e R J dP{fi) / 1.
so it is not a probability distribution in the usual sense. Such priors are called
improper. The example we have given arises from a standard normal distribution,
whose variance tends to oc. Although they are improper in this sense, their use
may result in proper posterior distributions. A common choice of non-informative
prior for scale parameters is P{9) oc 9'"", where u eR^. Note that the oc notation
is slightly abused here, since improper distributions do not have normalization
constants and hence reporting the kernel does not assure uniqueness. However,
improper priors are used when the multiplying constant of the kernel is irrelevant.
In the example for location parameter, any real number c would serve the same
purpose, and the posterior distribution does not depend on the choice of c. Box
and Tiao (1973) provide several mathematical and philosophical motivations for
the use of these priors.
The use of non-informative priors have their drawbacks as well. It is difficult for
most practitioners to understand them. Also, they generally lead to indeterminate
or infinite Bayes Factors (see, for example, Kass and Raftery. 1995). so they must
be used with caution in hypothesis testing situations.
With regard to all these discussions we find it suitable to choose non-informative
priors for the nuisance parameters and proper priors for the parameters of interest.
This is similar to Berger and Deely (1988) and Westfall and Gonen (1996).
Many researchers working on similar situations have assumed a priori inde
pendence of location and scale parameters. Unless there is contrary evidence, this
seems reasonable in the context of linear models as well. A handful of published
studies that treat mixed models from a Bayesian standpoint have used this assump
tion without any problems. Some examples are Hill (1965), Tiao and Tan (1965).
Dickey (1974), and Berger and Deely (1988). So, we will make this assumption
without further questioning it, i.e., we will assume
P ( ^ , a ^ r j ) = P(/3)P(<T^rj), (3.7)
and we choose P(y9), such that it is constant over
It is not as easy, however, to assume a priori independence of the individual
variance components and error variance. Dickey (1974) has done this mainly for
mathematical convenience, but we criticize him for this choice. Both the variance
19
components and the error variance are related to the scale of the problem, and in
most cases information about one of them usually leads to some information about
the other. This is the main reason why we use the variance ratios {T''S) instead of
the variance components themselves ((T|'S). Variance ratios are unitless quantities
and it is reasonable to argue that they are independent of a' a priori, in many
cases. So we choose
P{<7\T]) = Pia')P{TJ). (3.8)
Furthermore, this setup allows us to incorporate a non-informative prior for a'
very easily; we simply choose
P{a') ex {a')-'. (3.9)
With regard to the discussion and notation on page 18, and reafizing that a
(not a') is the scale parameter, this corresponds to taking v = 4. The more
commonly used choice in the literature seems to be z* = 2 (justifications for this
choice is given by Datta and Ghosh (1995)), although Ye (1994) suggests taking
1/ = 3, in the context of the one-way model, based on a derivation involving the use
of Fisher information. Our particular choice of z- = 4 is motivated by invariance
considerations. Its use in the one-way ANOVA has led to a BF that arises after
reducing the data to maximal invariant (Westfall and Gonen, 1996). We could
have easily used a somewhat richer family of priors suggested by Chaloner (1987).
Her priors are of the form
P(a') (X {a')-^-\
Our numerical experience in one-way ANOVA (Gonen, 1995) has shown that BF
is somewhat insensitive to the choice of A, so we prefer not to introduce another
prior parameter. The use of this family, however, does not bring any additional
analytical difficulties to our approach. One can easily incorporate it into the fol
lowing derivation. In Chaloner's family of priors, our choice in (3.9) corresponds
to A = 2.
The choice of P(T]) is not a trivial matter. Past studies suggest that the use
of non-informative priors for the parameters that are being tested usually result in
infinite Bayes Factors (see Smith and Spiegelhalter, 1980 ; Spiegelhalter and Smith,
1982; and Berger and Deely, 1988). So, we have to be informative. However, there
20
are no informative choices that wifi make the integrations in (3.1) analytically pos
sible. Since this will represent the prior information about parameters of interest,
we want a family of prior distributions that possess some nice properties. We
would fike to afiow a rich enough structure that will permit possible a priori de
pendence of the variance ratios, as this may very well be the case. .A.lso, we would
like to keep the parameters of the prior distribution as understandable as possible,
with the potential users in mind, who may not have the time or background to
understand the intricacies of probability theory.
We will suggest an approach to prior selection based on intraclass correlation, which is defined as
This is a correlation coefficient between the subjects within the same group.
By definition, it takes values on the unit interval. So, if we define a vector pj =
{pjjjeJi tbe Dirichlet family presents itself naturally to represent prior information
on PJ. Dirichlet is a well-known and well-studied (Ferguson, 1967) multivariate
generalization of the beta family. A vector p = (pi , . . . ,Pr)^ is said to have a
Dirichlet distribution with parameter vector a = ( a i , . . . ,Q;r-+i)^, if it has the
following distribution function:
nr(a,)W •=' i=l
where ao = E S I ai, 0 < pi < I ioi all i = 1..., r; and aj > 0 for all j = 1 . . . , 7"+1.
We now suggest using a member of Dirichlet family to represent prior information
about the intraclass correlation coefficients.
Using a Dirichlet family has the advantage of incorporating a reference prior,
in the terminology of Box and Tiao (1973). We suggest that, in the absence of
prior information, one should choose a = 1. We have already mentioned that
we have to use proper priors for the parameters of interest to avoid indeterminate
Bayes Factors. However, choosing a = 1 gives us a proper prior which does not
favor any of the parameters a priori. This is a generalization of the choice of a
uniform distribution as a prior for the intraclass correlation in one-way random
models, see Westfafi and Gonen (1996) for details. Another consideration here is
21
the concern for developing a proper reference prior. O'Hagan (1995) argues that
the Bayes Factors can be sensitive to the choice of prior inputs for the parameters
of interest and a prior family should include reasonable choices of reference priors,
which is what the Dirichlet family does for us.
Having represented our prior information in terms of pj. we now face the task
of finding the corresponding distribution on r J . We remind ourselves that r J is a
one-to-one transformation of pj. so the inverse transformation exists and is well-
defined. It is, then, conceptually simple to find the distribution of r j , by using
the Jacobian of the inverse transformation, a method that is very well-known and
widely used (see, for example, Hogg and Craig, 1978, p. 134). However, in this
case, computing the Jacobian is not trivial. The following theorem establishes this
connection for the case a = 1.
Theorem 3 Let TJ, pj and a be as defined above. If pj has a Dirichlet distri
bution with parameter vector a = 1, then P{TJ) a (1 + Y.]=i T')~^'^'^^\
Proof: Let (p i , . . . , pj.+i)^ has a Dirichlet distribution with all parameters
being equal to 1. We want to find the distribution of {r' , r'). where
Pi = hi{T',...,T') = — r
2 1 + E rj
Let A — { ^ \ . and /(.) be the denisty of the Dirichlet distribution. Then
P{Tl....T^)=\A\f{h,{Tl),...,hr{T',)).
Since aU the parameters are equal to 1, this reduces to (recalfing that r ( l ) = 1)
P(Tl,...y,) = \A\Y{r-rl).
So. we need to find |^|. We let T = E L i '^i and note that
To find the determinant of the matrix A, defined above, we convert it into an
upper triangular matrix in the foUowing way. For each z = 1 , . . . , r - 1 . we perform
the following elementary row operations:
dh ^ I ^-^f$- lii=j dr' ^ -'
9 0
1. Replace the {i -\-1)^^ row by the sum of the i^^ row and the (/ -r If^ row.
2. Replace the i^^ column by the difference of the i^^ column and the (? -^ iY^ column.
Since these are elementary row operations, the determinant is left unaffected at each step. After doing this, we have
flfjf — 1/(1+ T) i f z ^ j
0 if zV j
Then \A\ = (1 + r)-(^+i). This leads to
- ( r ^ l )
P{T', r,2) = r ( r + l) 1 + E^/ t = i
which is a multivariate extension of the Pearson Type \'I family. •
By combining (3.7). (3.8) and (3.9) we arrive at our prior specification for the
full model
P(/3,a^rj)oc((T^)-^P(Tj), (3,10)
and for the reduced model
P ( / 3 , a ^ T ^ ) o c ( a V ^ ( T ^ ) , (3.11)
where P(r^) and P{TJ) are given by Theorem 3.
3.3 Deriving the Bayes Factor
The main purpose of this section is to prove the following theorem, which gives
us an explicit expression for the Bayes Factor for a single variance component.
Theorem 4 BF for a single variance component a' in the standard model is given
by
B = ^ (3.12)
23
where
mo (y) = / lUol-i-lA'^ro-^Yr^'-^[(y - A'^orUo-^y - A' /3o)]"^ '^ ' '^^P(r^) . (3.13)
777 (y) = / \\\\-' ^|.V^\r'-V|-'/^[(y - A A ) ^ r r ' ( y - .Y/3i) -1.^-1)
dP{TJ). (3.14)
and
$0 = (X^Vo-'X)-'X^\l-'y.
$^ = iX^\\-KX)-KX^V-'y.
The proof of the theorem will be greatly facifitated by the use of the following
lemmas.
Lemma 1 Let U 6e a symmetric, positive-definite matrix. The generalized least
squares estimator corresponding to \' is /3 = {X-^\'~^X)~^X^\ ~^y. Then
r ^ - T T — 1 (y - X^yv-\y - X(3) = {y - X(3y V-\y - Xf3) + (yS - /3)^ A" 1 -^V(/3 - /3). (3.15)
get:
Proof: We start by adding and subtracting A'/3 and expand the expression we
[(y - A-/3) + {X0 - .V/3)f V-'[(y - .V/3) + (A'^ - A/S)] (3,16)
= (y - X$yv-\y - A/3) + (/3 - 0)^X^-^X0 - /3) +
(y - X$fV-'X{$ - /3) - (/3 - l3fX^V-\y - A/3),
By comparing (3.15) and (3.16). we see that the only thing we need to prove is
(y - X(3)^V-'X(0 - /3) -h (/3 - ^yx'^iy - X(3) = 0
To prove (3.17). it is sufficient to prove that
(3.17)
{y-X(3yV-KX{(3-0) = O. (3.18)
24
and
{^-(3fX^V-\y-X^) = 0. (3.19)
We first tackle (3.18):
(y - X/3)^U-^X(/3 - /3 ) = y^V-'X^-y^V-'X(3 (3.20)
- ^^X'^V^^X^ -h ^^X^\ '-KX(3.
Keeping in mind that U~^ and {X'^V~'^X)~^ are symmetric matrices and sub
stituting /3 = {X'^V-^X)-'^X'^V-'^y in every term of (3.20), we obtain the follow
ing equations:
y'^y-'^XP = y'^V-^XiX'^V-'^Xy^X'^y-'^y (3.21)
$'^x^y-^x$ = y'^y-^x{x^y-^x)-\x'^y-'^x){x^y-''x)-Kx^y-^y = y^y-'^xix'^y-'^xy^x'^y-'^y (3.22)
^^x^'y-'x^ = y^y-'x{x''y-'x)-\x^y-Kx)f3 = y^U-^X/3. (3.23)
Substituting (3.21), (3.22) and (3.23) back in (3.20), we see that (3.18) is proved.
Now we go back to (3.19):
{$-(3)^X^y-\y-X0) = $^X^y-'y-$'^X''y-'X$ (3.24)
- /3^x^u-V + f3'^x^y-Kx$.
The terms in (3.24) can be expanded in the same way as in (3.20) to get:
0^X^y-'y = y^y-'X{X^y-'X)-'X^y-'y. (3.25)
$^x^y-'x0 = y^y-'x{x^y-'x)-'x^y-'xix^y-'x)-Kx'~yy = y'^y-'xix^y-'xy'x^y-'y (3.26)
^Ty^Ty-ij^^ = y^y-'x{x'^y-'x)-\x''y-Kx)(x''y-'x)-Kx^y-'y = 0^X^y-^y. (3.27)
25
Now, (3.25), (3.26) and (3.27) substituted back in (3.24) imply (3.19). Then
(3.18) and (3.19) together imply (3.17) which in turn implies (3.15). which is what
we wanted to prove. •
Lemma 2
y exp{(y - Xl3n-'{y - X0)}dy = {2-)'"'\V\'/\
where p is the dimension of the vector y.
Proof: Follows from the definition of a p-dimensional multivariate normal distribution. •
Lemma 3
^0
Proof: Follows from the definition of the inverted-gamma distribution . •
We are now ready to prove Theorem 4.
Proof (Theorem 4): We will work out the details of the proof for 77ii(.)
only, since exactly the same steps can be traced for 777o(.), with a slight change in
notation.
We first substitute (3.6), (3.10) and (3.11) in (3.1) to get
m,{y) = JJJ{2na')--/'\y,\-y'
e x p { - ^ ( y - X/3)^Uf ^(y - Xf3)}{a')-'d0da'dPir]).
It is important to keep in mind that VQ and Tl are functions of the variance
ratios, but this dependence is not explicit to keep the notation to a minimum.
After grouping like terms, we arrive at
e x p { - 2 ^ ( y - Xl3f \r'(y - A/3)} d0da'dP{T]). (3,28)
We will integrate over /3 first, in (3,28), Using Lemma 1, we get
J/J{<T'ri-'\\\r"exp{~(y - Xl3y\\-'{y - X/3)}d0dc'-dP(T^)
= j\Vi\-'"J{<rr''-'exp{-^{y - X$rf\\-\y - A/3x)}
Jexp{-^^{0,-0VX^\r'X0i-0)}dl3da'dP{TJ).
26
Notice how the lemma has enabled us to separate the term in the exponent into two parts one involving /3 only and the other involving rJ only. By virtue of this, we are in a position to integrate over /3 analytically, using Lemma 2:
/ e x p { - ^ ( / 3 x - lifX^V,-'X0, - H)] dH = {2^rl\ayl'\X^\\-'Xr".
Now, we turn our attention to the integral over a':
/ / /(<7^)-*-^|r ,r i /^exp{-^(y - XI3fVr\y - Xl3)}{a')-'dfida'dPirj)
/ ( a ' ) - ( ^ « ' e x p { - ^ ( y - X$^f\\-\y - X$,)}da' dPirj).
By using Lemma 3,
/ ( . ^ ) - ( ^ - ^ ) e x p { - ^ 2(T2
\{y-X0^fVr\y-X0^)
]^{y - X^ifVf'iy - X0,)}]da'-
n ^ i ^ + i
This development leads to the following expression for 7ni(y):
™i(y) = r i ^ ^ + i
/ \V,r"\X^Vf'X\-'"[\{y - X^)^V,-'{y - Aft)]"*^ '* dPir'j).
Carrying out similar steps, we arrive at the following expression for the numer
ator:
mo(y) = r l ^ ^ + l
f \Vo\-'"\X^Vi'X\-'" ^(y-X0ofV,-\y-X0o) - ( ^ + 1 )
dP(rj,).
The only difference between the numerator and the denominator that is not explicit in those two expressions is the domain of integration. For the numerator.
27
the integration is over (R+)'' ^ whereas for the denominator, it is (R^)'". After
grouping like terms and cancelations, we arrive at
„ / \Vo\-'"\X^Vo-'X\-'''[iy - X0ofV,-\y - A/3o)l" ' '^^"rfP(T^) jy =
/ | r i | - i /2 |A^yr 'A|-V2[(y _ X0,)nr\y - .V/3 i ) ] " ' ' ^^"dP( r j )
which completes the proof. •
A special case worthy of interest is the one-way random model, for which BF
is derived by Westfafi and Gonen (1996). It can easily be seen that, this Bayes
Factor reduces to the one reported in that study.
3.4 Hierarchical Approach
In this section we will provide an alternative expression for the Bayes Factor
using a hierarchical approach. As noted by Searle et al. (1992). hierarchical models
have a distinct Bayesian flavor, but they have also been used successfully beyond
Bayesian analysis. As far as algebraic simplicity is concerned, the standard model
we have worked with in the previous section is preferred to a hierarchical model.
But it will turn out that, the BF based on a hierarchical model is much more
efficient computationally than the one we derived in Theorem 4. Also, in general,
hierarchical models are easier to conceptualize.
The mixed model hierarchy is well-investigated and the seminal work in this
area is Lindley and Smith (1972) , followed by the works of Smith (1973a, 1973b).
The starting point is the mixed linear model as specified in (2.2):
y = Xf3-\-Zu^e.
In what follows, we will treat u as a parameter as well. This is not a problem
from the Bayesian perspective since u is a vector of unobservable quantities and
hence can be treated as parameters. Then, we can specify the fikefihood function
for the model as
y|u, /3. a', T] - K{X(3 + Zu. a'I). (3.29)
Now, we have to specify priors for the parameters u,/3, a'.rj. We will do this
in two stages. In the first stage we will specify a prior for u. conditional on /3. a'. TJ
28
and then in the second stage we wifi specify priors for /3, a' and r j . Because of this
strategy of specifying priors, this approach is termed as "hierarchical."" .Actually,
it is a Bayesian example of the more general topic of "hierarchical modeling." see
Casefia and Berger (1990) for an introductory treatment of hierarchical models.
Berger (1985) provides a treatment of such models from a Bayesian perspecti^•e.
FoUowing the model assumptions for the mixed linear model, as stated in Section 2.1, we specify
Ui|/3,a2,TJ~A/;,(0,c7V/) (3.30)
with the understanding that Ui|/3, o• r j and Uj|/3, a^ r j are independent for z J-
We also keep in mind that u = (u i , . . . , Ur). This is cafied the "first-stage prior".
and the parameters that are conditioned on in a first-stage prior are known as
"hyperparameters". As one might expect, the second-stage priors are concerned
with hyperparameters:
P(/3,^^Tj)cx(a2)-2p(Tj) (3,31)
which is the same as (3.10). When some of the random factors are not present,
such as in the reduced model, one can easily make the necessary adjustments to
arrive at an appropriate hierarchical specification.
In this context, we will prove the following theorem about the Bayes Factor
expressed under this hierarchical setup. We assume that the variance components
are arranged in such a way that the one to be tested is labeled as a\. This
assumption is needed only for simplicity in notation.
Theorem 5 The BF for a single variance component al in the hierarchical model
is given by
B = ^ (3.32) ^i(y)
where
(3.33)
mi(y) = J\X^BrX\-'/'Q:^"-'^'y'[f[ \A i=l
1-1/2 dP{TJ).
29
(3.34)
and
Bo
A,
B,
$ho
$hi
Qo
Qi
= I
= I-^TjZfB,.,Zj
= Bj^,-TfBj_,ZjA-'ZjB,_,
= {X'^Br-iXy'X^Br-.y
= {X^BrXy^X'^Bry
= {y-X$nofBr-i{y-X$no)
= {y-X$ni)'^Br{y-X$ni)
(3.35)
(3.36)
(3.37)
(3.38)
(3.39)
(3.40)
(3.41)
forj = 1,2,... , r .
The proof of Theorem 5 will be greatly simplified by the use of the following
lemmas:
Lemma 4
Uj^Uj + rfi^i - ZjUifBj.,{ij - ZjUi) =
(uj - .4j'dj)^.4,(uj - ylj 'dj) + Tj^/B,ij (3,42)
where
i=j+l
(3,43)
and
.IrjT d j = ' ^ j ^ j ^ j - i ^ J (3.44)
Proof: Our proof starts with a straightforward expansion of the left hand side:
Uj^Uj + r]{ii - Z,Uj)^B,-i(€,- - Z,Uj)
= u/(I + r/ZjB,_iZ,)uj - 2uj^r;zjB,.iC,- + T]^/BJ.,^J.
30
Using the definitions of Aj and dj from (3.36) and (3.44) we have
u / u j + rji^j - Z,Uj)^P,_i($,- - Z,Uj) (3.45)
= Uj .Uj - 2uj dj + rf^/B,_,^j
= (uj - A-Mj)^^,(uj - ^-Mj) - dj^.4-Mj + r /^ /P ,_ i^ , - . (3.46)
Since dj^^-Mj = (rfy^/Bj_,ZjA-'ZjBj_,^j. we conclude that
- d / ^ - M j + r /^ /P ,_ i = rj\^/{B,.,-rjBj_,Z,Aj^ZjB,.,)^j]
= rJ^j'^Bj^j. (3.47)
Substituting (3.47) back in (3.46) gives us the desired result. D
Lemma 5
/
I
"" l 2 ? V L"J "J + ^!iii - ZjUifBj.,i^j - Z,Uj)]}dUj =
(2^aV;) '^ /V, | - ' /^ e x p { - ^ € / B , € i } . (3,48)
Proof: The proof of this lemma is greatly facilitated by the use of Lemma 4
which enable us to rewrite the exponent in the following manner:
r 1
J ^"^^i" 2 ^ ^ E""' ""' " ^^^^' ~ ^^'"J^^^^-i^^J' ~ - ' j)] } ^^^ 1 /* 1
= e x p { - ^ € / B ^ € j } J exp{-^^3;^(uj - Aj'difAjiui - Aj'di)} duj
= e x p { - ^ ^ / B , « , } ( 2 x < T V ; ) ' ^ / V , | - ' / ^
where the integral is evaluated by using Lemma 2. D
Armed with these lemmas, we turn out attention to Theorem 5.
Proof (Theorem 5): As we have done before with Theorem 4, we will derive
the marginal fikelihood for the fuH model only. The derivation for the reduced
model will turn out to be very similar to that of the full model.
m^(y) = J•••Jpiy\u,/3,a\T^)
[^P(ui|/3,<7^rj)]P(,9,<7^Tj) [ n d u , ] dtSdaUr 2 J-
i = l 1=1
31
We substitute the likelihood as specified in (3.29). the first-stage prior as in
(3.30) and the second-stage prior as in (3.31) to get
mi(y) = / - - - / ( 2 ^ < r ^ ) - " / ^ e x p { - ^ }
n ( 2 x a V / ) - ' - / 2 e x p { - ^ g ^ } ] ( a ^ ) - ^ [f[du,] dfida'dP{TJ). (3,49) .1—1 I J i = l
Arranging the terms in (3.49) we arrive at
777i(y) = //|(27r)-("+^)/2(a2)-("+^+4)/2[n(r2) -9. /2 I^df3da'dP(T]). i=l (3.50)
where
Iu = J ••• /exp{ 2a2 - E
r T r Ui Ui
27r a^r' i=i ^ " ^ 'i i=i }[nH (3.51)
It is helpful to keep in mind that ^o . as defined by 3.43. is a function of Ui"s.
We will evaluate /„ first. Define, for i = 1, . . . , r — 1
U_i = ( U i + i , . . . U r ) . (3.52)
Then
/u = / - - - / e x p { - t ^ } ^ u _ , / . „
where
/u. = / e x p { - ^ [ ^ + ( 4 i - Z i U i ) ^ ( 6 - 2 i U i ) ] } r f u , r?
iJB,^i = ( 2 w V f ) ' - / V , i - / ^ e x p { - i i ^ } ,
where, in the last step, we have used Lemma 5.
We substitute (3.54) in (3.53) and repeat the same process for U2.
/u = Ui^Ui ^ / P i ^ i
(3.53)
(3.54)
}du_] (2. .Vf)V^|^.r/7. . . /exp{-i:^-il^
= ( 2 . a V f ) ' ' / ^ | . 4 : | - ' / 7 - - - / e x p { - | : ^ } / „ , d u - , (3,55)
32
where
In, = / e x p { - — [ H l ^ + (^2-^2U2)^Pl(6-^2U2)]}rfU2 2 a n r |
= {2'Ka'Tiy^l'\A2\-"'ey.v[ ^2 B2^2
2^2 } (3.56)
where, in the last step we have used Lemma 5 again. Substituting (3.56) back in (3.55), we get
/u = {27ra'y'^^'^^/'\A,\-'/'{T',y/'\A2\-'/'{r^y'/'
/••7^M-gS?}^"p — } d u . 2 - (3.o7) 2a^
As one might realize easily, these are exactly the same steps we went through
in evaluating the integral over Ui. Repeating this procedure for the remaining
random effects U3, . . . , Ur, we find that
/„ = [f[(2^a'Tfr^"\A.r"] e x p { - ^ ( y - A^)^B.(y - X/3)}
(3.58)
Using (3,58) in (3,50), we get
mi(y) = ///(2^)-("+'>/^a2)-("+'+^'/2rjJ(^2)-,,/ •^•^•^ i = l
[f[{2na\^)'"''\Ai\-'/'] e x p { - 2 ^ ( y - XfifB^y - A/3)} d/ida' dP{TJ) 1=1
which can be rewritten as
mi(y) = (2^)-"/^//(^^)-("^^'/^[lI \A.\-'/%daUP{T]) i=l (3.60)
where
/^ = | e x p { - ^ ( y - A ^ ) ^ B . ( y - X / 3 ) } d / 3
= /exp{-2^[(y-A/3hi )^B,(y-A;3„) +
i0-0^ifX''BrX{l3-0Hr)]}d0
= exp{-2^(y - X0HifB,{y - A/3M)}(27r(7^)''/^|A^B,X|-'/^(3,61)
33
where we have used the techniques we have developed in Section 3.4 to prove Theorem 4 and we have also defined
Phi = (X^BrXr'X^Bry. (3.62)
Substituting (3.61) back in (3.60), we get
7771 (y) = (27r)-^--Py'J\X^BrX\-'/'[fl \Ai\-'/']l,2 dPir]) 1=1 (3.63)
where
4 . = l{a')-^"-'^'y'exp{-^^{y-X0^,fBAy-X0n^)} da'
n-p + 2 = r(^^^f^) \(y-X0HifBr{y-X0^i) -{n-p+2)/2
which, when substituted in (3.63), leads to
777i(y) = (2^)-("-P) /2r( I^Z|±^)
j\X'^BrX\-"' \{y-X$ni)'^Br{y-X^hi) -1 - (n-p+2) /2 r
[n \A,\-"'] dP{r]). (3,64) 2 = 1
A similar line of argument gives
777o(y) = ( 2 . ) - ( " - ) / 2 r ( ^ ^ f ± ^ )
j \X^Br-iX\-"' ^{y-X$ho)'^Br-i{y-X$no) -]-(n-p+2)/2 r-1
[n \A^r^'] dP{ri). (3.65) i=l
where
-IvT /3^o = {X' Br-iX)-'X' Br-iy
Combining (3.64) and (3.65), we arrive at (3.32) and this completes the proof.
D
To realize the computational advantage of the hierarchical model, one needs
to compare Theorem 4 and Theorem 5. The expression for the Bayes Factor in
34
Theorem 4 requires the inversion of 77 x ri matrices. As we know from the numerical
analysis literature (see, for example, Golub and Van Loan. 1989). matrix inversion
requires n^ operations and as such is expensive (in the sense of computation).
However, the Bayes Factor in Theorem 5 only requires inversion for matrices (Aj)
where the dimensions are in the orders of qi. Typically. 77 (number of observations)
is much larger than qi (number of levels of the i^^ random effect) and this gi •es
the expression in Theorem 5 a great advantage over that in Theorem 4. We will
make more remarks about this in the next chapter, where we tackle the issues of
computation. However, in fairness to the standard model, we note that the algebra
involved was much less tedious than the hierarchical model.
3.5 Missing Observations
The subject of missing observations has been a concern for practitioners for a
long time, because of the regularity with which it happens. Even the most carefully
designed experiment under strict controlled conditions can lose experimental sub
jects during the course of the experiment. In studies dealing with living subjects,
which are common in fields like psychology, biology, medicine and agriculture, the
phenomenon of missing observations is much more common. This partly explains
the existence of a lengthy literature on this topic and we do not intend to give a
complete treatment here. There have been a variety of proposed solutions rang
ing from the naive (discarding the data associated with missing observations), to
computationally intensive (multiple imputations), to analytically challenging (per
forming an exact mathematical analysis). The discussion of missing observations
is further complicated by the apparent lack of taxonomy. We will loosely follow
the terminology used by Searle et al. (1992), but several of the literature reviews
that we refer here introduce their own taxonomies. Another confusing factor is the
approach which considers all unbalanced data as missing, as adopted by Dodge
(1985), for example.
One of the earlier reviews of missing observations in the literature is given b}-
Afiffi and Elashoff (1966). They review the work initiated by the ideas of Allen and
Wishart (1930) and Yates (1933). Later Hartley and Hocking (1971) provided a
taxonomy for incomplete data problems and develop estimation techniques based
on fikelihood principle for normal models, including the finear model, with missing
35
observations. Anderson. Basilevsky and Hum (1983) provide their own taxonomy
and consider the problem of missing observations in planned experiments, multi
variate parameter estimation and least squares regression procedures. The most
recent fiterature survey is by Little (1992). His focus is on regression models with
missing values of covariates, however he reviews contemporary approaches to the
topic as well. He, too, provides his own taxonomy of methods.
The concept of identifiabifity plays an important role in the discussion of miss
ing observations. A statistical model with parameters 0 and observables y is said
to be identifiable if distinct values of 0 correspond to distinct fikelihood functions
P(y |^) . That is, if 0 / e\ then f{y\e) is not the same function as f{y\9').
As an example, consider a 2 x 2 experiment with 1 observation per cell where we want to model both the main-effects and the interaction. Using (2.4), we see that
y = alZiZ^ -h alZ^Zl + al^ZnZ^^ + c^ /
where al and al are main effect variances, al^ is the interaction variance and a'
is the error variance. Also, the Z's are the corresponding incidence matrices. .\
quick calculation assures us that Z12 = / , so
y = alZiZl + (T^ZzZj + {al^ + a')I. (3.66)
Letting 0 = {af^a^, 0-^2,(r'), we see that two different values of 0 such that al2-¥a'
is constant will yield the same U and hence the same likefihood function, so this
model is not identifiable. It is sometimes said that the interaction is confounded
with error variance addressing the fact that we cannot distinguish the two param
eters. The reasoning behind this terminology should be clear from (3.66). At this
point, let us observe that if two effects (fixed or random) have the same incidence
matrix, the mixed linear model will not be identifiable.
In the context of designed experiments with a balanced structure, a few miss
ing observations will result in an unbalanced design which may be undesirable
for several purposes. In this case, the most common practice is to "estimate"
these missing observations and use these estimated values as if they are the actual
observed ones. Adjustments have to be made regarding the degrees of freedom.
36
subtracting one for each estimated observation. This method should be used only
when there are few missing observations indeed, since the degrees of freedom ad
justment may result in a substantial loss of power in the hypothesis tests: or create
a problem with identifiabifity. One should also realize that the validity of infer
ences is now conditioned on that the specified model is true, since the missing
observations are generated according to the model and then used as if the\' were
actually observed. One also makes the additional (and somewhat unreafistic in
some cases) assumption that the distribution of the observed data does not change
in the presence of the missing ones. As a result of these considerations, one should
treat the results of such analyses as only approximate.
In this section, we use the term missing observation to refer to a datum that
we have planned to observe, but were unable to do so, because of the uncontrol
lable factors of the experimental environment. So, as long as the model remains
identifiable, an experiment where we were already planning for unbalanced data
does not create any further problems for the Bayes Factor and all of the discus
sions in the previous sections apply without exception. It is also not a problem
if we our planned-to-be balanced experiment turns out to be unbalanced because
of the missing values (of course, assuming the model is still identifiable), since
the BF works equally well in both balanced and unbalanced situations. However
sometimes an entire cell is missing and this leads to difficulties. In such cases,
estimation (in the sense of previous paragraph) of the missing cell means becomes
necessary and the analysis is carried out using these estimated values. Most of
the literature on handling missing cells in the mixed model is concerned with pa
rameter estimation and mot much has been done about hypothesis testing. The
distinction between "all-cells-filled" and "some-cells-empty" data is most helpful
for linear models as noted by Searle (1987), but does not facilitate the analysis in
mixed models.
If, for one of the reasons explained in the previous discussion, estimation of in
dividual observations or cell means becomes necessary, there are three main routes
which one may take: Least Squares, Maximum Likelihood and Bayesian. This is
also the order they have appeared in the literature historically. The process of
using the estimated values in place of the observed values is called "imputation."
Least Squares techniques have been known and appfied for a long time, but as con-
eluded by Little (1992), they are usually outperformed by fikefihood and Bayesian
methods. The method of multiple imputations (Rubin, 1987) uses the predictive
density of missing observations based on the EM algorithm for maximum likeli
hood estimation for generating a set of possible values for the missing observations
and perform the analysis using these generated values. It is inherently a computer
intensive approach, but seems to have dominated the area, at least in survey re
search. Little and Rubin (1987) extends the method of multiple imputations to
several other statistical models.
A Bayesian approach to imputation is advocated by Tanner and Wong (1987)
and models the missing observations as random variables with a prior distribution.
Since the missing values are unobservable to the experimenter, this approach is \er}'
sensible from a Bayesian perspective. Their device, called data augmentation, has
a distinctive Markov Chain Monte Carlo flavor. Later, Tanner (1993) established
data augmentation as a special case of Gibbs sampling. In the data augmentation,
one first generates values from the conditional distribution of parameters given the
observed and missing data. Then a sample of imputed values are generated from
the sampling distribution of the missing data, that is the conditional probability
distribution of the missing data given the parameters and observed data. This
iterative scheme is continued until some convergence is reached.
There is another way of modeling variance components advocated by Hocking,
Green and Bremer (1989). The main theme of their approach is interpreting the
variance components as covariances. Although this idea is motivated by the pursuit
of a model which would avoid negative variance estimates, it gives us a tool in the
case of missing data, in both all-cells-filled and some-cells-empty cases. Their
focus, however, has been mostly on estimation.
We conclude this section noting that, the BF we have proposed suffers no diffi
culty in the case of missing observations as long as the model remains identifiable.
If the model is no longer identifiable or if the analyst prefers to keep the balance
and symmetry in the experiment, then one can use any of the imputation methods
to find estimates of the missing observations and/or cefi means. The imputed data
set. then, can be used to calculate the Bayes Factor. In the case of imputations,
the caveats about the results of the analysis being only approximately true, as
discussed above, should be kept in mind. Of course, a novel Bayesian approach
38
to the problem would be to devise an algorithm that will integrate the calculation
of the BF and the data augmentation process that will automatically take care of
the missing observations, but this task is beyond the purposes of this study.
CHAPTER IV
NUMERICAL METHODS
The Bayes Factor, as given by either Theorem 4 or Theorem 5, involves analyt-
icaUy intractable multi-dimensional integrals both in the numerator and denomi
nator. Hence, for calculations, one must divert to numerical methods. Classical
techniques of numerical integration are based on quadrature methods and their sta
tistical appfications are considered in a recent review by Kahaner (1991). In higher
dimensions, it has been established that quadrature methods are outperformed by
Monte Carlo techniques (Niederreiter, 1992). We wifi, therefore, focus on Monte
Carlo estimation of the integrals in this chapter. When quadrature methods are
used for numerical evaluation of an integral, it is generafiy said that the integral
is approximated: however in the language of Monte Carlo, estimation of integrals
seem to be the more common terminology.
In Section 4.1, we will provide a general framework and notation in which
the Monte Carlo estimators we will use can be defined. In Section 4.2. we will
introduce a small data set and a model to work with. In subsequent sections, we
will implement the estimators we suggest in Section 4.1. In particular. Section 4.3
will be concerned with simple random sampling. Section 4.4 will consider Latin
hypercube sampling and Section 4.5 is going to deal with Gibbs sampling.
4.1 Monte Carlo Estimation of the Bayes Factor
We will first introduce a framework in which all of the Monte Carlo estimators
considered in this chapter can be examined easily. Our approach and notation is
similar to that of Raftery (1996).
The key quantity in the estimation of the BF is the marginal likelihood. In
what follows, we will work with 777(y), a generic marginal fikefihood in the form of
777 (y) = JL(T^)P[T^)dT\ (4,1)
L(.) is either a likelihood function, or what we will call a partially integrated
likelihood. If we let 0 = (0, -0) to be the vector of parameters with 0 denot
ing the nuisance parameters, then we define the partially integrated likelihood as
39
40
/ /((/), \l))P[(f), 0 ) d0, where /(</), 0 ) denotes the likelihood function for the model
and P(</>, 0 ) is the prior.
Let ^(r^) be a positive integrable function so that cg{T^) is a probability den
sity function where c~^ = J g{T^)dT^, a possibly unknown integration constant.
Then (4.1) can be rewritten as
777(y) = fL{T')[^^]cg{T')dT\ (4.2)
If we have a sample of size T from cg(T^), which we will denote by T^{i). for
7 = 1 , . . . , T, then we can use the following estimator for 777(y):
1 £ L{T%)P{T%^)
= \\LP/cg\U, (4,4)
where the g in the subscript refers to the density (or the kernel of the density)
where the samples are obtained from. We will refer to T as the simulation size (as
opposed to sample size, which we will use for the number of observations in a data
set). This estimator is simulation-consistent, that is m(y) converges in probability
to m(y) as T —» oo.
In some cases (including ours), c cannot be found analytically. We can, then,
estimate it by using the following discussion,
P{T^)dT^ = 1
My) = fi: ' /_.\ ' (4.3)
/
/
/
^cg{r^)dr^ = 1
cg(T )dT = c.
So, a simulation-consistent estimator of c is c = \\P/9\\g. Substituting this back in
(4.4), we get
\\LP/g\U . , , ,
We wifi call this estimator the general importance sampling estimator, following
Newton and Raftery (1994), with importance sampfing function g.
41
An obvious choice of g is P, the prior. This gives us
777p(y) = | | L | | p . (4.6)
as the importance sampling estimator. An attentive look at (4.6) assures us that
this is the usual Monte Carlo method for evaluating integrals using the sample
average, see Hammersley and Handscomb (1964). Historically, this has been done
using a simple random sample (SRS) from P, but we will also consider an alter
native, called "Latin hypercube sampfing" (LHS) suggested by McKay. Conover
and Beckman (1979). SRS and LHS will be investigated in Sections 4.3 and 4.4.
respectively.
Recent advances in Markov Chain Monte Carlo Methods (MCMC) has enabled
us to produce samples from the posterior distribution with relative ease, so another
reasonable suggestion is to use the posterior as the importance sampling function,
that is g = LP. This gives us
mip(y) = ||l/L||lJ,. (4.7)
This estimator is known as the harmonic mean estimator. Note that, the impor
tance of incorporating an unknown c frequently comes into the picture at this
stage, since in several situations the user, having specified the prior, knows the
integrating constant for P, but not for the posterior, which is given as propor
tional to LP. The use of the harmonic mean estimator in the context of posterior
simulation has been suggested by Newton and Raftery (1994). We investigate this
estimator in Section 4.5.
Let
Lo = [ n \Ar/']{X^Br-iX)-'/'[{y - XfLni)^Br-i{y - Xpni)]'^''^'^.^) i=\ r - (n+ l ) /2
L: = [^\.\i\-"%X^B,X)-"%y-Xiiu,fB,(y-XM\ • (4.9) i=l
These correspond to the partially integrated likelihoods of the numerator and
the denominator of the BF, given in Theorem 5. By replacing the generic L with
Lo and P with PQ (prior for the reduced model) in the discussion above, one can
get the importance sampfing estimators of the marginal fikelihood for the reduced
42
model. The same can also be done for the fuU model. It wifi. then, follow that the
BF can be estimated by the ratio of the two marginal likefihoods. that is
B = ? 2 M , (4.10) mi(y)
ll^oPo/g|| \\Po/g\\,
\\LiPi/g\\
where
rho(y) =
^ i ( y ) = I IP / II •
We wifi specifically spefi out what these estimates will be in the following
sections whenever we will evaluate them.
4.2 Data and Model
In this section, we introduce a smafi data set made-up for the purposes of this
study and we will suggest a model. In subsequent sections, we will employ the
Monte Carlo techniques discussed above to evaluate the Bayes Factor for the data
and model below.
The data set in Table 4.1 has two factors, both of which we assume to be
random effects. We will call those effects a and 7, a being the one with two levels
and 7 being the one with three levels. We work with the following random model
for this data set:
yij = fi-\-ai-\-jj-\-eij, (4.11)
The GLM procedure on SAS is used to produce the ANOVA table for this data
set, along with tests for main effects. These results are provided in Tables 4.2 and
4.3.
An investigative analysis suggests the presence of strong interaction, but to
highlight the computational strategies, while keeping the details to a minimum,
we ignore this fact and work with the no-interaction model. The tables above
summarize the results to perform the frequentist hypothesis tests. Roughly speak
ing, a is a highly significant effect, but 7 is not. Now, we wiU look at the same
hypothesis from a Bayesian perspective.
43
Table 4.1: Data
a
7
4.9
5.0
5.1
4.4
4.2
4.5
4.0
4.1
4.9
5.1
Table 4.2: ANOVA Table for the Model
Source
Model
Error
Corr. Total
DF
3
6
9
Sum of Squares
1.2620
0.3941
1.6560
Mean Square
0.4207
0.0657
F Value
6.41
Pr > F
0.0267
Table 4.3: ANOVA Table for Main Effects
Source
a
7 Source
a
7
DF
1
2
DF
1
2
Type I SS
0.9000
0.3620
Type III SS
0.6001
0.3620
Mean Square
0.9000
0.1810
Mean Square
0.6001
0.1810
F Value
13.70
2.76
F Value
9.14
2.76
Pr > F
0.0101
0.1416
Pr > F
0.0233
0.1416
44
We will test Eo '. a^ = 0 versus ifi : cr;; > 0, or equivalently EQ : T.^ = 0 \'ersus 2 _ 7
El : T' > 0. The choice of testing 7 main effect instead of a is. to some extent.
arbitrary. We are hoping that we will be able to demonstrate the "irreconcilabil
ity" of posterior probabilities and p-values mentioned earlier. Previous experience
suggests the difference is more dramatic especially when the p-values are of mod
erate size, instead of being very small, so at the outset, testing '^ may serve that
purpose. In the sequel, r^ and T' will denote the same thing as will r' and r | .
The model matrices can easily be found to be
X' =
zl =
zi =
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1
1 0 0 0 0 1 1 0 0 0
0 1 1 0 0 0 0 1 0 0
0 0 0 1 1 0 0 0 1 1
The Bayes Factor is given by mo(y)/777i(y) where
777o(y) = JLo{rl)P{Tl)dT^
7771 (y) = JL,{Tl)P{T])dTl
2 K
and
- (n+ l ) /2 (4.12) Lo = \A,\-"'{X'^B,X)-"'[{y-Xiino)^B,{y-XM]
Li = \A,\-"'\A2\-'l'{X^B2X)-"'[[y - XM^B^y - Xfi,i)]~^''^%.lS)
The A and B matrices are given by (see Theorem 5)
Al
Bi
A2
B2
= I-hT^Z^BiZi
= I-r'iZiAi'Zl
= I -\- T2Z2 B2Z2
= I-TIBIZ2A2^ZIBI.
45
and. following (3.62),
fiho = [X^ BiX)-'X'^ Biy
Phi = (X''B2X)-'X^B2y.
Finally, P(TI) and P[TI, T^) are the one-to-one transformations of the intraclass
correlations whose prior distributions are selected from the Dirichlet family.
4.3 Simple Random Sampling
In order to use the Monte Carlo method as described in the Section 4.1, we
need to be able to generate from the prior distributions. In the fiterature this has
been done mostly using simple random sampfing, or SRS (a term we borrow from
statistical sampling theory to emphasize the difference between different kinds of
sampling, otherwise SRS is known as random sampling): however, we will also
investigate the possibifity of using Latin hypercube sampling for this purpose in
the next section.
In several computer packages, a routine for generating a simple random sample
from a U{0,1) is readily available. By the definition of a simple random sample, this
method produces samples that are independent of each other. Strictly speaking,
there is no way to generate independent random numbers by using a computer,
however most of the available random number generators generate deviates that
behave as if they are independent. Still, some authors use the term "pseudo
random numbers" to remind us the inherent deterministic nature of the computer-
generated random numbers. There is a well-established literature, which is given an
encyclopedic treatment by Devroye (1986), on how to transform uniform deviates
to obtain a simple random sample from a given distribution. For some applications
of using simple random sampling in system simulations, see Law and Kelton (1991)
or Ripley (1987). This is the first of two methods we will use to generate random
numbers and we will label the resulting Monte Carlo estimator as BSRS-
In order to draw samples from P{T^), we use the fact that the variance ratios
are functions of intraclass correlations, the latter being modeled by a Dirichlet
distribution, a priori. To generate samples from a /c-dimensional Dirichlet distri
bution with parameter vector ot = (ai , a^+i)^, we use the following algorithm
reported by Devroye (1986):
46
1. Generate an independent sequence of random numbers {xjjfj^i. where J , is drawn from a gamma distribution with parameters a and 1.
fc+i 2. Let X = J2 ^i and yi — Xi/x ioi i = I,... ,k. Then ?/i, yk is a random
1=1
sample from a /c-dimensional Dirichlet distribution with parameter vector a . There are several methods to generate an SRS from the Gamma family, most of
which are summarized by Devroye (1986). But, for convenience, we prefer to use
the built-in generator RANGAM of SAS, the system that we chose to implement
SRS.
As we mentioned before, it is quite easy to get an idea of the accuracy of the
Monte Carlo method. One possibility is finding the joint asymptotic distribution
of 777o and 7771 as we will now do.
Theorem 6 Let rho and rhi be as defined above. Then, as T ^^ oc,
j l / 2 7770
777i
Po
Ml ^M 0
0 OQ paoai
paooi a'l
Proof Follows from the multivariate central limit theorem. •
We will now state two results and defer their proofs for a brief moment, until
we mention Theorem 9, which is a vital tool in the proofs of Theorems 7 and 8.
The next theorem estabfishes the asymptotic distribution of BSRS-
Theorem 7 Let BSRS = rho/rhi. Then
BSRS — ' J^iPsRS, CTSRS)^
where
and
PSRS — /fo /^i
_ j ^ ^ / 2 2paoaipo /7o<J?\ -Tp'X' pi " p'l ) • ^SRS ~ rr .,2\ " 0
Finally, we suggest the fofiowing estimator for po:
PSRS = 1 + l - T T o 1
^0 BSRS .
The asymptotic distribution of psRs is given by the foUowing theorem:
Theorem 8
where
PSRS^^f{^ipsRs^^L^,)•
i^PSRS ~ \ ^ ~^
l - T T Q 1
TTo / i s ,
- 1
0776?
^PSRS ~
l - T T Q
. \ TTo ^BSRS J
2 BSRS
Our main tool for proving Theorems 6 and 7, that estabfish the asymptotic
distributions of the Monte Carlo estimates for BF and po, will be the next theorem,
called (5-method. A proof of Theorem 9 can be found in Rao (1973).
Theorem 9 Let ST be a k-dimensional statistic {SIT, • • • • SkT) whose asymptotic
distribution is k-variate normal with mean {9i ,9k) and covariance matrix ^.
Let hi,... ,hq be q functions ofk variables with each of them being totally differen-
tiable. Then the asymptotic distribution of T^^'[hi{Sim', • • • • Skm) — hi{9i 9k)\
for all i = 1 , . . . , ^ is q-variate normal with mean zero and covariance EHE^.
where the elements of the q x k matrix E is given by Eij = {dhi/d9j}.
Proof (Theorem 7): Let h(mo,mi) = fho/fhi. k = 2 and q = I. Then, by
using theorems 6 and 9, we state that
rpl/2 7770
7T7i
Mo
M l -^Af{0,EEE'^).
where
E = a^ paoai
paoai al
and
E = d h d h dt\ t2 dt2 t2 (ti,t2)=(M0,Mi)
Performing the partial differentiation, we have
E = L /^i f^ i
48
Then
and
Mo MB = —.
M l
T 4 = ET.E'^,
so we have 1 1
^B = ^ ^0-2paoaipo +
2 2
Mo^i
M? T p'i\ " pi
which completes the proof. D
Proof (Theorem 8): The proof wiU follow the central theme (and notation) of the previous proof. We apply theorems 6 and 9 again, this time with k = q = l and
l - T T o 1 \ - 1 h{B) = 1 -h
TTo B.
Differentiating h, we get dh^
~d§
l - T T Q
TTQ
ih^+B) 2-
This leads to
MpQ = KI^B) = U + l - T T Q 1
TTo M B
-1
and
^lo = l - T T Q
TTQ
2^B
and the proof is complete. •
The method of simple random sampling is implemented by using the SAS/IML
code provided in Appendix A. We run this program for several different values
of T and report the results in Tables 4.4, 4.5 and 4.6. To estimate the unknown
parameters al, al and p, we use the corresponding sample variances and the sample
correlation coefficient.
We see that the estimates are sufficiently stable, especially for sample sizes 10^
or larger. We have a very high positive correlation between the numerator and
the denominator. A glance at Theorem 7 assures us that this high positive corre
lation will play an important role in reducing the standard error of our estimates
considerably.
49
Table 4.4, by itself, provides only point estimates for B. To go a step fur
ther we implement the results of the previous section and form 95% confidence
intervals for B and psRs- Those confidence intervals are based on the asymptotic
distributions established in Theorems 7 and 8, so their coverage probability is only
asymptotically correct. However, since our simulation sizes are large enough, we
accept them as satisfactory approximations. The confidence intervals are reported
in Tables 4.5 and 4.6. In calculating psRs's, we assume TTQ = 0.5, where TTQ is the
prior probability of EQ, which can be thought of as a priori indifference between
two hypotheses.
Tables 4.5 and 4.6 give us a reasonable picture to draw conclusions. For the
Bayes Factor, when T = 10^ or T = 10^, the standard error of the Monte Carlo
estimate is high and the corresponding confidence intervals are too wide to work
with. However, for T > lO' , the standard errors are satisfactory and the confidence
intervals shrink to acceptable values. For this example, we estimate the BF to be
0.59. A similar situation occurs for po- For T < 10^, we have high standard errors
and wide confidence intervals, but for T > 10" , the results become acceptable. We
estimate po to be 0.37 in this example. This is the "irreconcilabifity" of p-values
and posterior probabilities mentioned previously. For the same hypothesis, the
p-value of the P-test was 0.1416 (Table 4.3). We see that there is a huge difference
between those two measures.
Monte Carlo integration using simple random sampling has been studied and
applied extensively in the literature. We have already mentioned that it outper
forms the quadrature methods in the context of multi-dimensional integration,
however objections have been raised as to the efficiency of the process. The most
comprehensive treatment of these objections is given by McCulloch and Rossi
(1991) . They carefully study the situations where the posterior is concentrated
around a single mode (not so uncommon with the use of reference priors and a
moderate amount of data) and demonstrate that this wifi lead to the domination of
the Monte Carlo integration estimates by a few large values of the fikelihood. This
leads to a high variance for the estimation process, resulting in considerable ineffi
ciency. In the next sections we wifi examine two alternative estimators attempting
to increase the efficiency.
50
Table 4.4: Simple Random Sampling
T
102
10^
10^
10^
10^
7770
0.1695480
0.1692819
0.1681263
0.1679399
0.1675505
7771
0.2974386
0.2757402
0.2842017
0.2844701
0.2835661
B
0.5700270
0.6139182
0.5915737
0.5902148
0.5908692
^l 0.0341101
0.0340637
0.0337364
0.0334786
0.0335596
^l 0.1182261
0.1087973
0.1117144
0.1118142
0.1114368
P
0.8499019
0.8071927
0.8078687
0.8225059
0.8172925
Table 4.5: SRS for B
T
102
10^
10^
10^
10^
BSRS
0.5701
0.6139
0.5916
0.5902
0.5909
^^Bsn,
0.0346
0.0137
0.0041
0.0012
0.0004
9 5 % Confidence Interval
(0.5009, 0.6393)
(0.5865, 0.6413)
(0.5834, 0.5998)
(0.5878, 0.5926)
(0.5901, 0.5917)
Length of Interval
0.1372
0.0548
0.0164
0.0048
0.0016
Table 4.6: SRS for po
10
10-
10
10
10
PSRS
0.3631
0.3804
0.3717
0.3711
0.3714
SE, PSRS
0.0145
0.0046
0.0016
0.0005
0.0001
95% Confidence Interval
(0.3341, 0.3921)
(0.3712, 0.3896)
(0.3685, 0.3749)
(0.3701, 0.3721)
(0.3712, 0.3716)
Length of Interval
0.0580
0.0184
0.0064
0.0020
0.0004
51
4.4 Latin Hypercube Sampfing
In this section we wifi consider "Latin hypercube sampling" (LHS), which is
introduced by McKay, Conover and Beckman (1979) and extended substantial!}' in
a subsequent study by Iman and Conover (1980). This method has been originated
as an extension of "stratified sampling" which is widely used in survey research.
Later studies in number-theoretic methods to generate random variates have estab-
fished Latin hypercube sampfing as a special case of "quasi-Monte Carlo" methods.
For a treatment of quasi-Monte Carlo methods, see Niederreiter (1992).
The idea behind LHS is to divide the range of the random variate into equi-
probable strata and draw a simple random sample from each of them. This ap
proach ensures that the range space of the random variate is covered adequately
and the decrease in variance can be understood by means of an analogy with the
intuition behind stratified sampling.
If we are working with a random vector, each component of the vector is treated
by the same method. Then components are randomly matched. This random
matching ensures that the entire range of the vector is covered appropriately.
As noted by McKay, Conover and Beckman (1979) "One advantage of the
Latin hypercube sampling appears when the output is dominated by only a few
components of the input. This method ensures that each of those components is
represented in a fully stratified manner, no matter which components might turn
out to be important" [p. 239].
This advantage is especially important for statistical integration problems,
where the integrands are typically highly-peaked around a maximum point, thus
leading to a situation where numerical approximations are dominated by a few
values of the variable of integration. So, we would expect LHS to improve over
SRS, considering the issues raised by McCulloch and Rossi (1991), mentioned at
the end of the previous section.
There are several ways of obtaining a Latin hypercube sample from the Dirich
let distribution. We choose to generate a Latin hypercube sample over the unit
hypercube (in this case, having two variance components, the unit hypercube is
simply the unit square) and then convert them into a sample from the Dirichlet
family. In order to obtain a Latin hypercube sample of size T on the unit square,
we devise the following algorithm:
52
1. Divide the unit interval [0,1) into T equaUy spaced subintervals.
2. Randomly sample two points rj and v within each interval. Call them 7/ and
Vi when they are sampled from the i* interval, that is from [ ^ , ^ ) .
3. At this point we have T/I, . . . , 7 ^ and z/i,..., z/ - Let z = 1 and consider 7/ .
Pick an integer j at random such that 1 < j < T and let x^i) = {r/i. Vj) be
the first point in the Latin hypercube sample. Now, refrain j from further
consideration, increase 7 by 1 and repeat this step until all the r}i are consid
ered. This gives us a Latin hypercube sample over the unit square that we
label as a;(i),... ,X{T)-
Having obtained a Latin hypercube sample over the unit square, we convert it into
a sample from the Dirichlet family as follows. First let y = (yi, 7/2) be defined by
yi,{i) = 1 - v^^ixiy ^^^ 2/2,(i) = ^2,(i)- It follows that 7/i,(i) is Beta(l,2) and 7/2,(i) is
Beta(l,l). Now consider 2;i,(i) = 7/i,(i) and 2:2,(1) = 2/2,(i)/(l -2/2,(i))- Using a result of
Aitchison (1963), we state that 2;(i) = (zi^^i), Z2,{i)) has a Dirichlet distribution with
all parameters being equal to 1. This approach is necessary since we are working
with a Latin hypercube sample which does not satisfy the iid assumption.
The algorithm as described above works for models with two variance com
ponents only. To describe the corresponding algorithm with r components pre
cisely would require further definitions involving random permutations of integers,
although the generalization is conceptually straightforward. Loosely speaking,
one has to obtain a Latin hypercube sample, Xi,(i),..., Xr,{i), where i = 1 , . . . , T,
over the r-dimensional unit hypercube first. Then, via a suitable transforma
tion, yi(i),- • -^Vrii) are obtained such that 7/j,(i) is Beta(l,77 - j + 1). Finally
Zj = 7/,(i)/(l — Sfc=i 2/A:,(o)' ^ ~ l-^--^T; constitute the desired Latin hypercube
sample from the Dirichlet distribution.
In order to use Theorems 7 and 8, we need to find the standard errors of Tho and
777i. Since SRS provided us an iid sample, a good estimate was the corresponding
sample standard deviations. However a Latin hypercube sample lacks the inherent
randomness mechanism, so the only way we can obtain these estimates is to use
repfications. We first simulate Ti values and calculate the corresponding estimate,
then repficate this process T2 times. For purposes of efficiency comparison Ti x T2
should be chosen equal to one of the values of T implemented for SRS. We had
53
Table 4.7: LHS for B
Ti
50000
20000
10000
2000
1000
400
250
100
1
T2
2
5
10
50
100
250
400
1000
100000
BLHS
0.5912
0.5909
0.5913
0.5906
0.5907
0.5901
0.5919
0.5915
0.5902
SE ' BLHS
0.0004
0.0003
0.0002
0.0003
0.0004
0.0006
0.0009
0.0013
0.0012
95% Confidence Interval
(0.5862, 0.5962)
(0.5901, 0.5917)
(0.5908, 0.5918)
(0.5900, 0.5912)
(0.5899, 0.5915)
(0.5889, 0.5913)
(0.5901, 0.5937)
(0.5889, 0.5941)
(0.5878, 0.5926)
Length of Interval
0.0100
0.0016
0.0010
0.0012
0.0016
0.0024
0.0036
0.0052
0.0048
Table 4.8: LHS for po
Ti
1000
400
250
100
1
T2
100
250
400
1000
100000
BLHS
0.3713
0.3707
0.3710
0.3715
0.3711
^^Br.MS
0.0001
0.0003
0.0004
0.0006
0.0005
95% Confidence Interval
(0.3711, 0.3715)
(0.3701, 0.3714)
(0.3702, 0.3718)
(0.3703, 0.3727)
(0.3701, 0.3721)
Length of Interval
0.0004
0.0012
0.0016
0.0024
0.0020
observed in Section 4.3 that Monte Carlo estimates stabifized for T > 10" , so we
choose T = 10^ here. Then we face the problem of choosing Ti and T2 such that
T = Ti X T2. There are no clear guidelines as to how to do this in an optimal
manner, so we try different values of Ti and T2 such that T = Ti xT2 = lO^.
The results of Latin hypercube sampling with different values of Pi and P2 are
reported in Tables 4.7 and 4.8. The SAS/IML code that we have used to obtain
these results is given in Appendix B.
Some caveats are in order before we interpret the results of Table 4.7. First of
54
all, we only report the results regarding the standard errors, since LHS does not
introduce any bias. All the point estimates for BF and po display a small random
fluctuation and their exclusion should not injure the validity of our conclusions
regarding the efficiency of computational strategies. Also, we should keep in mind
that when Ti = 1, we subdivide the range of our random variable into one part
only, hence we are effectively performing an SRS. So the last fine of Table 4.7 can
be used for comparing the efficiencies of SRS and LHS. Finally, there is another
issue that needs to be kept in mind regarding the last column which represents the
half-length of the corresponding confidence interval. Recall that we are estimating
the ratio of two simulated statistics (that is P = rho/rhi) and the standard error
of the ratio is given by Theorem 7 whose proof relies heavily on the assumption
that both 777o and 777i are normally distributed. For small T2 in Table 4.8 this
assumption is violated and the best we can say is that they have ^-distributions
with T2 — 1 degrees of freedom. In this situation the probability theory is beyond
our reach to calculate the distribution of P . So the results of the last column for
small T2 should be considered ad hoc, since we have used the critical values of
% - i -
Looking at Tables 4.7 and 4.8, one observes that the choice of Ti and T2 is
critical for reducing the standard error of the LHS estimator. Especially, it seems
that smaller values of T2 improves the efficiency. However, the first three cases
where T2 is 2, 5 and 10 must be used with caution because of the remark made in
the previous paragraph. We feel that normality asumptions required for Theorem
7 are satisifed for T2 > 50, specifically we suggest using T2 = 100. After making
this observation we report the results of LHS for po for fewer combinations of Ti
and T2 (see Table 4.8), since we face a simfiar situation regarding normafity in
Theorem 8. By comparing the results of T2 = 100 with the results of SRS (last fine in Table
4.7) we see that LHS is far more efficient than SRS. The standard error is reduced
by a factor of 3 for B and a factor of 5 for po- Thus the resulting confidence intervals
are much narrower. Another interesting feature is that, the standard error of LHS
for T = 10^ is about the same as that of SRS for T = 10^ (for P , also see Table
4.5), hence we can interpret the gain in efficiency as an order of magnitude in terms
of the computing time. We also observe that one should choose T2 to be small as
0 0
long as convergence to normafity is satisfied. Considering that implementation of
LHS does not need a significant amount of increase in human effort (programming
etc.), we advocate its use to calculate the Bayes Factor, when the prior is our
choice of importance sampling function.
Latin hypercube sampling, by and large, has been ignored in the Bayesian lit
erature; so the degree of inefficiency suffered by SRS in the examples of McCufioch
and Rossi (1991) is not investigated in the case of LHS. The way it is constructed,
sampling randomly within equi-probable strata, it attempts to cover the range of
the prior more efficiently, but how does this transform to the case of a concentrated
posterior is unknown, as of yet. Stifi, our fimited numerical experience suggests it
will not suffer as much as SRS, if at all.
4.5 Gibbs Sampling
As we have noted before, the widespread use of Bayesian methods among prac
titioners is impeded by the analytical difficulties that are experienced in the deriva
tion of posterior distributions. A recent solution emerges in the form of obtaining
a sample from the posterior distribution in a rather easy and computationally in
expensive way. One can use this sample to calculate functionals of the posterior
distribution, such as quantiles, modes or marginal likelihoods; that will be of use
in inference. Currently, these methods are collectively known as Markov Chain
Monte Carlo (MCMC). The essence of these methods is to create a Markov chain
whose stationary distribution is the desired distribution to sample from, that is the
posterior. Then, after the convergence is reached, the sample path of the Markov
chain is a sample from the posterior. Of course the remarkable part of this ap
proach is devising ways of constructing a Markov chain with a given stationary
distribution, however it turns out that this is easy to do. Although the literature
has seen an explosion of MCMC studies recently, the essential ideas date back to
Metropolis, Rosenbluth, Rosenbluth, Tefier and Teller (1953) and Hastings (1970).
Gibbs sampling, our method of choice for this study, has been first used by Geman
and Geman (1984) in the context of image processing. Its use in statistical models
has been advocated by Gelfand and Smith (1990) .
We will first explain the basic mechanism behind the Gibbs sampler. V 'e want
to generate a sample from the distribution of T^|y, where r^ can be either r J or
56
T^.. Consider, for z = 1, . . . r the set of conditional posterior distributions
P^(r^) = P{rmrjh^.yl (4.14)
which are commonly called "fufi conditionals". It is not always the case that the
fufi conditionals determine the joint, but the conditions for them to do so are
rather mild and given by Besag (1974). As indicated above, we will use P^{.) to
refer to the conditional posterior distribution of rf suppressing the dependence on
the rest of the variance ratios and the data, mainly for notational simplification.
Consider the iterative scheme in which we start from a set of values, sa\',
{'^l(o)Vi=i ^^^ then generate the next set of values by
^ U l ) ~ ^('^1^2,(0).- •• ,'^r,(0).y)
2,(1) ~ Pir^Kn),... ,r2(o),y)
.2 7-r,(l) ^ Bi^rKil)^--- ,7-r-l,(l),y)
and continue this updating scheme. The set of values {T^^^^J^^I is called the sample
generated at the t^^ iteration. This set constitutes a sample, collectively, from
the joint posterior distribution and furthermore the individual values Tf,^s. is a
realization from the marginal posterior density of T'. Hence by continuing this
iterative scheme T times, one can get a realization of the posterior distribution.
This iterative scheme is known as Gibbs sampling. Note that it is not necessary
to consider all the univariate conditionals one by one, instead we can partition
the parameter vector into subvectors and work with the conditional distributions
of these subvectors. If we choose all the subvectors to be univariate, we get the
version of the Gibbs sampler described above.
There are two major issues surrounding the implementation of and inference
from Gibbs sampling (and more generally MCMC methods). The first one is
assessing convergence and the second one is the dependence among the samples
from the posterior distribution.
5/
The conditions under which a Markov chain has a unique stationary distribu
tion are well-known theoretically (it has to be aperiodic, irreducible and positive
recurrent, see Roberts (1996)) and the way we construct our chains in the Gibbs
sampler (and in other MCMC methods for that matter) guarantees the existence
of a stationary distribution. However, in practice, convergence can be painfully
slow and the major issue is determining when reasonable convergence has been
reached (called "burn-in"). Then, the samples obtained upto the burn-in point
are discarded. Although there is considerable theoretical work in the literature in
the forms of establishing bounds on convergence rates (examples being Rosenthal
(1993) and Poison (1995)), they generally turn out to be too loose to be of any
practical use. It seems that, the rather ad hoc methods suggested by Gelman and
Rubin (1992) and Raftery and Lewis (1992) are enjoying widespread practice.
An issue related to the convergence is the rate of mixing. Informally, mixing
is the rate with which the Markov chain moves throughout the support of the
stationary distribution. So, if a chain is slow mixing, it may stay at a certain
portion of the state-space for a long time and, unless the chain length is adjusted
accordingly, the inferences will be unduly affected. Previous work suggests that
reparametrization is the best remedy to slow mixing, some examples are given in
Wilks and Roberts (1996) .
The second issue is concerned with the fact that the observed values, being the
sample path of a Markov chain, are not independent of each other. Assuming that
convergence is reached, the observed values wifi form a dependent sample from the
posterior distribution. This is an uncomfortable setting for many statisticians, but
is not necessarily bad in MCMC. In most problems, the typical estimate wifi be ob
tained by averaging over the samples to get some empirical quantity. Although the
samples are dependent, the ergodic theorem assures us that these sample averages
(also called ergodic averages) converge to the true expectations (Gilks, Richard
son and Spiegelhalter, 1996). That the Markov chain is ergodic follows from the
aperiodicity and positive recurrence which have already been required for the ex
istence of a stationary distribution. So the most common approach to dependence
is to ignore it. Nevertheless, if for any reason, one needs an independent (at least
approximately) sample, there are two ways of getting one. One is to run several
chains with independent starting points and use the endpoint of each chain in the
58
final sample. The second one, called thinning, is to use every k^^ value from the
chain, which will be approximately independent due to the aperiodicity. Obviously,
one has to foresake some computational efficiency to obtain independent samples
and it is generally deemed not worthwile.
There is also a heated debate as to whether running several chains has any other
merit. Supporters of the several-chain approach maintain that comparison of the
sample paths of the chains might reveal essential information regarding convergence
diagnostics. Those who subscribe to the school of single-chain submit that one long
chain has better chances of finding new modes of the posterior distribution, thus
providing faster convergence in the ergodic sense. To this date, in practice, the
choice between single and several chains seem to be a matter of computational
availability and taste, rather than being based on rational justifications.
This lengthy discussion about the implementation of the Gibbs sampler is ne
cessitated by the lack of agreement and established guidelines in the literature. As
far as our implementation in this study is concerned, we choose to go with a single
chain. We feel the ergodic theorem is secure enough for our purposes and the
computational and algorithmic simplicity of it is also welcome. For assessing con
vergence, the techniques mentioned above do not come to the rescue. One of them
(Gelman and Rubin) is exclusively for multiple-chains and the other (Raftery and
Lewis) is most suitable when the target posterior quantities are quantiles. Hence
we will rely on simple graphical methods to monitor convergence.
One major input to our implementation is the full conditionals as described
by (4.14). For the data and model described in Section 4.2 the full conditionals
can be derived by the key observation that the conditional posterior distribution
of both T' and T' is proportional to their joint posterior, that is.
Pi(rf) = P{T',\Ti) P{rlrl)
P{rl) oc P(r^r | )
a L,(TITI)P(TITI)
59
and
P,{T',) = P(r||Tf) Pjrlri)
P(r?) « P(Tf,r |)
oc I , (T? , r | )P( r f , r | )
SO tha t
P,{T^) a \A,r/'\A,\-'/'{X^B,xr/'[{y - Xil,,fB^{y - . Y M . ^ ) ] " ' " ^ " ' '
(l + rf + r l ) -^ (4.15)
and
P,(T',) OC \A,r"\A,\-'''{X^B,X)-y'[(y - X(i,,fBAy - XM]'^"^'^"
(1 + TI + TI)-\ (4.16)
What seems to be a mysterious situation can be better understood by recalling
the discussion about the use of the symbol oc in Section 3.2. For Pi(Tf), the random
variable in question is rf, hence anything that does not involve rf is considered a
constant, including Pir'). A similar argument holds for P2{TI). SO, in order to get
the full conditionals, we only need to write the expression for the joint posterior
and then pick up the terms involving rf for Pi{r') and T2 for P2{T'). Because of
the implicit dependence of Li, as given in (4.13), on T' and r', it is difficult (and
not very helpful) to write the full conditionals explicitly; instead we will work with
what is given in (4.15) and (4.16).
An immediate difficulty with the full conditionals is the absence of the integrat
ing constants in their expressions. Fortunately, there are a wide variety of methods
to generate samples from distributions with unknown integrating constants. Gilks
(1996) reviews some of them in the context of MCMC and Bennett, Racine-Poon
and Wakefield (1996) provides a comparative study within the framework of a
nonlinear model.
Among the ones reviewed by Gilks (1996) are rejection, ratio-of-uniforms and
Metropofis-Hastings (which is itself a MCMC technique). They are generic in the
60
100 200 300 400 500 600 700 800 900 t
Figure 4.1: Bayes Factor
sense that they are applicable equally well to arbitrary distributions regardless
of whether we are implementing a MCMC method. Another recent suggestion is
"Griddy Gibbs Sampler," by Ritter and Tanner (1992) which is applicable only
when one is running several chains with each subvector being univariate. We choose
to work with the ratio-of-uniforms, which has also been suggested by Gelfand.
Hills, Racine-Poon and Smith (1990). An explanation of how this technique works
is given in Appendix C.
Our implementation of the Gibbs sampler, using the code given in Appendix
D in SAS IML, starts with an arbitrarily chosen value of rf = 1, then generates
a sample from P2 using ratio-of-uniforms. The next step is to generate a sample
from Pi, conditioning on the generated value of T', again using ratio-of-uniforms.
We run the chain for T = 1000 values and the corresponding estimator of the BF,
as given in (4.7), is plotted against the number of iterations in Figure 4.1. The
figure fluctuates until the number of iterations is around 350 and by using this
graphical judgment, we set the burn-in to be 400. So, the reported estimates of
marginal likelihoods and the BF in Table 4.9 are based on an effective simulation
size of 600.
The estimates reported in Table 4.9 are reasonably close to the ones we have
got by SRS and LHS. However, as Newton and Raftery (1994) notes, the harmonic
61
Table 4.9: Estimates from the Gibbs Sampler
7770
0.1679
777i
0.2849
B
0.5893 P
0.3708
estimator has infinite variance and does not satisfy a central fimit theorem. So.
we cannot get standard errors for the harmonic mean estimator for the BF and a
comparison of efficiency by SRS and LHS on formal grounds does not seem possible.
To get around the problem of infinite variance, Newton and Raftery (1994) suggest
some modifications.
Raftery (1996) reviews several methods of estimating marginal likefihoods,
mostly based on posterior simulation. Based on the findings of Rosenkranz (1992)
and Lewis (1994), he encourages the use of the harmonic mean estimator in spite
of the infinite variance problem.
In our numerical experience with the small data set, posterior-based estima
tors have taken longer computer time to evaluate. However, they are worthwile to
investigate, because the BF is usually not the sole aim of data analysis. In most
cases, analysts will support their findings by using posterior distributions. Consid
ering the widespread use of MCMC methods in exploring posterior distributions,
any Bayesian data analysis would have already set up a scheme to sample from
the posterior and the harmonic mean estimator (or its modifications) can be easily
obtained as by-products of this process.
Estimating marginal likelihoods from posterior simulation is an active research
area. Recently, Chib (1995) suggested another method which requires known in
tegrating constants in the set of full conditionals, hence not applicable to our
formulation. DiCiccio, Kass, Raftery and W^asserman (1996) has a recent re
view of posterior-based marginal likelihood estimators and suggests improvements.
Raftery (1996) reviews analytic approximations as well as the mentioned posterior ft
simulation methods, however the performance of all of these methods remain to
be tested.
CHAPTER \ '
CONCLUSIONS AND FUTURE RESEARCH
In this study we have derived Bayes Factors to test for a null variance ratio
(or a nufi variance component) in the context of a normal mixed finear model.
We have taken a fully operational Bayesian approach, specifying non-informative
improper prior distributions for nuisance parameters and informative priors that
require fittle or no prior input from the analyst for the parameters of interest.
The Bayesian approach avoids the difficulties of frequentist and likelihood so
lutions to the problem, such as the non-uniqueness of the most powerful invari
ant tests or intractable asymptotic distributions of test statistics. Our particular
formulation also overcomes the difficulties of similar Bayesian approaches to the
problem by using a proper family of priors for the variance components. An inves
tigation, by Westfall and Gonen (1996), of some previous studies using improper
priors for parameters of interest and employing a device of "imaginary training
sample" has established asymptotic difficulties with that approach. However, we
also provide a reasonable choice of reference prior which does not need any prior in
put from the analyst and can be used if prior information on variance components
is not available. We have illustrated our technique on both a standard model and a
hierarchical model and found that each has its own merits. The minimally param
eterized model was easier to work with from an analytic standpoint but presented
computational challenges which were solved by the hierarchical approach. We have
also noted that missing observations or cells can be handled within this framework
just as they are handled in other approaches, i.e., imputation or remodeling to
solve issues of identifiabifity.
It turned out that the calculation of the Bayes Factor necessitates numerical in
tegration. We have devoted Chapter IV to investigate the Monte Carlo approaches
to the problem, employing prior and posterior as importance sampling functions.
We have considered Latin hypercube sampling as an alternative to simple random
sampling when the prior is used as the importance sampling function. We have
also used Gibbs sampling to simulate the posterior distribution when the poste
rior is used as the importance sampling function. Our numerical experience favors
Latin hypercube sampling over simple random sampling. Also, considering the
62
63
popularity and necessity of posterior simulation in contemporar}' Bayesian data
analysis, we encourage the use of the harmonic mean estimator as well.
Bayesian analysis of variance components is an active research area and there
are many directions which future research may take. Our approach may be gen
eralized to include non-normal models, that is linear models using a link function
other than normal. Such models, known as generalized linear models, have been
successfully employed to explain the variations of 0-1 or count data, but to this
date no one has attempted to find Bayes Factors for random effects in generalized
mixed linear models. Preliminary studies of Bayesian analysis concerning estima
tion and using improper priors have been done by Zeger and Karim (1991) and
Clayton (1996).
Another direction is to look into the possibifities of developing more efficient
numerical strategies. As mentioned in Section 4.5, estimating marginal likelihoods
via posterior simulation receives a great deal of attention in the current literature,
as well as efficient reparametrizations to improve mixing. A study reviewing and
comparing the estimators suggested in the literature would be appropriate.
Another possibility is to investigate the performance of the suggested Bayes
Factors in model selection and compare with other popular measures, such as the
ones mentioned in Section 1.2. Such studies give the practitioners a chance to pick
up their choice of statistical summary measures according to their particular needs
without going into lengthy numerical studies themselves.
We have already noted at the end of Section 3.5 that a better way to approach
the problem of missing observations in a Bayesian setting is to treat the missing
values as parameters and use strategies like data augmentation to simulate the
values of missing observations from their posterior distributions and then impute
those values to perform the analysis. This may further be integrated into a Gibbs
sampling environment that will automaticafiy handle missing values and calculate
the Bayes Factor. Finally, it should be noted that the development of Bayesian approaches, and
especially Bayes Factors, for popular statistical models has been substantially de
layed due to the computational concerns which has led the practicing statisticians
to regard Bayesian methods as elegant but inapplicable. The explosion of the
availability of inexpensive computing power in the last decade along with clever
64
numerical methods seem to overcome this problem. X'irtually any statistician will
benefit from the addition of easily applicable Bayesian methods to his or her arse
nal.
REFERENCES
Afiffi, A. & Elashoff, R. (1966). Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61: 595-605.
Airy, G. (1861). On the Algebraical and Numerical Theory of Errors of Observations, MacMillan, London.
Aitchison, J. (1963). Inverse distributions and independent gamma-distributed products of random variables, Biometrika 50: 505-508.
Allen, F. & Wishart, J. (1930). A method of estimating the yields of a missing plot in field experimental work. Journal of Agricultural Society 30: 399-406.
Anderson, A., Basilevsky, A. & Hum, D. (1983). Missing data, in P. Rossi, J. Wright & A. Anderson (eds). Handbook of Survey Research, Academic Press, New York, pp. 415-494.
Bayes, T. (1783). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53: 370-418.
Bennett, J., Racine-Poon, A. k Wakefield, J. (1996). MCMC for nonlinear hierarchical models, in W. Gilks, S. Richardson & D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 339-358.
Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis, second ed. Springer-Verlag, New York.
Berger, J. & Deely, J. (1988). A Bayesian approach to ranking and selection of related means with alternatives to analysis-of-variance methodology. Journal of the American Statistical Association 83: 364-373.
Berger, J. & Delampady, M. (1987). Testing precise hypotheses. Statistical Science 2: 317-352.
Berger, J. & Sellke, T. (1987). Testing of a point nufi hypothesis. Journal of the American Statistical Association 82: 112-139.
Bernardo, J. & Smith, A. (1994). Bayesian Theory, Wiley, New York.
Besag. J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B 36: 192-326.
65
66
Box, G. & Tiao, G. (1973). Bayesian Inference in Statistical Analysis, Addison-Wesley, Reading, Massachusetts.
Bremer, R. (1994). Choosing and modefing your mixed linear model. Communications in Statistics: Theory and Methods 22: 3491-3522.
Broemeling, L. (1985). Bayesian Analysis of Linear Models, Marcel-Dekker. New York.
Carlin, B. k Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series B 57: 473-484.
Casella, G. k Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association 82: 106-111.
Casella, G. k Berger, R. (1990). Statistical Inference, Duxbury Press, Belmont. California.
r
Chaloner, K. (1987). A Bayesian approach to the estimation of variance components for the unbalanced one-way random model, Technometrics 29: 323-337.
Chauvenet, W. (1863). A Manual of Spherical and Practical Astronomy. Lippin-cott. Philadelphia.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90: 1313-1321.
Clayton, D. (1996). Generalized mixed linear models, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 275-302.
Datta, G. k Ghosh, M. (1995). Some remarks on noninformative priors. Journal of the American Statistical Association 90: 1357-1363.
de Finetti, B. (1964). Foresight: Its logical laws, its subjective sources, in H. Ky-burg k H. Smokier (eds), Studies in Subjective Probability, Wiley, New York, pp. 93-158. Translated and reprinted, originally appeared in 1937.
de Finetti, B. (1972). Probability, Induction and Statistics, Wiley, New York.
de Finetti, B. (1974). Theory of Probability, Vol. 1, Wiley, New York.
de Finetti, B. (1975). Theory of Probability, Vol. 2. Wiley. New York.
67
DeGroot, M. (1970). Optimal Statistical Decisions. McGraw HiU. New York.
DeGroot, M. (1973). Doing what comes naturally: Interpreting a tail-area as a posterior probabifity or as fikelihood ratio. Journal of the American Statistical Association 68: 966-969.
Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer-\erlag, New York.
DiCiccio, T., Kass, R., Raftery, A. k Wasserman, L. (1996). Computing Bayes factors by combining simulation and analytic approximations. Technical Report 630, Carnegie Mefion University, Department of Statistics.
Dickey, J. (1974). Bayesian alternatives to the F-test and least squares estimates in the normal linear model, in S. Fienberg k A. Zellner (eds). Studies in Bayesian Econometrics and Statistics, North Holland, Amsterdam, pp. 515-554.
Dodge. Y. (1985). Analysis of Experiments with Missing Data, Wiley. New York.
Edwards, W., Lindman. H. k Savage, L. (1963). Bayesian statistical inference for psychological research, Psychological Review 70: 193-242.
Efron, B. (1986). Why isn't everyone a Bayesian?, American Statistician 40: 1-11.
Eisenhart, C. (1947). The assumptions underlying the analysis of variance. Biometrics 3: 1-21.
Ferguson, T. (1967). Mathematical Statistics: A Decision-Theoretic Approach, Academic Press, New York.
Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Edinburgh Royal Society 52: 399-433.
Gauss, K. (1809). Theoria Motus Corporum Celestrium in Sectonibus Conies Solem Ambientium, Pertles and Besser, Paris.
Gelfand. A., Hills, S., Racine-Poon. A. k Smith, A. (1990). Illustration of Bayesian inference in normal data models using Gibbs sampfing, Journal of the American Statistical Association 85: 972-985.
68
Gelfand, A. k Smith, A. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85: 398-409.
Gelman, A., Carlin, J., Stern, H. k Rubin, D. (1995). Bayesian Data Analysis. Chapman and Hall, London.
Gelman, A. k Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7: 457-511.
Geman, S. k Geman, D. (1984). Stochastic relaxation, Gibbs distributions and Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741.
George, E. k McCulloch, R. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881-889.
Gilks, W. (1996). Full conditional distributions, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 75-88.
Gilks, W., Richardson, S. k Spiegelhalter, D. (1996). Introducing Markov chain Monte Carlo, in W. Gilks, S. Richardson k D. Spiegelhalter (eds). Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 1-20.
Golub, G. k Van Loan, C. (1989). Matrix Computations, The Johns Hopkins University Press, Baltimore, Maryland.
Gonen, M. (1995). Bayes factors for the ANOVA hypothesis. Presented in IMS/WNAR Joint Conference, Stanford, California.
Graybill, F. (1986). Theory and Application of the Linear Model, Duxbury, North Scituate, Massachusetts.
Hammersley, J. k Handscomb, D. (1964). Monte Carlo Methods, Methuen. London.
Hartley, H. k Hocking, R. (1971). The analysis of incomplete data. Biometrics 27: 783-823.
Hartley, H. k Rao, J. (1967). Maximum likelihood estimation for the mixed analysis of variance model, Biometrika 54: 93-108.
69
Hastings, W. (1970). Monte Carlo methods using Markov chains and their applications, Biometrika 57: 97-109.
Hill, B. (1965). Inference about variance components in the one-way model, Journal of the American Statistical Association 60: 806-825.
Hocking, R. (1985). The Analysis of Linear Models, Brooks/Cole. Monterey. California.
Hocking, R., Green, J. k Bremer, R. (1989). Variance component estimation with model-based diagnostics, Technometrics 31: 227-240.
Hodges, J. k Lehmann, E. (1954). Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical Society. Series B 16: 261-268.
Hogg, R. k Craig, A. (1978). Introduction to Mathematical Statistics, fourth ed. McMifian, New York.
Iman, R. k Conover, W. (1980). Small sample sensitivity analysis techniques for computer models with an application to risk assessment, Comrtiunications in Statistics: Theory and Methods 17: 1749-1842.
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Proceedings of the Cambridge Philosophy Society 31: 203-222.
Jeffreys, H. (1939). Theory of Probability, Oxford University Press, London.
Kahaner, D. (1991). A survey of existing multidimensional quadrature routines, in N. Flournov k R. Tsutakawa (eds). Statistical Multiple Integration, American Mathematical Society, Providence, Rhode Island, pp. 9-22.
Kass, R. k Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association 90: 773-795.
Khuri, A. k Sahai, H. (1985). Variance components analysis: A selective literature survey. International Statistical Review 53: 279-300.
Law, A. k Kelton, D. (1991). Simulation Modeling and Analysis, second ed, McGraw-Hifi, New York.
Legendre, A. (1806). Nouvelles Methodes pour la Determination des Orbites des Cometes; avec un Supplement Contenat Divers Perfectionnements de ces Methodes et leur Application aux deux Cometes de 1805, Courcier. Paris.
70
Lehmann, E. (1986). Testing Statistical Hypotheses, Wiley. New ^brk.
Lempers, F. (1971). Posterior Probabilities of Alternative Linear Models, University Press, Rotterdam.
Lewis, S. (1994). Multilevel Modeling of Discrete Event History Data Using Markov Chain Monte Carlo Methods, PhD thesis. University of V\ ashington, Department of Statistics.
Lindley, D. (1957). A statistical paradox, Biometrika 44: 187-192.
Lindley, D. k Smith, A. (1972). Bayes estimates for the finear model. Journal of the Royal Statistical Society, Series B 34: 1-41.
Little, R. (1992). Regression with missing X's: A review. Journal of the American Statistical Association 87: 1227-1237.
Little, R. k Rubin, D. (1987). Statistical Analysis with Missing Data, Wiley, New York.
Mallows, C. (1973). Some comments on Cp, Technometrics 15: 661-675.
McCulloch, R. k Rossi, P. (1991). A Bayesian approach to testing the arbitrage pricing theory, Journal of Econometrics 49: 141-168.
McKay, M., Conover, W. k Beckman, R. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code, Technometrics 21: 239-245.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. k Teller. E. (1953). Equations of state calculations by fast computing machine. Journal of Chemical Physics 21: 1087-1091.
Miller, A. (1990). Subset Selection in Regression, Chapman and Hall, New York.
Mitchefi, T. k Beauchamp, J. (1988). Bayesian variable selection in finear regression. Journal of the American Statistical Association 83: 1023-1032.
Newton, M. k Raftery, A. (1994). Approximate Bayesian inference by the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B 56: 3 -48.
'1
Neyman, J. k Pearson, E. (1933). On the problem of most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A 231: 289-337.
Neyman, J. k Pearson, E. (1967). Joint Statistical Papers of J. Neyman and E.S. Pearson, University of California Press, Berkeley.
Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Appfied Mathematics, Philadelphia. Pennsylvania.
O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference, Edward Arnold, London.
O'Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society, Series B 57: 99-138.
Poison, N. (1995). Convergence of Markov chain Monte Carlo algorithms, in J. Bernardo, J. Berger, A. Dawid k A. Smith (eds). Bayesian Statistics 5, Oxford University Press, Oxford.
Press, J. (1989). Bayesian Statistics: Principles, Models and Applications, Wiley, New York.
Raftery, A. (1996). Hypothesis testing and model selection via posterior simulation, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 163-188.
Raftery, A. k Lewis, S. (1992). How many iterations in the Gibbs sampler?, in J. Bernardo, J. Berger, A. Dawid k A. Smith (eds), Bayesian Statistics 4-Oxford University Press, Oxford, pp. 765-776.
Rao, C. (1973). Linear Statistical Inference and Its Applications, second ed, Wiley. New York.
Ripley, B. (1987). Stochastic Simulation, Wiley, New York.
Ritter, C. k Tanner, M. (1992). The Gibbs stopper and the griddy Gibbs sampler. Journal of the American Statistical Association 87: 861-868.
Roberts, G. (1996). Markov chain concepts related to sampfing algorithms, in W. Gilks, S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hafi, London, pp. 45-58.
72
Rosenkranz, S. (1992). The Bayes Factor for Model Evaluation in a Hierarchical Poisson Model for Area Counts, PhD thesis, University of Washington. Department of Biostatistics.
Rosenthal, J. (1993). Rates of convergence for data augmentation on finite sample spaces. Annals of Applied Probability 3: 819-839.
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. Wilev. New York.
Samaniengo, F. k Reneau, D. (1994). Toward a reconciliation of the Bayesian and frequentist approaches to point estimation. Journal of the American Statistical Association 89: 947-957.
Savage, L. (1954). The Foundations of Statistics. Wiley, New York.
Savage, L. (1962). The Foundations of Statistical Inference, Methuen, London.
Scheffe, H. (1956). The Analysis of Variance, Wiley, New York.
Searle, S. (1971). Linear Models, Wiley, New York.
Searle, S. (1987). Linear Models for Unbalanced Data, Wiley, New York.
Searle, S., Casella, G. k McCulloch, C. (1992). Variance Components, Wiley, New York.
Self, S. k Liang, K. (1987). Asymptotic properties of maximum likelihood estimators and the likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82: 605-610.
Smith, A. (1973a). Bayes estimates in one-way and two-way models, Biometrika 60: 319-330.
Smith, A. (1973b). A general Bayesian linear model. Journal of the Royal Statistical Society, Series B 35: 67-75.
Smith, A. k Spiegelhalter, D. (1980). Bayes factors and choice criteria for finear models. Journal of the Royal Statistical Society, Series B 42: 213-220.
Spiegelhalter, D. k Smith, A. (1982). Bayes factors for linear and log-linear models with vague prior information. Journal of the Royal Statistical Society, Series B 44: 377-387.
73
Tanner, M. (1993). Tools for Statistical Inference: Methods for Exploring Posterior Distributions and Likelihood Functions, second ed, Springer-X'erlag, New "^brk.
Tanner, M. k Wong, W. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82: 528-550.
Tiao, G. k Tan, W. (1965). Bayesian analysis of random-effects models in the analysis of variance, Biometrika 52: 37-53.
Westfafi, P. (1989). Power comparisons for invariant ratio tests in mixed .\NO\'A models. Annals of Statistics 17: 318-326.
Westfafi, P. k Gonen. M. (1996). Asymptotic properties of ANO\A Bayes factors. Communications in Statistics: Theory and Methods 25(12). To appear.
Wilks, W. k Roberts, G. (1996). Strategies for improving MCMC. in W. Gilks. S. Richardson k D. Spiegelhalter (eds), Markov Chain Monte Carlo in Practice, Chapman and Hall, London, pp. 89-114.
Yates, F. (1933). The analysis of replicated experiments when the field results are incomplete. Empire Journal of Experimental Agriculture 1: 129-142.
Ye, K. (1994). Bayesian reference prior analysis on the ratio of variances for the balanced one-way random effect model. Journal of Statistical Planning and Inference 41: 267-280.
Zeger, S. k Karim, M. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association 86: 79-86.
Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics, Wiley, New York.
Zellner, A. (1986a). On assessing prior distributions and Bayesian regression analysis with g-prior distributions, in P. Goel k A. Zellner (eds), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, North-Hofiand, Amsterdam, pp. 233-243.
Zellner, A. (1986b). Posterior odds ratios for regression hypotheses: General considerations and specific results, in A. Zellner (ed.), Basic Issues in Econometrics, University Press, Chicago, pp. 275-305.
Zellner, A. k Slow, A. (1980). Posterior odds ratios for selected regression hypotheses, in J. Bernardo, J. Berger, A. Dawid k .A.. Smith (eds). Bayesian Statistics, University Press, Valencia.
APPENDIX A
SAS IML CODE FOR SRS
75
1 /* Same as Ihs.sas, but uses simple random sampling */
2 /* instead of latin hypercube sampling */
3
4 /* calculate estimates based on rep=100 repetitions */
5 /* repeat m=500 times to get variance of the estimator*/
6
7
8 proc iml;
9
10 rep=10;
11 m=50;
12 sumb=0;sumsqb=0;
13
14 do k=l to m;
15
16 e l=j ( l , rep ,0) ;
17 e2=el;
18 e3=el;u3=el;
19
20
21 do i=l to rep;
22 el[i]=ranexp(0)
23 e2[i]=ranexp(0)
24 e3[i]=ranexp(0)
25 u3[i]=uniform(0);
26 end;
27
28 e6=el+e2+e3;
29 zl=el/e6;
30 z2=e2/e6;
31
/ (
32 / * ( z l , z 2 ) i s D i r i c h l e t ( 1 , 1 , 1 ) * / 33
34 t l = z l / ( l - z l - z 2 ) ;
35 t 2 = z 2 / ( l - z l - z 2 ) ;
36
37 /*tl and t2 are the variance ratios for the denominator*/
38
39
40 t 3 = u 3 / ( l - u 3 ) ;
41
42 / * t 3 i s the var iance r a t i o for the numerator*/
43
44 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;
45
46 n = 1 0 ;
47 p = l ;
48
49 x = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /
50
51 wl={l 0 ,1 0 ,1 0,1 0,1 0,0 1,0 1,0 1,0 1,0 1};
52 w2={ l 0 0 , 0 1 0 , 0 1 0 , 0 0 1 ,0 0 1 ,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1 ,0 0 1 } ;
53 110=1(10);
54
55 pay=j(1 , rep ,0) ;payda=pay;
56
57 numbar=j(1,5,0);denombar=numbar;/*b=numbar;*/
58
59 sumnum=0;sumdenom=0;
60
61 do j = l t o r e p ;
62 covl=I10+t1 [ j ]*wl*wl '+t2[j]*w2*w2';
63 covO=I10+t3 [ j ]*wl*wl ' ;
64
'8
65 i n v l = i n v ( c o v l ) ;
66 invO=inv(covO); 67
68 b e t a h a t l = i n v ( x ' * i n v l * x ) * x ' * i n v l * y ;
69 betahatO=inv(x'*invO*x)*x'*invO*y; 70
71 d e l = d e t ( c o v l ) * * ( - 0 . 5 ) ;
72 d e 2 = d e t ( x ' * i n v l * x ) * * ( - 0 . 5 ) ;
73 ycd=y-x*betahat l ;
74 de3=(ycd '* inv l*ycd)**( - (n -p+2) /2 ) ;
75 payda[j]=del*de2*de3; 76
77
78 n l = d e t ( c o v 0 ) * * ( - 0 . 5 ) ;
79 n2=de t (x '* inv0*x)**( -0 .5 ) ;
80 ycn=y-x*betahatO;
81 n3=(ycn '* inv0*ycn)**(- (n-p+2) /2) ;
82 pay[ j ]=nl*n2*n3;
83
84 sumnum=sumnum+pay[ j ] ;
85 sumdenom=sumdenom+payda[ j ] ;
86
87 end;
88
89 numbar=sumnum/rep;
90 denombar=sumdenom/rep;
91
92 sumb=sumb+(numbar/denombar);
93 sumsqb=sumsqb+((numbar/denombar)**2);
94
95 end;
96
97 blhs=sumb/m;
"9
98 varblhs=(sumsqb-(m*(blhs**2)))/m; 99
100 print blhs varblhs;
101
APPENDIX B
SAS IML CODE FOR LHS
80
81
1 /* We generate a sample from multivariate independent imiform */
2 /* distribution on the unit hypercube and then transform */
3 /* them into Beta distributed variates and then transform */
4 /* into a vector with Dirichlet distribution */
5
6 proc iml;
7
8 rep=100;
9 m=500;
10 sumb=0;sumsqb=0;
11
12 do k=l t o m;
13
14 ul=j(l,rep,0);u2=ul;u3=u2;
15
16 do i=l to rep;
17 ul [i]=uniform(0);
18 u2[i]=uniform(0);
19 u3[i]=uniform(O);
20 end;
21
22 vl=rank(ul);
23 v2=rank(u2);
24
25 xl=(vl/rep)-(l/(2*rep));
26 x2=(v2/rep)-(l/(2*rep));
27
28 /*xl and x2 is a LHS sample from bivariate independent uniform(0,l)*/
29
30 yl=l-(l-xl)##(0.5);
31 y2=x2;
82
32
33 / * y l i s B e t a ( l , 2 ) and y2 i s uni form(0 ,1)* /
34
35 z l = y l ;
36 z 2 = y 2 # ( l - y l ) ;
37
38 / * ( z l , z 2 ) i s D i r i c h l e t ( 1 , 1 , 1 ) * /
39
40 tl=zl/(l-zl-z2);
41 t 2 = z 2 / ( l - z l - z 2 ) ;
42
43 /*tl and t2 are the variance ratios for the denominator*/
44
45 t3=u3/(l-u3);
46
47 / * t 3 i s the var iance r a t i o for the numerator*/
48
49 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;
50
51 n=10;
52 p = l ;
53
54 x = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /
55
56 wl={l 0 , 1 0 , 1 0 , 1 0 , 1 0 ,0 1,0 1,0 1,0 1,0 1 } ;
57 w 2 = { l 0 0 , 0 1 0 , 0 1 0 , 0 0 1 ,0 0 1 ,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1 ,0 0 1 } ;
58 110=1(10);
59
60 pay=j (1 , rep ,0) ;payda=pay;
61 62 numbar=j(1,5,0);denombar=numbar;b=numbar;
63
64 sumnum=0;sumdenom=0;
83
65
66 do j = l t o r ep ;
67 covl=I10+t l [ j ]*wl*wl '+t2[j]*w2*w2';
68 covO=I10+t3 [ j ]*wl*wl ' ; 69
70 i n v l = i n v ( c o v l ) ;
71 inv0=inv(cov0); 72
73 b e t a h a t l = i n v ( x ' * i n v l * x ) * x ' * i n v l * y ;
74 betahatO=inv(x'*inv0*x)*x'*invO*y; 75
76 d e l = d e t ( c o v l ) * * ( - 0 . 5 ) ;
77 de2=de t (x ' * inv l*x )** ( -0 .5 ) ;
78 ycd=y-x*betahat l ;
79 de3=(ycd '* inv l*ycd)**( - (n -p+2) /2 ) ;
80 payda[j]=del*de2*de3;
81
82 n l = d e t ( c o v 0 ) * * ( - 0 . 5 ) ;
83 n2=de t (x '* inv0*x)**( -0 .5 ) ;
84 ycn=y-x*betahatO;
85 n3=(yen '*inv0*ycn)**(- (n-p+2) /2) ;
86 pay[ j ]=nl*n2*n3;
87
88 sumnum=sumnum+pay[ j ] ;
89 sumdenom=sumdenom+payda[j] ;
90
91 e n d ;
92
93 numbar=sumnum/rep;
94 denombar=sumdenom/rep;
95
96 sumb=sumb+(numbar/denombar);
97 sumsqb=sumsqb+((numbar/denombar)**2);
84
98
99 e n d ;
100
101 blhs=sumb/m;
102 v a r b l h s = ( s u m s q b - ( m * ( b l h s * * 2 ) ) ) / m ; 103
104 p r i n t b l h s v a r b l h s ;
105
APPENDIX C
RATIO-OF-UNIFORMS
85
86
In this appendix, we will explain the mechanics of the ratio-of-uniforms method.
We use this to generate from the fufi conditionals in Section 4.5. Our approach
follows that of Gilks (1996), with a sfight change in notation.
Let Cg be a subset of R defined by
Cg = {{u,v):0<u< ^g(v/u)} (C.l)
and let {u,v) be a bivariate random variable that is uniformly distributed on Cg.
that is f{u, v) = k ii {u, v) e Cg and 0 otherwise. Let 9 = v/u. Then, the joint
density of 9 and u is f(9, u) = ku ior 0 < u < ^g{9) and 0 otherwise. Then the
marginal density of 9 is
r\/9{&) k m = f{e,n)du = -g(9) (C.2)
JO I
which implies that f{9i) a g{9i) and the normalizing constant i^ k = 2/ f g(9)d9.
So we can generate from a non-normalized density g{.) without evaluating its
normalizing constant by using this method, as long as we can identify the region
Cg and generate bivariate uniform variables over it.
One way to generate uniformly on Cg is to use rejection. If we can find a
rectangle R that contains Cg, say [0, a] x \b,c\, then we will generate a candidate
from R uniformly and accept it if it lies in C^. A proposed R is
a = sup \lg{x) X
b = mixJg{x)
c = sup xyjg{x)
This is the minimal rectangle, but any rectangle containing it should do. The
probability of acceptance is i/ka{c - b). It is difficult to derive the infimums and
supremums analytically. One approach to is to undertake a decrease in efficiency
and use somewhat looser bounds that can be calculated easily. We take the sec
ond approach in our implementation and computed the supremums and infimums
s:
numerically. This might have resulted in some increase in the required computing
time, but it provided a generic algorithm, where, otherwise, we had to adapt the
bound every time we change our model. Hence, the implementation of the ratio-
of-uniforms methods requires three maximizations (counting the minimization as
a maximization for purposes of computer time) per generation and a further re
jection scheme. In our experience we found the rejection part to be quite efficient,
on the average 10 out of 13 candidates are accepted.
APPENDIX D
SAS IML CODE FOR THE GIBBS SAMPLER
88
89
1 options ps=32 ls=76 nodate;
2
3 proc iml;
4
5 file "harden-a.out";
6
7 /* epsilon */
8 eps=10e-6;
9
10 /* Setup regarding the observations and model matrices */
11
12 y = { 4 . 9 , 4 . 4 , 4 . 2 , 4 , 4 . 1 , 5 , 5 . 1 , 4 . 5 , 4 . 9 , 5 . 1 } ;
13
14 n = 1 0 ;
15 p = l ;
16 q l = 2 ;
17 q 2 = 3 ;
18
19 o n e = j ( n , l , l ) ; /*des ign matr ix for the f ixed e f f e c t s * /
20
21 Z1={1 0 ,1 0 ,1 0 ,1 0,1 0,0 1,0 1,0 1,0 1,0 1};
22 Z2={1 0 0 , 0 1 0 , 0 1 0 , 0 0 1,0 0 1,1 0 0 , 1 0 0 , 0 1 0 , 0 0 1,0 0 1 } ;
23
24
25 / * Module t o genera te from the var iance r a t i o s us ing
26 r a t i o -o f - im i fo rms method */
27
28
29 s t a r t denl (x2) g l o b a l ( q l , q 2 , n , Z l , Z 2 , o n e , y , t l ) ;
30 A2=I(q2)+x2*Z2'*Z2;
31 B2=I(n)-x2*Z2*(inv(A2))*Z2' ;
9(
38
39
32 Al=I(ql)+tl*Zl'*B2*Zl;
33 Bl=B2-tl*B2'*Zl*(inv(Al))*zr*B2;
34 gl=(one'*Bl*one)**(-0.5);
35 g2=(det(Al))**(-0.5);
36 g3=(det(A2))**(-0.5);
37 muhat2=(gl**2)*one'*Bl*y;
g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);
vl=gl*g2*g3*g4*((1+tl+x2)**(-3));
40 v2=sqrt(vl);
41 retum(v2);
42 finish denl;
43
44 start den2(x2) global(ql,q2,n,Zl,Z2,one,y,t1);
45 A2=I(q2)+x2*Z2'*Z2;
46 B2=I(n)-x2*Z2*(inv(A2))*Z2';
47 Al=I(ql)+tl*Zl'*B2*Zl;
48 Bl=B2-tl*B2'*Zl*(inv(Al))*Zl'*B2;
49 gl=(one'*Bl*one)**(-0.5);
50 g2=(det(Al))**(-0.5);
51 g3=(det(A2))**(-0.5);
52 muhat2=(gl**2)*one'*Bl*y;
53 g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);
54 vl=gl*g2*g3*g4*((l+tl+x2)**(-3));
55 v2=x2*sqrt(vl);
56 return(v2) ;
finish den2; O I
58
59 Start den3(xl) global(ql,q2,n,Zl,Z2,one,y,t2);
60 A2=I(q2)+t2*Z2'*Z2;
61 B2=I(n)-t2*Z2*(inv(A2))*Z2';
62 Al=I(ql)+xl*Zl'*B2*Zl;
63 Bl=B2-xl*B2'*Zl*(inv(Al))*Zl'*B2;
gl=(one'*Bl*one)**(-0.5); 64
91
65 g2=(det(Al))**(-0.5);
66 g3=(det(A2))**(-0.5);
67 muhat2=(gl**2)*one'*Bl*y;
68 g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);
69 vl=gl*g2*g3*g4*((l+xl+t2)**(-3));
70 v2=sqrt(vl);
71 return(v2);
72 finish den3;
73
74 Start den4(xl) global(ql,q2,n,Zl,Z2,one,y,t2);
75 A2=I(q2)+t2*Z2'*Z2;
76 B2=I(n)-t2*Z2*(inv(A2))*Z2';
77 A 1 = I ( q l ) + x l * Z l ' * B 2 * Z 1 ;
78 Bl=B2-xl*B2'*Z1*(inv(Al))*Z1'*B2;
79 gl=(one'*Bl*one)**(-0.5);
80 g2=(det(Al))**(-0.5);
81 g3=(det(A2))**(-0.5);
82 muhat2=(gl**2)*one'*Bl*y;
83 g4=((y-muhat2*one)'*Bl*(y-muhat2*one))**(-(n+l)/2);
84 vl=gl*g2*g3*g4*((l+xl+t2)**(-3));
85 v2=xl*sqrt(vl);
86 return(v2);
87 finish den4;
88
89 start likden(xl,x2) global(ql,q2,n,Zl,Z2,one,y);
90 A2=I(q2)+x2*Z2'*Z2;
91 B2=I(n)-x2*Z2*(inv(A2))*Z2';
92 A1=I(ql)+xl*Zl'*B2*Z1;
93 Bl=B2-xl*B2'*Zl*(inv(Al))*Zl'*B2;
94 gl=(one'*Bl*one)**(-0.5);
95 g2=(det(Al))**(-0.5);
96 g3=(det(A2))**(-0.5);
97 muhat2=(gl**2)*one'*Bl*y;
92
98 g4=((y-muhat2*one) '*B1*(y-muhat2*one))**(-(n+1)/2);
99 vl=gl*g2*g3*g4;
100 return(vl);
101 f i n i s h l i kden ;
102
103 s t a r t modgen2;
104
105 infl=0;
106 inf2=0;
107
108 opt=l;
109 bc={0,.};
110 xO=l;
111
112 call nlpqn(rl,argl,"denl",xO,bc,opt);
113 call nlpqn(r2,arg2,"den2",x0,bc,opt);
114 s u p l = d e n l ( a r g l ) ;
115 sup2=den2(arg2);
116
117 accept=0;
118
119 do u n t i l ( a c c e p t = l ) ;
120 vl=supl*uniform(O);
121 v2=inf2+(sup2-inf2)*uniform(O);
122 v=v2/vl;
123
124 upper=denl(v);
125
126 if vl<(upper**2) then accept=l;
127
128 end;
129
130 t2=v;
93
131 f i n i s h modgen2;
132
133 s t a r t modgenl;
134
135 i n f l = 0 ;
136 in f 2=0;
137
138 o p t = l ;
139 b c = { 0 , . } ;
140 x O = l ;
141
142 c a l l n l p q n ( r l , a r g l , " d e n 3 " , x 0 , b c , o p t ) ;
143 c a l l n lpqn(r2 ,arg2 ,"den4" ,xO,bc ,opt ) ;
144 s u p l = d e n 3 ( a r g l ) ;
145 sup2=den4(arg2);
146
147 accept=0;
148
149 do u n t i l ( a c c e p t = l ) ;
150 v l=supl*uniform(O);
151 v2=inf2+(sup2- inf2)*uni form(O);
152 v = v 2 / v l ;
153
154 upper=den3(v);
155
156 i f v l<(upper**2) then accept= l ;
157
158 e n d ;
159
160 t l = v ;
161 f i n i s h modgenl;
162
163 n i t e r = 1 0 0 0 ;
94
164 burnin=100;
165 sum=0;
166 t l = l ;
167 do j= l to n i t e r ;
168 c a l l m o d g e n 2 ;
169 c a l l m o d g e n l ;
170 i f j>burnin then do;
171 sum=sum+(l / l ikden(t l , t2));
172 mltemp=(j-burnin)/sum;
173 put t l +1 t 2 +1 mltemp;
174 end;
175 end;
176
177 ml=(niter-burnin)/sum;
178 print ml;
179
180
181
182
183
184
185
186