Analysis of Repeated Measurements (Spring 2017) of Repeated Measurements (Spring 2017) ... p1 ˙ p2...
Transcript of Analysis of Repeated Measurements (Spring 2017) of Repeated Measurements (Spring 2017) ... p1 ˙ p2...
Analysis of Repeated Measurements (Spring 2017)
J.P.Kim
Dept. of Statistics
Finally modified at June 18, 2017
Preface & Disclaimer
This note is a summary of the lecture Analysis of Repeated Measurements (326.748A) held at
Seoul National University, Spring 2017. Lecturer was MC Paik, and the note was summarized
by J.P.Kim, who is a Ph.D student. There are few textbooks and references in this course,
which are following.
• Analysis of Longitudinal Data, Diggle, Heagerty, Liang & Zeger, Oxford University Press,
2002.
• Linear Mixed Models in Practice: An SAS-oriented Approach, Verbeke & Molenberghs,
Springer, 1997.
• Applied longitudinal analysis, Fitzmaurice, Laird & Ware, Wiley, 2004.
• Statistical Methods for the Analysis of Repeated Measurements, Charles S. Davis, Springer-
Verlag, 2002.
Also I referred to following books when I write this note. The list would be updated contin-
uously.
• Applied Multivariate Statistical Analysis, Johnson & Wichern, Pearson, 2002.
Most contents of this summary note is from these textbooks and references. If you want to
correct typo or mistakes, please contact to: [email protected]
1
Chapter 1
Introduction
1.1 Introduction
Consider the data those are repeatedly measured over time. Such data, which are so-called longi-
tudinal data, are compared with those from cross-sectional studies, because longitudinal studies offer
unique information that other study design cannot. While it is often possible to adress the same sci-
entific questions with a longitudinal or cross-sectional study, the major advantage of the former is its
capacity to separate what in the context of population studies are called cohort or age effects.
Example 1.1. Consider data illustrated in figure 1.1. In the left one, score is plotted against age
for a hypothetical cross-sectional study of children; data are collected from 8 individuals. With this
data, little else can be said. However, in the right one, we suppose the same data were obtained in a
longitudinal study in which each individual was measured twice (over 2 years). Now it is clear that
while younger children began at a higher score, everyone improved with time. Longitudinal study gives
a unique information that cross-sectional study cannot imply.
Figure 1.1: Cross-sectional study vs Longitudinal one
Notations. In this item, we define some notations that we will use among the entire course. Let Yij
be the outcome (can be a vector) for the ith individual at the jth time point, where i = 1, 2, · · · ,K
(e.g. # of each individual) and j = 1, 2, · · · , ni (e.g. # of time points or subunits). Also, let Xij =
(xij1, xij2, · · · , xijp)> be p-variate vector of covariates for the ith individual at the jth time point.
2
Analysis of Repeated Measurements J.P.Kim
Some elements do not change over time and are called time-constant covariates. For example, age at
entry or gender is time-constant. Then the second subscript (of Xij) is not necessary. Some elements
change over time and are called time-varying covariates.
Example 1.2. If covariate xij = xi is time-constant, e.g.
E(Yij |xi) = β0 + β1xi + β2j,
then β1 is difference in Y per unit change in x. If covariate is time-varying, we can write the model as
E(Yij |xi) = β0 + β1xi1 + β2(xij − xi1) + β3j.
Note that β2 can be estimated only if time-varying covariates are available. For example, if x denotes
the exercise level at each month, if all of individuals in the study do never exercise, just as the
summarizer, we cannot estimate β2.
Remark 1.3. Note that repeated measurements within a unit are correlated. The most important
part in analysis of longitudinal data is how to deal with such correlation issue. Of course, we can avoid
the correlation issue by using summary measures (e.g. average through the time) for independent
units (Two-stage model), or conditional analysis eliminating nuisance parameters associated with
correlation. However, in this course, we will see some methods handling such correlation issue. Of
course, depending on the goal of the study, ifferent analytical methods should be called for. For
example, if the goal of the study is how the outcome varies for one group (e.g. smokers) vs. the other
(e.g. nonsmokers), a marginal approach such as Generalized Estimating Equation fits well. In other
words, if each individual belongs to one of groups and such group does not change, the goal of the
study might be to see difference between groups, and hence we may see marginal (integrated) properties
of each group. On the other hand, if the goal of the study is how the outcome varies within a subject
when a condition (e.g. smoking vs. non-smoking) is changed, a conditional approach such as mixed
effects model fits well. (e.g. smoking state of each individuals change over time)
Remark 1.4. In linear models, the two approaches (marginal vs. conditional) are congruent; in
nonlinear models, incongruent. For example, consider linear models. If one uses marginal approach,
then our model might be
E(Y |X) = β0 + βXX,
where X = 0 or X = 1. Then βX is related to the difference between two groups,
βX = E(Y |X = 1)− E(Y |X = 0).
3
Analysis of Repeated Measurements J.P.Kim
However, if we consider a conditional approach, then outcome of each individual matters, and hence
the model becomes
E(Yi|Xi, bi) = β0 + bi + βXXi,
where bi denotes the “random effect” which distinguishes from each individual (i.e., it is a specific
quantity determines individual). Then if an individual is determined (as b),
βX = E(Y |X = 1, b)− E(Y |X = 0, b)
denotes the difference between two groups, and hence the two models do not ‘contradict’ each other,
even though interpretation of βX is different. However, in nonlinear case, two models ‘contradict’ each
other. For example, consider following two logistic models
logitP (Y = 1|X) = β0 + βXX
and
logitP (Y = 1|X, b) = β0 + b+ βXX.
The former one (a marginal approach) yields
βX = logP (Y = 1|X = 1)P (Y = 0|X = 0)
P (Y = 1|X = 0)P (Y = 0|X = 1), (“log odds ratio”)
or equivalently
P (Y = 1|X) =eβ0+βXX
1 + eβ0+βXX,
while the latter one (a conditional approach) yields
βX = logP (Y = 1|X = 1, b)P (Y = 0|X = 0, b)
P (Y = 1|X = 0, b)P (Y = 0|X = 1, b),
or
P (Y = 1|X, b) =eβ0+b+βXX
1 + eβ0+b+βXX.
Denoting the distribution (pdf) of b as g(b), we obtain that
P (Y = 1|X) =
∫P (Y = 1|X, b)g(b)db
=
∫eβ0+b+βXX
1 + eβ0+b+βXXg(b)db
6= eβ∗0+β∗XX
1 + eβ∗0+β∗XX
,
4
Analysis of Repeated Measurements J.P.Kim
and hence two approaches contradict each other.
Previous remarks emphasize that we have to make the goal of study clear for analysis. Explatory
data analysis or descriptive statistics can be helpful for modeling before the data analysis.
Example 1.5. High correlation between data (from each individual) means that each individual is
distinct from others; we can distinguishes each other. Low correlation gives tangled plot over time;
each individual is indistinguishable.
Figure 1.2: Example 1.5.
1.2 Review for multivariate analysis
In this section, we recall some multivariate data analysis methods.
Definition 1.6. Let Y = (Y1, Y2, · · · , Yp)> be a random vector. Then
E(Y ) =
E(Y1)
E(Y2)...
E(Yp)
=
µ1
µ2
...
µp
= µ
and
V ar(Y ) = E(Y − µ)(Y − µ)> = Σ =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
......
. . ....
σp1 σp2 · · · σpp
= E(Y Y >)− µµ>.
5
Analysis of Repeated Measurements J.P.Kim
Also, correlation matrix is defined as R = V −1/2ΣV −1/2, where
V =
σ11 0 · · · 0
0 σ22 · · · 0...
.... . .
...
0 0 · · · σpp
.
Example 1.7. Estimation of mean and variance. We estimate µ as
µ = y =1
n
n∑i=1
Yi.
If µ is known, we can estimate Σ as
Σ =1
n
n∑i=1
(Yi − µ)(Yi − µ)>,
and if µ is unknown, then
Σ =1
n− 1
n∑i=1
(Yi − µ)(Yi − µ)>.
Now clearly
E(AY ) = AE(Y ) = Aµ
and
V ar(Aµ) =1
nAΣA>.
It can be easily obtained that
E(µ) = µ and V ar(µ) =1
nΣ.
Also, note that
E(Y−µ)(Y−µ)> = E(Y−µ)(Y−µ)>−E(Y−µ)(µ−µ)>−E(µ−µ)(Y−µ)>+E(µ−µ)(µ−µ)> =
(1− 1
n
)Σ
and therefore
E(Σ) =n− 1
nΣ, E(Σ) = Σ.
Example 1.8. Multivariate normal density uses multivariate version of a standardized distance, or
Mahalanobis distance,
f(Y ) ∝ exp
(−1
2(Y − µ)>Σ−1(Y − µ)
).
6
Analysis of Repeated Measurements J.P.Kim
Note that
Y ∼ Np(µ,Σ) ⇔ Z = Σ−1/2(Y − µ) ∼ Np(0, I) ⇔ z1, z2, · · · , zpi.i.d.∼ N(0, 1).
Example 1.9. Consider a hypothesis testing problem
H0 : µ = µ0 vs H1 : µ 6= µ0
based on Y1, Y2, · · · , Yn ∼ Np(µ,Σ). When Σ is known, we can use that
n(µ− µ0)>Σ−1(µ− µ0) ∼ χ2(p).
If Σ is unknown, we use following Hotelling’s T 2 distribution.
• We denote the distribution of
Z1Z>1 + · · ·+ ZkZ
>k
where Z1, · · · , Zk ∼ Np(0,Σ) is called a Wishart distribution and denoted as Wp(Σ, k).
• If α can be written as md>M−1d where d ⊥M , d ∼ Np(0, I), M ∼Wp(I,m), then we say that
α has the Hotelling T 2 distributino with parameter p and df m. We write α ∼ T 2(p,m).
• Note that
T 2 = (µ−µ0)>(S
n
)−1
(µ−µ0) =[√
nΣ−1/2(µ− µ)]>
Σ1/2S−1Σ1/2[√
nΣ−1/2(µ− µ)]∼ T 2(p, n−1).
• Also note that
T 2(p, n− 1)d≡ (n− 1)p
n− pF (p, n− p).
These come from following two facts: let Z1, · · · , Zn ∼ Np(0, I) and SZ be its sample variance
matrix. Then
(n− 1)SZ ∼Wp(I, n− 1)
and
nZ>S−1Z Z ∼ (n− 1)p
n− pF (p, n− p).
For the first one, let Z := (Z1, Z2, · · · , Zn)> be n× p matrix. Then
(Z1 − Z, · · · , Zn − Z)> = (I −Π1)Z
7
Analysis of Repeated Measurements J.P.Kim
where Π1 = 1(1>1)−11>. Hence we get
SZ =n∑i=1
(Zi − Z)(Zi − Z)> = Z>(I −Π1)Z,
and then spectral decomposition of idempotent matrix I −Π1 ends the proof.
For the second one, consider following two properties of Wishart distribution.
• Let S ∼Wp(Σ, k) (k ≥ p) and
Σ =
Σ11 σ12
σ21 σ22
, Σ−1 =
Σ11 σ12
σ21 σ22
, S−1 =
S11 s12
s21 s22
.
Thenσ22
s22∼ χ2(k − p+ 1).
Proof is given as following. Note that
1/s22 = s22 − s21S−111 s12.
Also note that,
Sd≡ Z1Z
>1 + · · ·+ ZkZ
>k
where Z1, · · · , Zki.i.d.∼ Np(0,Σ). Letting Zi =
Xi
Yi
, we get
S =
∑XiX>i
∑XiYi∑
YiX>i
∑Y 2i
and hence
1/s22 = s22·1 = Y >(I −ΠX)Y.
Now V ar(Y |X) = σ22·1 gives
Y >(I −ΠX)Y |X ∼ σ22·1χ2(k − (p− 1)),
which is equivalent toσ22
s22∼ χ2(k − p+ 1).
8
Analysis of Repeated Measurements J.P.Kim
• If S ∼Wp(Σ, k), then for any vector d,
d>Σ−1d
d>S−1d∼ χ2(k − p+ 1).
Proof is given as following. WLOG ‖d‖ = 1. Let
P =
P1
d>
be an orthogonal matrix. Then
PSP> ∼Wp(PΣP>, k)
holds. Now note that(PΣ−1P>)22
(PS−1P>)22=d>Σ−1d
d>S−1d∼ χ2(k − p+ 1)
by previous proposition.
• Now come back to the origin problem. Our goal is to show that
nZ>S−1Z Z ∼ (n− 1)p
n− pF (p, n− p).
Note that
(n− 1)Z>Z
Z>S−1Z Z
∣∣∣∣∣Z ∼ χ2(n− p)
and hence
(n− 1)Z>Z
Z>S−1Z Z
∼ χ2(n− p)
and it is independent of Z. Therefore,
nZ>Z/p
(n− 1)Z>Z
Z>S−1Z Z
/(n− p)∼ F (p, n− p),
i.e.,
nZ>S−1Z Z ∼ (n− 1)p
n− pF (p, n− p).
Now we can obtain our conclusion:
T 2 = (µ− µ0)>(S/n)−1(µ− µ0)
= (√nΣ−1/2(µ− µ0))>(Σ−1/2SΣ−1/2)−1√nΣ−1/2(µ− µ0)
9
Analysis of Repeated Measurements J.P.Kim
∼ T 2(p, n− 1)d≡ (n− 1)p
n− pF (p, n− p)
under H0 : µ = µ0. In here, the additional notation S = Σ is used.
Remark 1.10. Note that Hotelling’s T 2 statistic is invariant under nonsingular linear transformation:
For Y = CX + d where C is p× p nonsingular matrix, we get
n(Y − µY )>S−1Y (Y − µY ) = n(CX − CµX)>(C>)−1S−1
X C−1(CX − CµX)
= n(X − µX)>S−1X (X − µX).
Example 1.11. Testing for time trend can be expressed as linear hypothesis H0 : C>µ = 0.
(i) For example, testing no time change at all:
H0 : µ1 − µ2 = µ2 − µ3 = · · · = µK−1 − µK = 0
can be treated as linear hypothesis
H0 : C>µ = 0
with
C> =
1 −1
1 −1
. . .. . .
1 −1
.
For example, if K = 3, then
C> =
1 −1 0
0 1 −1
,
and if K = 4, then
C> =
1 −1 0 0
0 1 −1 0
0 0 1 −1
.
We can say that C is K × (K − 1) matrix. Now from
√n(µ− µ) ∼ NK(0,Σ),
we get√nC>(µ− µ) ∼ NK−1(0, C>ΣC),
10
Analysis of Repeated Measurements J.P.Kim
and from
(n− 1)Σ ∼WK(Σ, n− 1),
we get
(n− 1)C>ΣC ∼WK−1(C>ΣC, n− 1).
Therefore, we get
n(C>µ)>(C>ΣC)−1(C>µ) ∼H0
T 2(K − 1, n− 1)d≡ (n− 1)(K − 1)
n−K + 1F (K − 1, n−K + 1).
Or, to make the problem simple, we can assume Σ ≈ Σ and use
n(C>µ)>(C>ΣC)−1(C>µ) ≈H0
χ2(K − 1).
(ii) Next, we consider testing for equal change (with equal intervals), or testing for linearity:
H0 : µ2 − µ1 = µ3 − µ2 = · · · = µK − µK−1
is equivalent to
H0 : µ1 − 2µ2 + µ3 = 0, µ2 − 2µ3 + µ4 = 0, · · · ,
i.e.,
H0 : C>µ = 0
where
C> =
1 −2 1
1 −2 1
. . .. . .
. . .
1 −2 1
.
For example, if K = 3, then
C> =(
1 −2 1),
and if K = 4, then
C> =
1 −2 1 0
0 1 −2 1
.
In this case, C is K × (K − 2) matrix. Now from
√nC>(µ− µ) ∼ NK−2(0, C>ΣC),
11
Analysis of Repeated Measurements J.P.Kim
and
(n− 1)C>ΣC ∼WK−2(C>ΣC, n− 1),
we get
n(C>µ)>(C>ΣC)−1(C>µ) ∼H0
T 2(K − 2, n− 1)d≡ (n− 1)(K − 2)
n−K + 2F (K − 2, n−K + 2),
or approximately,
n(C>µ)>(C>ΣC)−1(C>µ) ≈H0
χ2(K − 2).
Example 1.12. Tukey’s Simultaneous Confidence Intervals. Consider a random sample Y1, Y2, · · · , Yn ∼
Np(µ,Σ). In this example, we want to find a confidence interval of `>µ with NOT specified or pre-
determined `. In other words, we want to consider various values of `. Let
S =1
n− 1
n∑i=1
(Yi − Y )(Yi − Y )>.
Note that for a particular choice of `, 100(1− α)% C.I. is
µ :
∣∣∣∣n(`>Y − `>µ)2
`>S`
∣∣∣∣ ≤ t2α/2(n− 1)
,
i.e.,
P
(∣∣∣∣n(`>Y − `>µ)2
`>S`
∣∣∣∣ ≤ t2α/2(n− 1)
)= 1− α.
However, we want to make a statement such that for any choice of `,
P
(∣∣∣∣n(`>Y − `>µ)2
`>S`
∣∣∣∣ ≤ c) ≥ 1− α.
It can be achieved if one choose c s.t.
P
(max`
∣∣∣∣n(`>Y − `>µ)2
`>S`
∣∣∣∣ ≤ c) = 1− α.
Note that
max`
n(`>Y − `>µ)2
`>S`= n(Y − µ)>S−1(Y − µ) ∼ T 2(p, n− 1)
from Cauchy-Schwarz inequality ((`>S`)((Y − µ)>S−1(Y − µ)) ≥ `>(Y − µ), equality occurs when
` ∝ S−1(Y − µ)). Thus we get
P
(max`
∣∣∣∣n(`>Y − `>µ)2
`>S`
∣∣∣∣ ≤ (n− 1)p
n− pFα(p, n− p)
)≥ 1− α
12
Analysis of Repeated Measurements J.P.Kim
for any `, and hence 100(1− α)% simultaneous C.I. of `>µ is
`>Y ±
√(n− 1)p
n(n− p)Fα(p, n− p)`>S`.
Example 1.13. Bonferroni’s Simultaneous Confidence Intervals. In this example, suppose that
we already prespecified our goals, and hence we want to find confidence intervals for m prespecified
linear combinations. In here, we control overall error rate. Let Ci be the event that `>i µ is included in
confidence interval, and P (Cci ) = αi. Then overall error rate is
1− P (all m linear combinations are included in C.I.’s) = P
(m⋃i=1
Cci
)≤
m∑i=1
P (Cci ) =m∑i=1
αi,
and hence we can achieve desired overall error rate as letting∑αi = α. One choice is αi = α/m:
P
(n
∣∣∣∣`>i Y − `>i µ`>i S`i
∣∣∣∣ < tα/(2m)(n− 1) ∀i)≥ 1− α.
Remark 1.14. Note that in usual, length of Bonferroni C.I. is shorter than that of Tukey. It is accept-
able because Tukey’s simultaneous C.I. ensures including every linear combination with predetermined
error rate. However, if the number of combination m becomes large, than Bonferroni C.I. becomes
poor.
Example 1.15. Multiple Comparison.
(i) Consider a multiple comparison problem: One should get a decision with various hypotheses
H01, H02, · · · , H0k. To make the overall error rate less than α, it is not sufficient to perform
an ordinary hypothesis testing to each hypothesis with significance level α. Rather, one should
control the probability of at least one incorrect rejection. For this, one can apply Bonferroni
correction: To test each hypothesis with significance level α/k, to make the probability of family-
wise error rate less than α.
(ii) One can apply other approach, which is called Holm’s step down procedure, or Holm-Bonferroni
method. The method is as follows: LetH1, H2, · · · , Hk be a family of hypotheses and P1, P2, · · · , Pkthe corresponding P -values. Order the P -values from lowest to highest, P(1), P(2), · · · , P(k), and
let the associated hypotheses be H(1), H(2), · · · , H(k). Now for a given significance level α, let i
be the minimal index such that
P(i) >α
k + 1− i.
In other words, starting from the hypothesis with smallest P -value, test each hypothesis H(j)
13
Analysis of Repeated Measurements J.P.Kim
with significance level α/(k − j + 1). Then,
reject H(j) if j = 1, 2, · · · , i− 1
accept H(j) if j = i, i+ 1, · · · , k.
Such procedure ensures FWER ≤ α, where FWER denotes family-wise error rate. Why? Let I0
be the set of indices corresponding to the (unknown) true null hypotheses, having k0 members.
Further assume that we wrongly reject a true hypothesis. We have to prove that the probability
of this event is at most α. Let h be the first rejected true hypotheses (first in the ordering
H(1), H(2), · · · , H(k)). So h − 1 is the last false hypotheses rejected and h − 1 + k0 ≤ k. From
there, we get1
k − h1≤ 1
k0.
Since h is rejected, we have
P(h) ≤α
k + 1− h,
by definition of the test, and hence
P(h) ≤α
k0
is obtained. Therefore, if we wrongly reject a true hypothesis, there has to be a true hypothesis
with P -value at most α/k0. Now define
A =
Pi ∈
α
k0for some i ∈ I0
.
Then whatever the (unknown) set of true hypotheses I0 is, we have
P (A) ≤ α
by the Bonferroni inequalities.
Example 1.16. Now we consider two-sample case. Consider hypothesis comparing means of two
populations,
H0 : µ1 = µ2
with equal variances.
(i) Paired comparison. For example, let
xi be repeated measurements for one eye, and
yi be repeated measurements for the other eye.
14
Analysis of Repeated Measurements J.P.Kim
Then xi and yi are paired, i.e., not independent. However, we are interested in the mean of
di = xi − yi. Denote E(di) = δ and Cov(d) = Σd. Then our interest is to test
H0 : δ = δ0,
where δ0 = 0 (in this case). Applying the result of one-sample case to di’s, we obtain
T 2 = n(d− δ0)>S−1d (d− δ0) ∼
H0
T 2(p, n− 1)d≡ (n− 1)p
n− pF (p, n− p),
where
d =1
n
n∑i=1
di and Sd =1
n− 1
n∑i=1
(di − d)(di − d)>.
Paired comparison might be used, for example, to opthalmologic1 data, or pre-post data.
(ii) Unpaired case. Now assume that observed data set is unpaired. Assume that
Xi ∼ Np(µx,Σ), i = 1, 2, · · · , n1, Yj ∼ Np(µy,Σ), j = 1, 2, · · · , n2.
Also assume that X’s and Y ’s are independent. Then
Σ−1/2(X − Y ) ∼H0
Np
(0,
(1
n1+
1
n2
)I
),
and hence from
(n1 + n2 − 2)Sp ≡n1∑i=1
(Xi −X)(Xi −X)>︸ ︷︷ ︸∼Wp(Σ,n1−1)
+
n2∑j=1
(Yi − Y )(Yi − Y )>︸ ︷︷ ︸∼Wp(Σ,n2−1)
∼Wp(Σ, n1 + n2 − 2),
we get (1
n1+
1
n2
)−1
(X − Y )>S−1p (X − Y ) ∼
H0
T 2(p, n1 + n2 − 2).
Example 1.17. (Unequal variance) Let
Xi ∼ Np(µx,Σx), Yj ∼ Np(µy,Σy), i = 1, 2, · · · , n1, j = 1, 2, · · · , n2.
Then under H0 : µx = µy,
X − Y ∼ Np
(0, n−1
1 Σx + n−12 Σy
),
1안과학의
15
Analysis of Repeated Measurements J.P.Kim
and hence we get
(X − Y )>(
1
n1Σx +
1
n2Σy
)−1
(X − Y ) ≈H0
χ2(p)
asymptotically.
Example 1.18. Multivariate ANOVA (MANOVA). Now we want to compare more than 2
populations. For this, assume that
X(1)i ∼ Np(µ1,Σ), i = 1, 2, · · · , n1,
X(2)i ∼ Np(µ2,Σ), i = 1, 2, · · · , n2,
...
X(k)i ∼ Np(µk,Σ), i = 1, 2, · · · , nk.
Also assume that X(j)i are independent, and all of populations have a common (unknown) variance
Σ. Let n =∑k
i=1 ni. Our goal is to test
H0 : µ1 = µ2 = · · · = µk.
For this, we use a likelihood ratio test; 2(`∗1− `∗0) is asymptotically χ2-distributed. MLE is obtained as
µj = X(j)
=1
nj
nj∑i=1
X(j)i , Σ =
1
n
k∑i=1
ni∑j=1
(X(j)i −X
(j))(X
(j)i −X
(j))>,
and
µ0 = X =1
n
k∑i=1
ni∑j=1
X(j)i , Σ0 =
1
n
k∑i=1
ni∑j=1
(X(j)i −X)(X
(j)i −X)>
under H0. Now note that log-likelihood function is obtained as
`(µ,Σ) = −n2
log |2πΣ| − 1
2
k∑i=1
ni∑j=1
(X(j)i − µ)>Σ−1(X
(j)i − µ)
= −n2
log |2πΣ| − 1
2
k∑i=1
ni∑j=1
tr(
Σ−1(X(j)i − µ)(X
(j)i − µ)>
)
= −n2
log |2πΣ| − 1
2tr
Σ−1k∑i=1
ni∑j=1
(X(j)i − µ)(X
(j)i − µ)>
and hence
`∗1 = `(µ, Σ) = −n2
log |2πΣ| − n
2tr(
Σ−1Σ)
16
Analysis of Repeated Measurements J.P.Kim
`∗0 = `(µ0, Σ0) = −n2
log |2πΣ0| − n
2tr(
(Σ0)−1Σ0).
Thus we obtain
2(`∗1 − `∗0) = n log|Σ0||Σ|
≈H0
χ2((k − 1)p).
Or we can also use
Λ =|Σ0||Σ|
=
(|W |
|B +W |
)−1
,
where
W = nΣ =k∑i=1
ni∑j=1
(X(j)i −X
(j))(X
(j)i −X
(j))> (“Variation within treatment”)
and
B =
k∑i=1
ni(X(j) −X)(X
(j) −X)>. (“Variation between treatment”)
Note that such statistic is based on following decomposition of sum of variations,
nΣ0 =∑k
i=1 ni(X(j) −X)(X
(j) −X)> + nΣ.
T = B + W.
Also note that,
Σ−1/2WΣ−1/2 ∼Wp(I,∑
ni − k)
Σ−1/2BΣ−1/2 ∼Wp(I, k − 1),
and W ⊥ B. Different from the lecture note, we will define
|Σ||Σ0|
=|W |
|B +W |,
and letting
2(`∗1 − `∗0) = −n log Λ.
Λ is said to be distributed as Wilk’s lambda distribution, and written as Λ ∼ Λ(p,∑ni−k, k−1),
where
A ∼Wp(Σ,m), B ∼Wp(Σ, n)⇒ |A||A+B|
∼ Λ(p,m, n).
Or we can apply
• Roy’s greatest root test, which uses the largest eigenvalue of W−1B;
• Lawley-Hotelling trace test, which uses tr(W−1B);
17
Analysis of Repeated Measurements J.P.Kim
• Pillai-Bartlett trace test, which uses tr((I +W−1B)−1).
Note that Wilk’s lambda is equivalent to det(I +W−1B)−1, which motivates Pillai-Bartlett test. Also
note that, Bartlett has shown modified version for likelihood ratio test,
−(n− 1− p+ k
2
)log Λ = −
(n− 1− p+ k
2
)log
|W ||B +W |
≈H0
χ2((k − 1)p).
18
Chapter 2
Multivariate Regression
2.1 Introduction
2.1.1 Review of univariate regression
Recall the univariate regression model
Y = Zβ + ε,
where Y is n × 1 vector, Z is n × p design matrix, β is p × 1 coefficient vector, and ε is n × 1 error
vector. It is equivalent to the model for each unit,
yi = Ziβ + εi, i = 1, 2, · · · , n.
We assume that εi’s are independently (normally) distributed with constant variance. Then OLS
(ordinary least square) estimator β of β is a solution of following normal equation
Z>(Y − Zβ) = 0,
or equivalently,n∑i=1
Z>i Ziβ =n∑i=1
Z>i yi.
With this, we can show that
Eβ = β and Cov(β) = σ2(Z>Z)−1,
and β is BLUE. In model checking step, we check whether data (and fitted model) satisfy “assump-
tions” of the model, such as mean structure (linearity; residuals does not contain systematic pattern),
19
Analysis of Repeated Measurements J.P.Kim
homoskedasticity, distributional pattern (or outliers), and independence.
2.1.2 Multivariate Regression Model
Consider following prolactin level data measured 4 times every 15 minutes over 30 women (figure
2.1). It is known that there exist 3 groups one of those every women belongs to, and our interest is
relationship between prolactin levels of each group and covariates. For this, one may apply regression
model using dummy variables reflecting a group effect assuming all of observations are independent.
However, since it is repeatedly measured over time, it is not reasonable to assume independence.
Rather, it is correlated; thus we might use multivariate regression model which deals with response
vectors of data observed from each individual.
Figure 2.1: Prolactine levels over time
First we only consider the case with equal dimensions; the number of repeat is same for every
individuals. Assume following mean model
Yi = Xiβ + εi, i = 1, 2, · · · ,K,
where
Yi = (yi1, yi2, · · · , yin)>
and
εi = (εi1, εi2, · · · , εin)>
are n×1 vectors. In here, i is an index for independent units; n is the number of repeated measurements.
For instance, in the prolactin data, K = 30 and n = 4. Equal dimensions means that n is equal for
any i.
20
Analysis of Repeated Measurements J.P.Kim
Example 2.1. If we analyze pre-post data, then n = 2. Also following model can be considered:
prei
posti
=
β0
β0 + δ
+ β1
age(pre)i
age(post)i
+
εi1εi2
=
1 0 age(pre)i
1 1 age(post)i
β0
δ
β1
+
εi1εi2
.
Example 2.2. Now consider an example of modeling measurements from 4 time points:
yi1
yi2
yi3
yi4
=
1 1 zi
1 2 zi
1 3 zi
1 4 zi
β0
βT
βZ
+
εi1
εi2
εi3
εi4
. (“Reduced Model”)
Note that in here, the only time-varying covariate is time itself; the other covariate Z is time-constant.
Or we can also consider following model, which assumes less than previous one:
yi1
yi2
yi3
yi4
=
1 0 0 0 zi
1 1 0 0 zi
1 0 1 0 zi
1 0 0 1 zi
α0
α1
α2
α3
αZ
+
εi1
εi2
εi3
εi4
. (“Full Model”)
Example 2.3. Testing Linearity. If the time interval is the same and if the trend is linear, using
time as continuous variable is correct (and we select a reduced model). But using dummy variables
yields consistent estimates
α1 − α0 = α2 − α1 = α3 − α2,
equivalently, C>α = 0, where
C> =
1 −2 1 0
0 1 −2 1
.
Then we can test H0 : C>α = 0 with test statistics
(C>α)>V ar(C>α)−1C>α ∼H0
χ2(2).
Remark 2.4. Note that such repeatedly measured data is correlated. Then using OLS estimator
21
Analysis of Repeated Measurements J.P.Kim
for correlated data is acceptable? Note that V ar(Yi) is not a diagonal matrix; independence and
homoskedasticity are violated, and therefore OLS is not BLUE anymore. Actually, for least square
estimator
β =
(K∑i=1
X>i Xi
)−1 K∑i=1
X>i Yi,
we get
E(β) = β,
but
V ar(β) = V ∗ =
(K∑i=1
X>i Xi
)−1 K∑i=1
X>i V ar(Yi)Xi
(K∑i=1
X>i Xi
)−1
=
(K∑i=1
X>i Xi
)−1( K∑i=1
X>i ΣXi
)(K∑i=1
X>i Xi
)−1
, (2.1)
where Σ = V ar(Yi|Xi). In statistical package problem, such as R, estimates of V ar(β) is available, but
such program gives standard error of β regarding all of the data independent,
V ar(β) = σ2(X>X)−1,
which is not acceptable in our case.
Remark 2.5. Note that we can express the model with augmented matrix,
X(nK×p)
=
X1
X2
...
XK
, Y(nK×1)
=
Y1
Y2
...
YK
,
which makes the least square estimator
β = (X>X)−1X>Y.
Note that such notation can be used even if the data is of unequal dimension; i.e., the number of
repeat is not the same.
Example 2.6. Then what should we do to find the standard error of β, i.e., an estimate of V ar(β)?
There are various ways to achieve the goal.
22
Analysis of Repeated Measurements J.P.Kim
• To make it simple, one can adapt
Σ =1
K
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>
to (2.1), which yields
V ∗ =
(K∑i=1
X>i Xi
)−1 K∑i=1
X>i ΣXi
(K∑i=1
X>i Xi
)−1
.
Main advantage of such estimator is consistency (under technical assumptions). Furthermore, it
is unbiased.
• Alternative empirical variance estimate is
V ∗ =
(K∑i=1
X>i Xi
)−1( K∑i=1
X>i (Yi −Xiβ)(Yi −Xiβ)>Xi
)(K∑i=1
X>i Xi
)−1
.
• Which one would be preferred? If the data set has an equal dimension, i.e., the number of repeat
is same among all of individuals, then former is nice estimate as well as the latter. However, if
we handle the data with unequal dimension, then variance matrix of each individual is not the
same; dimension does not agree! In precise, Σi = V ar(Yi|Xi) is ni × ni matrix, where ni is the
number of repeated measurements of ith individual. Therefore, the estimate
Σ =1
K
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>
is NOT available, and hence the latter estimate V ∗ must be preferred.
• Another alternative is following Jackknife estimate, which is asymptotically equivalent to V ∗
(the second one). It is defined as
JV =K∑i=1
(β−i − β)(β−i − β)>
=
K∑i=1
(X>−iX−i)
−1X>−i(Y−i −X−iβ)− (X>X)−1X>(Y −Xβ)
×
(X>−iX−i)−1X>−i(Y−i −X−iβ)− (X>X)−1X>(Y −Xβ)
>≈ (X>X)−1
(K∑i=1
X>i (Yi −Xiβ)(Yi −Xiβ)>Xi
)(X>X)−1
≈ V ∗
23
Analysis of Repeated Measurements J.P.Kim
from (X>−iX−i)−1 ≈ (X>X)−1 if K is large.
• Even though it has to compute each β−i, Jackknife estimate can be obtained via iteration of
simple routine (i.e.. further matrix multiplication or inverse computation may not be needed),
and hence it can be more preferred.
• In general, we can consider following “model-based” and “robust” variance estimates. Consider
a model with V ar(Yi|Xi) = Vi = Vi(α) and let Vi = Vi(α) (In here, α denotes parameter, such
as σk or ρ). Then we can use following “naive” or “model-based” estimate,
V armod
(β) =
(K∑i=1
X>i V−1i Xi
)−1
.
However, it might yield poor performance if the model (e.g. for variance) is mis-specified. Thus we
can also use following “empirical” or “robust” estimate (i.e., “robust for model misspecification”),
V arrob
(β) =
(K∑i=1
X>i V−1i Xi
)−1( K∑i=1
X>i V−1i (Yi − µi)(Yi − µi)>V −1
i Xi
)(K∑i=1
X>i V−1i Xi
)−1
.
Such “robust” estimate is motivated as follows. Let W−1 be a “working variance;” i.e., we
“modeled” V ar(Y|X) = W−1. Then we get
β = (X>WX)−1X>WY,
and hence
V ar(β) = (X>WX)−1X>V−1X(X>WX)−1,
where V = V ar(Y|X). If we construct V based on our model, then V = W−1, and hence we
get “model-based” estimate. In contrast, if one wants to plug-in V which is consistent for V
whatever the true variance structure is, then we obtain a “robust” variance estimate.
Example 2.7. Following is a result of applying regression model on prolactine data. In the table,
naive s.e. means a standard error that statistical package gave (i.e., it is σ(X>X)−1/2), while s.e.
means a real s.e. based on theoretical result.
Note that naive s.e. assumes all of data are independent, even if our data are correlated, so it cannot
estimate sd(β) well. For example, if data are positively correlated, then naive s.e. may underestimate
the real one, since it assumes each independent n × K points are observed, while theoretical one is
based on K correlated individuals. Note that standard error (or variance) converges to 0 as sample
size becomes large due to the consistency of estimator.
24
Analysis of Repeated Measurements J.P.Kim
β naive s.e. s.e. p-value
Intercept 3.902 0.256 0.306 <.0001base 0.00680 0.00167 0.0025 <.0001time -0.481 0.0727 0.0970 <.0001
group 1 -0.112 0.350 0.3285 0.7492group 2 -0.288 0.288 0,2356 0.3193group 3 0.000 . . .
time*group1 -0.0794 0.126 0.1358 0.5291time*group2 0.0741 0.103 0.1318 0.4722time*group3 0.000 . . .
Example 2.8. Weighted Least Square Estimator. Recall that OLS is consistent and asymptot-
ically normal distributed, but not the BLUE because of repeatedness of data. If we can figure out
overall structure of correlation definitely, then weighted least square (WLS) estimator would be the
BLUE, which gives better estimates. Consider WLSE
β =
(K∑i=1
X>i Σ−1Xi
)−1 K∑i=1
X>i Σ−1Yi = (X>V −1X)−1X>V −1Y,
where
V =
Σ
Σ
.. .
Σ
,
which minimizes following weighted square loss
WSS(β) =
K∑i=1
(Yi −Xiβ)>Σ−1(Yi −Xiβ).
In here, we assume that Σ does not depend on the data, i.e., the model is homoskedastic. We can
easily obtain β and
V ar(β) =
(K∑i=1
X>i Σ−1Xi
)−1
if Σ is known. However, Σ is unknown in most cases, and hence we should replace Σ with estimate Σ,
where
Σ =1
K
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>.
Unfortunately, it depends on β, while Σ is needed to obtain β; Thus, we cannot obtain β or Σ directly.
Hence we use an iterative approach:
25
Analysis of Repeated Measurements J.P.Kim
Step 1. First obtain OLS β (as an initial value of iteration). Then compute
Σ =1
K
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>.
Step 2. Using Σ, obtain β(1).
Step 3. Using β(1), obtain Σ(1). Go to Step 2, and repeat until converge.
Actually, β(1) is already good enough.
Remark 2.9. Note that Σ is not assumed to have a specified structure in previous paragraph. What
if one would like to assume some structure of Σ? We rather consider a correlation structure with
correlation matrix Ω, which satisfies
Σ = (diag(Σ))1/2Ω(diag(Σ))1/2.
Note that diag(Σ)1/2 is a part reflecting heteroskedasticity. For example,
Σ =
σ1 0 0 0
0 σ2 0 0
0 0 σ3 0
0 0 0 σ4
Ω
σ1 0 0 0
0 σ2 0 0
0 0 σ3 0
0 0 0 σ4
,
if Σ is 4× 4. We can use several types of the correlation structure, such as
• Intraclass;
Ω =
1 ρ ρ ρ
ρ 1 ρ ρ
ρ ρ 1 ρ
ρ ρ ρ 1
• AR(1);
Ω =
1 ρ ρ2 ρ3
ρ 1 ρ ρ2
ρ2 ρ 1 ρ
ρ3 ρ2 ρ 1
• or Random effects (handled later).
26
Analysis of Repeated Measurements J.P.Kim
Example 2.10. Note that depending on the structure, estimators of correlation coefficients are dif-
ferent. We can use a method-of-moment approach in here. Let
rij =Yij − µij
σj
be a standardized residual of ith individual at jth time point. Then
E(rijrik) = ρjk,
by definition, and hence we get a system of equations whose solution is MME. For example, if one
assumes intraclass correlation, then ρ satisfies
∑i
rij rik =∑i
ρ
for any j and k, and hence ∑i
∑j>k
rij rik =∑i
∑j>k
ρ
holds, which implies
ρ =
(K∑i=1
ni(ni − 1)
2
)−1 K∑i=1
∑j>k
rij rik.
On the other hand, if one assumes AR(1) correlation structure, then
E(rijrik) = ρ|j−k|,
It is achieved when
rij = ρri,j−1 + uij
for independent r.v.’s rij with suitable variance, and hence we can obtain ρ from the regression
coefficient where outcome is rij and covariate is ri,j−1 without intercept.
Example 2.11. We can also compute MLE for β or Σ. In here, we assume the normality of Y. Then
MLE for β is equal to WLS estimator. Thus the remain part is ML estimates for Σ. Note that
logL = −K2
log |Σ| − 1
2
K∑i=1
(Yi −Xiβ)>Σ−1(Yi −Xiβ),
and matrix differentiation rule∂ log |Σ|∂θk
= tr
(Σ−1 ∂Σ
∂θk
),
27
Analysis of Repeated Measurements J.P.Kim
∂Σ−1
∂θk= −Σ−1 ∂Σ
∂θkΣ−1.
Then we get
∂ logL
∂θk= −K
2tr
(Σ−1 ∂Σ
∂θk
)+
1
2
K∑i=1
(Yi −Xiβ)>Σ−1 ∂Σ
∂θkΣ−1(Yi −Xiβ)
= −K2
tr
(Σ−1 ∂Σ
∂θk
)+
1
2
K∑i=1
tr
(Σ−1 ∂Σ
∂θkΣ−1(Yi −Xiβ)(Yi −Xiβ)>
)
= −K2
tr
(Σ−1 ∂Σ
∂θk
)+
1
2tr
(Σ−1 ∂Σ
∂θkΣ−1
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>
),
and then likelihood equation becomes
tr
(Σ−1 ∂Σ
∂θk
)= tr
(Σ−1 ∂Σ
∂θkΣ−1K−1
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>
).
In here, θ denotes a vector of parameters in Σ, for example, θ = (σ1, σ2, σ3, σ4, ρ)>.
Example 2.12. Similarly, we can deal with heteroskedastic case. In here,
logL = −1
2
K∑i=1
log |Σi| −1
2
K∑i=1
(Yi −Xiβ)>Σ−1i (Yi −Xiβ),
and hence likelihood equation is
K∑i=1
tr
(Σ−1i
∂Σ
∂θk
)=
K∑i=1
tr
(Σ−1i
∂Σi
∂θkΣ−1i (Yi −Xiβ)(Yi −Xiβ)>
).
Remark 2.13. Note that
E
[∂ logL
∂θk
]= −K
2tr
(Σ−1 ∂Σ
∂θk
)+
1
2tr
(Σ−1 ∂Σ
∂θkΣ−1E
K∑i=1
(Yi −Xiβ)(Yi −Xiβ)>
)= 0
provided that V ar(Yi|Xi) = Σ, i.e., variance is correctly specified. In summary,
E
[∂ logL
∂θk
]= 0. (2.2)
Note that (2.2) does hold even if normality of Y is violated; it comes from V ar(Yi|Xi) = Σ. However,
recall that (2.2) is the key fact for consistency or asymptotic normality of MLE (or MCE with contrast
function − logL). It tells us that “even when Y is not normal, MLE for Σ obtained as if Y is normal is
consistent” estimator of Σ (under some technical assumptions). For more details, see following remark
2.14.
28
Analysis of Repeated Measurements J.P.Kim
Remark 2.14. Do not assume any distributional characteristic of Y. We only know that E(Yi|Xi) =
Xiβ and V ar(Yi|Xi) = Σ. Our strategy to estimate Σ is to find minimizer of following contrast function
ρ(Σ) =1
2log |Σ|+ 1
2K
K∑i=1
(Yi −Xiβ)>Σ−1(Yi −Xiβ),
which is equivalent to (minus) log-likelihood function when Y is normal.
ρ(Σ) Eρ(Σ)
Σ = arg min ρ(Σ) Σtrue(∗)= arg minEρ(Σ)
K →∞P
K →∞P
We can ensure (∗) part from (2.2), i.e.,
∂
∂θEρ(Σ) = 0,
which yields consistency. Similarly, we can derive asymptotic normality from (2.2). Recall that key
point of the derivation of asymptotic distribution is second-order approximation and LLN.
Example 2.15. For certain correlation structure, computation of MLE becomes simple. For example,
consider following intrclass correlation
Σ =
1 ρ ρ ρ
ρ 1 ρ ρ
ρ ρ 1 ρ
ρ ρ ρ 1
= (1− ρ)I + ρ11>.
Then
Σ−1 =1
1− ρ
(I − ρ
1 + (p− 1)ρ11>
)and hence
tr
(Σ−1∂Σ
∂ρ
)= tr
Σ−1
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
= tr(
Σ−1(11> − I))
=ρp(p− 1)
(1− ρ)(1 + (p− 1)ρ).
Remark 2.16. There are some remarks.
• Note that the MLE is different from ad-hoc estimates (such as MME, ...). But plugging in
29
Analysis of Repeated Measurements J.P.Kim
different estimates of ρ does not affect the asymptotic variance of β. In other words, you don’t
have to sweat on finding “best estimate” of Σ (equivalently, ρ), if we’re interested in attaining
the estimates of β (This will be handled at the appendix section, section 2.3).
• Consistency of MLE is guaranteed even if the normality of Y is violated, but only when mean
and the variance are correctly specified, i.e. µi = Xiβ and V ar(Yi|Xi) = Σ in real.
• If mean is mis-specified, neither OLS nor WLS is consistent. If variance is mis-specified, OLS is
still consistent (∵ it does not assume the variance structure), but WLS is not consistent. Thus,
efficiency of WLS depends on true correlation structure and the number of time points.
• In other words, WLS has a burden of “specifying variance correctly,” but then you get an
efficiency. In contrast, OLS is less efficient, but there is no burden to specify correct variance
structure.
Example 2.17. We saw that ignoring correlation structure will give us a loss on efficiency, but then
how many do we lose? One criteria is an asymptotic relative efficiency, or ARE, in abbreviation, which
is defined as a ratio of (asymptotic) variance. In this case, we can write
ARE = V ar(β)V ar(β)−1,
where β and β denote WLS (with unstructured covariance) and OLS, respectively. Then we easily
obtain
ARE = (X>WX)−1(X>X)(X>VX)−1(X>X),
where W = V −1. For example, see following figure 2.2, which gives a plot of ARE when the correlation
structure is given as intraclass or AR(1), respectively. The number of repeat is T = 4. Note that higher
ARE denotes that OLS is more efficient. As you can see, if there is no correlation, OLS is “the best,”
and hence ARE is highest. Also you can see that as correlation becomes higher, OLS loses its efficiency.
In addition, since correlation between measurements is lower when the data have AR(1) structure,
and hence OLS is more efficient when the true correlation structure is AR(1) compared with the case
intraclass-correlated.
Now we see how the number of repeated measurements T affects to the performance of OLS. Fol-
lowing figure (figure 2.3) gives ARE-correlation plot with various values of T when the data set is
intraclass-correlated. As you can see, if T becomes larger, since correlation structure becomes signifi-
cant, OLS would be less efficient.
Note that, in practical study, efficiency means gain of money; if the estimator has lower variance,
then required sample size becomes smaller. In this sense, efficiency (ARE) might be an important issue
30
Analysis of Repeated Measurements J.P.Kim
Figure 2.2: ARE of OLS vs. WLS
Figure 2.3: ARE of OLS vs. WLS
in the study.
2.2 Modeling Variance and Correlation
In many problems, variance changes over times (e.g.. see prolactin data), and how the variability
changes over time might be of interest. However, if there are many time points (e.g. T = 24), not
only estimating σ1, σ2, · · · , σ24 is annoying, but such estimation cannot show trend of variance chang
effectively. In this section, we see some approaches for modeling variance or correlation coefficients
and estimating methods.
31
Analysis of Repeated Measurements J.P.Kim
2.2.1 Modeling Variance
Consider a model
E(Yij − µij)2 = σ2ij = exp(γ0 + γ1Xij). (2.3)
(Since it is a nonlinear model, we cannot use the notation Yi or Xi in EYi = Xiβ) Then we can treat
(Y − µ)2 as outcome, exp(γ0 + γ1Xij) as mean, and find γ using least square estimation, i.e., find the
minimizer of ∑i,j
ε2ij =∑i,j
((Yij − µij)2 − exp(γ0 + γ1Xij)
)2.
It is equivalent to find a solution of
∑i,j
2∂εij∂γ
εij = −2∑i,j
∂σ2ij
∂γ
((Yij − µij)2 − exp(γ0 + γ1Xij)
)= 0,
i.e., find a zero of
U(β, γ) :=∑i,j
∂σ2ij
∂γ
((Yij − µij)2 − exp(γ0 + γ1Xij)
)2=∑i,j
1
Xij
exp(γ0 + γ1Xij)((Yij − µij)2 − exp(γ0 + γ1Xij)
).
(Or we can rewrite U(β, γ) as
U(β, γ) =∑i,j
1
Xij
∂σ2ij
∂ηij
((Yij − µij)2 − exp(γ0 + γ1Xij)
),
where ηij = γ0 + γ1Xij is a linear component) Then we can solve the problem using weighted least
square technique, if one treats ∂σ2ij/∂ηij as a weight. However, there are two main problems in such
procedure. First, outcome (Yij − µij)2 is unavailable as β is unknown. Thus one needs to obtain the
estimate of β first, and then solve U(β, γ) = 0 after plug-in β. Or, equivalently, one should solve
∑Ki=1X
>i (Yi −Xiβ)
U(β, γ)
= 0.
Second, even if we can obtain the outcome, we cannot solve U(β, γ) = 0 directly, since the weight
∂σ2ij
∂ηij= exp(γ0 + γ1Xij)
32
Analysis of Repeated Measurements J.P.Kim
depends on unknown variable γ. We can easily solve this problem using iterative algorithm such as
Newton-Rhapson,
γ(p+1) = γ(p) +
(−∂U(β, γ)
∂γ>
)−1
U(β, γ)|γ=γ(p) ,
or scoring method,
γ(p+1) = γ(p) + E
(−∂U(β, γ)
∂γ>
)−1
U(β, γ)|γ=γ(p) .
Remark 2.18. Note that such estimator is asymptotic distributed as multivariate normal: Details
will be discussed later.
Remark 2.19. A consistent variance estimate can be obtained by Jackknife procedure.
(i) Delete ith subject and obtain OLS, say β−i.
(ii) Given the estimate, compute (Yij − µij)2= (Yij − (Xiβ−i)j)2.
(iii) Fit variance model and obtain γ−i.
(iv) Repeat (i) to (iii) for i = 1, 2, · · · ,K, and compute
K∑i=1
β−i − βγ−i − γ
(β−i − β γ−i − γ).
Remark 2.20. Note that, we can also use WLS instead of OLS. WLS should be used if the purpose
of variance modeling is to improve the efficiency of β, but since WLS needs covariance to obtain, we
should recompute β at every iteration as γ is updated. Also note that, to obtain more efficient estimate
of γ, we may use weighted square criterion
K∑i=1
∂ε>i∂γ
Ω−1i εi,
where Ωi = V ar(εi). If our interest is only in the (efficient) estimation of β, we may use just least
square form; but we should employ above (weighted) criteria if one wants to perform an inference on
γ for efficient inference.
2.2.2 Modelling Correlation
In genetic studies (or famaily studies), modeling correlation is of primary interest (e.g. “Is the
correlation between siblings higher than that between cousins?”). Note that
E(Yij − µij)(Yik − µik) = ρijk =exp(α0 + α1Zijk)− 1
exp(α0 + α1Zijk + 1,
33
Analysis of Repeated Measurements J.P.Kim
where Zijk is a covariate (e.g. whether an individual is sibling or cousin). In correlation modeling, we
can treat (Yij − µij)(Yik − µik) as outcome,exp(α0+α1Zijk)−1exp(α0+α1Zijk+1 as mean, and find α so that
∑i,j,k ε
2ijk is
minimized, where εijk = (Yij − µij)(Yik − µik)− ρijk. Then we should solve
U(β, α) = 0,
where
U(β, α) =∑i,j,k
∂ρijk∂α
((Yij − µij)(Yik − µik)− ρijk) =∑i,j,k
1
Zijk
∂ρijk∂ηijk
((Yij − µij)(Yik − µik)− ρijk).
Note that our ‘outcome’ is not known, so one needs to obtain the estimate of β first, plug in, then solve
U(β, α) = 0. The same arguments used in the variance modeling can be applied in the correlation
modeling, such as nonlinear regression routine or jackknife procedure. Furthermore, one can combine
two models, and then mean, variance, and correlation is modeled simultaneously.
2.3 Appendix: Asymptotic variance of WLS estimator
Consider a WLS estimator for β,
β(θ) = (X>V−1(θ)X)−1X>V−1(θ)Y,
where θ is a (vector of) parameter(s). Then we get
V ar(β(θ) = (X>V−1(θ)X)−1
and hence we can estimate such variance as
V ar(β(θ)) = (X>V−1(θ)X)−1.
However, actually we obtain β(θ), instead of β(θ), and then it may be too optimistic to expect
V ar(β(θ)) = V ar(β(θ)).
Following theorem tells us that such expectation may be acceptable; if θ is sufficiently a good estimator
for θ (consistent, at least), their asymptotic variances are the same, i.e.,
V ar(β(θ)) ≈ V ar(β(θ)).
34
Analysis of Repeated Measurements J.P.Kim
Theorem 2.21. Let θ be a consistent estimator for θ and the index 0 denotes the true value; e.g., β0
denotes the true β. Then
√K(β(θ)− β0) =
√K(β(θ0)− β0) +OP (K−1/2)
as K →∞.
Proof. Note that
√K(β(θ)− β0) =
(1
KX>V−1(θ)X
)−1
K−1/2X>V−1(θ)ε
=
(1
KX>V−1(θ0)X
)−1
K−1/2X>V−1(θ)ε + oP (1) (∵ consistency of θ)
holds. Under reasonable assumption, we can say that K−1X>V−1(θ0)X converges to somewhere,
and hence the remain part is K−1/2X>V−1(θ)ε. Now define S(θ) = X>V−1(θ)ε. Then by Taylor’s
theorem, we get
K−1/2S(θ) = K−1/2S(θ0) +K−1/2 ∂S
∂θ
∣∣∣∣θ=θ∗
(θ − θ0)
= K−1/2S(θ0) +K−1 ∂S
∂θ
∣∣∣∣θ=θ∗
·√K(θ − θ0).
From consistency, under some technical assumptions,√K(θ− θ0) = OP (1), and therefore it’s enough
to show
K−1 ∂S
∂θ
∣∣∣∣θ=θ∗
= OP (K−1/2).
Note that for any component θk
∂S
∂θk= −X>V−1(θ)
∂V
∂θk(θ)V−1(θ)ε = −
K∑i=1
X>i Σ−1 ∂Σ
∂θkΣ−1εi = OP (K1/2)
from E(εi) = 0.
35
Chapter 3
Mixed Effects Model
3.1 Motivation
In previous chapter, we learned analysis methods for repeated measured data with equal dimensions.
Then how to handle the data with unequal dimension? That is, we observed
Yi ∼ Nni(Xiβ,Σi), i = 1, 2, · · · ,K.
(Since there are different number of observations, their variances become also different; at least, their
dimension is different) If one estimates β and Σi directly (without assumption of correlation structure),
then there exist too many parameters, which becomes difficult to estimate. If the maximum number of
measurements, say n, is known, one can assume that Σi is a submatrix of Σn×n, but then there might
be a problem of data missing or irregular time points (in fact, two might coincide with). Alternatively
one can consider specially structured models, which is called mixed effects model, which assumes
that the elements of Yi are correlated only because they share common characteristics, called random
effects. For example, let yij be an observation of ith individual at jth time point. Then mixed effects
model
yij = β0 + bi +Xijβ + εij
can be considered, where εi and bi are independent random variables with
εi ∼ N(0, σ2Ii), bi ∼ N(0, D), bi ⊥ εi|Xi.
Note that all of distributional characteristics or independence are obtained conditional X. Also do
not confuse the notation; σ2 and D are scalars. Then random effect term bi can be interpreted as a
“specific characteristic or propensity of each individual.” Furthermore, we can easily obtain following
36
Analysis of Repeated Measurements J.P.Kim
properties.
• E(yij |xij) = β0 + xijβ. (“Marginal mean response”)
• E(yij |bi, xij) = β0 + bi + xijβ. (“Conditional mean response”)
• V ar(yij |bi, xij) = σ2. (“Conditional independence”)
• V ar(yij |xij) = EV ar(yij |bi, xij) + V arE(yij |bi, xij) = σ2 +D.
• Cov(yij , yik|xij , xik) = D (“only sharing bi yields correlation”)
• Corr(yij , yik|xij , xik) = Dσ2+D
.
Remark 3.1. Random Intercept Model. Recall that the model is
E(yij) = β0 + xijβ.
However, if we consider a response of specific individual, then we should see
E(yij |bi) = β0 + xijβ + bi.
In other words, mean response of each individual has a random intercept depending on random effect
bi. Note that model yields a common slope, i.e., parallel mean response lines.
Figure 3.1: Random intercept model
Remark 3.2. Matrix notation. Consider following notations.
37
Analysis of Repeated Measurements J.P.Kim
• Yi = (Yi1, · · · , Yini))>, Xi =
1 Xi1
1 Xi2
......
1 Xini
.
Then our model can be rewritten asY1
Y2
...
YK
=
X1
X2
...
XK
β0
β
+
11
12
. . .
1K
b1
b2...
bK
+
ε1
ε2...
εK
or
Y = Xβ + Zb+ ε.
Then we can write
• E(Yi|Xi, bi) = Xiβ + 1ibi;
• E(Yi|Xi) = Xiβ;
• V ar(Yi|Xi, bi) = σ2Ii;
• V ar(Yi|Xi) = σ2Ii + 1iD1>i ;
• Corr(Yi) =D
σ2 +DJi +
(1− D
σ2 +D
)Ii.
Also, marginal (log) likelihood becomes
logL = −1
2
K∑i=1
log |Vi| −1
2
K∑i=1
(Yi −Xiβ)>V −1i (Yi −Xiβ),
where
Vi = σ2Ii + 1iD1>i = V ar(Yi|Xi).
Note that
V −1i = (σ2 +D)
1
1− ρ
(Ii −
ρ
1 + (ni − 1)ρJi
),
where ρ =D
σ2 +D. Note that if σ2 increases, then correlation ρ decreases; verifying “which one belongs
to each individual” becomes more difficult. We can minimize such likelihood using Newton-Rhapson
algorithm.
38
Analysis of Repeated Measurements J.P.Kim
3.2 Model Estimation
Consider following mixed effects model
Yi = Xiβ + Zibi + εi, i = 1, 2, · · · ,K.
3.2.1 Notations
From now on, we use following notations. Let
Yi = (yi1, yi2, · · · , yini)>
be ni × 1 vector of responses, and
εi = (εi1, εi2, · · · , εini)>
be ni× 1 vector of errors. Also, let Xi be ni× p covariate matrix and β be p× 1 coefficient vector. For
random effect terms, let Zi be ni × q matrix and bi be q × 1 vector of random regression coefficients.
Example 3.3. (Random intercept and random slope) We can consider following model with q = 2
and ni = 4: yi1
yi2
yi3
yi4
=
1 1 xi1
1 2 xi2
1 3 xi3
1 4 xi4
β0
β∼
+
1 1
1 2
1 3
1 4
b0ib1i
+
εi1
εi2
εi3
εi4
.
In here, the notation β∼
is used to emphasize that β = β∼
is a vector.
3.2.2 Likelihood approach
Now we assume distributional characteristics. Assume that εi ∼ Nni(0, σ2Ii) and bi ∼ Nq(0, D),
where D is q × q matrix. Also assume that bi and εi are (conditionally) independent. Then,
• Conditionally on bi,
E(Yi|Xi, bi) = Xiβ + Zibi, V ar(Yi|Xi, bi) = σ2Ii.
• Marginally,
E(Yi|Xi) = Xiβ, V ar(Yi|Xi) = σ2Ii + ZiDZ>i .
More precisely,
Yi ∼ Nni(Xiβ, σ2Ii + ZiDZ
>i ).
39
Analysis of Repeated Measurements J.P.Kim
Thus we get a marginal likelihood as
logL = −1
2
K∑i=1
log |Vi| −1
2
K∑i=1
(Yi −Xiβ)>V −1i (Yi −Xiβ),
where
Vi = V ar(Yi|Xi) = σ2Ii + ZiDZ>i .
Then we can find MLE (or equivalently, WLSE) which maximizes marginal likelihood. Note that
likelihood equation isK∑i=1
X>i V−1i (Yi −Xiβ) = 0
which is obtained from∂ logL
∂β= 0. For V , we can also find likelihood equation from
− ∂L∂θk
=1
2
K∑i=1
tr
(V −1i
∂Vi∂θk
)− 1
2
K∑i=1
(Yi −Xiβ)>V −1i
∂Vi∂θk
V −1i (Yi −Xiβ) = 0,
where θ is a vector of parameters in Vi, e.g., θ = (σ2, D11, D12, D22)>, in the case of q = 2.
Remark 3.4. In mixed effects model, there are fixed effect term and random effect term. If bi is
available, then conditional mean response gives us more information if individual is specified; but
unfortunately, bi cannot be observed. Thus we should estimate or predict bi terms1. Under our model,
Yi and bi are jointly distributed as
Yibi
∼ Nni+q
Xiβ
0
,
Vi ZiD
ZiD> D
.
Thus conditional distribution of bi given Yi is also multivariate normal:
bi|Yi ∼ Nq(E(bi|Yi), V ar(bi|Yi))
where
E(bi|Yi) = DZ>i V−1i (Yi −Xiβ) and V ar(bi|Yi) = D −DZ>i V −1ZiD.
Then E(bi|Yi) = DZ>i V−1i (Yi −Xiβ) is the best linear unbiased predictor (BLUP), in the sense
that
E(bi − E(bi|Yi))2 ≤ E(bi − a>i Yi)2 ∀a s.t. E(a>Yi) = E(bi).
1bi can be estimated : In this view, we regard bi as a “realization” of each random effect. bi can be predicted : In thisview, we think bi as a random term itself. The latter term is more used in literature.
40
Analysis of Repeated Measurements J.P.Kim
(Actually, it is also a best predictor (BP)) It gives us a heuristic interpretation; if Yi > Xiβ for any
component, then bi should be positive.
Remark 3.5. Can we maximize the joint likelihood over β and b1, b2, · · · , bK as if bi’s are fixed
parameters? That is, minimize SS(β, b1, b2, · · · , bK), where
SS(β, b1, b2, · · · , bK) =K∑i=1
(log f(Yi|Xi, bi) + log g(bi|Xi))
=K∑i=1
(Yi −Xiβ − Zibi)>R−1
i (Yi −Xiβ − Zibi) + b>i D−1b,
and V ar(Yi|bi) = Ri. In general, this approach will not lead to a correct inference about β, since the
number of unknown bi’s increases as the sample size increases, which implies that we would not be
able to enjoy goal properties of MLE. However, in linear mixed effect models, this will give a correct
inference for β. Why? Note that bi minimizing SS becomes
bi = (Z>i R−1i Zi +D−1)−1Z>i R
−1i (Yi −Xiβ),
since SS can be viewed as regularized (weighted) squared sum of residuals with responses Yi − Xiβ
and covariates Zi. However, from
(Z>i R−1i Zi +D−1)DZ>i = Z>i R
−1i (ZiDZ
>i +Ri),
we obtain
(Z>i R−1i ZiD
−1)−1Z>i R−1i = DZ>i (ZiDZ
>i +Ri)
−1,
which yields
bi = DZ>i (ZiDZ>i +Ri)
−1(Yi −Xiβ) = DZ>i Vi(Yi −Xiβ) = E(bi|Yi).
It is exactly equal to BLUP of bi, and hence β based on bi will be also same as that based on BLUP
by following equantity:
K∑i=1
X>i R−1i (Yi −Xiβ − Zibi) =
K∑i=1
X>i Vi(Yi −Xiβ). (3.1)
Why (3.1) yields β = βBLUP ? LHS of (3.1) is the same as derivative of SS(β, b1, · · · , bK); RHS is
41
Analysis of Repeated Measurements J.P.Kim
a likelihood equation. Thus the remained part is to prove (3.1). It is easily obtained from
Yi −Xiβ − Zibi = Yi −Xiβ − ZiDZ>i (ZiDZ>i +Ri)
−1(Yi −Xiβ)
= Ri(ZiDZ>i +Ri)
−1(Yi −Xiβ)
and
K∑i=1
X>i R−1i (Yi −Xiβ − Zibi) =
K∑i=1
X>i (ZiDZ>i +Ri)
−1(Yi −Xiβ)
=K∑i=1
X>i V−1i (Yi −Xiβ).
Remark 3.6. If bi is observed, the problem is easy. Construct full data as (Y o, b) and observed data
as (Y o) and missing data as (b).2 If bi were observed, the joint likelihood is
logL =
K∑i=1
log f(Yi|bi) + log g(bi)
= −1
2
K∑i=1
log |σ2Ii| −1
2
K∑i=1
(Yi −Xiβ − Zibi)>(σ2Ii)−1(Yi −Xiβ − Zibi)−
n
2log |D| − 1
2
K∑i=1
b>i D−1bi,
and hence
D =1
K
K∑i=1
bib>i
is obtained. But bi is unobserved, and hence such estimate is unavailable.
A possible approach for such missing problem is to implement an EM algorithm.
Theorem 3.7. EM algorithm. Repeat following steps until estimates are converge.
β(p+1) =
(K∑i=1
X>i Xi
)−1( K∑i=1
Xi(Yi − Zibi)
)
D(p+1) = D(p) − 1
K
K∑i=1
D(p)Z>i V−1i ZiD
(p) +1
K
K∑i=1
D(p)V −1i (Yi −Xiβ
(p))(Yi −Xiβ(p))>V −1
i ZiD(p)
σ2(p+1) =1
n
K∑i=1
(Yi −Xiβ
(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi) + tr
(Z>i ZiV ar(bi|Yi, θ(p))
),
2The term “missing” does not denote only the case that the data is failed to observed even if we intended to, but it isdetermined what we defined the full data set is. For example, in mixed effects model, bi is not intended to observe, andhence it does not contain failure in collecting the data, but we often regard it as missing. In contrast, (assuming thatmax(n1, · · · , nK) = n1), Y2,n2+1, · · · , Y2,n2 are “failed to observe” even if we intended to (∵ j denotes the time points),but we will NOT call such data are missing.
42
Analysis of Repeated Measurements J.P.Kim
where
bi = D(p)Z>i V−1i (Yi −Xiβ
(p)).
Proof. Note that
log f(Yc|θ) = −1
2
K∑i=1
log |σ2Ii| −1
2
K∑i=1
(Yi −Xiβ − Zibi)>(σ2Ii)−1(Yi −Xiβ − Zibi)
− K
2log |D| − 1
2
K∑i=1
b>i D−1bi.
Then we can obtain Q(θ|θ(p)) for EM algorithm as
Q(θ|θ(p)) = Eb|Y,θ(p) [log f(Yc|θ)]
= −n2
log σ2 − K
2log |D| − 1
2
K∑i=1
Eb|Y,θ(p)[b>i D
−1bi
]− 1
2σ2
K∑i=1
Eb|Y,θ(p)[(Yi −Xiβ − Zibi)>(Yi −Xiβ − Zibi)
].
If we denote as
µpi := E(bi|Yi, θ(p)) = D(p)Z>i (ZiD(p)Z>i + σ2(p)Ii)
−1(Yi −Xiβ(p))
Σpi := V ar(bi|Yi, θ(p)) = D(p) −D(p)Z>i (ZiD
(p)Z>i + σ2(p)Ii)−1ZiD
(p),
we can see
Eb|Y,θ(p) [b>i D−1bi] = tr(D−1Σp
i ) + µp>i D−1µpi
and
Eb|Y,θ(p) [(Yi −Xiβ − Zibi)>(Yi −Xiβ − Zibi)] = (Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ)
+ tr(Z>i ZiΣpi ) + µp>i Z>i Ziµ
pi .
Combining all above terms, finally we get
Q(θ|θ(p)) = −n2
log σ2 − K
2log |D| − 1
2
K∑i=1
(tr(D−1Σp
i ) + µp>i D−1µpi
)− 1
2σ2
K∑i=1
((Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ) + tr(Z>i ZiΣ
pi ) + µp>i Z>i Ziµ
pi
).
43
Analysis of Repeated Measurements J.P.Kim
In M-step, we have to find θ(p+1) maximizing Q(θ|θ(p)).
• To find β(p+1), fix σ2 and D. β(p+1) is a maximizer of
K∑i=1
((Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ) + µp>i Z>i Ziµ
pi
)=
K∑i=1
(Yi−Xiβ−Ziµpi )>(Yi−Xiβ−Ziµpi ),
and hence
β(p+1) =
(K∑i=1
X>i Xi
)−1 K∑i=1
(Yi − Ziµpi ).
• Now we find D(p+1) which minimizes
K log |D|+K∑i=1
(tr(D−1Σp
i ) + µp>i D−1µpi
).
(Note that for any β and σ2, Q is maximized when this is minimized) It is equal to
K log |D|+K∑i=1
tr
(K∑i=1
(Σpi + µpiµ
p>i )
),
and hence
D(p+1) =1
K
K∑i=1
(Σpi + µpiµ
p>i ) =
1
K
K∑i=1
Eb|Y,θ(p) [bib>i ].
It can be rewritten as
D(p+1) =1
K
K∑i=1
(Σpi + µpiµ
p>i )
=1
K
K∑i=1
(D(p) −D(p)Z>i V
(p)−1i ZiD
(p) +D(p)Z>i V(p)−1i (Yi −Xiβ
(p))(Yi −Xiβ(p))>V
(p)−1i ZiD
(p))
= D(p) − 1
K
K∑i=1
D(p)Z>i V(p)−1i ZiD
(p)
+1
K
K∑i=1
D(p)Z>i V(p)−1i (Yi −Xiβ
(p))(Yi −Xiβ(p))>V
(p)−1i ZiD
(p).
• Finally, plugging in β(p+1) and D(p+1), we may find σ2(p+1) which maximizes profile likelihood.
Note that
Q(β(p+1), D(p+1), σ2|θ(p)) = −n2
log σ2 − K
2log |D(p+1)| − 1
2
K∑i=1
K∑i=1
Eb|Y,θ(p)[b>i D
(p+1)−1bi
]− 1
2σ2
K∑i=1
Eb|Y,θ(p)[(Yi −Xiβ
(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi)
],
44
Analysis of Repeated Measurements J.P.Kim
and hence maximizing Q(β(p+1), D(p+1), σ2|θ(p)) is equivalent to minimize
n
2log σ2 +
1
2σ2
K∑i=1
Eb|Y,θ(p)[(Yi −Xiβ
(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi)
].
Note that
Eb|Y,θ(p)[(Yi −Xiβ
(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi)
]= (Yi −Xiβ
(p+1))>(Yi −Xiβ(p+1))− 2µp>i Z>i (Yi −Xiβ
(p+1)) + Eb|Y,θ(p)(b>i Z>i Zibi)
= (Yi −Xiβ(p+1) − Ziµpi )
>(Yi −Xiβ(p+1) − Ziµpi ) + tr(Z>i ZiΣ
pi ).
So we obtain
σ2(p+1) =1
n
K∑i=1
((Yi −Xiβ
(p+1) − Ziµpi )>(Yi −Xiβ
(p+1) − Ziµpi ) + tr(Z>i ZiΣpi )).
Remark 3.8. Diagnostic tools. In model diagnostic, we should check all the assumptions in model.
First, we should check whether mean part (=fixed effect part) is correctly specified. That is, we have
to check if the model includes correct variables or not (Two types of possibilities exist in the mean
misspecification; one is that variables those should be in the model are omitted; the other one is that
variables those should not be in the model are included) There are several tools to achieve the goal;
we can test H0 : βi = 0 (for example) using likelihood ratio, or see residual terms V−1/2(Y − µ).
Note that V−1/2(Y − µ) is a function of covariates, and its mean would not be zero if mean part is
incorrectly designed. We can also employ AIC = −2L + 2k or BIC = −2L + k logK (where k is the
number of unknown parameters) in order to perform a model selection.
Next task is to check random effect part, i.e., selection of Z. Compared to mean specification, since
random part Zb is unobservable directly, we cannot “test” b = 0. Then how to verify correctness? One
can pay attention to the fact that Z appears in marginal variance of Y,
V ar(Yi) = ZiDZ>i + σ2Ii.
Thus we can verify model specification with variance term. If model has been correctly specified, then
(Yij − µij)2 − σij should have mean zero, and V−1/2(Y − µ) should be homoscedastic. For precise
45
Analysis of Repeated Measurements J.P.Kim
analysis, we can test D12 = 0 or D22 = 0 where
V ar(Yi|Xi) = σ2Ii + Zi
D11 D12
D21 D22
Z>i .
How to test it? We can apply likelihood ratio test intuitively. The nullD12 = 0 denotes uncorrelatedness
between random intercept and slope. However, testing D22 = 0 may cause a problem, because 0 is on
the boundary of parameter space for D22. In here, standard likelihood inference may not perform well.
Finally, we can check normal assumption using p-p plot or q-q plot with standardized residuals
V−1/2(Y − µ).
Remark 3.9. What happen if mean or variance model is misspecified? Assume that E(Y|X) = Xβ
is true model but we mistakenly fitted E(Y|X) = X∗β∗. Then β∗ is a solution of
K∑i=1
X∗>i V −1i (Yi −X∗i β∗) = 0.
Recall that the solution β ofK∑i=1
X>i V−1i (Yi −Xiβ) = 0
converges to the solution of
E
[K∑i=1
X>i V−1i (Yi −Xiβ)
]= 0,
i.e., optimizer of
E
[K∑i=1
(Yi −Xiβ)>V −1i (Yi −Xiβ)
].
Under the true model, solution is a true value of β. Also, in this case, we can say that β∗ converges
to the solution β∗ of
E
[K∑i=1
X∗>i V −1i (Yi −X∗i β∗)
]= 0,
even if we considered misspecified model.
How about variance misspecification? Also in this case, solution V ∗ and β∗ converges to the optimizer
of
∫ (−
K∑i=1
log |V ∗i (ρ)| −K∑i=1
(Yi −X∗i β∗)>V ∗−1
i (ρ)(Yi −X∗i β∗)
)︸ ︷︷ ︸
loss under the model
|2πVi|−1/2e−12
∑Ki=1(Yi−Xiβ)>V −1
i (Yi−Xiβ)︸ ︷︷ ︸pdf under the true model
dYi.
From this, we can also say that, if only variance model is misspecified, and mean model is well specified,
46
Analysis of Repeated Measurements J.P.Kim
then β becomes consistent, because it converges to the solution of
E
(K∑i=1
X>i V∗−1
i (Yi −Xiβ)
)= 0.
Example 3.10. Nothern Manhattan Stroke Study (Cognitive Function Data). Stroke-free
subjects in nothern Manhattan are followed annually to detect the outcome of stroke. One of the goals is
to examine whether the changes in cognitive function over time depend on kidney function measured
by creatinine levels. Totally 2029 subjects are investigated, but as the time goes by, the number
of observed subjects decreases (n1 = 2029, n2 = 1695, n3 = 1348, n4 = 920, n5 = 491, n6 = 164).
Observed outcome is TICS (Repeated Telephone Interview for Cognitive Status) score. Following
figures describe the data.
Figure 3.2: Box plot. Each box corresponds to each time point. There is a slightly increasing trend astime goes by.
Figure 3.3: Score-time plot. (Left) Subjects are divided in two groups with low creatinine level andhigh level, respectively. Median is used to determine whether creatinine level is low or high. (Right)Subjects are clustered with the number of repeats. We can find a tendency that subjects observedlonger had higher score.
47
Analysis of Repeated Measurements J.P.Kim
Figure 3.4: (Left) Random intercept - random slope plot. Each dot corresponds to each individual.Dots on a line are from subjects with ni = 1 (i.e., observed only once). Note that bi = E(bi|Yi) =DZ>i V
−1i (Yi − µi), and ni = 1 yields Zi = (1, 1) (∵ Zi’s first column is (1, 1, 1, 1)>; second one is
(1, 2, 3, 4)>). (Right) Random intercept - subjects plot.
WLSE Mixed model
β s.e. p-value β s.e. p-value
(Intercept) 47.4 1.21 <.0001 (Intercept) 47.8 1.22 <.0001logCR -0.88 0.53 0.096 logCR -0.70 0.54 0.1963time 0.081 0.036 0.0235 time 0.056 0.029 0.0592AGE -0.23 0.013 < .0001 AGE -0.24 0.013 < .0001
woman -0.68 0.25 0.0063 woman -0.68 0.26 0.0105heduc 4.56 0.26 < .0001 heduc 4.54 0.26 < .0001med -2.00 0.24 < .0001 med -2.11 0.24 < .0001
DIAB -0.46 0.27 0.0892 DIAB -0.53 0.27 0.0522etmod 0.84 0.22 0.0002 etmod 0.90 0.22 < .0001NCAD 0.54 0.26 0.0412 NCAD 0.52 0.26 0.0482
CR*time -0.36 0.12 0.0049 CR*time -0.41 0.11 0.0001
Table 3.1: WLSE vs LMM fit. In WLSE, intraclass structure of correlation is assumed. In LMM,random intercept and slope model is used. We can verify that two models are fitted similarly.
Remark 3.11. Conditional independence also can be dropped. That is, we need not assume V ar(Yi|bi) =
σ2Ii, but more complicated error structure can be given. For example, we can assume that εi|bi ∼
Nni(0, σ2Ωi), where
Ωi =
1 α α2 · · · αni−1
α 1 α · · · αni−2
α2 α 1 · · · αni−3
......
.... . .
...
αni−1 αni−2 αni−3 · · · 1
.
3.2.3 Restricted Maximum Likelihood (REML)
Note that DMLE (estimate of variance component) is criticized because it tends to be biased down-
ward by ignoring the degrees of freedom lost by estimating β. In this view, restricted maximum like-
48
Analysis of Repeated Measurements J.P.Kim
lihood (REML) estimation is used, which is based on N − p linearly independent error contrasts.
Intuitively, it uses p dimensional components from N data to estimate main effect term β, and uses
rest N − p error components to estimate variance components.
Let S = I − X(X>X)−1X> be N × N matrix, and A be N × (N − p) matrix satisfying S =
AA> and A>A = I (spectral decomposition). Then w = A>Y provides a particular set of N − p
linearly independent error contrasts which is orthogonal (in the sense of statistical independence)
to β = G>Y, where G = V−1X(X>V−1X)−1. Note that orthogonality of w and β comes from
A>VG = A>X(X>V−1X)−1 = 0 and hence
Cov(A>Y, G>Y) = A>VG = 0.
(When we assume that Y is distributed as multivariate normal, w = A>Y and β = G>Y are inde-
pendent) This point has an important interpretation; since A>Y is independent of β, and β might be
viewed as nuisance when we estimate variance component, we can achieve more efficient estimate in
hand if we use A>Y. Patterson and Thompson (1971) provides likelihood based on A>Y and later
named by restricted likelihood. It turns out that
|A>VA| = |V||X>V−1X|.
(It can be shown via change of variable; for more details, see a tutorial document by Zhang (2015).)
Also note that:
Proposition 3.12.
Y>A(A>VA)−1A>Y = (Y −Xβ)>V−1(Y −Xβ).
Proof. Key point of the proof is that A>VG = 0 holds. In other words, A∗ := V1/2A and X∗ :=
V−1/2X are orthogonal. (Motivation: since column space of (A∗,X∗) is N dimensional with A∗ ⊥ X∗,
projection matrix constructed by (A∗,X∗) would be identity) Now let
T :=(A∗ X∗
)A∗>X∗>
(A∗ X∗)−1A∗>
X∗>
.
Then
T = I−(A∗ X∗
)(A∗>A∗)−1 0
0 (X∗>X∗)−1
A∗>X∗>
= I−A∗(A∗>A∗)−1A∗>−X∗(X∗>X∗)−1X∗>
49
Analysis of Repeated Measurements J.P.Kim
is obtained. Since A∗(A∗>A∗)−1A∗> + X∗(X∗>X∗)−1X∗> is idempotent matrix, so is T , but from
tr(TT>) = tr(T ) = N − (N − p)− p = 0,
we obtain T = 0. Thus we get
A∗(A∗>A∗)−1A∗> + X∗(X∗>X∗)−1X∗> = I.
Rearranging the terms, we obtain
A(A>VA)−1A> + V−1X(X>V−1X)−1X>V−1 = V−1.
Thus we get
A(A>VA)−1A> = V−1 −V−1X(X>V−1X)−1X>V−1
= V−1/2(I −V−1/2X(X>V−1X)−1X>V−1/2)V−1/2
= V−1/2(I −V−1/2X(X>V−1X)−1X>V−1/2)(I −V−1/2X(X>V−1X)−1X>V−1/2)V−1/2
= (V−1/2 −V−1X(X>V−1X)−1X>V−1/2)(V−1/2 −V−1/2X(X>V−1X)−1X>V−1)
= (I −GX>)V−1(I −XG>)
= (I −XG>)>V−1(I −XG>),
and therefore
Y>A(A>VA)−1A>Y = Y>(I −XG>)>V−1(I −XG>)Y
= (Y −Xβ)>V−1(Y −Xβ).
Note that (restricted) likelihood of A>Y ∼ N(0, A>VA) is
l∗ = −1
2log |A>VA| − 1
2Y>A(A>VA)−1A>Y
= −1
2log |V| − 1
2log |X>V−1X| − 1
2(Y −Xβ)>V−1(Y −Xβ)
= −1
2log
∣∣∣∣∣K∑i=1
X>i V−1i Xi
∣∣∣∣∣− l(β(D)),
50
Analysis of Repeated Measurements J.P.Kim
where
l(β) = −1
2log |V| − 1
2(Y −Xβ)>V−1(Y −Xβ)
is a marginal likelihood. Thus, restricted likelihood can be viewed as a “proper likelihood for V when
β is nuisance.” In other words, in REML approach, β is not of interest, so we “plug-in” β into the
likelihood, and find estimate of variance component, i.e., the restricted likelihood is free of β.
Remark 3.13. Bayesian interpretation is also available. Note that restricted likelihood becomes
l∗ = −1
2log |V ar(β)|+ l(β);
i.e., it can be viewed as a posterior likelihood when we have a prior distribution of β.
3.2.4 Two-stage Modeling
We can also handle the data two-stage. Such two-stage approach summarizes within individual trend
first, and then perform an inference across individuals. Model can be summarized as following: First,
“summarize within individual trend,” i.e.,
Yi|βi ∼ Nni(Xiβi, Vi).
Note that
βi = (X>i V−1i Xi)
−1X>i V−1i Yi,
and hence
βi|βi ∼ Np(βi, (X>i V−1i Xi)
−1).
Next, inference “across individuals” is performed, i.e.,
βi ∼ Np(Ziα,D).
It can be viewed as following. First, main effect of each individual βi is determined from βi ∼
Np(Ziα,D); βi becomes different as individual differs, even if Ziα are all the same among individuals.
Next, for each individual, response Yi is determined given βi. It is named as “two-stage” modeling
since estimation procedure is two stage: First stage is to obtain βi. Note that marginal likelihood is
βi ∼ N(Ziα,D + (X>i V−1i Xi)
−1).
Using this, one can infer α and D; this is second stage.
51
Analysis of Repeated Measurements J.P.Kim
Remark 3.14. Two-stage modeling is useful when time series is long such as diary studies. Note that
X>i V−1i Xi becomes large if studied long time, i.e., ni is large. Thus V ar(βi) = (X>i V
−1i Xi)
−1 + D
becomes small when time series is long; variance becomes large when time series is short.
Example 3.15. Recall prolactin data. Let
Xi =
1 1
1 2
1 3
1 4
, Zi =
1 groupi baselinei 0 0 0
0 0 0 1 groupi baselinei
;
first and second row of Zi correspond to (random) intercept and (random) slope, respectively (Note
that βi is random in this model). Then
βi|βi ∼ N(βi, (X>i V−1i Xi)
−1).
Remark 3.16. Second stage is to estimate α and D. It can be achieved by maximum likelihood:
K∑i=1
Z>i (D + (X>i V−1i Xi)
−1)−1(βi − Ziα) = 0
is likelihood equation for α. It is easily obtained that
α =
(K∑i=1
Z>i (D + (X>i V−1i Xi)
−1)−1Zi
)−1( K∑i=1
Z>i (D + (X>i V−1i Xi)
−1)−1βi
).
For DMLE , iterative algorithm such as Newton-Rhapson algorithm is required. Alternatively, EM-type
estimate can be computed:
D(p+1) = D(p) − 1
K
∑D(p)Σ−1
i D(p) +1
K
∑D(p)Σ−1
i (βi − Ziα(p))(βi − Ziα(p))>Σ−1i D(p),
where Σi = D(p) + (X>i V−1i Xi)
−1.
3.3 Appendix 1: EM algorithm
The Expectation-Maximization (EM) algorithm is a computational algorithm to find the maximum
likelihood estimate of the parameters of an underlying distribution from a given data set when the data
is incomplete or has missing values. Many statistical problems can be formulated as a missing data
problem. Examples include mixture models, cluster analysis, analysis with latent variables, random
52
Analysis of Repeated Measurements J.P.Kim
effects models, and causal inference. Before we start, we introduce some notations those will be used
through the section. Let Yo be the observed data, while Ym denotes the missing data. Then Yc =
(Yo, Ym) be complete data set. Also, let f(Yc|θ) and g(Yo|θ) be densities of the complete data and
observed data given θ, respectively. Then
g(Y0|θ) =
∫f(Yc|θ)dYm
holds. Finally, let
k(Ym|Yo, θ) =f(Yc|θ)g(Yo|θ)
be a density of missing data YM . In fact, to cover a broader case of incomplete data, we can let
observed data as a function of complete data, say T (Yc) = Yo, X(y) = Yc ∈ X|T (Yc) = y be a set of
complete data whose observed data are equal to y. Then
g(y|θ) =
∫X(y)
f(x|θ)dx.
However, in this section, we only consider the case where missing data is overt so that Yc = (Yo, Ym).
The goal of EM algorithm is to maximize `(θ) = log g(Yo|θ). For this, at (p + 1)th iteration, one
finds θ s.t. `(θ) ≥ `(θ(p)), which satisfies
`(θ(p+1)) ≥ `(θ(p))
so that θ(p) converges to θMLE , eventually. To do this, EM algorithm introduces Q(θ|θ(p)) and maxi-
mizes `(θ) via
Q(θ|θ(p)) = EYm [log f(Yc|θ)|Yo, θ(p)],
i.e., maximizes Q(θ|θ(p)) at the pth iteration. It is motivated as following. Note that
`(θ) = log g(Yo|θ)
= log
∫f(Yc|θ)dYm
= log
∫f(Yc|θ)
k(Ym|Yo, θ(p))k(Ym|Yo, θ(p))dYm
= logEYm
[f(Yc|θ)
k(Ym|Yo, θ(p))
∣∣∣∣Yo, θ(p)
]≥ EYm
[log
f(Yc|θ)k(Ym|Yo, θ(p))
∣∣∣∣Yo, θ(p)
]
53
Analysis of Repeated Measurements J.P.Kim
by Jensen inequality. Also note that,
`(θ(p)) = log g(Yo|θ(p))
= EYm
[log g(Yo|θ(p))
∣∣∣Yo, θ(p)]
= EYm
[log
f(Yc|θ(p))
k(Ym|Yo, θ(p))
∣∣∣∣∣Yo, θ(p)
]
holds. Combining both results in
`(θ) ≥ `(θ(p)) + EYm
[log
f(Yc|θ)f(Yc|θ(p))
∣∣∣∣Yo, θ(p)
].
We can observe θ∗ that increases right-hand side of previous inequality increases `(θ). In order to get
the greatest possible increase, choose
θ(p+1) = arg maxθ
EYm
[log
f(Yc|θ)f(Yc|θ(p))
∣∣∣∣Yo, θ(p)
].
Note that it is equivalent to
θ(p+1) = arg maxθ
EYm [log f(Yc|θ)|Yo, θ(p)].
Algorithm 1 EM algorithm
1: Repeat following iteration until it converges.2: for p = 1, 2, 3, · · · do3: E-step. Evaluate Q(θ|θ(p)).4: M-step. Find θ(p+1) s.t. θ(p+1) = arg maxθQ(θ|θ(p)).5: end for
It is well known that EM algorithm finds a stationary point monotonely.
Theorem 3.17 (Monotonicity Theorem.). Every EM algorithm increases `(θ) at every iteration, that
is
`(θ(p+1)) ≥ `(θ(p))
with equality if and only if
Q(θ(p+1)|θ(p)) = Q(θ(p)|θ(p)).
Proof. Only a sketch would be given. Note that f(Yc|θ(p+1)) = g(Yo|θ(p+1))k(Ym|Yo, θ(p+1)), and hence
Q(θ(p+1)|θ(p)) =
∫log f(Yc|θ(p+1))k(Ym|Yo, θ(p))dYm
=
∫log g(Yo|θ(p+1))k(Ym|Yo, θ(p))dYm
54
Analysis of Repeated Measurements J.P.Kim
+
∫log k(Ym|Yo, θ(p+1))k(Ym|Yo, θ(p))dYm
= log g(Yo|θ(p+1)) +
∫log k(Ym|Yo, θ(p+1))k(Ym|Yo, θ(p))dYm︸ ︷︷ ︸
=:H(θ(p+1)|θ(p))
holds. Similarly we can obtain
Q(θ(p)|θ(p)) = log g(Yo|θ(p)) +H(θ(p)|θ(p)).
Now combining both formulae gives
Q(θ(p+1)|θ(p))−Q(θ(p)|θ(p)) = `(θ(p+1))− `(θ(p)) +H(θ(p+1)|θ(p))−H(θ(p)|θ(p)).
By definition of EM algorithm, Q(θ(p+1)|θ(p)) − Q(θ(p)|θ(p)) ≥ 0. Thus if we show H(θ(p+1)|θ(p)) −
H(θ(p)|θ(p)) ≤ 0, we can derive `(θ(p+1)) ≥ `(θ(p)).
Claim. H(θ(p+1)|θ(p))−H(θ(p)|θ(p)) ≤ 0.
By definition of H,
H(θ(p+1)|θ(p))−H(θ(p)|θ(p)) =
∫log
k(Ym|Yo, θ(p+1))
k(Ym|Yo, θ(p))k(Ym|Yoθ(p))dYm
= EYm log
[k(Ym|Yo, θ(p+1))
k(Ym|Yo, θ(p))
∣∣∣∣∣Yo, θ(p)
].
now Jensen’s inequality gives
EYm log
[k(Ym|Yo, θ(p+1))
k(Ym|Yo, θ(p))
∣∣∣∣∣Yo, θ(p)
]≤ logEYm
[k(Ym|Yo, θ(p+1))
k(Ym|Yo, θ(p))
∣∣∣∣∣Yo, θ(p)
]≤ 0.
References
• Lim, C.Y. (2016).Applied Statistics [LATEXdocument].
• Rencher, A. C., & Schaalje, G. B. (2008). Linear models in statistics. John Wiley & Sons.
• Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.
55
Analysis of Repeated Measurements J.P.Kim
3.4 Appendix 2: BLUP approach for mixed effects model
In class, EM algorithm and (RE)ML approach were handled for estimation (or prediction) on mixed
effects model. In this section, we introduce BLUP approach, which was handled shortly in class.
Consider a linear mixed effects model
Yi = Xiβ + Zibi + εi,
or equivalently,
Y = Xβ + Zb+ ε,
where
V ar
bε
=
D 0
0 R
.
(D in here is different as that of D in the main paragraph; actually, D = I ⊗D) Then
V ar(Y) = ZDZ> +R =: V.
Let K>β +M>b be a function of interest to be predicted. Let’s denote a (linear) predictor as H>Y.
Then “unbiasedness” condition of H yields
E(H>Y) = H>Xβ = E(K>β +M>b) = K>β,
i.e.,
K = X>H.
Now our goal is to (find the “best” one) minimize a variance of “residuals” (in fact, “prediction error”).
Let
PE = K>β +M>b−H>Y
be a prediction error. Then
V ar(PE) = V ar(K>β +M>b−H>Y)
= V ar
M
−H
> b
Y
= M>DM +H>VH −M>DZ>H −H>ZDM.
56
Analysis of Repeated Measurements J.P.Kim
Then V ar(PE) should be minimized under the constraint K = X>H. Therefore, our target function
to optimize becomes
Q = V ar(PE) + (X>H −K)>Φ. (“penalized criterion”)
Then “normal equations” become
∂Q
∂H= 2VH − 2ZDM + XΦ = 0
∂Q
∂Φ= X>H −K = 0.
From now on, for convenience, let θ = Φ/2. Then “normal equations” become
VH = ZDM −Xθ
X>H = K.
From the first one, we get
H = V−1(ZDM −Xθ);
and hence
X>V−1(ZDM −Xθ) = K,
i.e.,
Xθ = X(X>V−1X)−(X>V−1ZDM −K),
where A− denotes the generalized inverse matrix of A.3 Finally, we get
H = V−1ZDM −V−1X(X>V−1X)−(X>V−1ZDM −K),
which yields to our conclusion,
K>β +M>bBLUP
= M>DZ>V−1(Y −X(X>V−1X)−X>V−1Y) +K>(X>V−1X)−X>V−1Y.
Note that exactly the same logic can be applied to the case that K>β + M>b is not a scalar; Then
we get
K>βBLUP = K>(X>V−1X)−X>V−1Y
3Recall that A = A(A>A)−A>A.
57
Analysis of Repeated Measurements J.P.Kim
and
M>bBLUP = M>DZ>V−1(Y −XβBLUP),
under the assumption that K>β is estimable. Especially, if X>V−1X is nonsingular, then
βBLUP = (X>V−1X)−1X>V−1Y
is the same as BLUE, and
bBLUP = DZ>V−1(Y −Xβ)
is equal to E(b|Y). Note that solutions XβBLUE
and ZbBLUP
are also obtained by minimizing
SS = (Y −Xβ − Zb)>R−1(Y −Xβ − Zb) + b>D−1b
(as we saw in remark 3.5), which is equivalent problem to solve
X>R−1X X>R−1Z
Z>R−1X Z>R−1Z +D−1
βb
=
X>R−1Y
Z>R−1Y
. (3.2)
In other words, if β and b are the solution of (3.2), then Xβ is BLUE of Xβ, and Zb is BLUP of Zb.
The equation (3.2) is called Mixed Model Equation (MME), and such method is called Henderson’s
MME approach.
References
• Searle, S. R., & Gruber, M. H. (2016). Linear models. John Wiley & Sons.
• Searle, S. R., Casella, G., & McCulloch, C. E. (2009). Variance components (Vol. 391). John
Wiley & Sons.
• Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression (No. 12). Cam-
bridge university press.
• Searle, S. R., & Henderson, C. R. (1961). Computing procedures for estimating components of
variance in the two-way classification, mixed model. Biometrics, 17(4), 607-616.
58
Chapter 4
Missing Data Problem
4.1 Introduction: Missing Mechanisms and Patterns
In the study, when subjects are lost to follow up, outcome cannot be observed. If we can observe
baseline covariates even if we lost to follow up (i.e., outcome and time-varying covariates are not
observed), then we need to separate treatments for missing outcome (drop-out) and missing covariates.
Definition 4.1. Before the start, we may define some notations.
• Yij be outcome at the jth measurement on the ith subject.
• Yi = (Yi1, Yi2, · · · , Yin)> be complete data.
• Xi be n× p design matrix for fixed effects.
• Zi be n× q design matrix for random effects.
• rij be an observation (non-missing) indicator for yij; rij = 1 if yij is observed.
• ti be the number of observations for i.
• Ri = (ri1, ri2, · · · , rin)> .
• Let Yi = (Y oi , Y
mi ), where Y o and Y m are observed and missing outcome, respectively.
Example 4.2. Let n = 3 and Ri = (1, 0, 1)>. Then Y oi = (Yi1, Yi3) and Y m
i = (Yi2).
Example 4.3. Why proper treatment of missing outcome is important? Figure 4.1 is random gener-
ated simulation data. First plot shows the fitted line when all of data are used. Fitted lines in second
and third plot are found when there are missing data but ignored. In the second plot, non-missing
indicator Ri is randomly generated; Ri depends on Xi. Third plot is generated under similar situation;
only difference is that Ri depends on Xi×Yi. Even if complete data is the same, how missing appears
59
Analysis of Repeated Measurements J.P.Kim
might affect to the fit if missing data are not treated properly. In summary, “some missing is none;
some is dangerous.”
Figure 4.1: Simulation: Treatment of Missing Data
Definition 4.4 (Missing Mechanisms). (a) If missingness is (conditionally) independent of complete
data, i.e.,
R ⊥ (Y o, Y m)|(X,Z),
then such missing mechanism is called Missing Complete At Random (MCAR).
(b) If missingness is (conditionally) independent of missing data given the observation, i.e.,
R ⊥ (Y m)|(Y o, X, Z),
then such missing mechanism is called Missing At Random (MAR).
(c) If R depends on Y m given (Y o, X, Z), then such missing mechanism is called Nonignorable
Missingness (NI).
Note that such definition is valid under regression setting and missing outcome.
Example 4.5. (a) Recall the cognitive function example. If cognitive score of patient A is low, then
cognitive function of A is questionable and hence further diagnostic is needed. Therefore, A will
revisit with high probability and missing may not occur at future time point. In contrast, if cogni-
tive score of patient B is high, then B may not have any problem in his/her cognitive function, and
hence future data of B would be missed. In this case, “missingness R depends on the observation
(status at previous visit),” and hence we may suppose that missing mechanism is MAR.
(b) (NI)
Definition 4.6. An array of observation indicators (r1, r2, r3) is called missing pattern. There are
seven possible patterns; (1,1,1), (1,1,0), (1,0,0), (1,0,1), (0,1,0), (0,1,1) and (0,0,1). Patterns (1,1,1),
(1,1,0) and (1,0,0) are non-increasing, and called monotone. In monotone data, P (rj = 1|rj−1 =
0) = 0.
1, 1, 1, · · · , 1, 1︸ ︷︷ ︸all observed
, 0, 0, 0, · · · , 0, 0︸ ︷︷ ︸all missed
Missing patterns (1,0,1), (0,1,0), (0,1,1), and (0,0,1) are called non-monotone. In non-monotone
data, P (rj = 1|rj−1 = 0) 6= 0.
60
Analysis of Repeated Measurements J.P.Kim
Remark 4.7. Sometimes nonresponse mechanism and missing pattern are confounded. For example,
mechanism r3 ⊥ y3|y1, y2, r2 = 1, i.e.,
P (r3|y1, y2, y3, r2 = 1) = P (r3|y1, y2, r2 = 1)
is MAR for the monotonically missing data (∵ under monotonicity, r2 = 1 implies r1 = 1, and hence
(y1, y2) = Y o), but is NI for the nonmotonically missing data (∵ y1 could be missed). Therefore, we
need a complex set of assumptions for the non-monotonically missing data to be MAR.
Remark 4.8. Note that we can estimate the missing mechanisms via binary regression (e.g. logis-
tic). For example, consider situation assumed at previous remark. Assume the monotonicity. Then
if P (r3|y1, y2, r2 = 1) does not depend on y1 and y2, missing mechanism becomes MCAR. In other
words, by testing coefficient (corresponding to Y o) equals to zero, we can test MCAR vs.
MAR. In our example, fitting (logistic) regression model
logitP (r3|y1, y2, r2 = 1) = γ0 + γ1y1 + γ2y2,
and if γ1 = γ2 = 0 fails to reject, we may accept MCAR. However, keep in mind that the sample size
is not powered to detect such effect; since study is not designed to verify missing mechanism,
sample size may not big enough to detect the difference of MCAR and MAR. Also remark that,
to distinguish between MAR and NI, we need to construct a test statistic to check, for example,
whether r3 is independent of y3 given observed data. In our example, perform binary regression to
P (r3|y1, y2, y3, r2 = 1), and seeing if it depends on y3. Even if y3 is missed in practice, it can be
handled as a missing covariate problem. Such approach may be theoretically fine, but we cannot check
parametric model assumptions (i.e., model diagnostic process is unavailable). In summary, such test
(MAR vs. NI) is doable but relies on unverifiable model assumptions.
4.2 Missing Outcomes
Remark 4.9. There are several approaches to handle missing outcomes; likelihood approach, esti-
mating equation approach, imputation, or inverse probability weighting might be available. In this
course, only likelihood approach will be covered. Likelihood approach maximizes joint likelihood given
observed data, (R, Y o):K∏i=1
∫f(Yi, Ri|Xi;β)dY m
i .
61
Analysis of Repeated Measurements J.P.Kim
Let
L(θ) =K∏i=1
∫f(Yi, Ri|Xi; θ)dY
mi
be a likelihood function given observed data. To make this approach possible, joint distribution of
(Yi, Ri) should be available; after partitioning into conditional and marginal one, we can model each
one. There are two ways of partitioning f(Yi, Ri|Xi; θ):
• Selection model: f(Y,R|X; θ) = P (R|Y,X; θ1)f(Y |X; θ2), or
• Pattern mixture model: f(Y,R|X; θ) = f(Y |X,R; θ3)P (R|X; θ4).
In selection model, distribution of R|Y should be characterized; it is about “what will be selected as
missing.” In pattern mixture model, distribution of Y |R should be characterized; given the missing
pattern, distribution of Y is characterized.
Note that parameter of our interest is θ2, as we are interested in the relationship between X and
Y . Thus selection model has an advantage that P (R|Y,X; θ1) can be ignored in estimation step. Also
note that for selection model, missing mechanism R|Y,X should be determined. On the other hand,
in pattern mixture model, an additional step should be taken to convert θ3 to θ2, but notation could
be convenient because (non-)missing indicator is given when handling Y |R,X.
Example 4.10. Selection Model. Note that
L(θ) =K∏i=1
∫P (Ri|Yi, Xi; θ1)f(Yi|Xi; θ2)dY m
i .
We have to examine P (Ri|Yi, Xi; θ1).
(i) Under MCAR, P (Ri|Yi, Xi; θ1) = P (Ri|Xi; θ1), and hence
L(θ) =K∏i=1
∫P (Ri|Yi, Xi; θ1)f(Yi|Xi; θ2)dY m
i
=
K∏i=1
∫P (Ri|Xi; θ1)︸ ︷︷ ︸constant of Yi
f(Yi|Xi; θ2)dY mi
=K∏i=1
P (Ri|Xi; θ1)
∫f(Yi|Xi; θ2)dY m
i .
Thus log-likelihood becomes
`(θ) =
K∑i=1
logP (Ri|Xi; θ1) +
K∑i=1
log
∫f(Yi|Xi; θ2)dY m
i .
62
Analysis of Repeated Measurements J.P.Kim
Note that our interest is only in θ2; we can ignore P (Ri|Xi; θ1). In other words, we don’t need
to construct missing model R|X; θ1!
(ii) Under MAR, P (Ri|Yi, Xi; θ1) = P (Ri|Y oi , Xi; θ1), and hence
L(θ) =K∏i=1
∫P (Ri|Yi, Xi; θ1)f(Yi|Xi; θ2)dY m
i
=
K∏i=1
∫P (Ri|Y o
i , Xi; θ1)︸ ︷︷ ︸constant of Ymi
f(Yi|Xi; θ2)dY mi
=K∏i=1
P (Ri|Y oi , Xi; θ1)
∫f(Yi|Xi; θ2)dY m
i .
Again, log likelihood becomes
`(θ) =K∑i=1
logP (Ri|Y oi , Xi; θ1) +
K∑i=1
log
∫f(Yi|Xi; θ2)dY m
i ,
and we are not interested in θ1, so we don’t need to construct missing model either.
In any case of missing mechanism, step of constructing missing model explicitly can be omitted, i.e.,
missing part (e.g., modeling P (r3|y1, y2, y3), etc..) can be “ignored;” it is compared as Nonignorable
(NI) case.
Remark 4.11. Remark that we should estimate E(Yi|Xi) in the regression. However, we can actually
estimate E(Yi|Xi, Ri) rather than E(Yi|Xi) directly (in MCAR or MAR); if missing pattern is given,
then it can be regarded as if missing values are designed to be missed, and so we can “ignore” the
missing! Then it is equivalent to solve the equation
K∑i=1
X>i ∆i(Yi −Xiβ) = 0.
where ∆i is a diagonal matrix with the jth diagonal element is rij . Then it is to solve
K∑i=1
X>i ∆i(Yi −Xiβ) =
K∑i=1
X>io(Yio −Xioβ) = 0.
It gives an inference for
E(Yi|Xi, Ri) = Xiβ.
However, if the missing mechanism is MCAR, E(Yi|Xi, Ri) = E(Yi|Xi), and hence we can “ignore the
missing” and get an estimate of E(Yi|Xi) = Xiβ. It means that, we don’t need to conduct a likelihood
63
Analysis of Repeated Measurements J.P.Kim
approach in MCAR case. We need a likelihood approach in other missing mechanisms. Thus the rest
part of this section concentrates on other missing mechanisms, MAR or NI.
Following proposition shows how to deal with regression model (e.g. multivariate linear model) via
likelihood approach using selection model.
Proposition 4.12. Maximizing the likelihood
L =K∏i=1
∫f(Yi|Xi;β)P (R|Y o
i , Ymi , Xi; θ1)dY m
i
is equivalent to solve the equation
K∑i=1
E
[∂
∂βlog f(Yi|Xi;β)
∣∣∣∣Y oi , Ri, Xi
]= 0.
Proof. Score equation from the likelihood becomes
∂
∂βlogL =
K∑i=1
∂
∂βlog
∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i .
Note that score becomes
∂
∂βlogL =
K∑i=1
∂
∂βlog
∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i
=K∑i=1
∂
∂β
∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i
=
K∑i=1
∫∂
∂βf(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i
=K∑i=1
∫∂
∂βlog f(Yi|Xi;β)× f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i
=K∑i=1
∫∂
∂βlog f(Yi|Xi;β)
f(Yi|Xi;β)P (Ri|Y oi , Y
mi , Xi; θ1)∫
f(Yi|Xi;β)P (Ri|Y oi , Y
mi , Xi; θ1)dY m
i︸ ︷︷ ︸=f(Ymi |Y oi ,Ri,Xi;β,θ1)
dY mi
=K∑i=1
∫∂
∂βlog f(Yi|Xi;β)f(Y m
i |Y oi , Ri, Xi;β, θ1)dY m
i
=K∑i=1
E
[∂
∂βlog f(Yi|Xi;β)
∣∣∣∣Y oi , Ri, Xi
].
64
Analysis of Repeated Measurements J.P.Kim
Remark 4.13. (i) Note that previous proposition is valid even under NI.
(ii) If Ri ⊥ (Y mi )|Y o
i , Xi, i.e., missing mechanism is MAR, then the likelihood can be factored into
L =K∏i=1
∫f(Yi|Xi;β)P (Ri|Y o
i , Ymi , Xi; θ1)dY m
i =K∏i=1
∫f(Yi|Xi;β)dY m
i P (Ri|Y oi , Xi; θ1).
Thus under MAR, the probability of observation (P (Ri|Y oi , Xi)) does not have to be explicitly
specified in likelihood approach. It implies that standard mixed effects modeling is valid, since
in linear mixed model we maximize the “marginal likelihood,” which is same as
∫f(Yi|Xi;β)dY m
i
in here (Y mi includes actual missing values and random effect term bi).
(iii) Now consider multivariate linear model with normal assumption. Then we get
log f(Yi|Xi;β) = −1
2log |Vi| −
1
2(Yi −Xiβ)>V −1
i (Yi −Xiβ).
Now let Yi = (Y oi , Y
mi ), Xi = (Xio, Xim) and Vi =
Vi11 Vi12
Vi21 Vi22
. To apply proposition 4.12, we
should obtain gradient of log f(Yi|Xi;β). Note that
∂
∂βlog f(Yi|Xi;β) = −X>i V −1
i (Yi −Xiβ)
= −(X>io X>im
)Vi11 Vi12
Vi21 Vi22
−1 Yio −Xioβ
Yim −Ximβ
= −
(X>io X>im
)I −V −1i11 Vi12
0 I
V −1i11 0
0 V −1i22·1
Yio −Xioβ
Yim −Ximβ − Vi21V−1i11 (Yio −Xioβ)
holds. Now under MAR,
E(Yim −Ximβ|Yio, Ri, Xi) = E(Yim −Ximβ|Yio, Xi)
= Vi21V−1i11 (Yio −Xioβ),
65
Analysis of Repeated Measurements J.P.Kim
and hence we get
E
[∂
∂βlog f(Yi|Xi;β)
∣∣∣∣Yio, Ri, Xi
]= −
(X>io X>im
)I −V −1i11 Vi12
0 I
V −1i11 0
0 V −1i22·1
Yio −Xioβ
0
= −X>ioV −1
i11 (Yio −Xioβ).
Then score equation becomes
K∑i=1
XioV−1io (Yio −Xioβ) = 0.
It implies that, likelihood approach with selection model is equivalent to “ignore missing out-
comes” in multivariate linear regression model under normal assumption.
(iv) In summary,
– both multivariate linear regression (MLR) and mixed linear regression models can accom-
modate different numbers of outcome measurements per subject;
– MLR (seeing∑X>i V
−1i (Yi −Xiβ)=0) is valid only under MCAR and biased under MAR
and NI;
– however, if data are from multivariate normal distribution, MLR with correct variance speci-
-fication is likelihood analysis, so is valid under MAR;
– mixed effect model is valid if the missing probability does not depend on any unobserved
outcome values or random effects, i.e., R ⊥ (Y mi , b)|(Y o, X, Z).
Example 4.14. Suppose that outcome is 2-dimensional,
Yi = (Yi1, Yi2)>,
and Xi’s are covariate. Assume that Yi1 is completely observed, but Yi2 is not. Let Ri = 1 if Yi2 is
observed, and Ri = 0 otherwise. For simplicity of notation, let Ri = 1 for i = 1, 2, · · · ,K1 and Ri = 0
for i = K1 + 1, · · · ,K. Then likelihood is
L =
K1∏i=1
f(Yi|Xi;β)P (Ri|Yi, Xi; θ1) ·K∏
i=K1+1
∫f(Yi|Xi;β)P (Ri|Yi, Xi; θ1)dYi2,
and hence likelihood equation is
K1∑i=1
X>i V−1i (Yi − µi) +
K∑i=K1+1
X>i V−1i (E(Yi|Xi, Yi1, Ri)− µi) = 0.
66
Analysis of Repeated Measurements J.P.Kim
Plug-in observations Xi, Yi and Ri = 0 to E(Yi|Xi, Yi1, Ri), we get
E(Yi|Xi, Yi1, Ri) =
Yi1
E(Yi2|Xi, Yi1, Ri = 0)
MAR=
Yi1
µi2 + σ21σ11
(Yi1 − µi1)
,
for Vi =
σ11 σ12
σ12 σ22
, and therefore, from
V −1i =
1
1− ρ2
1σ11
− σ12σ11σ22
sym. 1σ22
,
we get
V −1i (E(Yi|Xi, Yi1, Ri = 0)− µi) =
1
1− ρ2
1σ11
− σ12σ11σ22
sym. 1σ22
Yi1
σ21σ11
(Yi1 − µi1)
=
σ11σ22
σ11σ22 − σ212
σ11σ22−σ212
σ211σ22
(Yi1 − µi1)
0
(∵ ρ2 = σ212/σ11σ22)
=
1σ11
(Yi1 − µi1)
0
.
Thus likelihood equation becomes
K1∑i=1
X>i V−1i (Yi − µi) +
K∑i=K1+1
X>i1Yi1 − µi1σ11
= 0.
It is equivalent to “ignore missing outcomes.”
When Ri is not independent of Yi2 (NI), given (Yi1, Xi), E(Yi2|Xi, Yi1, Ri) 6= E(Yi2|Xi, Yi1) and
E(Yi2|Xi, Yi1, Ri = 0) =
∫Yi2f(Yi2|Yi1, Xi)P (Ri = 0|Yi2, Yi1, Xi)dYi2∫f(Yi2|Yi1, Xi)P (Ri = 0|Yi2, Yi1, Xi)dYi2
.
In practice, it is evaluated via numerical integration. Note that the conditional expectation of missing
Y2 is a function of the observation probability P (Ri = 0| · · · ) when data are nonignorably missing. We
should consider the model on the observation probability in this case, such as logistic model.
Example 4.15. Missing covariate example. Also suppose 2-dimensional outcome example, Yi =
67
Analysis of Repeated Measurements J.P.Kim
(Yi1, Yi2)>, and xi be covariate. In here, assume that Yi is completely observed, but xi is not. Let
Ri = 1 if xi is observed, and Ri = 0 otherwise. Let Xi = (1, xi) be covariates including intercept. For
the simplicity of notation, let Ri = 1 for i = 1, 2, · · · ,K1, and Ri = 0 for i = K1 + 1, · · · ,K. Then
likelihood becomes
L =
K1∏i=1
f(Yi|Xi;β)g(Xi)P (Ri|Yi, Xi)︸ ︷︷ ︸joint likelihood of (Yi, Xi, Ri)
K∏i=K1+1
∫f(Yi|Xi;β)g(Xi)P (Ri|Yi, Xi)dxi.
Recall the missing outcome problem: It’s enough to consider Yi|Y oi , not Xi|Yi, and hence we can
“plug-in” observed values in the likelihood term as
X>i V−1i
Y oi −Xo
i β
E(Y mi −Xm
i β|Y oi , Ri)
.
However, in the missing covariate problem, score is not linear anymore: X>i V−1i Xiβ is quadratic for
Xi. Thus we cannot “plug-in”;
E(X>imV−1i Ximβ|Yio, Xio, Ri) 6= E(Xim|Yio, Xio, Ri)
>V −1i E(Xim|Yio, Xio, Ri)β.
Thus computation becomes difficult. In the rest part of this example, we consider the case that xi is
binary. Then likelihood becomes
L =
K1∏i=1
f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)
K∏i=K1+1
1∑xi=0
f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)
and hence likelihood equation is
−∂ logL
∂β=
K1∑i=1
X>i V−1i (Yi − µi)−
K∑i=K1+1
∂
∂βlog
1∑xi=0
f(Yi|Xi)g(Xi)P (Ri|Yi, Xi) = 0.
Now by the proof of proposition 4.12,
∂
∂βlog
1∑xi=0
f(Yi|Xi)g(Xi)P (Ri|Yi, Xi) =
1∑xi=0
∂
∂βlog f(Yi|Xi) · f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)
1∑xi=0
f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)
= E(X>i V−1i (Yi −Xiβ)|Yi, Ri)
68
Analysis of Repeated Measurements J.P.Kim
is obtained, and therefore likelihood equation is
K1∑i=1
X>i V−1i (Yi − µi) +
K∑i=K1+1
(1, 1)>V −1
i (Yi − (1, 1)β)pi + (1, 0)>V −1i (Yi − (1, 0)β)(1− pi)
= 0,
where pi = E(xi|Yi, Ri) = P (xi = 1|Yi, Ri). Computation of solving this equation is not difficult when
xi is binary. We may handle the problem as if missing covariate part is “duplicated,” and plug-in 1
and 0 in the covariate and conduct a weighted average.
Y X W
Y1 x1 1Y2 x2 1...
......
YK1 xK1 1YK1+1 1 pK1+1
YK1+1 0 1− pK1+1
YK1+2 1 pK1+2
YK1+2 0 1− pK1+2...
......
YK 1 pKYK 0 1− pK
Example 4.16. Mixed Effect Models with Selection Model. In here, the observation probability
can depend on unobservable random effects. Likelihood is
L =N∏i=1
∫ ∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o
i , Ymi , Xi, Zi, bi)dbi︸ ︷︷ ︸
marginal likelihood in LMM
dY mi .
(i) Under MAR, Ri ⊥ (Y mi , bi)|(Y o
i , Xi, Zi). Then likelihood becomes
L =
N∏i=1
∫∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o
i , Ymi , Xi, Zi, bi)dbidY
mi
=N∏i=1
∫ ∫f(Yi|Xi, Zi, bi;β)g(bi)dbi︸ ︷︷ ︸
marginal likelihood
dY mi P (Ri|Y o
i , Xi, Zi).
Hence we can “ignore” P (Ri|Y oi , Xi, Zi) part if our interest is to obtain an inference for β.
(ii) Under NI, explicit modeling of the observation probability is required. Maximizing
L =
N∏i=1
∫∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o
i , Ymi , Xi, Zi, bi)dbidY
mi
69
Analysis of Repeated Measurements J.P.Kim
is to solveN∑i=1
E
[∂
∂βlog f(Yi|Xi, Zi, bi;β)
∣∣∣∣Y oi , Xi, Zi, Ri
]= 0,
i.e.,N∑i=1
X>i (E(Yi|Y oi , Xi, Zi, Ri)−Xiβ) = 0.
Note that
E(Yij |Y oi , Xi, Zi, rij = 0) =
∫∫yf(y|Y o, Xi, Zi, bi)g(bi)P (R = 0|y, Y o, Xi, Zi, bi)dbdy∫∫f(y|Y o, Xi, Zi, bi)g(bi)P (R = 0|y, Y o, Xi, Zi, bi)dbdy
,
and in practice it can also be computed via numerical methods.
70
Chapter 5
Generalized Estimating Equation
5.1 Review on GLM
5.1.1 Basic Concepts
There are three components of GLM (model assumption): They are random component, systematic
component, and link function, respectively.
• Random Component
Y has a distribution in the exponential family taking the form
fY (y; θ, φ) = exp
(yθ − b(θ)
φ+ c(y, φ)
).
θ is called a “canonical parameter.”
Example 5.1. (i) For N(µ, σ2), θ = µσ , φ = σ and b(θ) = µ2
2σ .
(ii) For Ber(p), θ = log p1−p and b(θ) = − log(1− p).
(iii) For Poi(µ), θ = logµ and b(θ) = µ.
Proposition 5.2. E(Y ) = µ = b′(θ), V ar(Y ) = b′′(θ)φ.
Proof. Note that ∫exp
(yθ − b(θ)
φ+ c(y, φ)
)dy = 1.
Thus we get
∂
∂θ
∫exp
(yθ − b(θ)
φ+ c(y, φ)
)dy =
∫y − b′(θ)
φexp
(yθ − b(θ)
φ+ c(y, φ)
)dy = 0,
71
Analysis of Repeated Measurements J.P.Kim
and hence
E(Y )− b′(θ) = 0.
Furthermore, from
∂2
∂θ2
∫exp
(yθ − b(θ)
φ+ c(y, φ)
)dy =
∫ (−b′′(θ)φ
+
(y − b′(θ)
φ
)2)
exp
(yθ − b(θ)
φ+ c(y, φ)
)dy = 0,
we get
V ar(Y ) = b′′(θ)φ.
• Systematic component
µ := E(Y |X) is related to X through a linear combination of X, η = Xβ.
• Link function
Linear combination η is a function of µ = E(Y |X) via a “link function” g, i.e., η = g(µ). It is
assumed that g is twice differentiable. g is called “canonical link” if it satisfies θ = η.
Example 5.3.
(i) In normal model, η = θ = µ is a canonical link.
(ii) In binary model, η = θ = log µ1−µ is a canonical link, i.e., canonical link is logit function.
(iii) In Poisson model, η = θ = logµ is a canonical link, i.e., canonical link is a log function.
Note that link function determines the interpretation of β. Also note that, canonical link makes
observed information and expected information the same.
5.1.2 Score and Information
• Consider
`(β, φ) =yθ − b(θ)
φ+ c(y, φ).
Then score is obtained as
∂`
∂β(β, φ) =
n∑i=1
∂ηi∂β
∂µi∂ηi
∂θi∂µi
∂`
∂θi
=n∑i=1
X>i1
g′(µi)(b′′(θ))−1Yi − b′(θi)
φ
72
Analysis of Repeated Measurements J.P.Kim
=n∑i=1
X>i g′(µi)
−1[V ar(Yi|Xi)]−1(Yi − µi)
=n∑i=1
∂µi∂β
V ar(Yi|Xi)−1(Yi − µi).
If one uses canonical link function, then
∂`
∂β(β, φ) =
n∑i=1
X>i(Yi − µi)
φ,
since∂µi∂ηi
∂θi∂µi
=∂θi∂ηi
= 1.
• Now consider
I(θ) =
n∑i=1
−∂2`(θ)
∂θ2.
I(θ) is called an (observed) information. One can also consider an expected information, i(θ) =
EI(θ). For canonical link,
I(θ) =n∑i=1
X>iφ
∂µi∂β>
is deterministic, and hence i(θ) = I(θ). It implies that, when we use canonical link, likelihood
function is strictly concave, g′(µi)−1 = b′′(θi), and hence
i(β) =
n∑i=1
X>i b′′(θi)Xi
φ.
5.1.3 Computation
• Newton-Raphson Method
In general, the equation U(θ) = 0 does not have a closed form solution. Then one should solve
the equation iteratively. Starting from initial value θ(0), one can iteratively update the estimate
by
θ(p+1) = θ(p) + I(θ(p))−1U(θ(p)).
When the update becomes very small, i.e., |θ(p+1) − θ(p)| < c, where c is pre-specified small
value, for example c = 10−8, the iteration is stopped and θ(p+1) is declared as the solution. This
computational algorithm is called Newton-Raphson algorithm.
• Iteratively Re-weighted Least Square Score function of GLM has a form of weighted linear re-
73
Analysis of Repeated Measurements J.P.Kim
gression.
U(β) =n∑i=1
X>i (g′(µi)V ar(Yi))−1︸ ︷︷ ︸
weight
(Yi − µi)
If one can represent “µi part” as a linear combination Xiβ, then one can conduct WLS estimation
updating weights at each iterations. However, µi 6= Xiβ makes ordinary weighted least square
approach impossible. Thus we “linearize” g(µi), and find the pseudo data variable whose mean
is Xiβ.
Consider a pseudo-data variable
Zi = g(µi) + g′(µi)(Yi − µi) = ηi +∂ηi∂µi
(Yi − µi).
Then E(Zi) = ηi = Xiβ and
V ar(Zi) =
(∂µi∂µi
)2
V ar(Yi) = g′(µi)2V ar(Yi).
Hence the score equation becomes
U(β) =
n∑i=1
X>i (g′(µi)2V ar(Yi))
−1g′(µi)(Yi − µi) =n∑i=1
X>i V ar(Zi)−1(Zi − ηi),
and we can obtain WLS estimate as
β =
(n∑i=1
X>i V ar(Zi)−1Xi
)−1 n∑i=1
X>i V ar(Zi)−1Zi,
and update V ar(Zi) (weights) iteratively. Such approach is called Iteratively Re-weighted Least
Remark 5.4. IRLS is equivalent to Newton-Raphson method with Fisher scoring. IRLS update is
β(p+1) =
(n∑i=1
X>i V ar(p)(Zi)
−1Xi
)−1 n∑i=1
X>i V ar(p)(Zi)
−1Zi
=
(n∑i=1
X>i
(g′(µi)
2V ar(p)(Yi))−1
Xi
)−1( n∑i=1
X>i
(g′(µi)
2V ar(p)(Yi))−1 (
η(p)i + g′(µ
(p)i )(Yi − µ(p)
i )))
= β(p)i +
(n∑i=1
X>i
(g′(µi)
2V ar(p)(Yi))−1
Xi
)−1( n∑i=1
X>i
(g′(µi)V ar
(p)(Yi))−1
(Yi − µ(p)i )
)
= β(p) + i(β(p))−1U(β(p)),
which is the same as that in Newton-Raphson algorithm.
74
Analysis of Repeated Measurements J.P.Kim
5.2 Generalized Estimating Equations
5.2.1 Quasi-Likelihood
Quasi-likelihood approach is an extension of likelihood inference, which is suggested by Wedderburn
(1974). It is motivated by score equation of GLM. Recall that score equation in GLM is
U(β) =
n∑i=1
X>i (g′(µi)V ar(Yi))−1(Yi − µi) = 0.
Note that score equation is based on the likelihood function, and hence it comes from full distributional
assumption, but it only requires mean µi and variance V ar(Yi) of the response variable! Thus, if we
only model mean and variance part, we can “mimic” the procedure in GLM even if there is no
distributional assumption.
Now consider only U(β) = 0, an estimating equation, not the likelihood `(β). In other words, we will
find an estimator β as a solution of U(β) = 0, and not consider it as an optimizer of the likelihood.
Such method is called a quasi-likelihood approach. It can be used to model overdispersed binomial
or counts data. For example, Poisson model is commonly used for counts data, whose variance is the
same as mean. However, if it is known that their variance should be larger than mean, Poisson model
may not be adequate. In here, we can apply quasi-likelihood, which does not specify full distributional
characteristics but only first and second moments. Note that we solve
∂µ>
∂βV −1(Y − µ) = 0,
if E(Y ) = µ and V ar(Y ) = V . Following asymptotic behavior is known.
Proposition 5.5. Let the solution of U(β) = 0 be β. Then as n→∞
√n(β − β0)
d−−−→n→∞
N(0,Ω−1),
where
Ω = limn→∞
1
n
∂µ>
∂βV −1 ∂µ
∂β>.
Proof. By definition,
U(β) = U(β0) +∂U
∂β>
∣∣∣∣β=β0
(β − β0) +OP (1)
75
Analysis of Repeated Measurements J.P.Kim
and note that
∂U
∂β>
∣∣∣∣β=β0
= − ∂µ>
∂βV −1 ∂µ
∂β>
∣∣∣∣β=β0
+∂2µ
∂β∂β>
∣∣∣∣β=β0
V −1(Y − µ)︸ ︷︷ ︸mean zero
= − ∂µ>
∂βV −1 ∂µ
∂β>
∣∣∣∣β=β0
+OP
(n1/2
).
There we get
√n(β − β0) =
(1
n
∂µ>
∂βV −1 ∂µ
∂β>
∣∣∣∣β=β0
+OP (n−1/2)
)−1(1√nU(β0) +OP (n−1/2)
)= Ω−1 1√
nU(β0)︸ ︷︷ ︸
d−−−→n→∞
N(0,Ω)
+oP (1).
5.2.2 Extension of quasi-likelihood method to multivariate outcome: GEE
Now consider a multivariate outcome Yi = (Yi1, Yi2, · · · , Yini)> and covariate Xij , for j = 1, 2, · · · , ni
and i = 1, 2, · · · ,K. There are two main approaches to treat longitudinal data; one is GEE, the other
one is GLMM. GEE (Generalized Estimating Equation) is a marginal approach: It does not assume
any full distributional properties. GLMM (Generalized Linear Mixed Model) is a conditional approach:
It controls the correlation via random effect term. In this chapter, GEE will be introduced.
GEE extends quasi-likelihood approach based on following argument. Assume that true parameter
is obtained as a solution of “real estimating equation.” If estimating equation is obtained by the
data converges to the real one as the number of the data increases, then estimates converge to the
true parameter under some technical assumptions. GEE approach extends such argument to the
longitudinal data, which is proposed by Liang and Zeger (1986) and Zeger and Liang (1986).
Figure 5.1: Estimating Equation
76
Analysis of Repeated Measurements J.P.Kim
Definition 5.6. For each subunit, assume that each element Yij given Xij arises from an exponential
family distribution and follows GLM, i.e.,
E(Yij |Xij) = µij , V ar(Yij |Xij) = b′′(θij)φ, ηij = Xijβ = g(µij).
In here, we can specify
E(Yi|Xi) = µi and V ar(Yi|Xi) = Vi,
where
Vi = Ai(µi)1/2Ωi(α)Ai(µi)
1/2
for working correlation matrix Ωi.
Remark 5.7. Note that we do NOT specify the FULL distribution of Yi; just marginal distri-
bution of each Yij is specified, and only specified properties of Yi are mean and variance.
In GEE, we consider an estimating equation
U(β, α) =K∑i=1
∂µ>i∂β
V −1i (Yi − µi) = 0.
An equivalent representation is
K∑i=1
(∂µi1∂β
· · · ∂µini∂β
)(p×ni)
V −1i
(ni×ni)(Yi − µi)
(ni×1)
= 0.
Example 5.8. For repeated binary data, assume that
logP (Yij = 1|Xij)
P (Yij = 0|Xij)= log
pij1− pij
= β0 + β1Xij
and
E(Yi) = pi = (pi1, pi2, · · · , pini)>, V ar(Yij) = pij(1− pij).
We further assume “working correlation” for Yi. For example, if Ωi(α) = Ii (“working independence
correlation”), the GEE solution is the same as the MLE obtained as if Yij ’s are independent.
We can obtain estimates β and α as following iterative procedure. First, set α = 0 (=indepen-
dent working correlation), and obtain initial estimates of β, β(0). Then solve U(β(0), α) = 0 and let
the solution α(1). Then solve U(β, α(1)) = 0 and let the solution β(1). Repeat this procedure until
convergence.
77
Analysis of Repeated Measurements J.P.Kim
Remark 5.9. (i) Recall the likelihood approach: In the likelihood inference, U =∂ logL
∂θ, where
logL = (Y −Xβ)>V (α)−1(Y −Xβ), and so via iterative approach, we can say that
L(β(0), α(0)) ≤ L(β(1), α(0)) ≤ L(β(1), α(1)) ≤ · · · .
It implies that “every single step guarantees that likelihood increases.” However, in GEE, there
is no objective function to optimize, and hence convergence of such iterative approach is not
ensured.
(ii) The GEE estimator β is asymptotically normal with mean β0 and variance W−10 W1W
−10 , where
W1(β0, α) = V ar(U) and W0(β, α) = E
(− ∂U
∂β>
).
In other words,√K(β − β0) ≈ N(0,KW−1
0 W1W−10 ).
For the proof, note following Taylor expansion
U(β, α) = 0 = U(β0, α) +∂U
∂β>
∣∣∣∣β=β∗
(β − β0), |β − β0| ≥ |β∗ − β0|.
Thus we get
β − β0 =
(− ∂U
∂β>(β∗)
)−1
U(β0, α).
Note that since∂U
∂β>is sum of K independent random variables,
−K−1 ∂U
∂β>(β∗) ≈ −K−1 ∂U
∂β>(β0) = −K−1W0 + oP (1)
as K →∞. Furthermore, since U is sum of K independent random variables,
K−1/2U(β0, α) = K−1/2(EU +OP
(√V ar(U)
)) = OP (1).
Thus we get
√K(β − β0) =
(K−1W0
)−1(K−1/2U(β0, α)
)+ oP (1) ≈ N(0,KW−1
0 W1W−10 ).
78
Analysis of Repeated Measurements J.P.Kim
(iii) Note that
V ar(U) =
K∑i=1
∂µ>i∂β
V −1i V ar(Yi)V
−1i
∂µi∂β>
=K∑i=1
∂µ>i∂β
V −1i
∂µi∂β>
if V ar(Yi) = Vi. Thus, if Ωi (and consequently Vi) is correctly specified, W1 = W0 holds. Even if
correlation structure is misspecified, variance of β has a sandwich form, whose estimate is robust.
(iv) By inverse function theorem (Foute, 1977), if θ is the solution of U(θ) = 0, then θ = U−1(0)
converges to (EU)−1(0). Thus, θ0 = (EU)−1(0), i.e., U(θ0) = 0 is the ONLY key part in the
consistency of θ (i.e.. only correct specification of mean model is required). Therefore, even when
Ωi(α) is misspecified, the estimate β is consistent. Variance misspecification does not affect to
the consistency of β.
(v) Recall that
W0 = E
(− ∂U
∂β>
)=
K∑i=1
∂µ>i∂β
V −1i
∂µi∂β>
and
W1 =
K∑i=1
∂µ>i∂β
V −1i V ar(Yi)V
−1i
∂µi∂β>
.
Vi might be misspecified, but for consistent estimate, V ar(Yi) should estimate the true one. Thus
we estimate W1 as
W1 =K∑i=1
∂µ>i∂β
∣∣∣∣β=β
Vi(α)−1(Yi − µi)(Yi − µi)>Vi(α)−1 ∂µi∂β>
∣∣∣∣β=β
for consistent estimate.
Remark 5.10. Estimation for α. Note that α is related to the “correlation coefficient,”
αjk =K∑i=1
(Yij − µij)(Yik − µik)σj σk
.
Since σj depends on β and φ, estimate of α depends on both terms, i.e.,
α = α(β, φ).
79
Analysis of Repeated Measurements J.P.Kim
Also note that
σ2j =
1
K
K∑i=1
(Yij − µij)2
depends on φ, and so φ does. Thus φ = φ(β). Denoting α(β, φ(β)) = α∗(β), we have
K−1/2K∑i=1
Ui(β, α∗(β)) = K−1/2
K∑i=1
Ui(β, α)︸ ︷︷ ︸A∗
+
(1
K
K∑i=1
∂
∂α>Ui(β, α)
)︸ ︷︷ ︸
B∗
√K(α∗ − α)︸ ︷︷ ︸
C∗
.
Now under some technical assumptions,
√K(α(β, φ)− α) = OP (1),
√K(β − φ) = OP (1),
∣∣∣∣∂α(β, φ)
∂φ
∣∣∣∣ ≤ H(Y, β) = OP (1),
we have
C∗ =√K(α(β, φ(β))− α(β, φ) + α(β, φ)− α
)
=√K
∂α(β, φ∗)
∂φ︸ ︷︷ ︸≤OP (K−1/2)
(φ− φ)︸ ︷︷ ︸=OP (K−1/2)
+ α(β, φ)− α︸ ︷︷ ︸=OP (K−1/2)
= OP (1)
and B∗ = oP (1) because
B∗k =1√K
K∑i=1
∂
∂αkUi(β, α) = − 1√
K
(K∑i=1
∂µ>i∂β
V −1i
∂Vi∂αk
V −1i (Yi − µi)
)k
is sum of independent random variables with mean zero. Thus score “with α∗ plugged-in”
K−1/2K∑i=1
Ui(β, α∗(β))
is asymptotically equivalent to
A∗ = K−1/2K∑i=1
Ui(β, α).
in other words, the quality of α does not affect on the asymptotic behavior of β; only the required
condition for α is√K-consistency. Thus we don’t have to sweat on finding “best estimate” of α. We
may take MME approach for α.
80
Analysis of Repeated Measurements J.P.Kim
Remark 5.11. (i) Choice of Ωi(α). Popular choices of Ωi(α) include intraclass,
Ωi(α) =
1 α α · · · α
α 1 α · · · α
α α 1 · · · α...
......
. . ....
α α α · · · 1
,
and AR(1) correlation structure,
Ωi(α) =
1 α α2 · · · αni−1
α 1 α · · · αni−2
α2 α 1 · · · αni−3
......
.... . .
...
αni−1 αni−2 αni−3 · · · 1
.
In intraclass, MME can be used for estimation of α, for example,
α =
K∑i=1
∑j>k
Yij − pij√pij(1− pij)
Yik − pik√pik(1− pik)
K∑i=1
ni(ni − 1)
2
,
in binary regression. Note that α is the solution of
W (α) :=
K∑i=1
ni(ni − 1)
2α−
K∑i=1
∑j>k
rij rik = 0,
and hence it converges to α∗, the solution of EW (α∗) = 0. It is same as the true α0 if Ωi(α) is
correctly specified.
(ii) Summary: Estimation of α. Since consistency of β is unaffected by misspecification of Ωi(α),
we can specify Ωi(α) as the identity matrix first. Then we can estimate (and we need to do) α
from residuals. Also note that, when the number of sub-units in the unit is the same for all units
(ni = n ∀i), we can assume that Ωi(α) = Ω(α), and estimate all n(n−1)/2 unknown parameters
without specifying a particular structure as follows:
Ω(α) =1
K
K∑i=1
A−1/2i (Yi − ρi)(Yi − ρi)>A−1/2
i .
81
Analysis of Repeated Measurements J.P.Kim
Remark 5.12. Caution in using GEE.
(i) Since the asymptotic properties of GEE depends on large K and fixed ni (precisely, maxi ni =
O(1) as K →∞), data with large ni or small K are not suited to GEE. For example, study that
the data are collected everyday, or the data with ni increasing as a function of K, may not be
adequate to apply GEE.
(ii) For some “working correlation” structures such as m-dependence, there is a range of α that
yields non-positive definite matrix. Users should make sure the estimated correlation matrix is
positive definite.
(iii) The lack of definition of α: One of the assumptions for proof is that the estimate of α be√K-
consistent, i.e.,√K(α− α) = OP (1). However, α is not a true correlation, but just a parameter
from “working correlation,” which is commonly misspecified, and sometimes it is not clear what
α estimates. For example, it would converge to the solution α∗ of EW (α∗) = 0, but what α∗
means is not well specified, and even it is not guaranteed that such solution exists. In summary,
α is subject to “uncertainty of definition,” leading to breakdown of the asymptotic properties of
β (α may not be√K-consistent!).
(iv) (Cont’d) Therefore, it is a good practice to fit independent working correlation first, i.e., Ωi(α) =
Ii.
(v) Ωi(α) = Ii yields OLS fit. βOLS is still consistent even if variance structure is misspecified;
however, we lose efficiency by ignoring correlation. In contrast, if we specify correlation structure
correctly, we gain efficiency, but there is a risk that the variance model is misspecified; then
asymptotic behavior of β might be ruined. One should keep in mind that there is no such thing
as a free lunch when choosing the method to employ.
5.3 More topics on GEE
5.3.1 Hypothesis Testing
Remark 5.13. In the likelihood inference, we can test the hypothesis H0 : β1 = β10 via likelihood
ratio test and its asymptotic equivalences. Let β = (β>1 , β>2 )> be a parameter of interest, where β1 is
the parameter of interest of length q and β2 is a nuisance parameter with length p − q. Then for log
likelihood function `,
2(`(β1, β2)− `(β10, β2)) ≈H0
χ2(q),
82
Analysis of Repeated Measurements J.P.Kim
where β2 is MLE of β2 under H0. Also, we can consider Wald’s test statistic
(β − β0)>V ar(β)(β − β0)
and Rao’s test statistic (score test statistic)
U(β0)>i(β0)−1U(β0) = U(β0)>
(− ∂U
∂β
∣∣∣∣β=β0
)−1
U(β0) = U1(β0)>
((− ∂U
∂β
∣∣∣∣β=β0
)11·2
)−1
U1(β0).
Note that: ((− ∂U
∂β
∣∣∣∣β=β0
)11·2
)−1
is an upper left matrix of (− ∂U
∂β
∣∣∣∣β=β0
)−1
.
Using this, we can develop a “Wald-like” and “score-like” tests,
(β − β0)>V ar(β)(β − β0)
and
U(β0)>V ar(U)−1U(β0),
respectively. However, we cannot develop “LRT-like” test because there is no likelihood or objective
function that we have to maximize. In this case, the “likelihood based on working independence model”
plays an important role. Let the product of the densities of Yij ’s be L(θ;Y ), where θ> = (β>, φ).
Although L(θ : Y ) is not the likelihood function, we can consider “LRT-like statistic”
TL = 2(logL(β1, θ2)− logL(β10, θ2)),
where θ2 = (β>2 , φ)> and θ2 is the maximizer of L under the restriction β1 = β10. TL would be asymp-
totically χ2 distributed if all Yij’s are independent, but actually they are correlated! Following
theorem tells the distribution of TL under the null.
Theorem 5.14. Let P0 be a q × q upper left matrix of W−10 and P1 be the variance of U1(β10, β2),
which is the first q elements of the GEE function evaluated at β2 = β2, and β2 is the solution of the
GEE under the null. Then
TLd≈
q∑j=1
djχ2j ,
where d1 ≥ d2 ≥ · · · ≥ dq are eigenvalues of P0P1.
83
Analysis of Repeated Measurements J.P.Kim
Proof. Note that
`(β0) ≈ `(β) +∂`(β)
∂β
∣∣∣∣β=β︸ ︷︷ ︸
=0
(β0 − β) +1
2(β0 − β)>
∂2`(β)
∂β∂β>
∣∣∣∣β=β
(β0 − β)
and hence
2(`(β)− `(β0)) ≈ (β − β0)>∂2`(β)
∂β∂β>
∣∣∣∣β=β
(β − β0).
Also recall that
β − β0 ≈
(− ∂U
∂β>
∣∣∣∣β=β0
)−1
U(β0),
where U =∂`
∂β. Then we get
2(`(β)− `(β0)) ≈ U(β0)>
E(− ∂U
∂β>
∣∣∣∣β=β0
)︸ ︷︷ ︸
=W0
−1
U(β0).
Similarly, we get
2(`(β0)− `(β0)) ≈ U2(β0)>E
(− ∂U2
∂β>
∣∣∣∣β=β0
)−1
U2(β0).
Therefore, letting W0 =
W011 W012
W021 W022
, we get
2(`(β)− `(β0)) ≈ U(β0)>W−10 U(β0)− U2(β0)>W−1
022U2(β0)
= (U1(β0)−W012W−1022U2(β0))>W−1
022·1(U1(β0)−W012W−1022U2(β0)).
Now note that:
V ar(U) = W0
only when the variance model is correctly specified ! Thus in general
V ar(U1(β0)−W012W−1022U2(β0)) 6= W022·1.
Hence in here, we should find the distribution of x>Ax, where x ∼ N(0, V ). Recall that mgf of x>Ax
isp∏i=1
(1− 2tλi)−1/2
84
Analysis of Repeated Measurements J.P.Kim
where λi’s are eigenvalues of AV . Note that t 7→ (1− 2tλi)−1/2 is mgf of λi · χ2(1), and hence
x>Axd≡ λ1χ
21 + λ2χ
22 + · · ·+ λpχ
2p,
where χ2i are independent χ2(1) distributed random variables. In our problem,
W−1022·1
is an upper left q × q matrix of W−10 , and
V ar(U1(β0)−W012W−1022U2(β0)) = P1.
It can be shown as following. First note that
U1(β10, β2)
U2(β10, β2)
=
U1(β10, β2)
0
≈ U(β0)−
(− ∂U
∂β>
) 0
β2 − β20
≈ U(β0)−W0
0
β2 − β20
x and
U1(β10, β2) ≈ U1(β0)−W012(β2 − β20)
0 ≈ U2(β0)−W022(β2 − β20)
hold. Now using
β2 − β20 ≈W−1022U2(β0),
we obtain
U1(β10, β2) ≈ U1(β0)−W012W−1022U2(β0).
85
Analysis of Repeated Measurements J.P.Kim
5.3.2 Model Selection
Remark 5.15. Review of AIC. Let ` be a log likelihood based on the specified modelM, andM∗
be the true model with true β∗. Let
∆(β, β∗) = EM∗(−2`(β))
be a Kullback-Leibler divergence between β and β∗. Then Akaike Information Criterion (AIC) is the
estimates of EM∗(∆(β, β∗)). (CAUTION: Note that we cannot plug-in directly as
∆(β, β∗) = EM∗(−2`(β))!
At ∆(β, β∗), expectation should be taken only for the likelihood.) Now note that
∆(β, β∗) = EM∗(−2`(β))
≈ EM∗(−2`(β∗)− 2(β − β∗)> ∂`
∂β
∣∣∣∣β=β∗
− (β − β∗)> ∂2`
∂β∂β>
∣∣∣∣β=β∗
(β − β∗)
)
= EM∗(−2`(β∗)) + 2(β − β∗)>EM∗(− ∂`∂β
∣∣∣∣β=β∗
)︸ ︷︷ ︸
=0
+(β − β∗)>EM∗(− ∂2`
∂β∂β>
∣∣∣∣β=β∗
)︸ ︷︷ ︸
=:I(β∗)
(β − β∗)
= EM∗(−2`(β∗)) + (β − β∗)>I(β∗)(β − β∗),
and hence
∆(β, β∗) ≈ EM∗(−2`(β∗)) + (β − β∗)>I(β∗)(β − β∗)
holds. Now note that
−2`(β∗) ≈ −2`(β)− 2∂`
∂β
∣∣∣∣β=β︸ ︷︷ ︸
=0
(β∗ − β)− (β∗ − β)>∂2`
∂β∂β>
∣∣∣∣β=β︸ ︷︷ ︸
≈−I(β)≈−I(β∗)
(β∗ − β)
= −2`(β) + (β − β∗)>I(β∗)(β − β∗),
and putting in this, we get
EM∗(∆(β, β∗)) ≈ EM∗(−2`(β)) + 2EM∗(
(β − β∗)>I(β∗)(β − β∗)).
Now from
β − β∗ ≈ I(β∗)−1 ˙(β∗),
86
Analysis of Repeated Measurements J.P.Kim
we get
(β − β∗)>I(β∗)(β − β∗) = ˙(β∗)>I(β∗)−1 ˙(β∗),
and therefore
EM∗(∆(β, β∗)) ≈ EM∗(−2`(β)) + 2EM∗(
˙(β∗)>I(β∗)−1 ˙(β∗))
= EM∗(−2`(β)) + 2tr
I(β∗)−1EM∗( ˙(β∗) ˙(β∗)>︸ ︷︷ ︸=:J(β∗)
)
= EM∗(−2`(β)) + 2EM∗(tr(J(β∗)I(β∗)−1))
holds. AIC is defined as an estimate of EM∗(∆(β, β∗)), which is defined as
AIC = −2`(β) + 2tr(J(β∗)I(β∗)−1
).
Remark 5.16. Extension to GEE. Lack of likelihood leads to lack of model selection devices
such as AIC. Quasi-likelihood under the Independence model Criterion, which is an AIC for GEE, is
suggested by Pan (2001). Let Q(β) be a “quasi-likelihood” under independent working criterion,
Q(β) =K∑i=1
ni∑j=1
log g(Yij ; θij).
Recall that score equation is U(β) = 0, where
U(β) =K∑i=1
∂µ>i∂β
V −1i (Yi − µi),
where Vi = A1/2i RiA
1/2i is a working variance. Let UR(β) be a score function with working correlation
R. Then∂Q
∂β= U I . Also, if we suppress the dependence of ∆(β, β∗) on the true model M∗, we get
EM∗
(−∂Q∂β
∣∣∣∣β=β∗
)= 0 and ΩI := EM∗
(− ∂Q
∂β∂β>
∣∣∣∣β=β∗
)=
K∑i=1
∂µ>i∂β
V −1i
∂µi∂β>
.
Thus we get
∆(β, β∗) = EM∗(−2Q(β))
≈ EM∗(−2Q(β∗)) + (β − β∗)>EM∗(− ∂2Q
∂β∂β>(β∗)
)(β − β∗)
= EM∗(−2Q(β∗)) + (β − β∗)>ΩI(β − β∗).
87
Analysis of Repeated Measurements J.P.Kim
Hence
∆(β, β∗) ≈ EM∗(−2Q(β∗)) + (β − β∗)>ΩI(β − β∗)
holds. In here, β = β(R) is estimate based on the working correlation R. Now note that
−2Q(β∗) ≈ −2Q(β)− 2(β∗ − β)>∂Q
∂β
∣∣∣∣β=β
− (β∗ − β)>∂2Q
∂β∂β>
∣∣∣∣β=β
(β∗ − β).
Note that
− ∂2Q
∂β∂β>
∣∣∣∣β=β
≈ EM∗(− ∂2Q
∂β∂β>
∣∣∣∣β=β
).
However, to be cautious that∂Q
∂β
∣∣∣∣β=β
=∂Q
∂β
∣∣∣∣β=β(R)
6= 0!
Rather,∂Q
∂β
∣∣∣∣β=β(I)
= U I(β(I)) = 0.
Thus EM∗∆(β, β∗) is approximated as
EM∗∆(β, β∗) ≈ EM∗(−2Q(β)) + EM∗(2(β − β∗)>U I(β))︸ ︷︷ ︸=(♠)
+2EM∗(
(β − β∗)>ΩI(β − β∗)).
Ignoring (♠) part, we get
QIC := −2Q(β) + 2tr(ΩI V ar(β)).
Now note that
ΩI =
K∑i=1
∂µ>i∂β
V −1i
∂µi∂β>
= W0,
and hence ΩI = W0. On the other hand, we should estimate the variance V ar(β) robust with respect
to the model misspecification, and hence we use
V ar(β) = W−10 W1W
−10 .
Therefore, we obtain
QIC = −2Q(β) + 2tr(W1W−10 ).
88
Chapter 6
Generalized Linear Mixed Models
6.1 Introduction
Recall the extension from linear regression model to the linear mixed effects model. There is only
one way to extend linear model to the mixed effects model; considering “additive” random effect terms
which are Gaussian distributed. However, GLM has two ways of extension for longitudinal data:
(i) ηi = Xiβ + Zibi, where bi ∼ g(bi;D);
(ii) ηi = Xiβ + ν(ui), where ui ∼ g(ui;D) is random effect term.
The former one is called Generalized Linear Mixed Model (GLMM); the latter one can be viewed
as a generalization of GLMM. Due to its hierarchical structure, second model is called a Hierarchical
Generalized Linear Model (HGLM).
How to handle such random effect term? One can integrate over with respect to the random term,
and consider a marginal likelihood. Since integration does not give a closed form in usual, one should
approximate the integral. Depending on the level of approximation, various methods are available
such as Laplace approximation, Penalized quasi-likelihood, or Marginal quasi-likelihood. On the other
hands, conditional approach can be used; if we “condition” on each individual, we only think for each
individual and hence individual effect becomes ignorable. For this, conditional likelihood or pseudo-
likelihood is available.
6.2 Extension of GLM
As mentioned in previous section, there are two ways for extension; one is GLMM, and the other is
HGLM.
89
Analysis of Repeated Measurements J.P.Kim
6.2.1 GLMM
The correlation is induced by sharing an unobservable common factor, which is called random effect.
The model is
Yij |bi ∼ f(Yij |bi;β),
where f comes from exponential family, and bi ∼ g(bi;D) with known g(·; ·) indexed by unknown D.
Also, there is systematic component ηij = Xijβ+Zijbi and link function ηij = h(E(Yij |bi)). Typically,
bi is assumed to be distributed as multivariate normal, but it is not necessary; actually, it cannot give
any advantage in GLMM, different as classical LMM.
Remark 6.1. GLMM is called “subject-specific model”; whereas GEE is called “population average
model.” GEE gives “population average,” and hence it yields the same prediction value under same
covariate value. In contrast, due to the random effect term, GLMM yields different prediction value
even under the same covariate value. prediction becomes different as subject differs.
Example 6.2. Consider a logistic model. Given random effect bi, the response probability is assumed
to be
P (Yij = 1|Xij , bi) =exp(β0 + β1Xij + bi)
1 + exp(β0 + β1Xij + bi).
Then β1 becomes “log odds ratio”;
eβ1 =P (Yij = 1|Xij = 1, bi)/P (Yij = 0|Xij = 1, bi)
P (Yij = 1|Xij = 0, bi)/P (Yij = 0|Xij = 0, bi),
but it does not hold marginally,
eβ1 6= P (Yij = 1|Xij = 1)/P (Yij = 0|Xij = 1)
P (Yij = 1|Xij = 0)/P (Yij = 0|Xij = 0).
The right-hand-side of latter equality is “marginal” or “population average” odds ratio. Note that:
P (Yij = 1|Xij) =
∫P (Yij = 1|Xij , bi)dG(bi).
Remark 6.3. (i) Population average model can be used to compare two groups; for example,
“smoking population” vs. “non-smoke population.” On the other hands, subject-specific model
can be used compare when “specific individual” smokes or non-smokes.
(ii) GLMM implies that conditional log odds ratio (when increase of a unit of Xij) is same among
individuals (log odds ratioi≡ β1). It is consequence of “additive modeling” for random effects.
90
Analysis of Repeated Measurements J.P.Kim
(iii) Note that marginal mean is obtained as
E(Yij =
∫E(Yij |bi)dG(bi).
Thus, GLMM models conditional mean to determine the (marginal) mean structure of Yij , rather
than directly modeling marginal mean model (just as GEE).
(iv) Marginal mean does not have a logit form even if we assumed logit form for conditional mean in
usual. Otherwise, we can consider a distribution of random effect that makes both marginal and
conditional mean same. In precise, one can find G such that
∫H(βX + b)dG(b) = H(φβX + α),
where H is a function, for example, a logit function. Such G is called a “bridge distribution.”
(v) Usually conditional independence is assumed:
Cov(Yij , Yik) = ECov(Yij , Yik|bi) + Cov(E(Yij |bi), E(Yik|bi))
= 0 + Cov(E(Yij |bi), E(Yik|bi)).
(vi) In practice, both GEE and GLMM are fitted, results from both models are compared, and
whether they are far from each other or not may be seen.
6.2.2 HGLM
Some models are formulated in a slightly different way. Instead of imposing distributional assumption
on the random effects that is linear in η, one can impose distributional assumption on a function of
that random effect. This type of model is called hierarchical generalized linear models.
Specifically, let Yi = (Yi1, · · · , Yini)> be the response for the ith unit (i = 1, 2, · · · ,K) and νi be the
corresponding unobserved random effect. We restrict our attention to nested hierarchical structure in
that each outcome Yij , j = 1, 2, · · · , ni of Yi is repeatedly measured within unit i.
Definition 6.4. We assume that Yij arises from an exponential family distribution given random
effect νi and follows generalized linear models (GLMs) with the density f(Yij |νi;ψij , φ), where
log f(Yij |νi;ψij , φ) =Yijψij − b(ψij)
φ+ d(Yij , φ).
91
Analysis of Repeated Measurements J.P.Kim
In here, ψij denotes the canonical parameter, and φ is the dispersion parameter. We denote by
µij := E(Yij |νi) = b′(ψij) and ηij = q(µij),
with q(·) as the link function and ηij = Xijβ+νi with νi = ν(ui) for some strictly monotonic differen-
tiable function of ui. Xij = (1, X1ij , · · · , Xpij) is a 1× (p+ 1) covariate vector corresponding to fixed
effects β = (β0, · · · , βp)>.
HGLM imposes distributional assumption on “ui” with density
f(ui;α) = exp
(a1(α)ui − a2(α)
ϕ+ c(ui, ϕ)
).
Let θ = (β>, φ,α>)>, ν = (ν1, ν2, · · · , νK)>, u = (u1, u2, · · · , uK)>, and Y = (Y>1 ,Y>2 , · · · ,Y>K)>.
The joint likelihood (H-likelihood) for θ and ν is defined as
H(θ,ν(u); Y) =1
K
K∑i=1
nihi(θ, ν(ui); Yi),
where
hi(θ, ν(ui); Yi) = `1i(θ, ν(ui); Yi) + `2i(θ, ν(ui))
with
`1i(θ, ν(ui); Yi) =1
ni
ni∑j=1
log f(Yij |νi;ψij , φ)
=1
ni
ni∑j=1
(Yijψij − b(ψij)
φ+ d(Yij , φ)
)
and
`2i(θ, ν(ui)) =1
nilog f(ν(ui);α)
=1
ni
a1(α)ui − a2(α)
ϕ+ c(ui, ϕ)︸ ︷︷ ︸
ui part
+ logduidνi︸ ︷︷ ︸
Jacobian
.
Remark 6.5. (i) Note that there is a subtle difference between the h-likelihood and the joint likeli-
hood. The h-likelihood is the joint density of Y and the random effects ν = ν(u), and therefore
is a subclass of joint likelihood defined on a particular scale of u, ν(u), out of a class of scales
ν∗(u) (i.e., only think particular class of transformations which yields a convenient form).
92
Analysis of Repeated Measurements J.P.Kim
(ii) Obviously, maximizing β and νi jointly would lead to incorrect inference about β due to curse of
dimensionality (number of i increases!). However, “posterior mode” νi plays an important role
in inference and computation. Recall that
K∏i=1
k(νi|Yi1, · · · , Yini)︸ ︷︷ ︸posterior
∝K∏i=1
ni∏j=1
f(Yij |νi)× g(ui)duidνi
︸ ︷︷ ︸
h-likelihood
,
regarding g(ui)dui/dνi as if it is “prior” on ui, and hence maximizer νi of h-likelihood becomes a
“posterior mode.” Below, we examine some popular models and the corresponding joint likelihood
and νi.
Example 6.6 (Normal-Normal Model). Consider a normal-normal model
E(Yij |ν(ui)) = µij + ν(ui), µij = ηij = Xijβ, ν(ui) = ui,
f(ui;D) =1√
2πDexp
(− u
2i
2D
).
Then since
f(Yij |ν(ui);β,D) =1√
2πσ2exp
(−(Yij − µij)2
2σ2
),
we get the h-likelihood
hi(θ, ν(ui); Yi) =1
ni
ni∑j=1
(−1
2log 2πσ2 − 1
2σ2(Yij − µij − ui)2
)− 1
2log 2πD − u2
i
2D
.
Hence we get
h′i(θ, ν(ui); Yi) :=∂
∂uihi(θ, ν(ui); Yi) =
1
ni
ni∑j=1
1
σ2(Yij − µij − ui)−
uiD
,
and the solution of h′i(θ, ν(ui); Yi) = 0 becomes
ui =D
niD + σ2
ni∑j=1
(Yij − µij).
Remark 6.7. Note that ui = E(ui|Yi) is the same as BLUP of ui. It implies that
– posterior mode ui is same as posterior mean E(ui|Yi) in normal case;
– in normal linear mixed effects model, regarding ui’s as fixed parameters gives correct inference
for β.
93
Analysis of Repeated Measurements J.P.Kim
Example 6.8 (Poisson-Gamma Model). Consider a Poisson-Gamma model
E(Yij |ν(ui)) = µij = exp(ηij), ηij = Xijβ + ν(ui), ν(ui) = log ui,
f(Yij |ν(ui);θ) =e−µijµ
Yijij
Yij !,
and
f(ui;λ, k) =uk−1i exp(−ui/λ)
Γ(k)λk.
Due to identifiability, we set E(ui) = 1, i.e., k = λ−1. Note that
µij = eηij = eXijβ+log ui = uieXijβ,
i.e.. ui is a “common multiplier” for ith unit. Then we have that H-likelihood is
K∏i=1
ni∏j=1
e−µijµYijij
Yij !·uk−1i e−ui/λ
Γ(k)λk· ui
=
K∏i=1
ni∏j=1
e−uieXijβ
(uieXijβ)Yij
Yij !· u
ki e−ui/λ
Γ(k)λk
.
Thus h-likelihood becomes
hi(θ, ν(ui); Yi) =1
ni
ni∑j=1
(−uieXijβ + Yij log ui + YijXijβ − log Yij !
)+ k log ui −
uiλ− log Γ(k)λk
,
and hence ui is the solution of
h′i(θ, ν(ui); Yi) =1
ni
ni∑j=1
(−eXijβ +
Yijui
)+k
ui− 1
λ
= 0,
i.e.,
ui =
ni∑j=1
Yij + k
ni∑j=1
eXijβ + λ−1
=
ni∑j=1
Yij + k
ni∑j=1
eXijβ + k
. (6.1)
Remark 6.9. (i) E(Yij |ui) = µij = uieXijβ. The assumption E(ui) = 1 means that marginal mean
becomes E(Yij) = eXijβ.
(ii) Predictor ui in (6.1) satisfies E(ui) = 1 = E(ui), and hence ui is an “unbiased predictor.” Note
that if we did not think ν(ui) = log ui and so Jacobian term were omitted in h-likelihood, ui
94
Analysis of Repeated Measurements J.P.Kim
should be
ui =
ni∑j=1
Yij + k − 1
ni∑j=1
eXijβ + k
,
which is biased predictor. Also note that bias becomes negligible when ni becomes large: “Prior
becomes negligible if sample size goes to ∞.”
(iii) Simple predictor of ui is a sample mean,
ui =
ni∑j=1
Yij
ni∑j=1
eXijβ
,
which is also unbiased. “Prior” gets a role “pulling ui towards to 1” as adding k to both numerator
and denominator.
Example 6.10 (Binomial-Beta Model). For binary outcome, model with canonical link function is
E(Yij |ν(ui)) = µij =eηij
1 + eηij.
There are two choices for modeling:
i) ηij = Xijβ + bi, µij =eXijβ+bi
1 + eXijβ+bi, bi ∼ N(0, D);
ii) ηij = Xijβ + ν(ui), µij =eXijβ+ν(ui)
1 + eXijβ+ν(ui), ui ∼ g(ui).
We consider second model in this example. That is,
ηij = logµij
1 + µij= Xijβ + ν(ui), ν(ui) = log
ui1− ui
,duidνi
=
(1
ui+
1
1− ui
)−1
= ui(1− ui),
f(Yij |ν(ui);θ) = µYijij (1− µij)1−Yij ,
and
f(ui;α1, α2) =uα1−1i (1− ui)α2−1
B(α1, α2),
where B(α1, α2) =Γ(α1)Γ(α2)
Γ(α1 + α2)is a Beta function. Here we set α1 = α2 = α to give E(ui) = 1/2. Then
hi(θ, ν(ui); Yi) =1
ni
ni∑j=1
(Yij logµij + (1− Yij) log(1− µij)) + loguα−1i (1− ui)α−1
B(α, α)ui(1− ui)
95
Analysis of Repeated Measurements J.P.Kim
=1
ni
ni∑j=1
(Yij log
µij1− µij
+ log(1− µij))
+ loguαi (1− ui)α
B(α, α)
=
1
ni
ni∑j=1
(Yij(Xijβ + ν(ui))− log(1 + eXijβ+ν(ui))
)+ log
uαi (1− ui)α
B(α, α)
and hence we get
h′i(θ, ν(ui); Yi) =1
ni
ni∑j=1
(Yij
ui(1− ui)−
1ui(1−ui)e
Xijβ+ν(ui)
1 + eXijβ+ν(ui)
)+α
ui− α
1− ui
=
1
ni
ni∑j=1
(Yij − µijui(1− ui)
)+α(1− 2ui)
ui(1− ui)
.
Therefore we get
ui =
ni∑j=1
(Yij − µij) + α
2α, µij =
eXij+ν(ui)
1 + eXij+ν(ui)=
ui1− ui
eXijβ
1 +ui
1− uieXijβ
=uie
Xijβ
(1− ui) + uieXijβ.
Example 6.11 (Gamma-Inverse gamma Model). Let
E(Yij |ν(ui)) = kµij , ηij = logµij , ηij = Xij(β) + ν(ui), ν(ui) = log ui,
f(Yij |ν(ui);β, k) =1
Γ(k)
(Yijµij
)kexp
(−Yijµij
)1
Yij,
and ui arises from an inverse-gamma density
f(ui;α) =1
Γ(α+ 1)
(α
ui
)α+1
exp
(− αui
)1
ui
with E(ui) = 1. Then h-likelihood is
hi(θ, ν(ui); Yi) =1
ni
ni∑j=1
(k log
Yijµij− Yijµij− log Γ(k)− log Yij
)
+(α+ 1) logα
ui− α
ui− log ui − log Γ(α+ 1) + log ui
)
=1
ni
ni∑j=1
(k log
Yijµij− Yijµij− log Γ(k)− log Yij
)+ (α+ 1) log
α
ui− α
ui− log Γ(α+ 1)
96
Analysis of Repeated Measurements J.P.Kim
and hence
h′i(θ, ν(ui); Yi) =1
ni
ni∑j=1
(− k
µij+Yijµ2ij
)∂µij∂ui
+
(−α+ 1
ui+α
u2i
)=
1
ni
ni∑j=1
(− k
uieXijβ+
Yij
u2i e
2Xijβ
)eXijβ +
(−α+ 1
ui+α
u2i
)=
1
ni
ni∑j=1
Yij − kuieXijβ
u2i e
Xijβ+α− (α+ 1)ui
u2i
.
Thus predictor of ui becomes
ui =Yije
−Xijβ + α
nik + α+ 1.
Remark 6.12. Note that E(Yij) = E(kµij) = E(kuieXijβ) = keXijβ. It implies
E(ui) =nik + α
nik + α+ 16= 1 = E(ui).
Thus, all of HGLM setting does not yield unbiased predictor for ui.
Example 6.13 (Gamma-Normal Model: with log link). In here, we consider the same model as
previous example (example 6.11) for Yij , but assume different model for ui. The model is
E(Yij |ν(ui)) = kµij , ηij = logµij , ηij = Xij(β) + ui, ui ∼ N(0, λ).
Note that it is same setting as that of GLMM. Then
hi(θ, ui; Yi) =1
ni
ni∑j=1
((k − 1) log Yij −
Yijµij− k logµij − log Γ(k)
)− 1
2log 2πλ− u2
i
2λ
.
µij = eXijβ+ui yields that
h′i(θ, ui; Yi) =1
ni
ni∑j=1
(Yije
Xijβ+ui
e2(Xijβ+ui)− k)− uiλ
=1
ni
ni∑j=1
(Yij
eXijβ+ui− k)− uiλ
.
However, in here, h′i(θ, ui; Yi) = 0 does not have a closed form solution for ui.
97
Analysis of Repeated Measurements J.P.Kim
6.3 Marginal Likelihood Approach
In here, we consider a “marginal likelihood,” which averages the joint likelihood over bi,
LM (β,D; Yi) =
∫ ni∏j=1
f(Yij |Xij , bi)dG(bi),
and maximizes it. Since it is defined via integration, usually marginal likelihood does not have a closed
form, and hence we should approximate the integral via, for example, Laplace approximation.
6.3.1 Laplace Approximation
Let `i(θ; bi,Yi) be the joint log-likelihood of Yi and bi,
`i(θ; bi,Yi) =
ni∑j=1
log f(Yij |bi;β) + log g(bi;D),
where θ = (β>, D)>. We need to evaluate
∫e`i(θ;bi,Yi)dbi. When the integral is hard to evaluate, one
can use following Laplace approximation.
Proposition 6.14 (Laplace approximation). Let bi be the solution of `′i(bi) = 0, i.e., h′i(bi) = 0. Then
∫e`i(θ;bi,Yi)dbi = e`i(θ;bi.Yi)
(√2π(−`′′i (θ; bi,Yi)
)−1/2+ oP (1)
)
as ni →∞.
Proof. Note that
∫e`i(bi)dbi =
∫enihi(bi)dbi
=
∫exp
(ni
(hi(bi) + (bi − bi)h′i(bi) +
(bi − bi)2
2h′′i (bi) +
(bi − bi)3
6h
(3)i (bi) + · · ·
))dbi
= enihi(bi)∫eni(bi−bi)
2h′′i (bi)/2 exp
ni(bi − bi)3
6h
(3)i (bi) +
ni(bi − bi)4
24h
(4)i (bi) + · · ·︸ ︷︷ ︸
=:∆i
dbi
= enihi(bi)∫eni(bi−bi)
2h′′i (bi)/2
(1 + ∆i +
1
2∆2i +
1
6∆3i + · · ·
)dbi
= enihi(bi)∫eni(bi−bi)
2h′′i (bi)/2
(1 +
ni(bi − bi)3
6h
(3)i (bi) +
ni(bi − bi)4
24h
(4)i (bi) + · · ·
+1
2
n2i (bi − bi)6
36h
(3)i (bi)
2 + · · ·
)dbi.
98
Analysis of Repeated Measurements J.P.Kim
Now recall that ∫e−ax
2x2mdx =
(2m)!
m!22m
√πa−
2m+12 .
Using this, we obtain ∫eni2
(bi−bi)2h′′i (bi)dbi =√π(−ni
2h′′i (bi)
)−1/2
∫eni2
(bi−bi)2h′′i (bi)(bi − bi)2m−1dbi = 0
and
∫eni2
(bi−bi)2h′′i (bi)ni
(2m)!(bi − bi)2mh
(2m)i (bi)dbi =
(2m)!
m!22m
√π(−ni
2h′′i (bi)
)− 2m+12 ni
(2m)!h
(2m)i (bi)
=
√π
m!22m
(−ni
2h′′i (bi)
)−1/2n1−mi (−2)m
h(2m)i (bi)
h′′i (bi)m
=(−ni
2h′′i (bi)
)−1/2· (−1)m
√π
m!2m`(2m)i (bi)
`′′i (bi)m.
Furthermore, we get
∫eni2
(bi−bi)2h′′i (bi)1
2
n2i
36(bi − bi)6h
(3)i (bi)
2dbi =(−ni
2h′′i (bi)
)−1/2 5
12× 24ni.
Then we get
∫enihi(bi)dbi = enihi(bi)
√π(−ni
2h′′i (bi)
)−1/2(1 + oP (1))
≈ enihi(bi)√π(−ni
2h′′i (bi)
)−1/2,
and therefore we get the approximation
∫e`i(θ;bi,Yi)dbi = e`i(θ;bi,Yi)
(√2π(−`′′i (θ; bi,Yi)
)−1/2+ oP (1)
),
i.e.,
log
∫e`i(θ;bi,Yi)dbi ≈ `i(θ; bi,Yi)−
1
2log(−`′′i (bi)
)+
1
2log 2π.
Remark 6.15. (i) That is, the marginal log-likelihood can be approximated by
logLM (β,D; Y) ≈K∑i=1
`i(θ; bi,Yi)︸ ︷︷ ︸“plug-in” bi
− 1
2
K∑i=1
log(−`′′i (θ; bi,Yi)
)︸ ︷︷ ︸
“correction term”
.
99
Analysis of Repeated Measurements J.P.Kim
(ii) Higher-order correction can be made by
logLM (β.D; Y) ≈K∑i=1
`i(θ; bi,Yi)−1
2
K∑i=1
log(−`′′i (θ; bi,Yi)
)+
K∑i=1
log(1− Cni(θ; bi,Yi))
where
Cni(θ; bi,Yi) =J!i(θ; bi,Yi)
8− J2i(θ; bi,Yi)
24,
J!i(θ; bi,Yi) = −`(4)i (θ; bi,Yi)
`′′2
i (θ; bi,Yi)= − 1
ni
h(4)i (θ; bi,Yi)
h′′2
i (θ; bi,Yi),
and
J2i(θ; bi,Yi) = −`(3)2
i (θ; bi,Yi)
`′′3
i (θ; bi,Yi)= − 1
ni
h(3)2
i (θ; bi,Yi)
h′′3
i (θ; bi,Yi).
Remark 6.16. (i) Goal of such computation is to obtain an inference for β. In normal-normal case,
h′′i (bi) does not depend on β, and hence likelihood function becomes
L(β) =K∏i=1
∫e`idbi ≈
K∏i=1
enihi(bi)√π(−ni
2h′′i (bi)
)−1/2
︸ ︷︷ ︸constant
.
Thus correction term would be a constant and hence it can be ignored. In other words, we can
just use `i(bi), i.e., just “plug-in” bi. It yields joint maximization of β and bi; jointly maximizing
β and bi as if bi’s are fixed parameters also works in normal-normal case.
(ii) Laplace approximation is valid only when ni’s are large. It would not be acceptable to apply
such approximation to the data with ni = 1 or ni = 2; it often fails to converge in practice.
Most commonly used methods in practice are penalized quasi-likelihood (PQL) and marginal
quasi-likelihood (MQL). From the rest part of this section, we describe Laplace approximation,
PQL and MQL when random effects arise from normal distribution.
Example 6.17. As before, let yi be the ni × 1 outcome vector, Yi = (Yi1, Yi2, · · · , Yini)>, and Xi, Zi
be ni × p, ni × q matrices of explanatory variable associated with the fixed and random effects. Also,
let bi be a q × 1 random effect, while α is a p× 1 fixed effect.
We assume that Yij |bi arise from f(yij |bi) and conditionally independent with E(Yij |bi) = µij(bi)
and V ar(Yij |bi) = φV (µij). Also assume that
ηi = g(µi(bi)) = Xiα+ Zibi,
or equivalently,
ηij = g(µij(bi)) = Xijα+ Zijbi.
100
Analysis of Repeated Measurements J.P.Kim
Random effects bi are assumed to arise from N(0, D(θ)).
Then the marginal likelihood is
e`(α,θ) =
K∏i=1
∫ ni∏j=1
f(yij |bi)
f(bi)dbi
∝K∏i=1
|D|−1/2
∫exp
ni∑j=1
`ij(yij ;µij(bi))−1
2b>i D
−1bi
dbi,
where `ij(yij ;µij(bi)) = log f(yij |bi).
Now we apply Laplace approximation. Let
ni∑j=1
`ij(yij ;µij(bi))−1
2b>i D
−1bi = k(bi).
Then it gives
`(α, θ) ≈K∑i=1
(−1
2log |D|+ k(bi)−
1
2log | − k′′(bi)|
),
where bi = bi(α, θ) is the solution of k′(bi) = 0, where1
k′(bi) =
ni∑j=1
∂ηij∂bi
∂µij∂ηij
∂θij∂µij
∂`ij∂θij
−D−1bi
=
ni∑j=1
Z>ijg′(µij)
−1b′′(θij)−1 yij − b′(θij)
φ−D−1bi
=
ni∑j=1
Z>ij (yij − µij(bi))φV (µij)g′(µij)
−D−1bi.
Also note that
−k′′(bi) =
ni∑j=1
Z>ijZij
φV (µij)g′(µij)2−
ni∑j=1
Z>ij (yij − µij(bi))∂
∂bi
(1
φV (µij)g′(µij)
)︸ ︷︷ ︸
=:Ri
+D−1,
where E(Ri) = EE(Ri|bi) = 0. Thus, ignoring Ri (regarding Ri ≈ 0), we can express
−k′′(bi) = Z>i WiZi +D−1,
1To maintain the consistency of notation (from chapter GEE), I regarded Xij and Zij as row vectors
101
Analysis of Repeated Measurements J.P.Kim
where
Zi =
Zi1...
Zini
, Wi =
1
φV (µi1)g′(µi1)2
. . .
1φV (µini )g
′(µini )2
.
In conclusion, we get
`(α, θ) ≈K∑i=1
(−1
2log |D|+ k(bi)−
1
2log |Z>i WiZi +D−1|
)
=K∑i=1
ni∑j=1
`ij(yij ;µij(bi))−1
2b>i D
−1bi −1
2log |I + Z>i WiZiD|
6.3.2 Penalized Quasi-Likelihood
In Laplace approximation, Z>i WiZi + D−1 is the “correction term” of just replacing bi into bi
(i.e., e`(bi) ≈ e`(bi)). Note that our interest is to obtain an inference for “α.” Since ∂∂αW 6= 0, we
cannot ignore the correction term in general, but ∂∂αW = 0 in the normal case (i.e., LMM, with
µij(bi) = Xijα+ Zijbi, g(µi) = µi). In such case, approximation becomes
`(α, θ) ≈K∑i=1
(k(bi)−
1
2log |D|
).
It can be viewed as a “profile likelihood.” In other words, maximizing `(α, θ) is equivalent to max-
imizing∑K
i=1
(∑nij=1 `ij(yij ;µij(bi))−
12 log |D| − 1
2b>i D−1bi
)jointly over (α>, b1, b2, · · · , bK)> gives
the correct inference.2
Motivated by this, we can obtain the inference via other approaches; which is called Penalized
Quasi-Likelihood (PQL). PQL ignores the correction term −12
∑Ki=1 log |D + Z>i WiZi| for Laplace
approximation, hoping that ∂∂αW ≈ 0. That is to maximize
K∏i=1
ni∏j=1
f(yij |bi;α) · f(bi;D)
jointly over (α>, b1, · · · , bK).
The name “Penalized” comes from the fact that this method maximizes the likelihood function
K∏i=1
ni∏j=1
f(yij |bi)
2Actually, the correction term is related to the variance term b′′(θij). Thus, if one differentiates the correction term(so that we can find the maximization solution), we obtain the function of “3rd derivative” of b. It implies that, if thedistribution of Yi has high skewness, then the correction term highly affects to the maximization of the approximatedlikelihood `(α, θ).
102
Analysis of Repeated Measurements J.P.Kim
over (α>, b1, · · · , bK) with penalty∏Ki=1 f(bi;D), so that “estimation” of bi’s obtained as if they are
fixed are “pulled toward zero.”
Example 6.18 (Continued). Note that penalized quasi-likelihood function is
PQL =K∑i=1
ni∑j=1
`ij(yij ;µij(bi))−1
2b>i D
−1bi
if we used normal random effect. Thus we get
∂
∂αPQL =
K∑i=1
ni∑j=1
X>ij (yij − µij(bi))φV (µij)g′(µij)
∂
∂biPQL =
ni∑j=1
Z>ij (yij − µij(bi))φV (µij)g′(µij)
−D−1bi.
Remark 6.19. Note that problem in example 6.18 can be treated as a huge GLM with parameter
(α>, b>1 , · · · , b>K)> ∈ Rp+Kq. Let
N =
K∑i=1
ni and X =
X1
X2
...
XK
be an N × p matrix, where Xi =
Xi1
Xi2
...
Xini
.
Also, let
Z =
Z1
Z2
. . .
ZK
be an N ×Kq matrix where Zi =
Zi1
Zi2...
Zini
,
and b =
b1
b2...
bK
be a Kq × 1 vector. Then the information matrix is a (p+Kq)× (p+Kq) matrix
I = E
− ∂2`
∂α∂α>− ∂2`
∂α∂b>
− ∂2`
∂b∂α>− ∂2`
∂b∂b>
,
103
Analysis of Repeated Measurements J.P.Kim
where we use ` = PQL. Note that
E
[− ∂2`
∂α∂α>
]=
K∑i=1
ni∑j=1
X>ijXij
φV (µij)g′(µij)2=
K∑i=1
X>i WiXi = X>WX,
where
Wi = diag
(1
φV (µi1)g′(µi1)2, · · · , 1
φV (µini)g′(µini)
2
)and W =
W1
W2
. . .
WK
.
Similarly,
E
[− ∂2`
∂α∂b>i
]=
ni∑j=1
X>ijZij
φV (µij)g′(µij)2= X>i WiZi,
and hence we get
E
[− ∂2`
∂α∂b>
]=(X>1 W1Z1, · · · , X>KWKZK
)= X>WZ.
Finally, from
E
[− ∂2`
∂bi∂b>i
]=
ni∑j=1
Z>ijZij
φV (µij)g′(µij)2+D−1 = Z>i WiZi +D−1
and
E
[− ∂2`
∂bi∂b′>i
]= 0 for i 6= i′,
we get
E
[− ∂2`
∂b∂b>
]= Z>WZ + I ⊗D−1,
where ⊗ denotes the Kronecker product. Therefore we get the summarized form of information matrix,
I =
X>WX X>WZ
Z>WX Z>WZ + I ⊗D−1
.
Remark 6.20. We can find α and b with Newton-Raphson algorithm (with Fisher scoring)
α(p+1)
b(p+1)
=
α(p)
b(p)
+ I−1
∂`∂α
∣∣α=α(p)
∂`∂b
∣∣b=b(p)
.
104
Analysis of Repeated Measurements J.P.Kim
Note that update rule at (α0, b0) can be written as
α0
b0
+ I−1
∂`∂α0
∂`∂b0
= I−1
Iα0
b0
+
∂`∂α0
∂`∂b0
= I−1
X>WX X>WZ
Z>WX Z>WZ + I ⊗D−1
α0
b0
+
∂`∂α0
∂`∂b0
= I−1
X>W (Xα0 + Zb0)
Z>W (Xα0 + Zb0) + (I ⊗D−1)b0
+
∂`∂α0
∂`∂b0
.
Each score can be rewritten as
∂`
∂α0=
K∑i=1
ni∑j=1
X>ij (yij − µij)φV (µij)g′(µij)
= X>W
g′(µ11)
. . .
g′(µKnK )
(Y − µ) = X>W∂η
∂µ(Y − µ)
∂`
∂b0=
K∑i=1
ni∑j=1
Z>ij (yij − µij)φV (µij)g′(µij)
−
D−1b01
...
D−1b0K
= Z>W
g′(µ11)
. . .
g′(µKnK )
(Y − µ)− (I ⊗D−1)b0
= Z>W∂η
∂µ(Y − µ)− (I ⊗D−1)b0,
and henceα0
b0
+ I−1
∂`∂α0
∂`∂b0
= I−1
X>W (Xα0 + Zb0)
Z>W (Xα0 + Zb0) + (I ⊗D−1)b0
+
∂`∂α0
∂`∂b0
= I−1
X>W (Xα0 + Zb0)
Z>W (Xα0 + Zb0) + (I ⊗D−1)b0
+
X>W ∂η∂µ(Y − µ)
Z>W ∂η∂µ(Y − µ)− (I ⊗D−1)b0
= I−1
X>W (Xα0 + Zb0 + ∂η∂µ(Y − µ))
Z>W (Xα0 + Zb0 + ∂η∂µ(Y − µ))
Therefore, as in GLM, we can see above that the Newton-Raphson algorithm is equivalent to IRLS-like
procedure as follows. Let
y∗i = Xiα+ Zibi +∂ηi∂µi
(yi − µi).
Note that
E(y∗i |bi) = Xiα+ Zibi and V ar(y∗i |bi) = ∆iV ar(yi|bi)∆i = W−1i ,
105
Analysis of Repeated Measurements J.P.Kim
where ∆i = diag
(∂ηi1∂µi1
, · · · , ∂ηini∂µini
). Thus, the update
α(p+1)
b(p+1)
=
X>WX X>WZ
Z>WX Z>WZ + I ⊗D−1
−1X>Wy∗
Z>Wy∗
is a weighted least square estimator of (α>, b1, b2, · · · , bK) with ridge penalty with pseudo-outcome
y∗, while W is re-evaluated at b(θ) = b(α(θ)) every iteration.3
Remark 6.21. Practical Remark. R function which performs PQL regards as X>WZ = 0, so that it
updates α and bi’s separately. Thus it may not give the exact solution.
Remark 6.22. Variance component estimation. Variance component θ (for b) can be obtained
by maximizing
`(α, θ) ≈K∑i=1
−1
2log |I + Z>WZD| −
ni∑j=1
`ij(yij ;µij(bi))−1
2b>i D
−1bi
,
which is an “approximated marginal likelihood over θ.” Again, the correction term −12
∑Ki=1 log |D +
Z>WZ| is ignored in PQL method. On the other hand, φ (variance component for α) can be estimated
by maximizing the PQL or by moments method using Pearson chi-squared statistics
φ =K∑i=1
ni∑j=1
(yij − µij(bi))2
V (µij(bi)).
Note that such estimator does not take into account penalizing term.
6.3.3 Marginal Quasi-Likelihood
Marginal Quasi-Likeliood (MQL) evaluates marginal moments and construct GEE-type esti-
mating equation, instead of evaluating marginal likelihood directly. This methods obtains approxi-
mate forms of marginal mean and variance using Taylor expansion around bi = 0 for all i’s, and use
estimating equations to draw inference. This approximation would be close when D ≈ 0 (i.e., bi’s are
almost all nearly zero), otherwise yields biased results. However, there are some special cases that this
approximation yields consistent results, including normal-normal and Poisson-gamma model.
From now on, let h = g−1 be an inverse link function.
3The fact that Fisher scoring in PQL is equivalent to IRLS with ridge penalty is not surprising; in fact, it is coherent,because in PQL we consider ridge-like penalty term of bi’s on the likelihood function.
106
Analysis of Repeated Measurements J.P.Kim
Example 6.23 (Poisson model). Assume that
E(Yij |bi) = biµij , µij = eXijα, and V ar(Yij |bi) = biµij .
Also assume that E(bi) = 1 and V ar(bi) = φ. Then marginal moments become
E(Yij = EE(Yij |bi) = µij
V ar(Yij = V arE(Yij |bi) + EV ar(Yij |bi) = φµ2ij + µij .
(Note that in GLM V ar(Yij) = µij ; there are additional term φµ2ij due to the random effect) Also,
under the conditional independence assumption,
Cov(Yij , Yik) = Cov(E(Yij |bi), E(Yik|bi)) + ECov(Yij , Yik|bi) = µijµikφ.
Thus one can use GEE-type estimating equations to estimate α:
K∑i=1
∂µi∂α
V −1i (Yi − E(Yi|Xi)) = 0,
where
E(Yi|Xi) =
eXi1α
...
eXiniα
, Vi =
µi1 + µ2
i1φ µi1µi2φ · · ·...
. . ....
· · · µi,ni−1µi,niφ µini + µ2iniφ
= φµiµ>i +diag(µi1, · · · , µini).
(Note that α has different interpretation from β in GEE) Variance component φ can be estimated by,
for example, solving ∑i,j
µ2ij
((Yij − µij)2 − (µ2
ijφ+ µij))
= 0,
which is equivalent to minimizing∑
i,j
((Yij − µij)2 − (µ2
ijφ+ µij))
. (It is one of possible approaches
to estimate φ, and other methods are also available.)
Now we see some general form of MQL. MQL uses following approximation for the conditional
moments.
E(Yi|bi) = h(Xiα+ Zibi) ≈ h(Xiα) + h′(Xiα)Zibi =: µ∗i
V ar(Yi|bi) = diag(φV (µij(bi))) ≈ diag(φV (µ∗ij))
107
Analysis of Repeated Measurements J.P.Kim
Based on these approximations, the marginal moments become
E(Yi) =: µi ≈ h(Xiα)
and
V ar(Yi) = V arE(Yi|bi) + EV ar(Yi|bi) ≈ V0i + ∆−1i ZiDZ
>i ∆−1
i .
where
V0i = diag(φV (µi)) = diag(φV (Eµ∗i )) ≈ EV ar(Yi|bi),
∆i = diag(g′(µi)).
(∆−1i = diag(g′(µi)
−1) ≈ diag(h′(Xiα))) So we can set up an estimating equation
U(α, θ) =∂µ>
∂αV ar(Y )−1(Y − µ) = 0,
which is to solve (∵ ∂µ>i /∂α = X>i ∆−1i )
X>∆−1(V0 + ∆−1ZDZ>∆−1
)−1(Y − µ) = 0,
which is again equivalent to
X>(∆V0∆ + ZDZ>)−1∆(Y − µ) = 0.
This also can be solved using pseudo-dependent variable setting
y∗ = η + ∆(Y − µ)
with weight matrix W = V −1 = (∆V0∆ + ZDZ>)−1.
6.4 Conditional Likelihood Approach
When the interest is only in the regression coefficient for time-varying covariate X, say β1, you can
“eliminate the individual-specific term” by conditioning each individual.
Example 6.24 (Motivation). Consider the case of binary Yij with logit link function and ηij =
Xijβ1 + Ziβ2 + bi. Note that
ni∑j=1
Yij =: Yi+ (“How many occurred in total?”)
108
Analysis of Repeated Measurements J.P.Kim
does not carry information about time-varying covariate coefficient β1, and hence conditioning on∑j Yij will only give the information of β1. (Comment for just a motivation: recall sufficiency of
statistics!) The conditional likelihood is
LC =K∏i=1
ni∏j=1
P (Yij = 1|Xij , bi)YijP (Yij = 0|Xij , bi)
1−Yij
∑pairs
ni∏j=1
P (Yij = 1|Xij , bi)YijP (Yij = 0|Xij , bi)
1−Yij
=K∏i=1
ni∏j=1
eYijβ1Xij
∑pairs
ni∏j=1
eYijβ1Xij
(“What actually happend?”)
(“Summation for all possible cases”)
where∑pairs
means summation over (Yi1, · · · , Yini) that satisfy∑
j Yij = Yi+. The last equality can be
easily verified. For example, assume that Yi1 = 1, Yi2 = Yi3 = 0 are observed. Then
LCi =P (Yi1 = 1|bi)P (Yi2 = 0|bi)P (Yi3 = 0|bi)
P (Yi1 = 1|bi)P (Yi2 = 0|bi)P (Yi3 = 0|bi) + P (Yi1 = 0|bi)P (Yi2 = 1|bi)P (Yi3 = 0|bi) + · · ·
=
eXi1β1+Ziβ2+bi
1 + eXi1β1+Ziβ2+bi
1
1 + eXi2β1+Ziβ2+bi
1
1 + eXi3β1+Ziβ2+bi
eXi1β1+Ziβ2+bi
1 + eXi1β1+Ziβ2+bi
1
1 + eXi2β1+Ziβ2+bi
1
1 + eXi3β1+Ziβ2+bi+ · · ·
=eXi1β1+Ziβ2+bi
eXi1β1+Ziβ2+bi + eXi2β1+Ziβ2+bi + eXi3β1+Ziβ2+bi=
eXi1β
eXi1β + eXi2β + eXi3β.
Note that β2 and bi are “eliminated” by conditioning. Also note that, such logic is valid only when we
use canonical link.
Following proposition gives a generalized version of example 6.24.
Proposition 6.25 (Kalbfleisch, 1978). Consider a model with canonical link and ηij = Xijβ1 +Ziβ2 +
bi. Conditioning on Y(i1), Y(i2), · · · , Y(ini) (i.e., we only knows the values but what matches to each one),
a conditional likelihood can be written as
LC :=
K∏i=1
ni∏j=1
f(Yij |Xij , bi)f(bi)
∑pairs
ni∏j=1
f(Y(ij)|Xij , bi)f(bi)︸ ︷︷ ︸=:LCi
=
K∏i=1
ni∏j=1
exp (Yijβ1Xij/φ)
∑pairs
ni∏j=1
exp(Y(ij)β1Xij/φ
) .
109
Analysis of Repeated Measurements J.P.Kim
Proof. Since
f(Yij |Xij , bi)f(bi) = exp
(Yijθij − b(θij)
φ+ c(Yij , φ)
)f(bi),
we get
LCi =
exp
ni∑j=1
(Yijθij − b(θij)
φ+ c(Yij , φ)
) f(bi)
∑pairs
exp
ni∑j=1
(Y(ij)θij − b(θij)
φ+ c(Y(ij), φ)
) f(bi)
=
exp
ni∑j=1
Yijθij − b(θij)φ
∑pairs
exp
ni∑j=1
Y(ij)θij − b(θij)φ
(∵∑j
c(Yij , φ) =∑j
c(Y(ij), φ))
=
exp
ni∑j=1
Yijθijφ
∑pairs
exp
ni∑j=1
Y(ij)θij
φ
=
exp
ni∑j=1
Yijηijφ
∑pairs
exp
ni∑j=1
Y(ij)ηij
φ
(∵ canonical link!)
=
exp
ni∑j=1
Yij(Xijβ1 + Ziβ2 + bi)
φ
∑pairs
exp
ni∑j=1
Y(ij)(Xijβ1 + Ziβ2 + bi)
φ
=
ni∏j=1
exp (Yijβ1Xij/φ)
∑pairs
ni∏j=1
exp(Y(ij)β1Xij/φ
) (∵∑j
Yij =∑j
Y(ij))
Remark 6.26 (Pseudo-likelihood). Evaluating the denominator can be computationally prohibitive,
because it requires exp(Y(ij)β1Xij) for all permutation of (Yi1, · · · , Yini). It means that, even if ni
is small as ni = 6, we have to evaluate more than 700 terms of exp(Y(ij)! One alternative is to use
“pairwise conditioning,” which leads to pseudo-likelihood (Liang and Qin, 2000):
K∏i=1
∏j>k
f(Yij |Xij , bi)f(Yik|Xik, bi)
f(Yij |Xij , bi)f(Yik|Xik, bi) + f(Yik|Xij , bi)f(Yij |Xik, bi)
=
K∏i=1
∏j>k
eYijβ1XijeYikβ1Xik
eYijβ1XijeYikβ1Xik + eYikβ1XijeYijβ1Xik.
110
Analysis of Repeated Measurements J.P.Kim
6.5 Applications and Further topics
6.5.1 Multi-level Modeling
Suppose we have repeated measures of systolic & diastolic blood pressure over time. Then we have
two levels of clustering, i.e., there are two factors which yield correlation; over times and across systolic
vs. diastolic.
Given b1i and b2i, suppose Y1ij (systolic) and Y2ij (diastolic) arise independently from an exponential
family distribvution, and
η1ij = X1ijβ1 + b1i
η2ij = X2ijβ2 + b2i
where bi = (b1i, b2i)> ∼ N(0, D) and D is a 2 × 2 matrix. Then under the assumptions Y1ij ⊥
Y2ij |(b1i, b2i); Y1ij |X1ij , X2ij , b1i, b2id≡ Y1ij |X1ij , b1i, the marginal likelihood becomes
K∏i=1
∫ ni∏j=1
(f(Y1ij |X1ij , b1i)f(Y2ij |X2ij , b2i)) g(bi;D)dbi.
If Ykij |Xkij , bki is distributed as normal with mean Xkijβ + bki, then Yi|Xi is distributed with multi-
variate normal and elements of variance are
Cov(Y1ij , Y1ik) = D11 (j 6= k), Cov(Y2ij , Y2ik) = D22 (j 6= k), and Cov(Y1ij , Y2ik) = D12.
6.5.2 Heavy-tail distribution
Suppose we have repeated measures on random effects whose distributions have heavier tails than
normal distribution. We may handle this by modeling the variance with random effects:
ηij = Xijβ + bi
where bi is the distributed with mean 0 and variance φi where log φi = logD + log ui. That is,
assume that bi|ui ∼ N(0, uiD) where ui is another random effect term, which is, e.g., inverse-gamma
distributed.
When Yij |bi is distributed with normal and ui follows the inverse-gamma distribution with E(ui) = 1
and V ar(ui) = τ/(1 − τ) for 0 ≤ τ ≤ 1. This model can be viewed as an extension of multivariate
111
Analysis of Repeated Measurements J.P.Kim
t-distribution, where the degrees of freedom may not necessarily be integer.
K∏i=1
∫ ∫ ni∏j=1
f(yij |bi)
1√2π
1√uiD
e− b2i
2uiD1
uα+1i
e− αui
1
uidbidui
=K∏i=1
∫ ni∏j=1
f(yij |bi)
∫ 1√2π
1√uiD
e− b2i
2uiD1
uα+1i
e− αui
1
uidui︸ ︷︷ ︸
multivariate t-distribution
dbi
In other words, marginal distribution of bi is multivariate t-distribution, and hence the model reflects
heavy-tail behavior of the random effects. When τ = 0, the model reduces to mixed effects model with
normal error.
6.5.3 Summarizing by individual or by time
Instead of one huge model, we can reduce the dimension by summarizing by individual or by time.
• Summarizing by individual. Consider a diary data: Asthma patients write a “diary” whether they
were attacked or not, once every three days. Because series is very long, we would summarize the
data by each individual. For this, we consider an individual regression coefficients (β0i, β1i), and
think that information about individual specific effect is summarized in (β0i, β1i), and obtain an
inference based on such fitted coefficients. It gives a kind of two-stage modeling. In precise, at
first stage we assume that
(β0i, β1i)>|β0i, β1i,Wi(β0i, β1i) ∼ N
((β0i, β1i)
>,Wi(β0i, β1i)),
where Wi(β0i, β1i) is the variance of each individual specific coefficients. It reflects that vari-
ance becomes small when the series is long. Then at second stage, the unobservable coefficients
(β0i, β1i)> is assumed to be normally distributed, i.e.,
(β0i, β1i)> ∼ N
((α0, α1)>, D
).
Combining the first and second stages, marginally we obtain that
β0i
β1i
∼ Nα0
α1
,Wi +D
and we can find maximal likelihood estimators for α0, α1 and D using (β0i, β1i)’s.
• Summarizing by time. Several authors (Moulton and Zeger, 1989; Wei and Stram, 1988; Wei and
112
Analysis of Repeated Measurements J.P.Kim
Johnson, 1985) considered obtaining summary statistics by time and combining the time-specific
statistics. These methods are useful when the number of time points is limited and time interval
is balanced, i.e., when each time points has a specific meaning. Then summarize statistic for
each time and then jointly draw an inference. This model could fit into GEE framework when
the interaction terms for time and all covariates are included in the model.
113
Chapter 7
Nonparametric Longitudinal Data
Analysis
7.1 Local Polynomial Regression
7.1.1 Review of Local Polynomial Regression
Recently many methods have been developed to analyze longitudinal data using nonparametric
methods such as kernel and spline. Start with the model
y = g(t) + ε
and independent observation (yi, ti)ni=1 on a closed interval [a, b]. It is assumed that g is unknown but
expected to be smooth. Fix a point s in the interior of [a, b]. Since g is smooth, we can consider Taylor
expansion
g(t) ≈ g(s) + g′(s)(t− s) + · · ·+ g(p)(s)
p!(t− s)p, where t is near s.
Now estimate g(s), g′(s), · · · , g(p)(s) using least square criterion: First we minimize
n∑i=1
(yi − β0 − β1(ti − s)− · · · − βp(ti − s)p)2Kh(ti − s)
w.r.t. β0, β1, · · · , βp, and as the estimate of g(s), take g(s) = β0. Also note that g′(s) = β1, · · · , g(p)(s) =
p!βp. In here, K(·) is a kernel function which satisfies K(·) ≥ 0 and∫K(t)dt = 1, and Kh(·) =
K(·/h)/h. For example, rectangular kernel K(t) = I(−12 < t < 1
2) or Gaussian kernel K(t) =
(2π)−1/2 exp(− t2
2 ) can be used. A bandwidth h makes the role of “size of neighborhood;” the small h
gives tight neighborhood and makes regression local.
114
Analysis of Repeated Measurements J.P.Kim
Criterion can be rewritten as matrix representation
n∑i=1
(yi − β0 − β1(ti − s)− · · · − βp(ti − s)p)2Kh(ti − s) = (y −Xsβ)>Ksh(y −Xsβ),
where y = (y1, · · · , yn)>, β = (β0, β1, · · · , βp)>, Ksh = diag(Kh(t1 − s), · · · ,Kh(tn − s)), and
Xs =
x>1
x>2...
x>n
=
1 t1 − s · · · (t1 − s)p
1 t2 − s · · · (t2 − s)p...
.... . .
...
1 tn − s · · · (tn − s)p
.
Thus estimating equation is
X>s Ksh(y −Xsβ) = 0,
which gives the solution
β = (X>s KshXs)−1X>s Kshy.
Since g(s) ≈ β0, we get
g(s) = β0 = e>1 (X>s KshXs)−1X>s Kshy
where e1 = (1, 0, 0, · · · , 0)>.
Remark 7.1. (a) One WLS computation is needed for each estimation of g(s) at s. However, com-
putation is not so heavy, since we invert just (p+ 1)× (p+ 1) matrix and practically p is set 0 or
1.
(b) (Choice of order p) Practically p is set 0 or 1. p = 0 yields local constant approximation or
Nadaraya-Watson estimator
g(s) =
∑ni=1Kh(ti − s)yi∑ni=1Kh(ti − s)
.
p = 1 yields local linear approximation.
(c) (Choice of bandwidth h) Too large h (=too many weight to far-off points) yields inaccurat estimate
(larger bias), while too small h (=use very few points for estimation) yields larger variance. Usually
we choose h which minimizes MSE or MISE empirically using CV.
7.1.2 Local Polynomial: Population mean model for longitudinal data
Suppose a longitudinal dataset (yij , tij) was observed for i = 1, 2, · · · , n and j = 1, 2, · · · ,mi.
115
Analysis of Repeated Measurements J.P.Kim
Naive Population Mean model (NPM)
Naively we can just minimize
n∑i=1
mi∑j=1
(yij − β0 − β1(tij − s)− · · · − βp(tij − s)p)2Kh(tij − s)
=n∑i=1
(yi −Xs,iβ)>Ksh,i(yi −Xs,iβ) = (y −Xsβ)>Ksh(y −Xsβ)
where β = (β0, · · · , βp)>, yi = (yi1, · · · , yimi)>, Ksh,i = diag(Kh(ti1 − s), · · · ,Kh(timi − s)),
Xs,i =
1 ti1 − s · · · (ti1 − s)p
1 ti2 − s · · · (ti2 − s)p...
.... . .
...
1 timi − s · · · (timi − s)p
and
y =
y1
...
yn
, Xs =
Xs,1
...
Xs,n
, Ksh = blockdiag(Ksh,1, · · · ,Ksh,n).
Note that NPM does not take into account correlated structure, but naive LS also gives good (e.g.
consistent) point estimation. Then the solution is
β = (X>s KshXs)−1X>s Kshy =
(n∑i=1
X>s,iKsh,iXs,i
)−1 n∑i=1
X>s,iKsh,iyi.
From f(s) ≈ β0, we get
f(s) = β0 = (1,0p)>(X>s KshXs
)−1X>s Kshy.
GEE-type model (NPM-GEE)
We can reflect “working correlation” in the model. Assume that mii≡ m and Vm×m be a working
covariance matrix. Then we minimize
n∑i=1
(yi −Xs,iβ)>K1/2sh,iV
−1K1/2sh,i(yi −Xs,iβ).
116
Analysis of Repeated Measurements J.P.Kim
Then the solution is
β =
(n∑i=1
X>s,iK1/2sh,iV
−1K1/2sh,iXs,i
)−1 n∑i=1
X>s,iK1/2sh,iV
−1K1/2sh,iyi
and hence
f(s) = (1,0p)>
(n∑i=1
X>s,iK1/2sh,iV
−1K1/2sh,iXs,i
)−1 n∑i=1
X>s,iK1/2sh,iV
−1K1/2sh,iyi.
Remark 7.2. It turns out that ignoring correlation (i.e., working independent correlation) gives more
efficient estimate than considering the correlation.
7.2 Spline methods
7.2.1 Review: Spline
There are two types of spline regression:
• Spline regression or Penalized Spline regression (P-spline). It is just a basis expansion; e.g.,
truncated power basis can be used.
• Smoothing spline (S-spline). It is performed via natural cubic spline basis.
P-spline
Consider y = f(t) + ε and independent observations (yi, ti)ni=1 on a closed interval [a, b]. Let
the class F of functions f(·) be known but f is unknown. Denote the basis function for F by
φ1(t), φ2(t), · · · , φl(t) and the assumption f ∈ F , i.e.,
f(t) = β1φ1(t) + β2φ2(t) + · · ·+ βlφl(t).
Then coefficients β = (β1, · · · , βl)> is estimated via minimizing following least square criterion
n∑i=1
(yi − β1φ1(ti)− β2φ2(ti)− · · · − βlφl(ti))2
or penalized criterion
n∑i=1
(yi − β1φ1(ti)− β2φ2(ti)− · · · − βlφl(ti))2 + λP (β),
117
Analysis of Repeated Measurements J.P.Kim
where P (β) is a specified penalty function of β. There are several choices for basis functions φi(·): For
global basis, polynomial basis (of order p), Fourier basis, or eigenfunctions of a covariance operator
(FPCA; Karhunen-Loeve theorem); for local basis, we consider fixed knots a = τ0 < τ1 < · · · < τM <
τM+1 = b on [a, b] and consider truncated power basis 1, t, · · · , tp, (t− τ1)p+, · · · , (t− τM )p+ or natural
cubic spline basis 1, t, d1(t) − dM−1(t), · · · , dM−2(t) − dM−1(t) where dk(t) =(t− τk)3
+ − (t− τM )3+
τM − τk,
B-spline, wavelet, etc.
Remark 7.3. In spline regression, we need to choose
• the number of knots;
• location of knots;
• degree p (if one uses polynomial basis);
• and smoothing parameter λ (if one uses P-spline).
Such things can be selected via, for example, CV or GCV. There are many other criteria such as like
goodness-of-fit, model complexity, generalized maximum likelihood, AIC, BIC, etc.
Smoothing spline
Basically it comes from minimizing
n∑i=1
(yi − f(ti))2 + λ
∫ b
a
(f ′′(t)
)2dt
w.r.t. f ∈W 22 [a, b], where
W 22 [a, b] =
f : [a, b]→ R : f ′′ is L2 and absolutely continuous on (a, b)
denotes a Sobolev space. The penalty term gives another way to control roughness of f . It is known
that the minimizing function of above criterion is a natural cubic spline with knots at t1, t2, · · · , tn if
ti’s are distinct. Based on this, the criterion can be rewritten as
n∑i=1
(yi − β1φ1(ti)− · · · − βMφM (ti))2 + λβ>Mβ = (y −Xβ)>(y −Xβ) + λβ>Mβ,
where M is the number of distinct timepoints, φ1(t), · · · , φM (t) denote the natural cubic spline basis
functions, and M is an M ×M matrix with
Mij =
∫ b
aφ′′i (t)φ
′′j (t)dt.
118
Analysis of Repeated Measurements J.P.Kim
We also use conventional notations, y = (y1, · · · , yn)>, β = (β1, · · · , βp)>, and
X =
φ1(t1) φ2(t1) · · · φM (t1)
φ1(t2) φ2(t2) · · · φM (t2)...
.... . .
...
φ1(tn) φ2(tn) · · · φM (tn)
.
Then
ˆbbeta = (X>X + λM)−1X>y
and hence
f(t) = Φ(t)>(X>X + λM)−1X>y
where Φ(t) = (φ1(t), · · · , φM (t))>.
7.2.2 Spline regression: Population mean model for longitudinal data
Suppose a longitudinal dataset (yij , tij) was observed for i = 1, 2, · · · , n and j = 1, 2, · · · ,mi. We
assume that
• E(yij |tij) = f(tij);
• (yi, ti)ni=1 are independent;
• and V ar(yi|ti) = Σ (working correlation).
Naive Population Mean model (NPM)
NPM pretends each (yij , tij) is an independent observation, and
i) (P-spline) minimizes
n∑i=1
mi∑j=1
(yij − β0 − β1tij − · · · − βp+M (tij − τM )p+)2 + λ
p+M∑k=p+1
β2k
=n∑i=1
(yi −Xiβ)>(yi −Xiβ) + λβ>Gβ = (y −Xβ)>(y −Xβ) + λβ>Gβ.
In here, β = (β0, · · · , βp+M )>, G = diag(0p+1,1M ), yi = (yi1, · · · , yimi)>,
119
Analysis of Repeated Measurements J.P.Kim
Xi =
1 ti1 · · · (ti1 − τM )p+
1 ti2 · · · (ti2 − τM )p+...
.... . .
...
1 timi · · · (timi − τM )p+
,y =
y1
y2
...
yn
, and X =
X1
X2
...Xn
.
Then the solution is
β = (X>X + λG)−1X>y =
(n∑i=1
X>i Xi + λG
)−1 n∑i=1
X>i yi
and hence
f(t) = Φ(t)>(X>X + λG)−1X>y = Φ(t)>
(n∑i=1
X>i Xi + λG
)−1 n∑i=1
X>i yi.
ii) (S-spline) minimizes
n∑i=1
mi∑j=1
(yij − β1φ1(tij)− · · · − βMφM (tij))2 + λβ>Mβ
=n∑i=1
(yi −Xiβ)>(yi −Xiβ) + λβ>Mβ = (y −Xβ)>(y −Xβ) + λβ>Mβ,
where M is the number of distinct timepoints through tij and others are similarly defined.
NPM-GEE model
NPM-GEE method is also proposed by Welsh, Lin and Carroll (2002). It reflects working correlation
structure to NPM and construct GEE-type estimation procedure. Assume mii≡ m and Wm×m be a
working covariance matrix. We estimate via
i) (P-spline) minimizingn∑i=1
(yi −Xiβ)>W−1(yi −Xiβ) + λβ>Gβ.
The solution is
β =
(n∑i=1
X>i W−1Xi + λG
)−1 n∑i=1
X>i W−1yi
and hence we get
f(t) = Φ(t)>
(n∑i=1
X>i W−1Xi + λG
)−1 n∑i=1
X>i W−1yi.
120
Analysis of Repeated Measurements J.P.Kim
ii) (S-spline) minimizingn∑i=1
(yi −Xiβ)>W−1(yi −Xiβ) + λβ>Mβ.
It gives the solution
β =
(n∑i=1
X>i W−1Xi
)−1 n∑i=1
X>i W−1yi.
Remark 7.4. Recall that in ordinary gee model, if true covariance structure is used as working one
(W = Σ), the estimator is the most efficient. For spline NPM-GEE methods, such tendency is remained
as same, but for kernel methods, true covariance as working one is inefficient (Lin and Carroll, 2000),
rather working independence is preferable! (cf. remark 7.2) One of the reasons is that kernel methods
are applied locally ; it needs local covariance structure, while our working correlation structure acts
globally. NPM-GEE works on spline methods because spline methods are applied globally.
121