Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10....
Transcript of Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10....
Introduction to multilevel analysis
Leonardo GrilliDepartment of Statistics “G. Parenti” - University of Florence
Email: [email protected] Web: http://www.ds.unifi.it/grilli/
Teaching staff Erasmus mobility
Santiago de Compostela, 28 February – 1 March 2012
1
1. Introduction2. ANOVA (fixed effects vs random effects)3. Inference in random effects ANOVA4. Basics of the two-level linear model – Case #1: a single
covariate at level 15. Basics of the two-level linear model – Case #2: introduction
of a covariate at level 26. Between, within and contextual effects7. Inference in two-level models8. Use of the residuals9. Sample size requirements to fit multilevel models10. Extensions of the hierarchical model11. Software and books
Outline
2
Introduction
Multilevel structures Basic definitions NELS-88 example
A hierarchical structure
district level 4
school 1 level 3 school 2
class 1 class 2 level 2 class 3 class 4
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12
level 1 - students
Remark: levels are numbered bottom-up
4
It can be the case that all the units and clusters are physical entities pupil, class, school patient, doctor, hospital worker, firm individual, family, region interviewed, interview
Hierarchical structures: Units within clusters
Often the sampling design reflects the hierarchical structure (multi-stage sampling), but this is not necessary !!
5
It can be the case that the bottom units (level 1) are different responses of a given statistical unit (level 2) Multivariate data Longitudinal data
(panel, repeated measurements)
Hierarchical structures: Multiple responses
Remark: a hierarchical structure may combine multiple responses and clusters given by physical entities (e.g. questionnaire on the students item, student, school)
subject j
resp3resp2resp1
6
Multiple responses and missing data
subject 1
item3item2item1
subject 2
item3item1
Missing response
Multivariate data
subject 1
wave3wave2wave1
subject 2
wave1
Drop-out
Panel datawave2
Standard estimation methods for multilevel models allow for non informative missing data (Little & Rubin’s MAR: Missing At Random)
7
Some hierarchical terms
Cross-sectionUnivariate response
Cross-sectionMultivariate response
Longitudinal data (panel data)
Level 2 unit (cluster) set of subjects subject subject
Level 1 unit subject measurement, response, item
measurement, occasion, wave
Other names for
Level 2: between macro cluster
Level 1: within micro
Warning: subject is a level 1 unit or a level 2 unit depending on the context
8
In cluster analysis the hierarchical structure is unknown: it is just the aim of the analysis to discover the clusters!
In multilevel analysis the hierarchical structure (number of clusters, cluster membership) is known a priori: the aim of the analysis is to understand the relationships within and between clusters
However, a multilevel model can be specified in a way to perform a model-based cluster analysis on the clusters of the hierarchy, e.g. a 2-level model for students within schools can be used to build clusters of schools (this task requires to specify the random effects as having a discrete instead of continuous distribution)
Cluster analysis vs multilevel analysis
9
Usually the phenomenon under study can be modelled through several alternative structures: e.g. Pupil, class Pupil, class, school Pupil, school Pupil, teacher Pupil, (school by quarter)
The structure to be used in the analysis depends on the aims of the research
Even if a complex structure may appear more realistic, for most research purposes a simple structure with 2 or 3 levels is enough
Which is the relevant structure?
10
Design of experiments variance components models
Statistics mixed models (Harville, 1977), hierarchical linear models (HLM)
Econometrics random coefficients models (Swamy 1972), random effects models for panel data
Biostatistics mixed models for repeated measures (Laird and Ware, 1982), random effects models
Educational statistics multilevel models (Cronbach 1976, Aitkin and Longford 1986)
Sociology, demography, small area estimation,…
Different fields, different names…
11
Level 1 Example: male/female, grade
Level 2 Global: feature of the cluster with no corresponding level 1
measure Example: public/private school, number of teachers
Compositional (or contextual): feature of the cluster obtained through aggregation of level 1 measures (summary of the features of the level 1 units)
Example: average class size, proportion of females, average grade
Types of variables
12
Levels of the variables
A level 2 variable is by definition constant within clusters its variation is only between clusters
A level 1 variable has distinct values for the elementary units and, in general, its cluster mean changes from cluster to cluster its variation is both within clusters and between clusters
1, , j J clusters (level 2 units)1, , j ji n elementary (level 1) units in cluster
In a two-level setting a level 1 variable has a double index: Xij
while a level 2 variable has a single, level 2 index: Wj
. . . .ij j ij j ij j ij jX X X X var X var X var X X
13
Relationships among levels
ZY
ZYX
macro-micro relationship
adjusted macro-micro relationship
ZYX
cross-level interaction
XZ micro-macro relationship
macro: level 2, e.g. school
micro: level 1, e.g. pupil
14
Levels: pupils (level 1); schools (level 2) [schnum] Response variable Y [math]: score on a math test Level 1 covariate X [homework]: hours per week spent on
math homework Level 2 covariate W [public]: binary indicator of public vs
non-public school (a global level 2 variable)
NELS-88 example
We consider 10 handpicked schools from the NELS-88 data
(Kreft and De Leeuw, Introducing Multilevel Modeling, Sage, 1988)
15
NELS-88 example: data
+-----------------------------------+| schnum public math homework ||-----------------------------------|
126. | 6 1 42 2 |127. | 6 1 47 1 |128. | 6 1 47 3 |129. | 6 1 51 1 |130. | 6 1 53 1 |131. | 6 1 44 1 |132. | 7 0 62 4 |133. | 7 0 68 5 |134. | 7 0 56 5 |
Here are some records (9 out of 260), each record refers to a pupil
16
NELS-88 example: summary statistics
Nr. ofschools Size
3 202 212 221 231 241 67
Variable | Obs Mean Std. Dev. Min Max------------+--------------------------------------------
math | 260 51.30 11.14 31 71homework | 260 2.02 1.55 0 7
public | 10 0.10 - 0 1
There are 10 schools (level 2 units) of different size (unbalanced design)
The total number of pupils (level 1 units) is 260
17
Standard models, such as OLS regression, are not adequate for analysing hierarchical data
Inaccurate modelling: unable to disentangle the contributions of the hierarchical levels
Inaccurate inference Wrong sample size for the cluster variables (their sample size
should be the number of clusters) Dependence: the units of the same cluster are alike with a positive
within cluster correlation the independence assumption of standard models is violated
biased standard errors (often underestimated, leading to type I error rates higher than the nominal level )
Problems with standard models
18
Basic question: the hierarchical structure (along with the relationships within and between levels and the associated correlation structure) is of primary interest for the research?
Yes, it is of primary interest multilevel models No, it is merely a nuisance (e.g. the sampling design is
multistage but interest is limited to individual level relationships) methods able to “correct” the standard errors such as Repeated measure methods like GEE Robust (sandwich, Huber, White) covariance matrix of the estimators
Which type of model?
19
A solution to the problems of correlated observations is to aggregate the data at cluster level (by taking cluster means) and apply standard regression to the new dataset. However such solution is harmful because of the following problems:
Shift of meaning: the contextual variables obtained through aggregation refer to the clusters (not the level 1 units) they cannot be used to investigate relationships at level 1
Ecological fallacy (aggregation bias): relationships at level 2 relationships at level 1
Interactions between levels: an aggregated analysis precludes the study of the relationships between levels
Aggregated analysis
20
ANOVA
(fixed effects
vs
random effects)
Fixed effects ANOVA
20,iid
ij ee N
ij j ijy e
1, , j J clusters (in ANOVA terminology "levels of the factor")
1, , ji n j units in cluster1
J
jj
N n
total sample size
is a parameter representing the overall mean
j is a parameter representing the deviation of the mean of j-th cluster (level of the factor) from the overall mean (J1 parameters)
…i.e. the simplest multilevel model
22
Random effects ANOVA (RANOVA)
20,iid
ij ee N ij je u i, j
ij j ijy u e
20,iid
j uu N
The J1 parameters j are replaced by a single random variable uj that takes J realizations, one for each cluster (level of the factor)
23
Random effects ANOVA:data generating process
ij j ijy u e Useful to think in terms of a two-stage data generation process:
1) sampling cluster j -> take a realization of uj
from its distribution (so cluster j has a mean of +uj)
2) sampling unit i within cluster j -> take a realization of eij from its distribution (so unit iof cluster j has value +uj+eij )
+u2
+u3
+u1
24
Random effects ANOVA:variances and covariances
2 2( ) ij u eVar y 2
0 if ( , )
if and 'ij i ju
j jCov y y
j j i i
ij j ijy u e
Variance of yij decomposed in two components: cluster (between) level + individual (within) level
Observations belonging to the same cluster are positively correlated
Remark: the correlation is necessarily positive since it is generated by a shared latent variable uj (it is the same basic idea of factor models, where uj is called factor indeed the GLLAMM class of Rabe-Hesketh and Skrondal includes both multilevel and factor models as special cases)
25
Random effects ANOVA:covariance matrix
2 2 2
2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
( )
u e u
u u e
u e u u
u u e u
u u u e
Var
y
1 2Example 2, 2, 3J n n
Block diagonal structure (empty space means zero)
26
is a measure of the degree of homogeneity of units belonging to the same cluster
The double nature of (correlation and variance ratio) does not hold in models with more than 2 levels
Random effects ANOVA:intraclass correlation coefficient
2
2 2
cluster variance( , ) 0,1total variance
uij i j
u e
Corr y y
denotes the ICC (Intraclass Correlation Coefficient) also known as VPC (Variance Partitioning Coefficient)
27
Random effects ANOVA:example with NELS-88 data
Parameter | Estimate-----------------+------------Intercept (=mean)| 48.87Level 2 variance | 30.54level 1 variance | 72.24
Total variance = 30.54+72.24 = 102.78
ICC = 30.54/ 102.78 = 0.297
29.7% of the variance of math scores is due to the clustering of pupils into schools
28
Random effects ANOVA:marginal vs conditional covariance
2( , )ij i j uCov y y
ij j ijy u e
For two units i and i’ of the same cluster j the responses are correlated because they share the same random effect uj
If we condition on uj the correlation disappears
( , | ) 0ij i j jCov y y u
Marginal covariance
Conditional covariance
29
Fixed effects ANOVA (again):conditional covariance
1 1 2 2( )ij j ij
J J ij
y ed d d e
This model does not specify a correlation structure among the responses of the same cluster: indeed the covariance is always null. However, the model refers to the distribution of y conditional on d, thus the covariance which is null is the conditional covariance
Since the model does not specify the distribution of d, we cannot compute the marginal covariance, but (for analogy with the random effects version) we can conclude that the marginal covariance is positive, i.e.
( , | ) 0ij i jCov y y d
Parameterization with J intercepts
1 2( , , , )Jd d dd is the vector of the J dummy variables for the clusters
1 2
1 2
Ex. 2, 2, 3
1 01 00 10 10 1
J n nd d
( , ) 0ij i jCov y y 30
ANOVA: fixed or random effects?
Random effects ANOVA when the clusters (levels of the factor) are a sample from a population of clusters or anyway when the clusters are many parsimonious description of the observed variability among clusters + generalizability
ij j ijy u e ij j ijy e
Fixed effects ANOVAwhen the clusters (levels of the factor) are few and represent an exhaustive classification distributional assumptions are avoided, but impossible to generalize the results to a population of clusters
parameters iid random variables
31
Inference in random effects ANOVA
Random effects ANOVA:estimation of the mean /1
.. jj juy e
22
.
.
.
( )
( ) ( ) ej j u j
j
j
j
E
Var Var u en
y
y
ij j ijy u e
.1 ., , Jy y
Each cluster mean is un unbiased estimator of , but it is inefficient
Idea : estimate combining
2( )ij eVar e
2( )j uVar u
Cluster mean
33
Random effects ANOVA:estimation of the mean /2
1
1
1.ˆ
hh
J
jj
jy
22
.( ) ej u
jjyVar
n
.1
j jy is the precision of
• In the balanced case all the precisions are equal the best estimator can be computed (it is just the arithmetic mean)
• In the unbalanced case it is necessary to have an estimate of j the best estimator depends on the variance components
ij j ijy u e 2( )ij eVar e
2( )j uVar u
34
Random effects ANOVA:estimation of the variance components /1
2 2 2
.. .. . .1 1 1 1 1
j jn nJ J J
ij ij jj i j i
jj
jy yy y y n y
SST = SSW + SSB
2 2
2 2. .. .
1 1 1
1 1 = ( 1) 1
jnJ J
W Bj
jjj
jii
SSW SSBS y S yN J N J n J J
y y
Two possible estimators of the variance components:
For simplicity, we consider only the balanced case nj=n
35
Random effects ANOVA:estimation of the variance components /2
2 2 2 2 22
, W e ue
B uE S E Sn
2 2
22
.
B u
ej u
S
Var yn
overestimates since it actually
estimates
2 222ˆˆ u BueS
n An unbiased estimator of
22
2 2ˆˆ ˆ 0ue
Bu Sn
: can be negative! If then Remark
20( : 0)uH : this fact underlies the classical ANOVA F test Remark
.
j
ju
y
Not all the variability of the cluster means is due to the variability of the clustepopu r me
sampleansion lat
36
To estimate the fixed parameters it is necessary to have an estimate of the variance-covariance parameters
However the converse is also true: to estimate the variance-covariance parameters it is necessary to have an estimate of the fixed parameters
Joint estimation of fixed effects and variance-covariance parameters
The only exception is in strictly balanced designs (all clusters with the same size and the same matrix of covariates) closed-form estimators of the variance components (see the textbook by Searle, McCulloch and Casella)
In general we have to rely upon iterative procedures
37
Random effects ANOVAprediction of the cluster-specific intercept 0j
1 1
1 1.
20
20
0
, ~ (0, ), Var( )
(biased) estimator of eˆ ach J J
j jj
jj
j j u j uj
jy
u u N
common
02
. . .
20.
1, (1 / ) ~ (0, / )
is an (unbiased) estimator of with variance /
jnj j ij j e ji
j e
j
j
j
j
e n e e N n
n
y
y
Remark: here unbiasedness is evaluated conditional on 0j (i.e. hold uj and calculate E(·) under repeated sampling of eij)
A linear combination of estimators 1 and 2 yields a new estimator better then both 1 and 2 in terms of MSE !!!
1
2
From level 1 model: sample mean as outcome
From level 2 model
ij j ijy u e
38
BLUP
The weight j (defined in the next slide) is a measure of the reliability of as an estimator of 0j
The superscript EB stands for Empirical Bayes since is also the posterior mean of 0j calculated by plugging-in the ML estimates
.0 ˆ(1 )j
EB
jj jy
. jy
0ˆ EB
j
The Best Linear Unbiased Predictor (BLUP) of is0 j
prediction using only the level 1model (i.e. data from cluster j)
prediction using the level 2 model (i.e. data from all clusters)
39
The reliability coefficient
In psychometrics letting i=item and j=individual, the parallel measurement model of classical test theory is
. observed score of individual = j jy
2
2 2 /u
ju e jn
The j that appears in is the reliability coefficient0ˆ EB
j
.
.
variance of true scoresvariance of observed scoresreliability of
jj
jj
Var
y
uy
Var
ij j ijy u e
True score of individual j
Observed measurement of individual j on item i
Measurement error of individual j on item i
40
EB prediction and borrowing strength
.0 ˆ(1 )j
EB
jj jy
Borrowing strength: for a cluster of low size, the prediction of the intercept heavily exploits information from other clusters
2 2
1/
1
1u e j
j
n
prediction using only data from cluster j
prediction using data from all clusters
When j rises, the EB prediction gets closer to the prediction based on data from the cluster under consideration
j is an increasing function of:
the between/within varianceratio and the cluster size
41
Prediction of the random effects /1
0 .ˆˆ ˆEB
jj jEBju y
OLS residual (also called ‘naive’ residual)
Shrinkage factor
Often we are interested in the value of the random effect, especially when it can be interpreted as a measure of effectiveness;
The EB method implies that the EB prediction of a random effect, also called EB residual, is
EB residuals are better than OLS residuals in terms of MSE
The amount of shrinkage depends on the cluster size; it may happen that the shrinkage is negligible for large clusters and substantial for small clusters (in effectiveness evaluation this fact causes some concern)
42
Prediction of the random effects /2
The random effects uj are random variables with priordistribution N(0u
2 The EB prediction is the mean of the posterior distribution
with parameters estimates plugged in, i.e. data information (likelihood) combined with population information (prior)
In a linear model the posterior is Normal mean=mode The mean of the posterior distribution is a value between 0 (the
mean of the prior) and the mode of the likelihood
ij j ijy u e .
2
2 2ˆ ˆˆ ˆ
/OLS
j j
EB OLSuj j
u e j
u yu un
43
In the RANOVA model, the sampling distribution of the EB prediction is
The sampling variance is
Note: sampling variance < prior variance Diagnostic standard error
It can be used for diagnostic purposes (e.g. to check if the EB prediction for a given cluster is anomalous)
2 2 2 2ˆvar( ) (1 ) ( | )EBj j u u j u u ju Var u y
Marginal sampling variance ( diagnostic standard error)
2ˆ (0, )EBj j uu N
2ˆˆ ˆ( )EBj j uSD u
44
In the linear random intercept model, the variance of prediction errors equals the posterior variance, i.e. variance of uj given the data
Note: posterior variance < prior variance Comparative standard error:
It can be used for inferences on differences among random effects (see later)
2ˆ( ) ( | ) (1 )EBj j j j uVar u u Var u y
Variance of prediction errors ( comparative standard error)
2ˆˆ ˆ( ) (1 )EBj j j uSD u u
45
Basics of the two-level linear model
Case #1: a single covariate at level 1
NELS-88 example: separate OLS analyses
One OLS regression for each of the 10 schools
How to perform an all-in-one analysis?
47
Levels: pupils (level 1); schools (level 2) Response variable Y: score on the final test Explanatory variable (at level 1) X: score on the initial test
Example: school effectiveness
0 1i i iy x e 2~ (0, )iid
i ee N
First consider a single school and assume a standard linear model:
48
Comparing the schools
Comparison between schools A and B based on the progress of the pupils (value added)
School A more effective (higher predicted Y for all the range of X)
School A more egalitarian (lower slope)
X = score on initial test
Y = score on final test
A
B
49
The two-level linear model(one covariate at level 1)
Sample of J schools (from a population of schools)
Level 1 modelEquation for the j-th school:
0 1j j ji jijiy x e 2~ (0, )iid
ejie N
Remark: each school has its own slope and intercept
50
The two-level linear model(one covariate at level 1)
Each school has a couple of “parameters”
Assumption: the “parameters” are iid random variables with a bivariate Normal distribution in the population of schools
0 1( , )j j
0 1,j j ije independent from
200 0 01
210
0
11,
iidu u
u
j
jN
0 1( , )j j
The Normal distribution is the “default” since it has nice properties and works well in many cases. Other choices are possible, such as a different continuous parametric family or an arbitrary discrete distribution
51
The two-level linear model(one covariate at level 1)
Model parameters
00 mean intercept
10
20u Intercept variance
mean slope
21u Slope variance
01u Slope-intercept covariance
Residual variance (level 1)2e
Fixed parameters
Variance-covariance parameters
(also called Random parameters even if they are fixed quantities –‘random’ just means that they refer to the random part of the model)
52
The two-level linear model(one covariate at level 1)
Correlation between slopes and intercepts:
0j
1j
11
0 10
0( ), u
u uj jcorr
Example of negative correlation
53
The two-level linear model(one covariate at level 1)
0 00 0
1 10 1
j j
j j
uu
0 1ij j j ij ijy x e
Level 2 models:
Combined model:
00 10 1 0ij ij j ij j ijy x u x u e
Fixed part Random part
Level 1 model:
54
The two-level linear model(one covariate at level 1)
20 0 00 0 0
21 1 10 1 1
Var( )
Var( )j j j u
j j j u
u u
u u
Level 2 errors(random effects):
Random effect = unexplained deviation of the value of the “parameter” in the j-th cluster from the mean value in the population
The covariates may contribute to explain the deviations (so reducing the corresponding variances)Usually the distributional assumptions refer to the random effects (rather than to the random intercepts/slopes):
20 0 01
21 1
0,
0
iidj u u
j u
uN
u
0
1
jij
j
ue
u
indep. from
55
The two-level linear model: variance and covariances
The total error is implying
heteroscedasticity:
non-homogeneous correlation among the responses of the units of the same cluster:
no correlation among the responses of units of different clusters:
0 1j j ij iju u x e
2 2 2 20 01 1( | ) 2ij ij u u ij u ij eVar y x x x
2 2' ' 0 01 ' 1 '( , | , ) ( )ij i j ij i j u u ij i j u ij i jCov y y x x x x x x
' ' ' ' ' '( , | , ) 0ij i j ij i jCov y y x x
Between-cluster variance Within-cluster variance
56
The two-level linear model: variance function
2 2 2 20 01 1( | ) 2ij ij u u ij u ij eVar y x x x
201 1/u u
( | )ij ijVar y x
ijx
The variance function is a parabola with minimum in
Depending on the range of x, in a given application the variance function can be descending, ascending or U-shaped, but never –shaped!
Up to now we have assumed that level 1 errors are homoscedastic, but this assumption can be relaxed, so a –shaped relationship can be captured by level 1 heteroscedasticity
201 1/u u
57
Example with a single level 1 covariate with random slope
Two-level model in matrix notation
1 10
1
1 10
1
1
1
1
1
j j
j j
j j
j j
n j n j
j jj
j j jj
n j n j
y x
y x
x euu
x e
y X β
Z u e
The matrix notation is rarely used, but it is needed for writing estimation algorithms
j j j j j y X β Z u e
58
Covariance structure of the two-level model
0 1
0 00 0
1 10 1
ij j j ij ij
j j
j j
y x euu
20 0 01
21 1
0,
0~iid
j u u
j u
uN
u
2~ (0, )ij ee N
20 01
21
u uu
u
Σ
Covariance matrix of the random effects
Some special cases of u
• Standard (OLS) regression
• Random intercept
• Random (intercept and) slope
2 20 1 01Remark: when or is null the covariance is nullu u u
59
Standard (OLS) regression model
Special case u =0The regression line is the same for all clustersFixed intercept and slope standard regression model
x
y
Homoscedasticity No correlation even within clusters
2
' '
( | )
( , | , ) 0ij ij e
ij i j ij i j
Var y xCov y y x x
00 10ij ij ijy x e
60
NELS-88 example: OLS regression
Parameter | Estimate------------------+------------Intercept | 44.07Homework | 3.57Residual variance | 93.73
Each school has the same intercept and slope (= same line)
61
Random intercept model
2 20
2' ' 0
( | )
( , | , )ij ij u e
ij i j ij i j u
Var y x
Cov y y x x
20 0
0u
u
Σ
x
y
The variance of the slope is null (and so is the slope-intercept covariance)
The variance of the intercept does not depend on X (i.e. centering X is irrelevant)
Homoscedasticity Equi-correlation within clusters
Special case
The regression lines are parallel the clusters can be ranked
00 10 0ij ij j ijy x u e
62
NELS-88 example: random intercept model
Parameter | Estimate-------------------------+------------Intercept | 44.98Homework | 2.21Residual lev. 2 variance | 22.50Residual lev. 1 variance | 64.26
Total residual variance = 22.50+64.26 = 86.76
Residual ICC = 22.50/ 86.76 = 0.259
25.9% of the residual variance of math scores after adjusting for homework is due to the clustering of pupils into schools
Each school has this estimated slope
Mean intercept in the population of schools
63
Random (intercept and) slope model
20 01
21
u uu
u
Σ
x
y
General case
The between-school variance is a quadratic function of X
Heterogeneous correlation within clusters (no unique residual ICC)
The intercept variance and the slope-intercept covariance depend on X since they refer to X=0 useful to center X
When the origin of X is arbitrary (which is often the case), the covariance should not be constrained to be zero
Many crossing points the clusters cannot be ranked
00 10 1 0ij ij j ij j ijy x u x u e
64
NELS-88 example: random slope model
Parameter | Estimate--------------------------+------------Intercept | 44.77Homework | 2.05Residual lev. 2 var/cov. |
Intercept var. | 61.81Homework var. | 19.98Intercept-Homework cov.| -28.26
Residual lev. 1 variance | 43.07
Here the ICC is meaningless
Mean slope in the population of schools
It amounts to a correlation of 0.80
65
Random (intercept and) slope model
mY = unit of measure of Y (e.g. Kilograms)
mX = unit of measure of X (e.g. Metres)
u0 is expressed in mY (just as 0)
u1 is expressed in mY/ mX (just as 1)
u01 is expressed in (mY)2/ mXx
y
What is the unit of measure of the variances/covariances?
Be careful in interpreting standard deviations, variances and covariances of the random effects
66
Usually a random coefficient refers to a continuous covariate (‘random slope’)
Also a binary covariate may have a random coefficient, but the interpretation is different
Suppose d is binary covariate (or dummy variable), then
Thus the between-cluster variance is heteroscedastic
Random coefficient on a binary covariate
2 2 2 20 01 1
2 2 20 01 1
( | ) 2
2
ij ij u u ij u ij e
u u u ij e
Var y d d d
d
2 2 20 0 01 1for 0 2 for 1u ij u u u ijd d
67
Basics of the two-level linear model
Case #2: introduction of a covariate at level 2
Introduction of level 2 covariates:
level 2 covariates represent features of the clusters useful to
define a model for the level 1 parameters
and so reduce the level 2 variances
Example: W is a binary variable coded 1=public school; 0=private school
The two-level linear model(one covariate at level 1 + one covariate at level 2)
0 1( , )j j
2 20 1( , )u u
69
The two-level linear model(one covariate at level 1 + one covariate at level 2)
0 1ij j j ij ijy x e
Level 2 models:
Combined model:
0 00 01 0
1 10 11 1
j j j
j j j
w uw u
00 01 10 11ij j ij j ijy w x w x
0 1j j ij iju u x e Random part
Fixed part
Here it becomes clear why the have a double index
Level 1 model:
70
The two-level linear model(one covariate at level 1 + one covariate at level 2)
Level 2 models:
mean difference in intercept between private and public schoolmean difference in slope between private and public schooldeviation of school j from the corresponding mean interceptdeviation of school j from the corresponding mean slope
0 00 01 0
1 10 11 1
j j j
j j j
w uw u
01110 ju1 ju
20 0Var( )j uu
21 1Var( )j uu
Remark: the distributional assumptions on the random effects are the same as before, but now the variances have a different meaning (remind: the variances are residual w.r.t. to the model covariates)
71
In the combined model there is a cross-level interaction
It arises because the level 1 coefficient depends on the level 2 covariate
A multilevel model can be written in two alternative ways: a) a single combined equation (like most software) b) a system of hierarchical equations (like the software named HLM)
Who uses approach b) usually ends up with a more complex model (notably, with more cross-level interactions)
The two-level linear model(one covariate at level 1 + one covariate at level 2)
j ijw x
1 jjw
72
Example: random slope model with the covariate public (NO cross-level interaction)
Parameter | Estimate--------------------------+------------Intercept | 58.06Homework | 1.94Public | -14.65Residual lev. 2 var/cov. |
Intercept var. | 40.68Homework var. | 21.68Intercept-Homework cov.| -29.16
Residual lev. 1 variance | 42.95
The effect of homework is assumed to be the same for public and non-public schools
Mean slope of schools (regardless of public/non-public)
Mean intercept of non-public schools
Difference in the mean intercept (public vs. non-public)
73
Example: random slope model with the covariate public AND cross-level interaction
Parameter | Estimate--------------------------+------------Intercept | 59.21Homework | 1.09Public | -15.94Homework*Public | 0.95Residual lev. 2 var/cov. |
Intercept var. | 40.50Homework var. | 21.58Intercept-Homework cov.| -29.02
Residual lev. 1 variance | 42.95
The mean slope of homework in the population of schools is 1.09 for non-public schools and 1.09+0.95=2.04 for public schools
Mean slope of non-public schools
Difference in the mean slope (public vs. non-public)
74
Within, between and contextual effects
Slopes: between, within and total Centering the covariates The contextual effect The fixed effects model
Three regression models
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
X
Y
Total
Within
Between
j i X_ij X_. j Y_ij Y_. j
1 1 1 2 5 61 2 3 2 7 62 1 2 3 4 52 2 4 3 6 53 1 3 4 3 43 2 5 4 5 44 1 4 5 2 34 2 6 5 4 35 1 5 6 1 25 2 7 6 3 2
Example from Snijders & Bosker (2011)
Difference Between-Within: “Ecological fallacy”
Real data example:
i= graduate, j= faculty
Y= employability
X= graduation mark
76
Regression models for estimating within and between relationships
. .
. .
. .
ˆ 5.33 0.33ˆ 8.00 1.00
ˆ 1.00
ˆ 8.00 1.00 1.00
ij ij
j j
ij j ij j
ij j ij j
y x
y x
y y x x
y x x x
Total
Between cluster means
Within clusters
Multilevel
The multilevel regression model allows us to study between and within relationships at the same time
77
A covariate can be centered w.r.t. a given constant, such as the grand mean: this affects the intercept (in a random slope model) the intercept variance and the
intercept-slope covariance
the cluster mean (CM centering), so if the cluster means are different the centering varies from cluster to cluster: this affects the slope (total effect vs. within effect)
Centering a covariate
78
The Cronbach model
00 10 . 01 .( )ij ij j j j ijy x x x u e
Within slope
Between slope
Cronbach model: CM centering & cluster mean
. 00 01 . .j j j jy x u e
. 10 . .( ) ( )ij j ij j ij jy y x x e e
79
The contextual model
00 10 01 .ij ij j j ijy x x u e ‘contextual’ model: no CM centering, but cluster mean
00 10 . 10 01 .( ) ( )ij ij j j j ijy x x x u e
10 10
01 01 10
within slopebetween slope within slope
. .replacing with ( yields)ij ij j jx x x x
Just a reparameterization of the Cronbach model !!
80
In sociology and education it is known as the contextual effect
It is the additional effect of the school mean of X on Y that is not accounted for by the individual level X
Usually X is prior score or SES (Socio-Economic Status) The estimate of the contextual effect of X will partially
encompass the effects of all school level variables that are correlated with X including peer influences school climate allocation of resources organizational and structural features of schools
The contextual effect
01 10 between slope within slope
81
Interpreting the three effects: within
Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Within effect: expected Y for Sam vs expected Y for another
pupil with X=81 and sch_mean(X)=70
00 . 01 1 .0 ( )ij ij j j j ijy x x x u e
. .
00 10 01 1000 10 01
| 81, 70 | 80, 70
11 70 10 70ij ij j ij ij jE y x x E y x x
82
Interpreting the three effects: between
Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Between effect: expected Y for Sam vs expected Y for another
pupil with X=81 and sch_mean(X)=71
00 10 . 0 .1( )ij ij j j j ijy x x x u e
. .
00 10 01 0100 10 01
| 81, 71 | 80, 70
10 71 10 70ij ij j ij ij jE y x x E y x x
Between effect: you increase the school mean score by 1, and you also increase the individual score by 1 in order to leave the deviation unchanged (the pupil to be compared with Sam has the same relative position, namely 10 points above the school mean) 83
Interpreting the three effects: contextual
Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Contextual effect: expected Y for Sam vs expected Y for
another pupil with X=80 and sch_mean(X)=71
00 .10 0 .1( )ij ij j j j ijy x x x u e
0
. .
00 10 01 00 1 101 10 0
| 80, 71 | 80, 70
9 71 10 70ij ij j ij ij jE y x x E y x x
Contextual effect: you increase the school mean score by 1 and leave the individual score unchanged (this changes the deviation: the pupil to be compared with Sam has a different relative position, namely 9 points above the school mean instead of 10) 84
The raw covariate model
This model implicitly assumes that the between and within slopes are identical !!
00 10ij ij j ijy x u
‘raw covariate’ model: no CM centering, no cluster mean
. 00 10 . .j j j jy x u e
. 10 . .( ) ( )ij j ij j ij jy y x x e e
85
Some model specifications
Models 1 to 3 try to fully control for X, while model 4 controls only for the within effect of X
In most applications, within and between slopes are quite different model 1 is wrong (though it is the most parsimonious model controlling for X)
In models with many covariates the parsimony principle may suggest to disentangle the slopes only for the covariates of main interest
.
. .
.
1)
2) ( )
3) ( )
4) ( )
ij total ij
ij within ij between within j
ij within ij j between j
ij within ij j
y xy x xy x x xy x x
More details in the textbooks of Raudenbush & Bryk and Snijders & Bosker
Statistically equivalent
86
Snijders TAB and Bosker RJ. (2011) Multilevel analysis: An introduction to basic and advanced multilevel modeling. 2nd ed. London: Sage.
Raudenbush SW and Bryk AS (2002) Hierarchical Linear Models(Second Edition). Thousand Oaks: Sage.
Kreft IGG, de Leeuw J and Aiken L (1995) The effect of different forms of centering in hierarchical linear models. Multivariate Behavioral Research, 1-21.
Paccagnella O (2006) Centering or not centering in multilevel models: The Role of the Group Mean and the Assessment of Group Effects. Evaluation Review 30, 66-85.
References on centering and within/between/contextual effects
87
The fixed effects model
No distributional assumptions on the cluster effects need not worry about homoscedasticity, normality, correlation between random effects and covariates
The slope is not the total effect, but the within effect (in panel data the corresponding estimator is known as the fixed effects estimator): in fact, all the between variation is absorbed by the fixed effects the covariates can only explain the within variation
ij ij j ijy x e random effects uj replaced by parameters jThus no distributional assumptions !!!
88
Random effects is the standard choice in most fields (Epidemiology, Sociology, Psychometrics …), while in Econometrics the standard choice is fixed effects [e.g. Rivkin S.G., Hanushek E.A., Kain J.F. (2005) Teachers, schools, and academic achievement. Econometrica, 73, 417-458.]
Fixed effects have the merit of avoiding assumptions on the random effects, and they can be used even with very few clusters(e.g. 5 clusters). But they entail several limitations: Impossible to use level 2 covariates, a dramatic limitation in the
(frequent) case where the research questions concern the effect of level 2 covariates!
Loss of efficiency (since number of fixed effects = number of clusters) Inefficient estimation of cluster effects (for example, if a cluster has two
units its fixed effect is estimated with just two observations)
Fixed vs random effects
89
Inference in two-level models
Parameter estimation
Maximum likelihood step 1: estimation of fixed parameters 00, 01, 10, 11
and variance-covariance parameters step 2: prediction of random effects (u0j,u1j, j=1,…,J)
Bayesian inferencethe parameters are random variables with a prior distribution parameters and random effects are all random variables
00 01 10 11 0 1ij j ij j ij j j ij ijy w x w x u u x e
Random partFixed part
2 2 20 1 01, , ,e u u u
91
The model is linear but the error term
violates the homoscedasticity and no-correlation assumptions
OLS estimation
What about estimating the fixed parameters using OLS (Ordinary Least Squares) or ML under the standard assumptions of the linear model?
*0 1ij j j ij ije u u x e
00 01 1*
0 11ij j ij j ij ijy w x x ew
Inefficient estimators
Biased standard errors
92
OLS standard errors
Consider a random intercept model with a single covariate
Fitting the model with OLS (i.e. omitting the random effects) leads to a wrong standard error (s.e.) for the covariate
Level 2 covariate (purely between, i.e. constant within clusters) OLS s.e. is (substantially) too low
Purely within level 1 covariate (it varies only within clusters) OLS s.e. is (slightly) too high
Level 1 covariate with between variation (it varies both within and between clusters) the OLS s.e. is the result of two opposite effects: the between part pushes it down, the within part pushes it up – but in practice the OLS s.e. is nearly always too low
ij ij j ijy x u e
93
Likelihood
( | , )f y u θ
( ) ( | , ) ( | )L f p d θ y u θ u θ u
Distribution of responses,conditional on random effects and parameters
Distribution of random effects,conditional on parameters
Likelihood
Problem: the integral has analytical solution only for conjugate distributions (e.g. Normal-Normal, Binomial-Beta, …)
The equation defining a random effects model includes the random effects, but they are not observable (so cannot appear in the likelihood) the random effects must be integrated out!
( | )p u θ
94
Closed-form likelihood
Multiple random effects multivariate distribution the Normal distribution is preferable (and in fact it is the standard choice in applications)
Linear model : Normal-Normal (conjugate) the integral has analytical solution
closed-form likelihood
Non linear model : e.g. Binomial-Normal (non conjugate) the likelihood must be evaluated through
approximate integration methods
95
Maximum Likelihood
( | , ) ( | )( | , )( | , ) ( | )
M L M LM L
M L M L
f ppf p d
y u θ u θu y θ
y u θ u θ u
Once the ML estimates of fixed and random parameters have been obtained
-> Empirical Bayes estimation of the random effects
arg max ( ) ( | , ) ( | )M L L f p d θ
θ θ y u θ u θ u
96
Bayesian inference
( | , ) ( | ) ( )( , | )( | , ) ( | ) ( )f p pp
f p p d d
y u θ u θ θu θ yy u θ u θ θ u θ
For Bayesian inference a prior distribution is defined
and the joint posterior distribution is used
( )p θ
( | ) ( , | )
( | ) ( , | )
p p d
p p d
θ y θ u y u
u y θ u y θ
Inference on the parameters
Inference on the random effects
97
Even for the linear model it is necessary to use approximate integration algorithms (e.g. Gibbs sampling)
The Bayesian approach has the usual disadvantages, whereas the main advantages here are Good estimates even for a small number of clusters In complex multilevel models the estimates properly account
for all the uncertainty Bayesian methods yield good estimates of the variance components and confidence intervals with appropriate coverage even in highly complex models, where ML methods show a poor performance
Bayesian inference: pros and cons /1
98
Seltzer, M.H., Wong W.H., and Bryk A.S. (1996) Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods. Journal of Educational and Behavioral Statistics 21(2): 131-167.
Bian, Guarui (2002) Bayesian Estimates in a One-Way ANOVA Random Effects Model. Australian & New Zealand Journal of Statistics 44(1): 99-108. (simulation 3 clusters of size 2)
Browne W.J., & Draper D. (2006). A comparison of Bayesian and likelihood-based methods for fitting multilevel models, Bayesian Analysis, 1, 673–514. http://ba.stat.cmu.edu/journal/2006/vol01/issue03/draper2.pdf
Bayesian inference: pros and cons /2
99
Hierarchical specification of the model(useful to write down the likelihood)
The random intercept model
can also be written hierarchically (which is the standard way to specify multilevel GLMs)
ij ij j ijy x u e
2
2
1. | , ,
2. 0 ,
ind
ij ij j ij j e
iid
j u
y x u N x u
u N
Remark: in the hierarchical specification the level 1 errors eijare not written (only their variance)
Independence stems from conditioning on the random effects
100
Hierarchical construction of the likelihood for the random intercept model
The likelihood can be written in steps by exploiting the conditional independencies shown by the hierarchical formulation
1
1
( | ) ( | ) -th cluster conditional likelihood
( ) ( | ) ( | ) -th cluster marginal likelihood
( ) ( ) marginal likelihood
j
j
n
j j ij ji
j j j j ju
J
jj
L u L u j
L L u p u du j
L L
ψ ψ
θ ψ τ
θ θ
2
2
param. of random effects, where
, , all other parameters
u
e
τθ ψ τ
ψ
2density of , ( | )
for fixed evaluated at the observed ij j e
ij jj ij
N x uL u
u y
ψ
conditional independence given uj
Independent clusters 101
Full Information Maximum Likelihood (FIML) Full likelihood, joint estimation of fixed and random parameters Underestimates the random parameters since it treats the fixed
parameters as known quantities (ignoring degrees of freedom)
Restricted Maximum Likelihood (REML) The random parameters are estimated by maximizing the restricted
likelihood, i.e. the density of the residuals The estimators of the random parameters are approximately unbiased
even in small samples
FIML and REML
102
Degrees of freedom
To understand why FIML underestimates the random parameters remind what happens in standard linear regression
If k is large w.r.t. n, then is severely downward biased (it does not correct for the degrees of freedom lost in estimating )
In a multilevel model the sample size which is relevant for the estimation of the level 2 variance is J, i.e. the number of clusters
2
22 2 2 2 2
ˆ' Var( )= '1 1ˆ ˆ ˆ ˆ E =
i i i i i i i
OLSFIML i i OLSi i
y e e e y
e en n k
β X β X
2ˆFIML
103
In a two-level model, REML and FIML lead to: Similar estimates for the level 1 variance Discordant estimates for the parameters of the random effects if the
number of clusters J is small (in such a case FIML estimates of variances are lower)
Unless the main aim is the estimation of the random parameters, FIML is preferred because: FIML estimators have a lower sampling variance (the comparison in
terms of MSE is often in favour of FIML estimators) With FIML the LRT (Likelihood Ratio Test) can be used not only to
test the random parameters, but also to test the fixed parameters
FIML vs REML
104
Iterative Generalized Least Squares (Goldstein 1986)
Fisher Scoring (Longford 1987)
EM (developed by Dempster, Laird and Rubin for models with missing data, later applied to multilevel models since the random effects are, in a sense, missing data)
FIML algorithms
105
The available estimation procedures differ in the step for estimating the variance-covariance parameters, while the step for estimating the fixed parameters is always Generalized Least Squares (GLS)
Generalized Least Squares (GLS)
10 1 1
1 11
ˆ
ˆ
J JT Tj j j j j j
j j
X Ω X X Ω y
2 2 21 1
2 2 2
1
1j j
j j u e u
j j j u u e
n j n j
y x
y x
y X Ω
• Replacing the variances with consistent estimates leads to a feasible GLS
• Maximum likelihood estimators are an instance of feasible GLS106
Under mild regularity conditions FIML estimators have good asymptotic properties:
Consistency Normality Efficiency
Remark: here asymptotic requires increasing the number of clusters (increasing the cluster sizes is not enough), so the number of clusters J is the key quantity for asymptotics
Properties of FIML estimators
107
Complete-case or ‘listwise’ analysis: Reduced power Inconsistent estimates unless missingness only depends on covariates
(Missing Completely At Random MCAR) ML of a mixed model based on all available responses: Consistent estimates if missingness only depends on covariates and
observed responses (Missing At Random MAR) & mixed model correctly specified
Inconsistent estimates if missingness depends on missing responses (Not Missing At Random NMAR)
ML of a mixed model with explicit missingness models: Consistent estimates under correct specification of missingness
model (and ‘substantive’ model)
ML and missing data
108
Hypothesis testing on a single fixed parameter
Null hypothesis: H0: h=0
Wald test statistic:
( )
. .( )h
h
h
Ts e
#(level 1 units)-#(covariates)-1 if coeff. covariate at level 1
#(clusters)-#(covariates at level 2)-1 if coeff. covariate at level 2( ) d.f.
approxh
hh
T t
A caveat: with few clusters the standard errors are usually underestimated the test rejects the null hypothesis too often (i.e. the type I error rate is higher the nominal level)
109
Hypothesis testing on a set of fixed parameters
Null hypothesis: H0: C ’ = 0(C = matrix of contrasts with k rows)
Wald test statistic: 1ˆ'ˆ ˆ ˆ( ' ) ( ' ) ( ' )TQ
C γC γ C γ C γ γ
Remark: with few clusters the F distribution is preferable
Alternative: LRT (asymptotically equivalent)
2( ' )approx
kQ C γ
110
Hypothesis testing on the random parameters
Hypothesis
where 0 is a restricted version of 1 (e.g. some elements are constrained to 0)
Unless the number of clusters is huge, the Wald test should not be used since the sampling distributions of the estimators of the random parameters are highly asymmetric
0 0 1 1H : vs. H : Σ Σ Σ Σ
111
LRT on a level 2 variance
D1 = deviance (2logL) of the unrestricted model
D0 = deviance (2logL) of the model with a variance at 0#(restrictions) = 1 + #(corresponding covariances)
220 1 ( )( )
1/21/2
0
approx
restrictionsrestrictions
D D
with prob.
with prob.
Practical rule: the p-value must be halved! (otherwise the test is conservative, i.e. the actual probability of type I error is lower than )
Remark: with REML the LRT can be used only if the fixed part of the model is unchanged!
112
Starting with a random effects ANOVA model
level 2 and level 1 covariates can be added: A level 2 covariate reduces (or leaves unchanged) the level 2
variance, but (being constant within each cluster) cannot affect the level 1 variance
A level 1 covariate reduces (or leaves unchanged) the level 1 variance, but its effect on the level 2 variance is unpredictable
Effect of the covariates on the (residual) variances
2 2( ) ( )j i eu jVar u Var e ij j ijy u e
113
jijx xjx
Why the estimated level 2 variance may increase /1
a level 2 covariate by definition is purely between i.e. varies only between clusters
a level 1 covariate can be written as the sum of:
a purely between component Reduces u
2 and does not affect e2
a purely within component Reduces e
2 and thus increases u2
A purely within level 1 covariate increases the estimate of u2
Usually a level 1 covariate varies both within and between the effecton u
2 is unpredictable (often reduces it)
222ˆ ˆ
ue
BSn
This effect is strong only if the cluster size n is small
The effect can be illustrated with the ANOVA formula
114
Why the estimated level 2 variance may increase /2
A rise in the estimated level 2 variance u2 can be due to:
the addition of a level 1 covariate varying mainly (or exclusively) within clusters this is a consequence of the estimation method(the model is assumed to be correctly specified) and it is negligible for large cluster sizes
the inclusion of an endogenous covariate, i.e. correlated with the random effects (an incorrectly specified model)
Example (Longford). Consider a comparison of hospitals based the response ‘death of the patient’ and suppose that the covariate “severity of the case” is added; then if hospitals that tend to treat more severe cases provide higher quality treatment, then after adjustment for severity the between-hospital variance will be much greater.
115
In a multilevel model (two or more levels) a covariate at an arbitrary level:
does not affect the variances at lower levels reduces (or leaves unchanged) the variances at the same level has an unpredictable effect on the variances at higher levels
Effect of the covariates on the (residual) variances: general rule
116
Use of the residuals
Diagnostics Inference for specific clusters
Many residuals Level 1 Level 2 (Empirical Bayes)
Extension of the techniques used in standard regression Purposes: Check the functional form and the distributional assumptions
(normality, homoscedasticity, …) Look for influential units
Diagnostics based on the residuals
ije0 1ˆ ˆ, ,j ju u
118
To check the normality assumption use an histogram or a Q-Q plot
Diagnostics based on the residuals
119
To locate anomalous units look at the standardized residuals, using the diagnostic standard errors
Diagnostics based on the residuals
0ˆstandardized ju
1ˆst
anda
rdiz
ed
ju
120
To look for misspecifications in the fixed part and/or heteroschedasticity, plot the residuals one by one against the predicted values of the response
Diagnostics based on the residuals
121
Two relevant inferential questions:
Is cluster j* significantly different from the mean? Since the mean is 0, the question is whether uj* 0
Is cluster j* significantly different from cluster j**? The question is whether uj* uj**
The level 2 residuals are predictions of the corresponding random effects uj they can be used to make inference
Inference on the random effects
122
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 10 20 30 40 50 60
EB p
redi
ctio
n of
rand
om e
ffect
s
Ranks
Caterpillar plot of the residuals for comparing each cluster with the mean
The residuals are ordered and endowed with 95% confidence bars (+/- 1.96 the comparative standard errors)The width of the error bar of a given cluster depends of its size
Comparisons with the mean
There are many clusters significantly above or below the mean
ˆ ˆ1.96 ( )j ju SE u
123
Warning: how to compare two means
A common misconception: believing that two quantities whose 95% intervals are disjoint are significantly different at 5%
2
2
( , )
( , )X
Y
X NY N
Assuming for simplicity that X and Y are independent
1.961.96
XY
2( , 2 )X YX Y N 1.96 2X Y
X is significantly different from Y at level 95% if and only if:
• the distance (in units) between X and Y exceeds 1.962 = 2.77
• or the univariate intervals of length 2.77/2 = 1.39 are disjoint124
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 10 20 30 40 50 60
Ranks
EB p
redi
ctio
n of
rand
om e
ffec
ts
Caterpillar plot of the residuals for comparing two clusters
The residuals are ordered and endowed with error bars (+/- 1.39 the comparative standard errors). The value 1.39 stems from assumptions (Normal distribution, independence, same variance) which usually are not satisfied the value 1.39 is an approximation and the significance level 5% is an average level
Graph for pair-wise comparisonsHere only a few comparisons are significant. The width of the error bar of a given cluster depends of its size to increase the number of significant comparisons a way is to collect more data on each cluster ˆ ˆ1.39 ( )j ju SE u
The only difference with the previous caterpillar plot is the width of the bars
125
Sample size requirements to fit multilevel models
A multilevel model requires a minimum number of clusters in order to get (approximately) unbiased estimates
The minimum depends on type of model (linear vs non-linear, random intercept vs random slope), on the average size of the clusters and on the true parameter values
In the best situation (a simple linear random intercept model and a dataset with large clusters) the minimum is about 10 (but often we are far from the best situation!)
In case of few clusters fixed effects model An alternative solution to handle few clusters is to fit a
random effects model with Bayesian methods since they do not rely on asymptotics
How many clusters are needed?
127
The sample size requirements are different depending on the target of inference
The less demanding target is to get unbiased point estimates of regression coefficients (in favourable situations 10 clusters of size 2 may be enough)
More clusters (say 30 or 50) are needed for unbiased estimation of variance components and standard errors (especially the standard errors of the variance components)
The requirement is higher for models with random slopes and for non-linear models (e.g. binary responses)
Target of inference
128
The cluster size is less relevant than the number of clusters: clusters of size 2 (as in a two-wave panel) are usually ok for a linear random effects model (but not for a logistic random effects model)
Even a few clusters with a single unit are not harmful However, small clusters worsen cluster-specific inferences (for
example, the precision of Empirical Bayes predictions of random effects)
Moreover, data with small clusters carry limited information on the variance-covariance structure at level 2, which should be kept simple (for example, no random slopes)
Small clusters
129
degrees of freedom at level 2 = number of clusters minusnumber of level 2 covariates
Given the number of level 2 covariates, the number of clusters should ensure enough degrees of freedom at level 2 For example, we cannot obtain good estimates with 20
clusters and 10 level 2 covariates
Degrees of freedom at level 2
130
LINEAR Maas C.J.M. & Hox J.J. (2005) Sufficient sample sizes for multilevel modeling.
Methodology, 1, 86-92. Bell, B.A., Morgan, G.B., Schoeneberger, J.A., Loudermilk, B.L., Kromrey, J.D., &
Ferron, J.M. (2010). Dancing the Sample Size Limbo with Mixed Models: How Low Can You Go? SAS Global Forum 2010, Paper 197-2010. http://support.sas.com/resources/papers/proceedings10/197-2010.pdf
LOGISTIC: Moineddin R., Matheson F.I. and Glazier R.H. (2007) A simulation study of
sample size for multilevel logistic regression models. BMC Medical Research Methodology, 7. [random slope model]
Austin P.C. (2010) Estimating multilevel logistic regression models when the number of clusters is low: A comparison of different statistical software procedures. International Journal of Biostatistics, 6(1), article 16
Paccagnella O. (2011) Sample Size and Accuracy of Estimates in Multilevel Models. New Simulation Results. Methodology, 7, 111-120.
References on sample size /1
131
SMALL CLUSTERS Raudenbush SW (2008) Many small groups. In: J. de Leeuw, E. Meijer (eds.),
Handbook of Multilevel Analysis, Springer.
OPTIMAL DESIGN: Snijders TAB, Bosker RJ (1993) Standard errors and sample sizes for two-
level research. Journal of Educational Statistics, 18:237–259. Snijders, TAB (2005) Power and Sample Size in Multilevel Linear Models. In:
B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in BehavioralScience. Vol. 3, 1570–1573. Chicester : Wiley.
http://stat.gamma.rug.nl/PowerSampleSizeMultilevel.pdf Cohen MP (1998) Determining sample sizes for surveys with data analyzed
by hierarchical linear models. Journal of Official Statistics, 14:267–275. Moerbeek M, Van Breukelen GJP, Berger MPF (2008) Optimal Designs for
Multilevel Studies. In: J. de Leeuw, E. Meijer (eds.), Handbook of Multilevel Analysis, Springer.
References on sample size /2
132
Extensions of the hierarchical model
Complex level 1 variation 3-level model
Many covariates at both levels: X1, X2, …,W1, W2, … Cluster means of the level 1 covariates Complex error structure: At level 1: e.g. heteroschedasticity At level 2: e.g. many random slopes
More than two hierarchical levels
Extensions of the linear mixed model
But be aware that the imagination of the researchers “can easily outrun the capacity of the data, the computer, and current optimization techniques to provide robust estimates” (Di Prete & Forristal)
2( ) ij ijVar e x
134
A model which allows the level 1 variance to depend on explanatory variables is called a complex level 1 variance model
Example: in an educational setting, it is often observed that boys vary more than girls in their attainment, i.e. there may be heteroscedasticity at the student level. Denoting with dij the indicator for girl, the level 1 error is
Complex level 1 variation (heteroscedasticity)
1, 0,
220,1 ,
1
( ) 1
iij ij iij
e
j
e
j
ij ij ij
e d
V e
e d
d d
e
Browne, W.J., Draper, D., Goldstein, H., & Rasbash, J. (2002). Bayesian and likelihood methods for fitting multilevel models with complex level-1 variation. Computational Statistics & Data Analysis, 39(2), 203-225.
135
The 3-level random intercept model
ijk k jk ijky fixed part v u e
1,, ,1, ,1, ,
k
jk
k Kj Ji I
level 3 units (e.g. schools)
level 2 units (e.g. classes)
level 1 units (e.g. pupils)
2
2
2
~ (0, )~ (0, )
~ (0, )
iid
k viid
jk uiid
ijk e
v Nu N
e N
independence among levels
136
The 3-level random intercept model
' ' '
2
' ' 2 2 2
2 2
' 2 2 2
, 0
,
,
ijk i j k
vijk i j k
v u e
v uijk i jk
v u e
corr y y
corr y y
corr y y
Two pupils of the same school but different classes
Two pupils of different schools
Two pupils of the same school and class
2 2 2
2 2 2 2 2 2 2 2 2v u e
v u e v u e v u e
school level class level student level
Variance Partition Coefficients (VPC):
137
Software & Books
Specialized software Procedures in general purpose software
Web resources Centre for Multilevel Modelling:
http://www.cmm.bristol.ac.uk/ Multilevel Modeling Resources at UCLA
http://www.ats.ucla.edu/stat/mlm/
Software for multilevel modelling
139
MLwiN (Goldstein) HLM (Raudenbush) SUPERMIX (Hedeker) aML (Panis)
Specialized software
140
STATA xt suite R (packages ‘lme4’ and ‘MCMCglmm’) SAS PROC MIXED and NLMIXED SPSS STATA gllamm (Rabe-Hesketh & Skrondal)
LISREL (Joreskog)
M-plus (Muthén)
WINBUGS (for Bayesian analysis)
Procedures in general purpose software
Also structural equations
141
Good introductory books
Snijders & Bosker, 2nd ed.
(to appear soon)
Hox, 2nd ed. Raudenbush & Bryk, 2nd ed.
Chapter 2 free to download
142
Pinheiro and Bates (2002) Mixed-effects models in Sand S-PLUS
Littell et al. (2006) SAS for mixed models, 2nd ed
Gelman and Hill (2007) Data analysis using regression and multilevel/hierarchical models (mainly R and WinBUGS)
Rabe-Hesketh and Skrondal (2008) Multilevel and longitudinal modeling using Stata, 2nd ed
Books for learning models AND software
143
Detailed instructions of how to carry out a range of analyses in R, MLwiN and Stata
It is free, but you will need to log on or register onto the course to view all these practicalshttp://www.cmm.bris.ac.uk/lemma/course/view.php?id=13
However, you can see short samples of these materials, without registering at http://www.bristol.ac.uk/cmm/learning/module-samples/
A web course
144
Advanced booksA. Skrondal &
S. Rabe-HeskethH. Goldstein
Unified framework for models with latent variables, including multilevel, factor, structural eq.
Now 4th edition!
J. de Leeuw, E. Meijer (eds.)
145