Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10....

25
Introduction to multilevel analysis Leonardo Grilli Department of Statistics “G. Parenti” - University of Florence Email: [email protected] Web: http://www.ds.unifi.it/grilli/ Teaching staff Erasmus mobility Santiago de Compostela, 28 February – 1 March 2012 1 1. Introduction 2. ANOVA (fixed effects vs random effects) 3. Inference in random effectsANOVA 4. Basics of the two-level linear model – Case #1: a single covariate at level 1 5. Basics of the two-level linear model – Case #2: introduction of a covariate at level 2 6. Between, within and contextual effects 7. Inference in two-level models 8. Use of the residuals 9. Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction Multilevel structures Basic definitions NELS-88 example A hierarchical structure district level 4 school 1 level 3 school 2 class 1 class 2 level 2 class 3 class 4 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 level 1 - students Remark: levels are numbered bottom-up 4 It can be the case that all the units and clusters are physical entities pupil, class, school patient, doctor, hospital worker, firm individual, family, region interviewed, interview Hierarchical structures: Units within clusters Often the sampling design reflects the hierarchical structure (multi-stage sampling), but this is not necessary !! 5 It can be the case that the bottom units (level 1) are different responses of a given statistical unit (level 2) Multivariate data Longitudinal data (panel, repeated measurements) Hierarchical structures: Multiple responses Remark: a hierarchical structure may combine multiple responses and clusters given by physical entities (e.g. questionnaire on the students item, student, school) subject j resp3 resp2 resp1 6

Transcript of Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10....

Page 1: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Introduction to multilevel analysis

Leonardo GrilliDepartment of Statistics “G. Parenti” - University of Florence

Email: [email protected] Web: http://www.ds.unifi.it/grilli/

Teaching staff Erasmus mobility

Santiago de Compostela, 28 February – 1 March 2012

1

1. Introduction2. ANOVA (fixed effects vs random effects)3. Inference in random effects ANOVA4. Basics of the two-level linear model – Case #1: a single

covariate at level 15. Basics of the two-level linear model – Case #2: introduction

of a covariate at level 26. Between, within and contextual effects7. Inference in two-level models8. Use of the residuals9. Sample size requirements to fit multilevel models10. Extensions of the hierarchical model11. Software and books

Outline

2

Introduction

Multilevel structures Basic definitions NELS-88 example

A hierarchical structure

district level 4

school 1 level 3 school 2

class 1 class 2 level 2 class 3 class 4

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12

level 1 - students

Remark: levels are numbered bottom-up

4

It can be the case that all the units and clusters are physical entities pupil, class, school patient, doctor, hospital worker, firm individual, family, region interviewed, interview

Hierarchical structures: Units within clusters

Often the sampling design reflects the hierarchical structure (multi-stage sampling), but this is not necessary !!

5

It can be the case that the bottom units (level 1) are different responses of a given statistical unit (level 2) Multivariate data Longitudinal data

(panel, repeated measurements)

Hierarchical structures: Multiple responses

Remark: a hierarchical structure may combine multiple responses and clusters given by physical entities (e.g. questionnaire on the students item, student, school)

subject j

resp3resp2resp1

6

Page 2: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Multiple responses and missing data

subject 1

item3item2item1

subject 2

item3item1

Missing response

Multivariate data

subject 1

wave3wave2wave1

subject 2

wave1

Drop-out

Panel datawave2

Standard estimation methods for multilevel models allow for non informative missing data (Little & Rubin’s MAR: Missing At Random)

7

Some hierarchical terms

Cross-sectionUnivariate response

Cross-sectionMultivariate response

Longitudinal data (panel data)

Level 2 unit (cluster) set of subjects subject subject

Level 1 unit subject measurement, response, item

measurement, occasion, wave

Other names for

Level 2: between macro cluster

Level 1: within micro

Warning: subject is a level 1 unit or a level 2 unit depending on the context

8

In cluster analysis the hierarchical structure is unknown: it is just the aim of the analysis to discover the clusters!

In multilevel analysis the hierarchical structure (number of clusters, cluster membership) is known a priori: the aim of the analysis is to understand the relationships within and between clusters

However, a multilevel model can be specified in a way to perform a model-based cluster analysis on the clusters of the hierarchy, e.g. a 2-level model for students within schools can be used to build clusters of schools (this task requires to specify the random effects as having a discrete instead of continuous distribution)

Cluster analysis vs multilevel analysis

9

Usually the phenomenon under study can be modelled through several alternative structures: e.g. Pupil, class Pupil, class, school Pupil, school Pupil, teacher Pupil, (school by quarter)

The structure to be used in the analysis depends on the aims of the research

Even if a complex structure may appear more realistic, for most research purposes a simple structure with 2 or 3 levels is enough

Which is the relevant structure?

10

Design of experiments variance components models

Statistics mixed models (Harville, 1977), hierarchical linear models (HLM)

Econometrics random coefficients models (Swamy 1972), random effects models for panel data

Biostatistics mixed models for repeated measures (Laird and Ware, 1982), random effects models

Educational statistics multilevel models (Cronbach 1976, Aitkin and Longford 1986)

Sociology, demography, small area estimation,…

Different fields, different names…

11

Level 1 Example: male/female, grade

Level 2 Global: feature of the cluster with no corresponding level 1

measure Example: public/private school, number of teachers

Compositional (or contextual): feature of the cluster obtained through aggregation of level 1 measures (summary of the features of the level 1 units)

Example: average class size, proportion of females, average grade

Types of variables

12

Page 3: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Levels of the variables

A level 2 variable is by definition constant within clusters its variation is only between clusters

A level 1 variable has distinct values for the elementary units and, in general, its cluster mean changes from cluster to cluster its variation is both within clusters and between clusters

1, , j J clusters (level 2 units)1, , j ji n elementary (level 1) units in cluster

In a two-level setting a level 1 variable has a double index: Xij

while a level 2 variable has a single, level 2 index: Wj

. . . .ij j ij j ij j ij jX X X X var X var X var X X

13

Relationships among levels

ZY

ZYX

macro-micro relationship

adjusted macro-micro relationship

ZYX

cross-level interaction

XZ micro-macro relationship

macro: level 2, e.g. school

micro: level 1, e.g. pupil

14

Levels: pupils (level 1); schools (level 2) [schnum] Response variable Y [math]: score on a math test Level 1 covariate X [homework]: hours per week spent on

math homework Level 2 covariate W [public]: binary indicator of public vs

non-public school (a global level 2 variable)

NELS-88 example

We consider 10 handpicked schools from the NELS-88 data

(Kreft and De Leeuw, Introducing Multilevel Modeling, Sage, 1988)

15

NELS-88 example: data

+-----------------------------------+| schnum public math homework ||-----------------------------------|

126. | 6 1 42 2 |127. | 6 1 47 1 |128. | 6 1 47 3 |129. | 6 1 51 1 |130. | 6 1 53 1 |131. | 6 1 44 1 |132. | 7 0 62 4 |133. | 7 0 68 5 |134. | 7 0 56 5 |

Here are some records (9 out of 260), each record refers to a pupil

16

NELS-88 example: summary statistics

Nr. ofschools Size

3 202 212 221 231 241 67

Variable | Obs Mean Std. Dev. Min Max------------+--------------------------------------------

math | 260 51.30 11.14 31 71homework | 260 2.02 1.55 0 7

public | 10 0.10 - 0 1

There are 10 schools (level 2 units) of different size (unbalanced design)

The total number of pupils (level 1 units) is 260

17

Standard models, such as OLS regression, are not adequate for analysing hierarchical data

Inaccurate modelling: unable to disentangle the contributions of the hierarchical levels

Inaccurate inference Wrong sample size for the cluster variables (their sample size

should be the number of clusters) Dependence: the units of the same cluster are alike with a positive

within cluster correlation the independence assumption of standard models is violated

biased standard errors (often underestimated, leading to type I error rates higher than the nominal level )

Problems with standard models

18

Page 4: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Basic question: the hierarchical structure (along with the relationships within and between levels and the associated correlation structure) is of primary interest for the research?

Yes, it is of primary interest multilevel models No, it is merely a nuisance (e.g. the sampling design is

multistage but interest is limited to individual level relationships) methods able to “correct” the standard errors such as Repeated measure methods like GEE Robust (sandwich, Huber, White) covariance matrix of the estimators

Which type of model?

19

A solution to the problems of correlated observations is to aggregate the data at cluster level (by taking cluster means) and apply standard regression to the new dataset. However such solution is harmful because of the following problems:

Shift of meaning: the contextual variables obtained through aggregation refer to the clusters (not the level 1 units) they cannot be used to investigate relationships at level 1

Ecological fallacy (aggregation bias): relationships at level 2 relationships at level 1

Interactions between levels: an aggregated analysis precludes the study of the relationships between levels

Aggregated analysis

20

ANOVA

(fixed effects

vs

random effects)

Fixed effects ANOVA

20,iid

ij ee N

ij j ijy e

1, , j J clusters (in ANOVA terminology "levels of the factor")

1, , ji n j units in cluster1

J

jj

N n

total sample size

is a parameter representing the overall mean

j is a parameter representing the deviation of the mean of j-th cluster (level of the factor) from the overall mean (J1 parameters)

…i.e. the simplest multilevel model

22

Random effects ANOVA (RANOVA)

20,iid

ij ee N ij je u i, j

ij j ijy u e

20,iid

j uu N

The J1 parameters j are replaced by a single random variable uj that takes J realizations, one for each cluster (level of the factor)

23

Random effects ANOVA:data generating process

ij j ijy u e Useful to think in terms of a two-stage data generation process:

1) sampling cluster j -> take a realization of uj

from its distribution (so cluster j has a mean of +uj)

2) sampling unit i within cluster j -> take a realization of eij from its distribution (so unit iof cluster j has value +uj+eij )

+u2

+u3

+u1

24

Page 5: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Random effects ANOVA:variances and covariances

2 2( ) ij u eVar y 2

0 if ( , )

if and 'ij i ju

j jCov y y

j j i i

ij j ijy u e

Variance of yij decomposed in two components: cluster (between) level + individual (within) level

Observations belonging to the same cluster are positively correlated

Remark: the correlation is necessarily positive since it is generated by a shared latent variable uj (it is the same basic idea of factor models, where uj is called factor indeed the GLLAMM class of Rabe-Hesketh and Skrondal includes both multilevel and factor models as special cases)

25

Random effects ANOVA:covariance matrix

2 2 2

2 2 2

2 2 2 2

2 2 2 2

2 2 2 2

( )

u e u

u u e

u e u u

u u e u

u u u e

Var

y

1 2Example 2, 2, 3J n n

Block diagonal structure (empty space means zero)

26

is a measure of the degree of homogeneity of units belonging to the same cluster

The double nature of (correlation and variance ratio) does not hold in models with more than 2 levels

Random effects ANOVA:intraclass correlation coefficient

2

2 2

cluster variance( , ) 0,1total variance

uij i j

u e

Corr y y

denotes the ICC (Intraclass Correlation Coefficient) also known as VPC (Variance Partitioning Coefficient)

27

Random effects ANOVA:example with NELS-88 data

Parameter | Estimate-----------------+------------Intercept (=mean)| 48.87Level 2 variance | 30.54level 1 variance | 72.24

Total variance = 30.54+72.24 = 102.78

ICC = 30.54/ 102.78 = 0.297

29.7% of the variance of math scores is due to the clustering of pupils into schools

28

Random effects ANOVA:marginal vs conditional covariance

2( , )ij i j uCov y y

ij j ijy u e

For two units i and i’ of the same cluster j the responses are correlated because they share the same random effect uj

If we condition on uj the correlation disappears

( , | ) 0ij i j jCov y y u

Marginal covariance

Conditional covariance

29

Fixed effects ANOVA (again):conditional covariance

1 1 2 2( )ij j ij

J J ij

y ed d d e

This model does not specify a correlation structure among the responses of the same cluster: indeed the covariance is always null. However, the model refers to the distribution of y conditional on d, thus the covariance which is null is the conditional covariance

Since the model does not specify the distribution of d, we cannot compute the marginal covariance, but (for analogy with the random effects version) we can conclude that the marginal covariance is positive, i.e.

( , | ) 0ij i jCov y y d

Parameterization with J intercepts

1 2( , , , )Jd d dd is the vector of the J dummy variables for the clusters

1 2

1 2

Ex. 2, 2, 3

1 01 00 10 10 1

J n nd d

( , ) 0ij i jCov y y 30

Page 6: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

ANOVA: fixed or random effects?

Random effects ANOVA when the clusters (levels of the factor) are a sample from a population of clusters or anyway when the clusters are many parsimonious description of the observed variability among clusters + generalizability

ij j ijy u e ij j ijy e

Fixed effects ANOVAwhen the clusters (levels of the factor) are few and represent an exhaustive classification distributional assumptions are avoided, but impossible to generalize the results to a population of clusters

parameters iid random variables

31

Inference in random effects ANOVA

Random effects ANOVA:estimation of the mean /1

.. jj juy e

22

.

.

.

( )

( ) ( ) ej j u j

j

j

j

E

Var Var u en

y

y

ij j ijy u e

.1 ., , Jy y

Each cluster mean is un unbiased estimator of , but it is inefficient

Idea : estimate combining

2( )ij eVar e

2( )j uVar u

Cluster mean

33

Random effects ANOVA:estimation of the mean /2

1

1

1.ˆ

hh

J

jj

jy

22

.( ) ej u

jjyVar

n

.1

j jy is the precision of

• In the balanced case all the precisions are equal the best estimator can be computed (it is just the arithmetic mean)

• In the unbalanced case it is necessary to have an estimate of j the best estimator depends on the variance components

ij j ijy u e 2( )ij eVar e

2( )j uVar u

34

Random effects ANOVA:estimation of the variance components /1

2 2 2

.. .. . .1 1 1 1 1

j jn nJ J J

ij ij jj i j i

jj

jy yy y y n y

SST = SSW + SSB

2 2

2 2. .. .

1 1 1

1 1 = ( 1) 1

jnJ J

W Bj

jjj

jii

SSW SSBS y S yN J N J n J J

y y

Two possible estimators of the variance components:

For simplicity, we consider only the balanced case nj=n

35

Random effects ANOVA:estimation of the variance components /2

2 2 2 2 22

, W e ue

B uE S E Sn

2 2

22

.

B u

ej u

S

Var yn

overestimates since it actually

estimates

2 222ˆˆ u BueS

n An unbiased estimator of

22

2 2ˆˆ ˆ 0ue

Bu Sn

: can be negative! If then Remark

20( : 0)uH : this fact underlies the classical ANOVA F test Remark

.

j

ju

y

Not all the variability of the cluster means is due to the variability of the clustepopu r me

sampleansion lat

36

Page 7: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

To estimate the fixed parameters it is necessary to have an estimate of the variance-covariance parameters

However the converse is also true: to estimate the variance-covariance parameters it is necessary to have an estimate of the fixed parameters

Joint estimation of fixed effects and variance-covariance parameters

The only exception is in strictly balanced designs (all clusters with the same size and the same matrix of covariates) closed-form estimators of the variance components (see the textbook by Searle, McCulloch and Casella)

In general we have to rely upon iterative procedures

37

Random effects ANOVAprediction of the cluster-specific intercept 0j

1 1

1 1.

20

20

0

, ~ (0, ), Var( )

(biased) estimator of eˆ ach J J

j jj

jj

j j u j uj

jy

u u N

common

02

. . .

20.

1, (1 / ) ~ (0, / )

is an (unbiased) estimator of with variance /

jnj j ij j e ji

j e

j

j

j

j

e n e e N n

n

y

y

Remark: here unbiasedness is evaluated conditional on 0j (i.e. hold uj and calculate E(·) under repeated sampling of eij)

A linear combination of estimators 1 and 2 yields a new estimator better then both 1 and 2 in terms of MSE !!!

1

2

From level 1 model: sample mean as outcome

From level 2 model

ij j ijy u e

38

BLUP

The weight j (defined in the next slide) is a measure of the reliability of as an estimator of 0j

The superscript EB stands for Empirical Bayes since is also the posterior mean of 0j calculated by plugging-in the ML estimates

.0 ˆ(1 )j

EB

jj jy

. jy

0ˆ EB

j

The Best Linear Unbiased Predictor (BLUP) of is0 j

prediction using only the level 1model (i.e. data from cluster j)

prediction using the level 2 model (i.e. data from all clusters)

39

The reliability coefficient

In psychometrics letting i=item and j=individual, the parallel measurement model of classical test theory is

. observed score of individual = j jy

2

2 2 /u

ju e jn

The j that appears in is the reliability coefficient0ˆ EB

j

.

.

variance of true scoresvariance of observed scoresreliability of

jj

jj

Var

y

uy

Var

ij j ijy u e

True score of individual j

Observed measurement of individual j on item i

Measurement error of individual j on item i

40

EB prediction and borrowing strength

.0 ˆ(1 )j

EB

jj jy

Borrowing strength: for a cluster of low size, the prediction of the intercept heavily exploits information from other clusters

2 2

1/

1

1u e j

j

n

prediction using only data from cluster j

prediction using data from all clusters

When j rises, the EB prediction gets closer to the prediction based on data from the cluster under consideration

j is an increasing function of:

the between/within varianceratio and the cluster size

41

Prediction of the random effects /1

0 .ˆˆ ˆEB

jj jEBju y

OLS residual (also called ‘naive’ residual)

Shrinkage factor

Often we are interested in the value of the random effect, especially when it can be interpreted as a measure of effectiveness;

The EB method implies that the EB prediction of a random effect, also called EB residual, is

EB residuals are better than OLS residuals in terms of MSE

The amount of shrinkage depends on the cluster size; it may happen that the shrinkage is negligible for large clusters and substantial for small clusters (in effectiveness evaluation this fact causes some concern)

42

Page 8: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Prediction of the random effects /2

The random effects uj are random variables with priordistribution N(0u

2 The EB prediction is the mean of the posterior distribution

with parameters estimates plugged in, i.e. data information (likelihood) combined with population information (prior)

In a linear model the posterior is Normal mean=mode The mean of the posterior distribution is a value between 0 (the

mean of the prior) and the mode of the likelihood

ij j ijy u e .

2

2 2ˆ ˆˆ ˆ

/OLS

j j

EB OLSuj j

u e j

u yu un

43

In the RANOVA model, the sampling distribution of the EB prediction is

The sampling variance is

Note: sampling variance < prior variance Diagnostic standard error

It can be used for diagnostic purposes (e.g. to check if the EB prediction for a given cluster is anomalous)

2 2 2 2ˆvar( ) (1 ) ( | )EBj j u u j u u ju Var u y

Marginal sampling variance ( diagnostic standard error)

2ˆ (0, )EBj j uu N

2ˆˆ ˆ( )EBj j uSD u

44

In the linear random intercept model, the variance of prediction errors equals the posterior variance, i.e. variance of uj given the data

Note: posterior variance < prior variance Comparative standard error:

It can be used for inferences on differences among random effects (see later)

2ˆ( ) ( | ) (1 )EBj j j j uVar u u Var u y

Variance of prediction errors ( comparative standard error)

2ˆˆ ˆ( ) (1 )EBj j j uSD u u

45

Basics of the two-level linear model

Case #1: a single covariate at level 1

NELS-88 example: separate OLS analyses

One OLS regression for each of the 10 schools

How to perform an all-in-one analysis?

47

Levels: pupils (level 1); schools (level 2) Response variable Y: score on the final test Explanatory variable (at level 1) X: score on the initial test

Example: school effectiveness

0 1i i iy x e 2~ (0, )iid

i ee N

First consider a single school and assume a standard linear model:

48

Page 9: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Comparing the schools

Comparison between schools A and B based on the progress of the pupils (value added)

School A more effective (higher predicted Y for all the range of X)

School A more egalitarian (lower slope)

X = score on initial test

Y = score on final test

A

B

49

The two-level linear model(one covariate at level 1)

Sample of J schools (from a population of schools)

Level 1 modelEquation for the j-th school:

0 1j j ji jijiy x e 2~ (0, )iid

ejie N

Remark: each school has its own slope and intercept

50

The two-level linear model(one covariate at level 1)

Each school has a couple of “parameters”

Assumption: the “parameters” are iid random variables with a bivariate Normal distribution in the population of schools

0 1( , )j j

0 1,j j ije independent from

200 0 01

210

0

11,

iidu u

u

j

jN

0 1( , )j j

The Normal distribution is the “default” since it has nice properties and works well in many cases. Other choices are possible, such as a different continuous parametric family or an arbitrary discrete distribution

51

The two-level linear model(one covariate at level 1)

Model parameters

00 mean intercept

10

20u Intercept variance

mean slope

21u Slope variance

01u Slope-intercept covariance

Residual variance (level 1)2e

Fixed parameters

Variance-covariance parameters

(also called Random parameters even if they are fixed quantities –‘random’ just means that they refer to the random part of the model)

52

The two-level linear model(one covariate at level 1)

Correlation between slopes and intercepts:

0j

1j

11

0 10

0( ), u

u uj jcorr

Example of negative correlation

53

The two-level linear model(one covariate at level 1)

0 00 0

1 10 1

j j

j j

uu

0 1ij j j ij ijy x e

Level 2 models:

Combined model:

00 10 1 0ij ij j ij j ijy x u x u e

Fixed part Random part

Level 1 model:

54

Page 10: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

The two-level linear model(one covariate at level 1)

20 0 00 0 0

21 1 10 1 1

Var( )

Var( )j j j u

j j j u

u u

u u

Level 2 errors(random effects):

Random effect = unexplained deviation of the value of the “parameter” in the j-th cluster from the mean value in the population

The covariates may contribute to explain the deviations (so reducing the corresponding variances)Usually the distributional assumptions refer to the random effects (rather than to the random intercepts/slopes):

20 0 01

21 1

0,

0

iidj u u

j u

uN

u

0

1

jij

j

ue

u

indep. from

55

The two-level linear model: variance and covariances

The total error is implying

heteroscedasticity:

non-homogeneous correlation among the responses of the units of the same cluster:

no correlation among the responses of units of different clusters:

0 1j j ij iju u x e

2 2 2 20 01 1( | ) 2ij ij u u ij u ij eVar y x x x

2 2' ' 0 01 ' 1 '( , | , ) ( )ij i j ij i j u u ij i j u ij i jCov y y x x x x x x

' ' ' ' ' '( , | , ) 0ij i j ij i jCov y y x x

Between-cluster variance Within-cluster variance

56

The two-level linear model: variance function

2 2 2 20 01 1( | ) 2ij ij u u ij u ij eVar y x x x

201 1/u u

( | )ij ijVar y x

ijx

The variance function is a parabola with minimum in

Depending on the range of x, in a given application the variance function can be descending, ascending or U-shaped, but never –shaped!

Up to now we have assumed that level 1 errors are homoscedastic, but this assumption can be relaxed, so a –shaped relationship can be captured by level 1 heteroscedasticity

201 1/u u

57

Example with a single level 1 covariate with random slope

Two-level model in matrix notation

1 10

1

1 10

1

1

1

1

1

j j

j j

j j

j j

n j n j

j jj

j j jj

n j n j

y x

y x

x euu

x e

y X β

Z u e

The matrix notation is rarely used, but it is needed for writing estimation algorithms

j j j j j y X β Z u e

58

Covariance structure of the two-level model

0 1

0 00 0

1 10 1

ij j j ij ij

j j

j j

y x euu

20 0 01

21 1

0,

0~iid

j u u

j u

uN

u

2~ (0, )ij ee N

20 01

21

u uu

u

Σ

Covariance matrix of the random effects

Some special cases of u

• Standard (OLS) regression

• Random intercept

• Random (intercept and) slope

2 20 1 01Remark: when or is null the covariance is nullu u u

59

Standard (OLS) regression model

Special case u =0The regression line is the same for all clustersFixed intercept and slope standard regression model

x

y

Homoscedasticity No correlation even within clusters

2

' '

( | )

( , | , ) 0ij ij e

ij i j ij i j

Var y xCov y y x x

00 10ij ij ijy x e

60

Page 11: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

NELS-88 example: OLS regression

Parameter | Estimate------------------+------------Intercept | 44.07Homework | 3.57Residual variance | 93.73

Each school has the same intercept and slope (= same line)

61

Random intercept model

2 20

2' ' 0

( | )

( , | , )ij ij u e

ij i j ij i j u

Var y x

Cov y y x x

20 0

0u

u

Σ

x

y

The variance of the slope is null (and so is the slope-intercept covariance)

The variance of the intercept does not depend on X (i.e. centering X is irrelevant)

Homoscedasticity Equi-correlation within clusters

Special case

The regression lines are parallel the clusters can be ranked

00 10 0ij ij j ijy x u e

62

NELS-88 example: random intercept model

Parameter | Estimate-------------------------+------------Intercept | 44.98Homework | 2.21Residual lev. 2 variance | 22.50Residual lev. 1 variance | 64.26

Total residual variance = 22.50+64.26 = 86.76

Residual ICC = 22.50/ 86.76 = 0.259

25.9% of the residual variance of math scores after adjusting for homework is due to the clustering of pupils into schools

Each school has this estimated slope

Mean intercept in the population of schools

63

Random (intercept and) slope model

20 01

21

u uu

u

Σ

x

y

General case

The between-school variance is a quadratic function of X

Heterogeneous correlation within clusters (no unique residual ICC)

The intercept variance and the slope-intercept covariance depend on X since they refer to X=0 useful to center X

When the origin of X is arbitrary (which is often the case), the covariance should not be constrained to be zero

Many crossing points the clusters cannot be ranked

00 10 1 0ij ij j ij j ijy x u x u e

64

NELS-88 example: random slope model

Parameter | Estimate--------------------------+------------Intercept | 44.77Homework | 2.05Residual lev. 2 var/cov. |

Intercept var. | 61.81Homework var. | 19.98Intercept-Homework cov.| -28.26

Residual lev. 1 variance | 43.07

Here the ICC is meaningless

Mean slope in the population of schools

It amounts to a correlation of 0.80

65

Random (intercept and) slope model

mY = unit of measure of Y (e.g. Kilograms)

mX = unit of measure of X (e.g. Metres)

u0 is expressed in mY (just as 0)

u1 is expressed in mY/ mX (just as 1)

u01 is expressed in (mY)2/ mXx

y

What is the unit of measure of the variances/covariances?

Be careful in interpreting standard deviations, variances and covariances of the random effects

66

Page 12: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Usually a random coefficient refers to a continuous covariate (‘random slope’)

Also a binary covariate may have a random coefficient, but the interpretation is different

Suppose d is binary covariate (or dummy variable), then

Thus the between-cluster variance is heteroscedastic

Random coefficient on a binary covariate

2 2 2 20 01 1

2 2 20 01 1

( | ) 2

2

ij ij u u ij u ij e

u u u ij e

Var y d d d

d

2 2 20 0 01 1for 0 2 for 1u ij u u u ijd d

67

Basics of the two-level linear model

Case #2: introduction of a covariate at level 2

Introduction of level 2 covariates:

level 2 covariates represent features of the clusters useful to

define a model for the level 1 parameters

and so reduce the level 2 variances

Example: W is a binary variable coded 1=public school; 0=private school

The two-level linear model(one covariate at level 1 + one covariate at level 2)

0 1( , )j j

2 20 1( , )u u

69

The two-level linear model(one covariate at level 1 + one covariate at level 2)

0 1ij j j ij ijy x e

Level 2 models:

Combined model:

0 00 01 0

1 10 11 1

j j j

j j j

w uw u

00 01 10 11ij j ij j ijy w x w x

0 1j j ij iju u x e Random part

Fixed part

Here it becomes clear why the have a double index

Level 1 model:

70

The two-level linear model(one covariate at level 1 + one covariate at level 2)

Level 2 models:

mean difference in intercept between private and public schoolmean difference in slope between private and public schooldeviation of school j from the corresponding mean interceptdeviation of school j from the corresponding mean slope

0 00 01 0

1 10 11 1

j j j

j j j

w uw u

01110 ju1 ju

20 0Var( )j uu

21 1Var( )j uu

Remark: the distributional assumptions on the random effects are the same as before, but now the variances have a different meaning (remind: the variances are residual w.r.t. to the model covariates)

71

In the combined model there is a cross-level interaction

It arises because the level 1 coefficient depends on the level 2 covariate

A multilevel model can be written in two alternative ways: a) a single combined equation (like most software) b) a system of hierarchical equations (like the software named HLM)

Who uses approach b) usually ends up with a more complex model (notably, with more cross-level interactions)

The two-level linear model(one covariate at level 1 + one covariate at level 2)

j ijw x

1 jjw

72

Page 13: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Example: random slope model with the covariate public (NO cross-level interaction)

Parameter | Estimate--------------------------+------------Intercept | 58.06Homework | 1.94Public | -14.65Residual lev. 2 var/cov. |

Intercept var. | 40.68Homework var. | 21.68Intercept-Homework cov.| -29.16

Residual lev. 1 variance | 42.95

The effect of homework is assumed to be the same for public and non-public schools

Mean slope of schools (regardless of public/non-public)

Mean intercept of non-public schools

Difference in the mean intercept (public vs. non-public)

73

Example: random slope model with the covariate public AND cross-level interaction

Parameter | Estimate--------------------------+------------Intercept | 59.21Homework | 1.09Public | -15.94Homework*Public | 0.95Residual lev. 2 var/cov. |

Intercept var. | 40.50Homework var. | 21.58Intercept-Homework cov.| -29.02

Residual lev. 1 variance | 42.95

The mean slope of homework in the population of schools is 1.09 for non-public schools and 1.09+0.95=2.04 for public schools

Mean slope of non-public schools

Difference in the mean slope (public vs. non-public)

74

Within, between and contextual effects

Slopes: between, within and total Centering the covariates The contextual effect The fixed effects model

Three regression models

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

X

Y

Total

Within

Between

j i X_ij X_. j Y_ij Y_. j

1 1 1 2 5 61 2 3 2 7 62 1 2 3 4 52 2 4 3 6 53 1 3 4 3 43 2 5 4 5 44 1 4 5 2 34 2 6 5 4 35 1 5 6 1 25 2 7 6 3 2

Example from Snijders & Bosker (2011)

Difference Between-Within: “Ecological fallacy”

Real data example:

i= graduate, j= faculty

Y= employability

X= graduation mark

76

Regression models for estimating within and between relationships

. .

. .

. .

ˆ 5.33 0.33ˆ 8.00 1.00

ˆ 1.00

ˆ 8.00 1.00 1.00

ij ij

j j

ij j ij j

ij j ij j

y x

y x

y y x x

y x x x

Total

Between cluster means

Within clusters

Multilevel

The multilevel regression model allows us to study between and within relationships at the same time

77

A covariate can be centered w.r.t. a given constant, such as the grand mean: this affects the intercept (in a random slope model) the intercept variance and the

intercept-slope covariance

the cluster mean (CM centering), so if the cluster means are different the centering varies from cluster to cluster: this affects the slope (total effect vs. within effect)

Centering a covariate

78

Page 14: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

The Cronbach model

00 10 . 01 .( )ij ij j j j ijy x x x u e

Within slope

Between slope

Cronbach model: CM centering & cluster mean

. 00 01 . .j j j jy x u e

. 10 . .( ) ( )ij j ij j ij jy y x x e e

79

The contextual model

00 10 01 .ij ij j j ijy x x u e ‘contextual’ model: no CM centering, but cluster mean

00 10 . 10 01 .( ) ( )ij ij j j j ijy x x x u e

10 10

01 01 10

within slopebetween slope within slope

. .replacing with ( yields)ij ij j jx x x x

Just a reparameterization of the Cronbach model !!

80

In sociology and education it is known as the contextual effect

It is the additional effect of the school mean of X on Y that is not accounted for by the individual level X

Usually X is prior score or SES (Socio-Economic Status) The estimate of the contextual effect of X will partially

encompass the effects of all school level variables that are correlated with X including peer influences school climate allocation of resources organizational and structural features of schools

The contextual effect

01 10 between slope within slope

81

Interpreting the three effects: within

Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Within effect: expected Y for Sam vs expected Y for another

pupil with X=81 and sch_mean(X)=70

00 . 01 1 .0 ( )ij ij j j j ijy x x x u e

. .

00 10 01 1000 10 01

| 81, 70 | 80, 70

11 70 10 70ij ij j ij ij jE y x x E y x x

82

Interpreting the three effects: between

Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Between effect: expected Y for Sam vs expected Y for another

pupil with X=81 and sch_mean(X)=71

00 10 . 0 .1( )ij ij j j j ijy x x x u e

. .

00 10 01 0100 10 01

| 81, 71 | 80, 70

10 71 10 70ij ij j ij ij jE y x x E y x x

Between effect: you increase the school mean score by 1, and you also increase the individual score by 1 in order to leave the deviation unchanged (the pupil to be compared with Sam has the same relative position, namely 10 points above the school mean) 83

Interpreting the three effects: contextual

Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(X)=70 Contextual effect: expected Y for Sam vs expected Y for

another pupil with X=80 and sch_mean(X)=71

00 .10 0 .1( )ij ij j j j ijy x x x u e

0

. .

00 10 01 00 1 101 10 0

| 80, 71 | 80, 70

9 71 10 70ij ij j ij ij jE y x x E y x x

Contextual effect: you increase the school mean score by 1 and leave the individual score unchanged (this changes the deviation: the pupil to be compared with Sam has a different relative position, namely 9 points above the school mean instead of 10) 84

Page 15: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

The raw covariate model

This model implicitly assumes that the between and within slopes are identical !!

00 10ij ij j ijy x u

‘raw covariate’ model: no CM centering, no cluster mean

. 00 10 . .j j j jy x u e

. 10 . .( ) ( )ij j ij j ij jy y x x e e

85

Some model specifications

Models 1 to 3 try to fully control for X, while model 4 controls only for the within effect of X

In most applications, within and between slopes are quite different model 1 is wrong (though it is the most parsimonious model controlling for X)

In models with many covariates the parsimony principle may suggest to disentangle the slopes only for the covariates of main interest

.

. .

.

1)

2) ( )

3) ( )

4) ( )

ij total ij

ij within ij between within j

ij within ij j between j

ij within ij j

y xy x xy x x xy x x

More details in the textbooks of Raudenbush & Bryk and Snijders & Bosker

Statistically equivalent

86

Snijders TAB and Bosker RJ. (2011) Multilevel analysis: An introduction to basic and advanced multilevel modeling. 2nd ed. London: Sage.

Raudenbush SW and Bryk AS (2002) Hierarchical Linear Models(Second Edition). Thousand Oaks: Sage.

Kreft IGG, de Leeuw J and Aiken L (1995) The effect of different forms of centering in hierarchical linear models. Multivariate Behavioral Research, 1-21.

Paccagnella O (2006) Centering or not centering in multilevel models: The Role of the Group Mean and the Assessment of Group Effects. Evaluation Review 30, 66-85.

References on centering and within/between/contextual effects

87

The fixed effects model

No distributional assumptions on the cluster effects need not worry about homoscedasticity, normality, correlation between random effects and covariates

The slope is not the total effect, but the within effect (in panel data the corresponding estimator is known as the fixed effects estimator): in fact, all the between variation is absorbed by the fixed effects the covariates can only explain the within variation

ij ij j ijy x e random effects uj replaced by parameters jThus no distributional assumptions !!!

88

Random effects is the standard choice in most fields (Epidemiology, Sociology, Psychometrics …), while in Econometrics the standard choice is fixed effects [e.g. Rivkin S.G., Hanushek E.A., Kain J.F. (2005) Teachers, schools, and academic achievement. Econometrica, 73, 417-458.]

Fixed effects have the merit of avoiding assumptions on the random effects, and they can be used even with very few clusters(e.g. 5 clusters). But they entail several limitations: Impossible to use level 2 covariates, a dramatic limitation in the

(frequent) case where the research questions concern the effect of level 2 covariates!

Loss of efficiency (since number of fixed effects = number of clusters) Inefficient estimation of cluster effects (for example, if a cluster has two

units its fixed effect is estimated with just two observations)

Fixed vs random effects

89

Inference in two-level models

Page 16: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Parameter estimation

Maximum likelihood step 1: estimation of fixed parameters 00, 01, 10, 11

and variance-covariance parameters step 2: prediction of random effects (u0j,u1j, j=1,…,J)

Bayesian inferencethe parameters are random variables with a prior distribution parameters and random effects are all random variables

00 01 10 11 0 1ij j ij j ij j j ij ijy w x w x u u x e

Random partFixed part

2 2 20 1 01, , ,e u u u

91

The model is linear but the error term

violates the homoscedasticity and no-correlation assumptions

OLS estimation

What about estimating the fixed parameters using OLS (Ordinary Least Squares) or ML under the standard assumptions of the linear model?

*0 1ij j j ij ije u u x e

00 01 1*

0 11ij j ij j ij ijy w x x ew

Inefficient estimators

Biased standard errors

92

OLS standard errors

Consider a random intercept model with a single covariate

Fitting the model with OLS (i.e. omitting the random effects) leads to a wrong standard error (s.e.) for the covariate

Level 2 covariate (purely between, i.e. constant within clusters) OLS s.e. is (substantially) too low

Purely within level 1 covariate (it varies only within clusters) OLS s.e. is (slightly) too high

Level 1 covariate with between variation (it varies both within and between clusters) the OLS s.e. is the result of two opposite effects: the between part pushes it down, the within part pushes it up – but in practice the OLS s.e. is nearly always too low

ij ij j ijy x u e

93

Likelihood

( | , )f y u θ

( ) ( | , ) ( | )L f p d θ y u θ u θ u

Distribution of responses,conditional on random effects and parameters

Distribution of random effects,conditional on parameters

Likelihood

Problem: the integral has analytical solution only for conjugate distributions (e.g. Normal-Normal, Binomial-Beta, …)

The equation defining a random effects model includes the random effects, but they are not observable (so cannot appear in the likelihood) the random effects must be integrated out!

( | )p u θ

94

Closed-form likelihood

Multiple random effects multivariate distribution the Normal distribution is preferable (and in fact it is the standard choice in applications)

Linear model : Normal-Normal (conjugate) the integral has analytical solution

closed-form likelihood

Non linear model : e.g. Binomial-Normal (non conjugate) the likelihood must be evaluated through

approximate integration methods

95

Maximum Likelihood

( | , ) ( | )( | , )( | , ) ( | )

M L M LM L

M L M L

f ppf p d

y u θ u θu y θ

y u θ u θ u

Once the ML estimates of fixed and random parameters have been obtained

-> Empirical Bayes estimation of the random effects

arg max ( ) ( | , ) ( | )M L L f p d θ

θ θ y u θ u θ u

96

Page 17: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Bayesian inference

( | , ) ( | ) ( )( , | )( | , ) ( | ) ( )f p pp

f p p d d

y u θ u θ θu θ yy u θ u θ θ u θ

For Bayesian inference a prior distribution is defined

and the joint posterior distribution is used

( )p θ

( | ) ( , | )

( | ) ( , | )

p p d

p p d

θ y θ u y u

u y θ u y θ

Inference on the parameters

Inference on the random effects

97

Even for the linear model it is necessary to use approximate integration algorithms (e.g. Gibbs sampling)

The Bayesian approach has the usual disadvantages, whereas the main advantages here are Good estimates even for a small number of clusters In complex multilevel models the estimates properly account

for all the uncertainty Bayesian methods yield good estimates of the variance components and confidence intervals with appropriate coverage even in highly complex models, where ML methods show a poor performance

Bayesian inference: pros and cons /1

98

Seltzer, M.H., Wong W.H., and Bryk A.S. (1996) Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods. Journal of Educational and Behavioral Statistics 21(2): 131-167.

Bian, Guarui (2002) Bayesian Estimates in a One-Way ANOVA Random Effects Model. Australian & New Zealand Journal of Statistics 44(1): 99-108. (simulation 3 clusters of size 2)

Browne W.J., & Draper D. (2006). A comparison of Bayesian and likelihood-based methods for fitting multilevel models, Bayesian Analysis, 1, 673–514. http://ba.stat.cmu.edu/journal/2006/vol01/issue03/draper2.pdf

Bayesian inference: pros and cons /2

99

Hierarchical specification of the model(useful to write down the likelihood)

The random intercept model

can also be written hierarchically (which is the standard way to specify multilevel GLMs)

ij ij j ijy x u e

2

2

1. | , ,

2. 0 ,

ind

ij ij j ij j e

iid

j u

y x u N x u

u N

Remark: in the hierarchical specification the level 1 errors eijare not written (only their variance)

Independence stems from conditioning on the random effects

100

Hierarchical construction of the likelihood for the random intercept model

The likelihood can be written in steps by exploiting the conditional independencies shown by the hierarchical formulation

1

1

( | ) ( | ) -th cluster conditional likelihood

( ) ( | ) ( | ) -th cluster marginal likelihood

( ) ( ) marginal likelihood

j

j

n

j j ij ji

j j j j ju

J

jj

L u L u j

L L u p u du j

L L

ψ ψ

θ ψ τ

θ θ

2

2

param. of random effects, where

, , all other parameters

u

e

τθ ψ τ

ψ

2density of , ( | )

for fixed evaluated at the observed ij j e

ij jj ij

N x uL u

u y

ψ

conditional independence given uj

Independent clusters 101

Full Information Maximum Likelihood (FIML) Full likelihood, joint estimation of fixed and random parameters Underestimates the random parameters since it treats the fixed

parameters as known quantities (ignoring degrees of freedom)

Restricted Maximum Likelihood (REML) The random parameters are estimated by maximizing the restricted

likelihood, i.e. the density of the residuals The estimators of the random parameters are approximately unbiased

even in small samples

FIML and REML

102

Page 18: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Degrees of freedom

To understand why FIML underestimates the random parameters remind what happens in standard linear regression

If k is large w.r.t. n, then is severely downward biased (it does not correct for the degrees of freedom lost in estimating )

In a multilevel model the sample size which is relevant for the estimation of the level 2 variance is J, i.e. the number of clusters

2

22 2 2 2 2

ˆ' Var( )= '1 1ˆ ˆ ˆ ˆ E =

i i i i i i i

OLSFIML i i OLSi i

y e e e y

e en n k

β X β X

2ˆFIML

103

In a two-level model, REML and FIML lead to: Similar estimates for the level 1 variance Discordant estimates for the parameters of the random effects if the

number of clusters J is small (in such a case FIML estimates of variances are lower)

Unless the main aim is the estimation of the random parameters, FIML is preferred because: FIML estimators have a lower sampling variance (the comparison in

terms of MSE is often in favour of FIML estimators) With FIML the LRT (Likelihood Ratio Test) can be used not only to

test the random parameters, but also to test the fixed parameters

FIML vs REML

104

Iterative Generalized Least Squares (Goldstein 1986)

Fisher Scoring (Longford 1987)

EM (developed by Dempster, Laird and Rubin for models with missing data, later applied to multilevel models since the random effects are, in a sense, missing data)

FIML algorithms

105

The available estimation procedures differ in the step for estimating the variance-covariance parameters, while the step for estimating the fixed parameters is always Generalized Least Squares (GLS)

Generalized Least Squares (GLS)

10 1 1

1 11

ˆ

ˆ

J JT Tj j j j j j

j j

X Ω X X Ω y

2 2 21 1

2 2 2

1

1j j

j j u e u

j j j u u e

n j n j

y x

y x

y X Ω

• Replacing the variances with consistent estimates leads to a feasible GLS

• Maximum likelihood estimators are an instance of feasible GLS106

Under mild regularity conditions FIML estimators have good asymptotic properties:

Consistency Normality Efficiency

Remark: here asymptotic requires increasing the number of clusters (increasing the cluster sizes is not enough), so the number of clusters J is the key quantity for asymptotics

Properties of FIML estimators

107

Complete-case or ‘listwise’ analysis: Reduced power Inconsistent estimates unless missingness only depends on covariates

(Missing Completely At Random MCAR) ML of a mixed model based on all available responses: Consistent estimates if missingness only depends on covariates and

observed responses (Missing At Random MAR) & mixed model correctly specified

Inconsistent estimates if missingness depends on missing responses (Not Missing At Random NMAR)

ML of a mixed model with explicit missingness models: Consistent estimates under correct specification of missingness

model (and ‘substantive’ model)

ML and missing data

108

Page 19: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Hypothesis testing on a single fixed parameter

Null hypothesis: H0: h=0

Wald test statistic:

( )

. .( )h

h

h

Ts e

#(level 1 units)-#(covariates)-1 if coeff. covariate at level 1

#(clusters)-#(covariates at level 2)-1 if coeff. covariate at level 2( ) d.f.

approxh

hh

T t

A caveat: with few clusters the standard errors are usually underestimated the test rejects the null hypothesis too often (i.e. the type I error rate is higher the nominal level)

109

Hypothesis testing on a set of fixed parameters

Null hypothesis: H0: C ’ = 0(C = matrix of contrasts with k rows)

Wald test statistic: 1ˆ'ˆ ˆ ˆ( ' ) ( ' ) ( ' )TQ

C γC γ C γ C γ γ

Remark: with few clusters the F distribution is preferable

Alternative: LRT (asymptotically equivalent)

2( ' )approx

kQ C γ

110

Hypothesis testing on the random parameters

Hypothesis

where 0 is a restricted version of 1 (e.g. some elements are constrained to 0)

Unless the number of clusters is huge, the Wald test should not be used since the sampling distributions of the estimators of the random parameters are highly asymmetric

0 0 1 1H : vs. H : Σ Σ Σ Σ

111

LRT on a level 2 variance

D1 = deviance (2logL) of the unrestricted model

D0 = deviance (2logL) of the model with a variance at 0#(restrictions) = 1 + #(corresponding covariances)

220 1 ( )( )

1/21/2

0

approx

restrictionsrestrictions

D D

with prob.

with prob.

Practical rule: the p-value must be halved! (otherwise the test is conservative, i.e. the actual probability of type I error is lower than )

Remark: with REML the LRT can be used only if the fixed part of the model is unchanged!

112

Starting with a random effects ANOVA model

level 2 and level 1 covariates can be added: A level 2 covariate reduces (or leaves unchanged) the level 2

variance, but (being constant within each cluster) cannot affect the level 1 variance

A level 1 covariate reduces (or leaves unchanged) the level 1 variance, but its effect on the level 2 variance is unpredictable

Effect of the covariates on the (residual) variances

2 2( ) ( )j i eu jVar u Var e ij j ijy u e

113

jijx xjx

Why the estimated level 2 variance may increase /1

a level 2 covariate by definition is purely between i.e. varies only between clusters

a level 1 covariate can be written as the sum of:

a purely between component Reduces u

2 and does not affect e2

a purely within component Reduces e

2 and thus increases u2

A purely within level 1 covariate increases the estimate of u2

Usually a level 1 covariate varies both within and between the effecton u

2 is unpredictable (often reduces it)

222ˆ ˆ

ue

BSn

This effect is strong only if the cluster size n is small

The effect can be illustrated with the ANOVA formula

114

Page 20: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Why the estimated level 2 variance may increase /2

A rise in the estimated level 2 variance u2 can be due to:

the addition of a level 1 covariate varying mainly (or exclusively) within clusters this is a consequence of the estimation method(the model is assumed to be correctly specified) and it is negligible for large cluster sizes

the inclusion of an endogenous covariate, i.e. correlated with the random effects (an incorrectly specified model)

Example (Longford). Consider a comparison of hospitals based the response ‘death of the patient’ and suppose that the covariate “severity of the case” is added; then if hospitals that tend to treat more severe cases provide higher quality treatment, then after adjustment for severity the between-hospital variance will be much greater.

115

In a multilevel model (two or more levels) a covariate at an arbitrary level:

does not affect the variances at lower levels reduces (or leaves unchanged) the variances at the same level has an unpredictable effect on the variances at higher levels

Effect of the covariates on the (residual) variances: general rule

116

Use of the residuals

Diagnostics Inference for specific clusters

Many residuals Level 1 Level 2 (Empirical Bayes)

Extension of the techniques used in standard regression Purposes: Check the functional form and the distributional assumptions

(normality, homoscedasticity, …) Look for influential units

Diagnostics based on the residuals

ije0 1ˆ ˆ, ,j ju u

118

To check the normality assumption use an histogram or a Q-Q plot

Diagnostics based on the residuals

119

To locate anomalous units look at the standardized residuals, using the diagnostic standard errors

Diagnostics based on the residuals

0ˆstandardized ju

1ˆst

anda

rdiz

ed

ju

120

Page 21: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

To look for misspecifications in the fixed part and/or heteroschedasticity, plot the residuals one by one against the predicted values of the response

Diagnostics based on the residuals

121

Two relevant inferential questions:

Is cluster j* significantly different from the mean? Since the mean is 0, the question is whether uj* 0

Is cluster j* significantly different from cluster j**? The question is whether uj* uj**

The level 2 residuals are predictions of the corresponding random effects uj they can be used to make inference

Inference on the random effects

122

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 10 20 30 40 50 60

EB p

redi

ctio

n of

rand

om e

ffect

s

Ranks

Caterpillar plot of the residuals for comparing each cluster with the mean

The residuals are ordered and endowed with 95% confidence bars (+/- 1.96 the comparative standard errors)The width of the error bar of a given cluster depends of its size

Comparisons with the mean

There are many clusters significantly above or below the mean

ˆ ˆ1.96 ( )j ju SE u

123

Warning: how to compare two means

A common misconception: believing that two quantities whose 95% intervals are disjoint are significantly different at 5%

2

2

( , )

( , )X

Y

X NY N

Assuming for simplicity that X and Y are independent

1.961.96

XY

2( , 2 )X YX Y N 1.96 2X Y

X is significantly different from Y at level 95% if and only if:

• the distance (in units) between X and Y exceeds 1.962 = 2.77

• or the univariate intervals of length 2.77/2 = 1.39 are disjoint124

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 10 20 30 40 50 60

Ranks

EB p

redi

ctio

n of

rand

om e

ffec

ts

Caterpillar plot of the residuals for comparing two clusters

The residuals are ordered and endowed with error bars (+/- 1.39 the comparative standard errors). The value 1.39 stems from assumptions (Normal distribution, independence, same variance) which usually are not satisfied the value 1.39 is an approximation and the significance level 5% is an average level

Graph for pair-wise comparisonsHere only a few comparisons are significant. The width of the error bar of a given cluster depends of its size to increase the number of significant comparisons a way is to collect more data on each cluster ˆ ˆ1.39 ( )j ju SE u

The only difference with the previous caterpillar plot is the width of the bars

125

Sample size requirements to fit multilevel models

Page 22: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

A multilevel model requires a minimum number of clusters in order to get (approximately) unbiased estimates

The minimum depends on type of model (linear vs non-linear, random intercept vs random slope), on the average size of the clusters and on the true parameter values

In the best situation (a simple linear random intercept model and a dataset with large clusters) the minimum is about 10 (but often we are far from the best situation!)

In case of few clusters fixed effects model An alternative solution to handle few clusters is to fit a

random effects model with Bayesian methods since they do not rely on asymptotics

How many clusters are needed?

127

The sample size requirements are different depending on the target of inference

The less demanding target is to get unbiased point estimates of regression coefficients (in favourable situations 10 clusters of size 2 may be enough)

More clusters (say 30 or 50) are needed for unbiased estimation of variance components and standard errors (especially the standard errors of the variance components)

The requirement is higher for models with random slopes and for non-linear models (e.g. binary responses)

Target of inference

128

The cluster size is less relevant than the number of clusters: clusters of size 2 (as in a two-wave panel) are usually ok for a linear random effects model (but not for a logistic random effects model)

Even a few clusters with a single unit are not harmful However, small clusters worsen cluster-specific inferences (for

example, the precision of Empirical Bayes predictions of random effects)

Moreover, data with small clusters carry limited information on the variance-covariance structure at level 2, which should be kept simple (for example, no random slopes)

Small clusters

129

degrees of freedom at level 2 = number of clusters minusnumber of level 2 covariates

Given the number of level 2 covariates, the number of clusters should ensure enough degrees of freedom at level 2 For example, we cannot obtain good estimates with 20

clusters and 10 level 2 covariates

Degrees of freedom at level 2

130

LINEAR Maas C.J.M. & Hox J.J. (2005) Sufficient sample sizes for multilevel modeling.

Methodology, 1, 86-92. Bell, B.A., Morgan, G.B., Schoeneberger, J.A., Loudermilk, B.L., Kromrey, J.D., &

Ferron, J.M. (2010). Dancing the Sample Size Limbo with Mixed Models: How Low Can You Go? SAS Global Forum 2010, Paper 197-2010. http://support.sas.com/resources/papers/proceedings10/197-2010.pdf

LOGISTIC: Moineddin R., Matheson F.I. and Glazier R.H. (2007) A simulation study of

sample size for multilevel logistic regression models. BMC Medical Research Methodology, 7. [random slope model]

Austin P.C. (2010) Estimating multilevel logistic regression models when the number of clusters is low: A comparison of different statistical software procedures. International Journal of Biostatistics, 6(1), article 16

Paccagnella O. (2011) Sample Size and Accuracy of Estimates in Multilevel Models. New Simulation Results. Methodology, 7, 111-120.

References on sample size /1

131

SMALL CLUSTERS Raudenbush SW (2008) Many small groups. In: J. de Leeuw, E. Meijer (eds.),

Handbook of Multilevel Analysis, Springer.

OPTIMAL DESIGN: Snijders TAB, Bosker RJ (1993) Standard errors and sample sizes for two-

level research. Journal of Educational Statistics, 18:237–259. Snijders, TAB (2005) Power and Sample Size in Multilevel Linear Models. In:

B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in BehavioralScience. Vol. 3, 1570–1573. Chicester : Wiley.

http://stat.gamma.rug.nl/PowerSampleSizeMultilevel.pdf Cohen MP (1998) Determining sample sizes for surveys with data analyzed

by hierarchical linear models. Journal of Official Statistics, 14:267–275. Moerbeek M, Van Breukelen GJP, Berger MPF (2008) Optimal Designs for

Multilevel Studies. In: J. de Leeuw, E. Meijer (eds.), Handbook of Multilevel Analysis, Springer.

References on sample size /2

132

Page 23: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Extensions of the hierarchical model

Complex level 1 variation 3-level model

Many covariates at both levels: X1, X2, …,W1, W2, … Cluster means of the level 1 covariates Complex error structure: At level 1: e.g. heteroschedasticity At level 2: e.g. many random slopes

More than two hierarchical levels

Extensions of the linear mixed model

But be aware that the imagination of the researchers “can easily outrun the capacity of the data, the computer, and current optimization techniques to provide robust estimates” (Di Prete & Forristal)

2( ) ij ijVar e x

134

A model which allows the level 1 variance to depend on explanatory variables is called a complex level 1 variance model

Example: in an educational setting, it is often observed that boys vary more than girls in their attainment, i.e. there may be heteroscedasticity at the student level. Denoting with dij the indicator for girl, the level 1 error is

Complex level 1 variation (heteroscedasticity)

1, 0,

220,1 ,

1

( ) 1

iij ij iij

e

j

e

j

ij ij ij

e d

V e

e d

d d

e

Browne, W.J., Draper, D., Goldstein, H., & Rasbash, J. (2002). Bayesian and likelihood methods for fitting multilevel models with complex level-1 variation. Computational Statistics & Data Analysis, 39(2), 203-225.

135

The 3-level random intercept model

ijk k jk ijky fixed part v u e

1,, ,1, ,1, ,

k

jk

k Kj Ji I

level 3 units (e.g. schools)

level 2 units (e.g. classes)

level 1 units (e.g. pupils)

2

2

2

~ (0, )~ (0, )

~ (0, )

iid

k viid

jk uiid

ijk e

v Nu N

e N

independence among levels

136

The 3-level random intercept model

' ' '

2

' ' 2 2 2

2 2

' 2 2 2

, 0

,

,

ijk i j k

vijk i j k

v u e

v uijk i jk

v u e

corr y y

corr y y

corr y y

Two pupils of the same school but different classes

Two pupils of different schools

Two pupils of the same school and class

2 2 2

2 2 2 2 2 2 2 2 2v u e

v u e v u e v u e

school level class level student level

Variance Partition Coefficients (VPC):

137

Software & Books

Page 24: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Specialized software Procedures in general purpose software

Web resources Centre for Multilevel Modelling:

http://www.cmm.bristol.ac.uk/ Multilevel Modeling Resources at UCLA

http://www.ats.ucla.edu/stat/mlm/

Software for multilevel modelling

139

MLwiN (Goldstein) HLM (Raudenbush) SUPERMIX (Hedeker) aML (Panis)

Specialized software

140

STATA xt suite R (packages ‘lme4’ and ‘MCMCglmm’) SAS PROC MIXED and NLMIXED SPSS STATA gllamm (Rabe-Hesketh & Skrondal)

LISREL (Joreskog)

M-plus (Muthén)

WINBUGS (for Bayesian analysis)

Procedures in general purpose software

Also structural equations

141

Good introductory books

Snijders & Bosker, 2nd ed.

(to appear soon)

Hox, 2nd ed. Raudenbush & Bryk, 2nd ed.

Chapter 2 free to download

142

Pinheiro and Bates (2002) Mixed-effects models in Sand S-PLUS

Littell et al. (2006) SAS for mixed models, 2nd ed

Gelman and Hill (2007) Data analysis using regression and multilevel/hierarchical models (mainly R and WinBUGS)

Rabe-Hesketh and Skrondal (2008) Multilevel and longitudinal modeling using Stata, 2nd ed

Books for learning models AND software

143

Detailed instructions of how to carry out a range of analyses in R, MLwiN and Stata

It is free, but you will need to log on or register onto the course to view all these practicalshttp://www.cmm.bris.ac.uk/lemma/course/view.php?id=13

However, you can see short samples of these materials, without registering at http://www.bristol.ac.uk/cmm/learning/module-samples/

A web course

144

Page 25: Multilevel linear Santiago2012 - UniFI · Sample size requirements to fit multilevel models 10. Extensions of the hierarchical model 11. Software and books Outline 2 Introduction

Advanced booksA. Skrondal &

S. Rabe-HeskethH. Goldstein

Unified framework for models with latent variables, including multilevel, factor, structural eq.

Now 4th edition!

J. de Leeuw, E. Meijer (eds.)

145