Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters,...

15
Generalized Mixed-effects Models Generalized Mixed-effects Models for Monitoring Cut-scores for for Monitoring Cut-scores for Differences Between Raters, Differences Between Raters, Procedures, and Time Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate School of Education & Information Studies National Center for Research on Evaluation, Standards, and Student Testing (CRESST) CRESST Conference 2004 Los Angeles

Transcript of Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters,...

Page 1: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Generalized Mixed-effects Models Generalized Mixed-effects Models for Monitoring Cut-scores for for Monitoring Cut-scores for Differences Between Raters, Differences Between Raters,

Procedures, and TimeProcedures, and Time

Yeow Meng ThumHye Sook Shin

UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST)

CRESST Conference 2004

Los Angeles

Page 2: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

RationaleRationale• Research shows that cut-scores vary as a

function of many factors: raters, procedures, and over time.

• How does one defend a particular cut-score? Averaging several values, use of collateral information are current options.

• High-stakes accountability hinges on the comparability of performance standards over time.

• Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)

Page 3: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Purpose of StudyPurpose of Study

• An approach for estimating the impact from procedural factors and rater characteristics and time.

• Monitoring the consistency of cut-scores across several groups.

Page 4: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Transforming Judgments into Transforming Judgments into Scale ScoresScale Scores

-4 -2 0 2 4480 540 600 660 720

0.0

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

Logit

Scale Score

Cut-Score

0.633 logits

619 scale-score points

Figure 1: Working with the Grade 3 SAT-9 mathematics scale

Page 5: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Performance DistributionPerformance Distributionfor Four Urban Schoolsfor Four Urban Schools

Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School A32% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School B70% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School C19% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012

School D32% Proficient

619

Page 6: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Potential ImpactPotential Impactof Revising a Cut-score of Revising a Cut-score

  Revised Cut-score

(as fraction of sem)

school -1 -0.5 0 0.5 1

A 41% 37% 32% 29% 26%

B 78% 75% 70% 67% 63%

C 25% 23% 19% 15% 13%

D 40% 36% 32% 28% 25%

Table 1: Potential impact on school performance when cut-score changes

Page 7: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Data & ModelData & Model• Simulate Data for a standard setting study

design : a ramdomized block comfounded factorial design (Kirk, 1995)

• Factors of standard setting study

a. Rater Dimensions (Teacher, Non-Teacher, etc.)

b. Procedural Factors/Treatments

1. Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc)

2. Item Sampling in Booklet (Number of items, etc)

3. Type of Task (A modified Angoff, a contrasting group approach, or

Bookmark method, etc)

Page 8: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Treating Binary OutcomesTreating Binary Outcomes

1 for "pass"

0 for "fail",ijty

ln1

ijtijt

ijt

p

p

(2)

Binary outcome

(1)

Logit link function

(pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t)

Page 9: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

s= + jt sjt jts

K S

IRT Model for Cut-score - IIRT Model for Cut-score - I

Procedural Factors Impacting A Rater’s Cut-scores

(3)

Where s is the fixed effect due to session characteristics s

is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj

jt

Item Response Model (IRT)

= - ijt jt ijtK d

(4)

Page 10: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Estimating Factors Impacting A Rater’s Cut-scores

(5)

0 1

0 00 0 0

1 10 1 1

jt j j jt

j p pj jp

j p pj jp

ROUND

X u

X u

0 1( , )j ju u are distributed bivariate normal

with means (0, 0) and variance-covariances

00 01

10 11

T

IRT Model for Cut-score - IIIRT Model for Cut-score - II

Page 11: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

LikelihoodLikelihood

( , )jg T

(1 )( ; ) [1 ]ijt ijty yj j ijt ijt

t i

f y p p

(7)

Prior distribution of j

Conditional posterior of the rater random effects j is

( ; ) ( ; )

( , , )j j j

j

f y g T

h y T

where ( , , ) ( ; ) ( ; )j

j j j j jh y T f y g T

Condition on , y has probability

(6)

Joint marginal likelihood

( , , )jj

h y T (8)

Page 12: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

= + jt sg sjt jts

K S

Multiple StudiesMultiple StudiesConsistency & StabilityConsistency & Stability

Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g)

Where sg is the fixed effect due to session characteristics s

is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj

jt

(9)

Group Factors Impacting A Rater’s Severity

0 1

0 00 0 01 1

1 10 1 11 1

jt j j jt

G G

j gj g gj g p pj jg g p

G G

j gj g gj p pj jg g p

ROUND

GROUP GROUP X u

GROUP GROUP g X u

(10)

Page 13: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

SimulationSimulationSAS Proc NLMixed

150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors.

Session Factor 1: Feedback type

Session Factor 2: Item Targeting in Booklet

Session Factor 3: Type of Standard Setting Task

Rater Characteristics: Teacher, Non-Teacher

Change over Round (time)

Page 14: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Selected Results Selected Results

• Model (reasonably) recovers parameters within sampling uncertainty across 3 studies.

• Average cut-score (All Teachers) for each rater group at the last Round is not significantly different from 619, while the first Round results were significantly different.

• Results from the model for multiple studies are similarly encouraging.

Page 15: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

SuggestionsSuggestions

• Large-scale testing programs should monitor their cut-score estimates for consistency and stability.

• For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time.

• The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.