A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay,...

37
A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.

Transcript of A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay,...

Page 1: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

A Critical Evaluation of Diagnostic Score Reporting: Some Theory and

Applications

Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman

Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner

Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.

Page 2: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores.

Paul W. Holland (2001)

Page 3: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the diagnostic scores.

Page 4: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Outline

Examples of diagnostic score reports. Approaches to report diagnostic scores. Problems with existing diagnostic scores in education. A method to evaluate if diagnostic scores have added value. Applications of the method to operational test data. Conclusions and recommendations.

Page 5: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

5

What Are Diagnostic Scores?

Diagnostic scores refer to scores on any meaningful cluster of items (subtests).

Typically, they refer to scores on content areas.

For example, on a test for prospective teachers of children, diagnostic scores are scores on the content areas Reading, Science, Social Studies, and Mathematics.

Page 6: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.
Page 7: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

7

Page 8: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.
Page 9: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

9

Subscores, Augmented Subscores, and Objective Performance Index

Subscores: Raw/percent scores on the subtests.

Augmented subscore (Wainer et al., 2001): A weighted average of the subscore of interest (e.g., reading) and the other subscores (e.g., science, social studies, and mathematics).

Objective Performance Index (Yen, 1987): A weighted average of (i) the observed subscore, and (ii) an estimate of the subscore based on the examinee’s overall test performance.

Page 10: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

10

Cognitive Diagnostic Models (CDM)

Assumptions: solving each test item requires one or more skills

(Q matrix)each examinee has a latent ability parameter

corresponding to each of the skills the probability of a score depends on the skills

the item requires and the ability parameters

The ability estimates are the diagnostic scores.

Page 11: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

11

Examples of CDMs

Rule Space Method (RSM; Tatsuoka, 1983, 2009): An early attempt at diagnostic scoring. Attribute Hierarchy Method (Leighton, Gierl, & Hunka, 2004): Extension of RSMThe DINA and NIDA models (Junker and Sijtsma, 2001).Multiple classification latent class model (Maris, 1999).General diagnostic model (GDM; von Davier, 2008).Reparameterized unified model (RUM; Hartz, 2002; Roussos et al., 2007).

Page 12: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

12

Examples of CDMs..Continued

Bayesian Network (Almond et al., 2007).Multidimensional item response theory (De la Torre & Patz, 2005; Yao & Boughton, 2007). Multicomponent latent trait model (e.g., Embretson, 1997). The higher-order latent-trait model (de La Torre, 2005). The DINO and NIDO models.Many excellent reviews of CDM’s exist (e.g., Rupp & Templin, 2008; von Davier et al., 2008; DiBello, Roussos, & Stout, 2007).

Page 13: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

13

Is It Possible to Report High-quality Diagnostic Scores for the

Existing Educational Tests?

Standards 1.12, 2.1, 5.12 etc. of Standards for Educational & Psychological Testing (1999) demand proof of adequate reliability, validity, and distinctness of diagnostic scores.

Page 14: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

14

Classical Test Theory

x=test score.

Partition the score x as: x=xt + xe, E(xe)=0, Cov(xt, xe)=0, V(xe)=σe

2, V(xt)= σt2,

Reliability = Correlation between scores on a test and a parallel form of the test = ρ2(x,xt).

Validity measures the extent to which a test is doing the job it is supposed to do (=correlation between x and a criterion score y).

Page 15: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

15

Is It Possible to Report High-quality Diagnostic Scores?...Cont’d

Diagnostic scores in educational tests most often have few items, but cover broad domains—

low reliability are highly correlated are outcomes of retrofitting

Page 16: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

16

Is It Possible to Report High-quality Diagnostic Scores?...Cont’d

Luecht, Gierl, Tan & Huff (2006): “Inherently unidimensional item and test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. Our obvious recommendation is not to try to extract something that is not there.”

Page 17: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

17

An Empirical Check of Reliability of Diagnostic Scores

Form X (120 Items)

Reading: 30 Items Math: 30 Items Social Studies: 30 Items Science: 30 Items

Form A (40 Items)

Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items

Form B (40 Items)

Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items

Form C (40 Items)

Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items

Page 18: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

18

An Empirical Check of Reliability of Diagnostic Scores….continued

Of the 6,035 examinees who scored 4 (1st quartile) or lower on science on Form A, 49 percent scored higher than 1st quartile on science on Form B.

Of the 383 examinees who scored 8 (3rd quartile) on Math and 4 on science on Form A, 32 percent had science score higher than or equal to their Math score on Form B.

rScience A, Science B=0.48.

rScience A,Total B=0.63.

Page 19: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

19

A Method Based on Classical Test Theory (Haberman, 2008)

Compute the PRMSE of • the subscore (=reliability)• the total scoreA subscore has added value over the total score only if the PRMSE of the subscore is larger than the PRMSE of the total score. A subscore has added value A subscore can be predicted better by the corresponding subscore on a parallel form than by the total score on the parallel form.

Page 20: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

20

A Method Based on Classical Test Theory…Continued

Subscore s=st +se; Total score x=xt + xe

PRMSE for the subscore = ρ2(s,st)= Subscore reliability.

PRMSE for the total score = ρ2(x,st) = ρ2(x,xt) ρ2(xt,st).

Page 21: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

21

A Method Based on Classical Test Theory…Continued

Can report a weighted average of the subscore and the total score (0.4xReading+0.2xTotal) if its PRMSE is large enough.

Special case of the augmented subscore (Wainer et al., 2001).

The computations need only simple summary statistics.

Page 22: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

What About Validity?

Reliability is an aspect of construct validity.

Recent work of Haberman (2008): A subscore that is not distinct or not reliable has limited value w.r.t. validity.

Thus, the method also examines whether the subscores have adequate validity (though additional validity studies are recommended).

Page 23: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

23

An Example: GRE Subject Biology

Sub-

score

The PRMSE’s for the subscore, total score, and weighted average

PRMSEsub (=reliability)

PRMSEtotal PRMSEwtd

Cell. & Molec.

.89 .78 .91

Organis-mal

.85 .89 .91

Ecol. & Evol.

.87 .79 .89

Page 24: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

24

Results from a Survey of Operational Data (Sinharay, 2009)

Test Name

# Sub scores

Average Length

Average reliability

Average Corr-dis

# subscores added value

# Wtd av. with added value

Old

SAT-V3 26 0.79 0.95 0 1

Sch. Std. Prg: Eng

4 15 0.70 0.98 0 0

DSTP (8th gr. M)

4 19 0.77 1.00 0 0

Teachers of math.

3 16 0.62 0.95 0 0

Old SAT 2 69 0.92 0.76 2 2

Praxis Series™

4 25 0.72 0.78 2 4

SweSAT 5 24 0.78 0.69 4 5

Page 25: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Percent of subscores with added value for different subscore length and average disattenuated correlation

Page 26: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Percent of subscores with added value for different average subscore reliability and average disattenuated correlation

Page 27: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Percent of weighted averages with added value

Page 28: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

28

Main Findings from the Survey of Operational Data

More than 50% of the tests had no subscore with added value.

Weighted averages had added value more often than subscores.

The subscores that had added value were

• based on a sufficient number of items (20+)

• sufficiently distinct from each other (disattenuated correlation less than 0.9)

Page 29: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

29

Reporting of Aggregate-level Subscores

To determine if aggregate-level subscores have added value, use an approach based on PRMSEs similar to that used for individual-level subscores.

The computation of PRMSEs is a bit different and is based on between and within-aggregation sum of squares.

Page 30: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

30

A Method Based on Classical Test Theory…Continued

s=sA +se; x=xA + xe

PRMSE for the aggregate-level subscore = ρ2(sav,sA) = σ2(sA)/[σ2(sA)+ σ2(se)/n]

PRMSE for the aggregate-level total score = ρ2(xav,sA) = ρ2(xav,xA) ρ2(xA,sA)

Page 31: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

31

Reporting of Subscores Based on MIRT Models

Fit, using a stabilized Newton-Raphson method (Haberman, von Davier, & Lee, 2008), a MIRT model with item response function

where θi corresponds to subscore i

1 1 2 2

1 1 2 2

exp( ... )( ) ,

1 exp( ... )j j jK K j

jj j jK K j

a a a bP

a a a b

θ

Page 32: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

32

Reporting of Subscores Based on MIRT…Continued

The diagnostic scores are the posterior means of the ability parameters.Calculate the proportional reduction in mean squared error (PRMSEMIRT)

Compare PRMSEMIRT to PRMSEwtd to examine if MIRT does better than CTT.

2

2

( ( | ))1

( ( ))i i i

i i

E E

E E

X

Page 33: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

33

Results for Aggregate-level Subscores and MIRT-based Subscores

Aggregate-level subscores, just like individual-level subscores, rarely have added value.

The PRMSEMIRT is very close to PRMSEwtd for the several tests we looked at.

Page 34: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

34

Conclusions and Recommendations

Most of the existing diagnostic scores on educational tests lack quality.

Evidence of adequate reliability, validity, and distinctness of the diagnostic scores should be provided.

If a CDM is used, it should be demonstrated that the model parameters can be reliably estimated in a timely manner and the model fits the data better than a simpler model.

Page 35: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

35

Conclusions and Recommendations

To report meaningful diagnostic scores for some tests, changing the structure by using assessment engineering practices (Luecht et al., 2006) may be necessary.

Alternatives: Scale anchoring (Beaton & Allen, 1992) and item mapping (Zwick et al., 2001).

Page 36: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

36

References for the Haberman method

Haberman (2008). Journal of Educational and Behavioral Statistics.

Sinharay, Haberman, & Puhan (2007). Educational Measurement: Issues and Practice.

Sinharay & Haberman (2008). Measurement.

Haberman, Sinharay, & Puhan (2009). British Journal of Math. & Stat. Psychology.

Page 37: A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

37

References for the Haberman method

Puhan, Sinharay, Haberman, & Larkin (in press). Applied Measurement in Education.

Sinharay (2009). ETS RR.

Haberman & Sinharay (2009). ETS RR.

Sinharay, Puhan, & Haberman (2009). Invited presentation at the annual meeting of NCME.