Test Scaling and Value-Added Measurement · Stanford Achievement Test, and the National Assessment...

Working Paper 2008-23December 2008

Test Scaling and

Value-AddedMeasurement

Dale Ballou

IN COOPERATION WITH:LED BY

The NaTioNal CeNTer oN PerformaNCe iNCeNTives(NCPI) is charged by the federal government with exercising leader-ship on performance incentives in education. Established in 2006through a major research and development grant from the UnitedStates Department of Education’s Institute of Education Sciences(IES), NCPI conducts scientific, comprehensive, and independentstudies on the individual and institutional effects of performance in-centives in education. A signature activity of the center is the conductof two randomized field trials offering student achievement-relatedbonuses to teachers. e Center is committed to air and rigorousresearch in an effort to provide the field of education with reliableknowledge to guide policy and practice.

e Center is housed in the Learning Sciences Institute on thecampus of Vanderbilt University’s Peabody College. e Center’smanagement under the Learning Sciences Institute, along with theNational Center on School Choice, makes Vanderbilt the only highereducation institution to house two federal research and developmentcenters supported by the Institute of Education Services.

This working paper was supported by the National Center onPerformance Incentives, which is funded by the United StatesDepartment of Education's Institute of Education Sciences(R30SA06034). is paper was presented at the Wisconsin Centerfor Education Research's National Conference on Value-AddedModeling in March 2008. e author thanks Kurt Scheib and WarrenLangevin for their expert research assistance. e views expressed inthis paper do not necessarily reflect those of sponsoring agencies orindividuals acknowledged. Any errors remain the sole responsibilityof the authors.

Please visit www.performanceincentives.org to learn more aboutour program of research and recent publications.

Test Scaling and Value-Added Measurement:Dale BallouVanderbilt University

Abstract

As currently practiced, value-added assessment relies on a strong assump-tion about the scales used to measure student achievement, namely thatthese are interval scales, with equal-sized gains at all points on the scale rep-resenting the same increment of learning. Many of the metrics in whichtest results are expressed do not have this property (e.g., percentile ranks,normal curve equivalents). However, this property is claimed for the scalesobtained when tests are scored according to the Item Response eory.

is claim requires that examinees and test items constitute, in the ter-minology of representational measurement theory, a conjoint additivestructure. Unfortunately, it is difficult to confirm that this conditionis met. In addition, the internal structure of a conjoint additive systemmay not be the property we seek to represent in an achievement scale, assuggested by the lack of surface plausibility of many scales resulting fromthe application of IRT.

Methods of ordinal data analysis can be employed instead, on weakerassumptions. Instead of comparing mean achievement of a teacher’sstudents to the students of a (hypothetical) average instructor, ordinalanalysis asks what fraction of the former outperforms the latter. efeasibility of such methods for value-added analysis is demonstrated.It is seen that value-added estimates are sensitive to the choice of ordi-nal methods over conventional techniques.

Clearly, if IRT scores are an interval-scaled variable, ordinal methodsthrow away valuable information. Practitioners of value-added measure-ment should ask themselves, however, whether they are so confident ofthe metric properties of these scales that they are willing to attribute dif-ferences between regression-based estimates of value added and estimatesbased on ordinal analysis to the superiority of the former.

1

I. Introduction

Models currently used for value-added assessment of schools and teachers

require that the scale on which achievement is measured be one of equal units: the five-

point difference between scores of 15 and 20 must represent the same gain as the five-

point difference between 25 and 30. If it does not, we will end up drawing meaningless

conclusions about such matters as the average level of achievement, relative gains of

different groups of students, etc., in that the truth of these conclusions will depend on

arbitrary scaling decisions.

A scale that possesses this property is known as an interval scale. It is clear that a

simple number-right score is not an interval scale of achievement when test questions are

not of equal difficulty. The same is true of several popular methods of standardizing raw

test scores that also fail to account for the difficulty of test items, such as percentiles,

normal curve equivalents, or grade-level equivalents normed to a nationally-

representative sample.1

The search for measures of achievement that are independent of the particular

items included on a test led to the development in the 1950’s of Item Response Theory.

IRT is now used to score most of the best-known and most widely administered

achievement tests today, such as the CBT/McGraw-Hill Terra Nova series, the SAT, the

Stanford Achievement Test, and the National Assessment of Educational Progress. IRT

was regarded as a significant advance over earlier scaling methods for the following

reasons: (1) The score of an examinee is not dependent on the difficulty of the items on

the test, provided the test is not so easy that the examinee answers all items correctly or

1 These last methods also have the disadvantage of being dependent on the distribution of ability in the tested population or must rely on arbitrary assumptions about this distribution.

2

so hard that he misses them all; (2) An examinee’s score is not dependent on the

characteristics of the other students taking the same test; (3) Finally, according to some

psychometricians, an examinee’s score is an interval-scaled measure of ability2. This last

claim makes IRT scaling particularly appealing to those practicing value-added

assessment.

However, not all psychometricians share these views, and the literature contains

many confusing and contradictory statements about the properties of IRT scales.

Because many social scientists using test scores to evaluate educational institutions and

policies have little or no training in measurement theory, the first objective of this paper

is to review the issues. The next section describes IRT. It is followed by a summary of

the controversy over scale type in Section III. While in the right conditions, IRT yields

interval-scaled measures of achievement, these conditions are difficult to verify.

Moreover, IRT scales are often at odds with common sense notions about the effects of

schooling and the distribution of ability as students advance through school. I argue in

Section IV that we are rightly suspicious of IRT scales when we see such results. Section

V takes up the implications for value-added assessment, with particular attention to

methods of ordinal data analysis. Concluding remarks appear in Section VI.

2 The psychometric literature uses the term ability. Other social scientists might prefer achievement. It represents the student’s mastery of the domain of the test and should not be confused with innate ability as opposed to knowledge and skills acquired through education.

3

II. Item Response Theory and Ability Scales

In IRT, the probability that student i correctly answers test item j correctly is a

function of an examinee trait (conventionally termed ability) and one or more item

parameters.3 Thus

(1) Pij = Prob[h( i, j) > uij] = F( i, j),

where i is ability of person i, j is a characteristic (possibly vector-valued) of item j, and

uij is an idiosyncratic person-item interaction, as a result of which individuals of the same

level of ability need not answer a given item alike. The uij are taken to be independent

and identically distributed random errors. The function h expresses how ability and item

parameters interact. F is derived from h and assumptions about the distribution of uij.

Common assumptions are that the uij are normal or logistic. i and j are estimated by

maximum likelihood methods.

When there is a single item parameter, the assumption that the uij are logistic

gives rise to the one-parameter logistic model (also known as the Rasch model):

(2a) Pij = [1 + exp(-D( i- j))]-1,

in which D is an arbitrary scaling parameter, invariant over items, that can be chosen by

the practitioner. (If D = 1.7 it makes essentially no difference whether the model is

estimated as a logistic or normal ogive model. Alternatively, D is set to 1.) The estimate

of is known as the scale score. The scale score is the principal measure of performance

on the exam, although other measures derived from it, such as percentile ranks, may also

be reported.4 The item parameter is conventionally termed difficulty. The functional

3 There are many lucid expositions of IRT, including Hambleton and Swaminathan (1985), and Hambleton, Swaminathan, and Rogers (1991). 4 Another measure is the so-called “true score,” which is simply the expected number right = Pij( ) on a

4

form of (2a) implies that it is measured on the same scale as ability.

More elaborate models introduce additional item parameters, as in the two- and

three-parameter logistic models:

(2b) Pij = [1 + exp(- j( i- j))]-1

(2c) Pij = cj + (1-cj)[1 + exp(- j( i- j))]-1

(2b) contains a second item parameter, j, known as the item discrimination parameter

because it enters the derivative of Pij with respect to i (thereby determining how well the

item discriminates between examinees of different ability). In both (2a) and (2b), the

limit of Pij as i - is 0. This is not appropriate for tests where a student who knows

nothing at all can answer an item correctly by guessing. Accordingly, cj allows for a non-

zero asymptote and is conventionally termed the guessing parameter.

The plot of Pij against i is known as the item characteristic curve (ICC). Three

item characteristic curves using the 2-parameter logistic IRT model are depicted in Figure

1. Curve II differs from curve I by an increase in the difficulty parameter, holding

constant item discrimination. Curve III corresponds to an item with the same level of

difficulty as II but a doubling of the discrimination parameter. All three curves have a

lower asymptote of 0. Observe that all three curves are steepest where Pij = .5 and = .5

At this point the slope equals j.

In the IRT models in (2a) – (2c), and are underidentified. Modification of

each by an additive constant obviously leaves Pij unchanged. Multiplicative constants

can be offset by changes to (or absorbed in) the discrimination parameter. Because an

test comprising the universe of items. This suffers from the usual defect of a number-right score: the metric depends on the composition of the universe. 5 Because these models are additive in and , ability and difficulty are expressed on the same scale.

5

additive constant corresponds to a change in the origin of the scale, while a multiplicative

constant represents a change in the scale’s units (e.g., from meters to yards), many

psychometricians have concluded that and are measured on interval scales, which are

characterized by the same conventional choice of origin and unit (see Section III).

However, this opinion is by no means universal, nor is it firmly held. Many

psychometricians, including some who state that these are interval scales, also regard the

scale as arbitrary. Others caution that ability scales should be accorded no more than

ordinal significance. These conclusions appear to be derived from the following

considerations: (1) Because ability is a latent trait, it is impossible to verify by physical

means that all 1-unit increments in represent the same increase in ability. This (the

argument goes) confers an inherent indeterminacy on the scaling of any latent trait.6 (2)

Replacing with a monotonic transformation g( ) while making offsetting changes to the

function F yields a model that fits the data just as well.7 Thus the scale rests on

arbitrary assumptions regarding functional form.8 (3) The notion that measures the

6 “When the characteristic to be measured cannot be directly observed, claims of equal-interval properties with respect to that characteristic are not testable and are therefore meaningless.” (Zwick, 1992, p. 209) 7 An example is Lord’s (1975) transformation of the scale to eliminate correlation between item difficulty and item discrimination, in which i was replaced by the regression of j on j and higher powers of j

evaluated at j = i. The result was a modification of the three-parameter logistic model: P*ij = F*ij( ) = cj + (1-cj)[1 + exp(- j( -1( ( i)))- j)]-1

where ( ) is proportional to -.27 + 1.1694 + .2252 2 + .0286 3. As a nonlinear transformation of the original ability scale, the -scale differs from the -scale by more than a change of origin and unit. Lord saw this as no drawback: “There seems to be no firm basis for preferring the scale to the scale for measuring ability.” (Lord, 1975, p. 205) 8 “A long-standing source of dissatisfaction with number-right and percent-correct scores is that their distribution depends on item reliabilities and difficulties. Radically different distributions of true scores (expected percents correct) can be obtained for the same sample of examinees when they take different tests. The obvious restriction of such scores to their ordinal properties casts doubt upon their use for problems that require interval scale measurement, such as comparing individuals’ gains…Item response theory appeared to offer a general solution, since the same underlying scale could explain the different true-score distributions corresponding to any subset of items from a domain. But this line of reasoning runs from model to data, not from data to model as must be done in practice. Suppose that a given dataset can be explained in terms of a unidimensional IRT model with response curves of the form Fj( ). Corresponding to any

6

amount of something misconceives the entire enterprise. There is no single trait (call it

ability, achievement or what-have-you) to be quantified by this or any other means.9 As

such, the question of scale type is essentially meaningless: there are just various models

for reducing the dimensionality of the data, some more convenient than others.10 I will

refer to this view of IRT scales in the ensuing discussion as the operational perspective.

To summarize, there are some psychometricians who consider to be interval-

scaled, others who think it is ordinal, still others who regard the choice of scale as

arbitrary, even if it is an interval scale, and finally some who are unsure what it is.

Clearly it is disconcerting to find this divergence of views on a question of fundamental

importance to value-added assessment. Is the IRT ability trait measured on an interval

scale or not? Indeed, how does one tell?

III. IRT Models and Scale Type

In the natural sciences, measurement is the assignment of numbers to phenomena

in such a way that relations among the numbers represent empirically-given relations

continuous, strictly increasing function h there is an alternative model with curves F*j( ) = Fj(h( )) that fits the data in precisely the same manner (Lord, 1975). That a particular IRT model fits a dataset, therefore, is not sufficient grounds to claim scale properties stronger than ordinal.” (Mislevy, 1987, p. 248) 9 The claim that a particular unidimensional scaling method is right must be based on the assumption that achievement is unidimensional, that it can be linearly ordered, and that students can be located in this linear ordering independently of performance on a particular achievement test. However, no one has succeeded in identifying or defining a linearly ordered psychological trait in educational achievement, and no one has demonstrated that a particular measurement scale is linearly related to such a trait. A serious obstacle to the establishment of truly (externally verifiable) equal-interval achievement scales is the fact that achievement is multidimensional and qualitatively changing. The nature of what is being learned is constantly being modified. Use of an objective-based approach to achievement highlights the difficulties in hypothesizing and verifying a continuous achievement trait. The student is learning the names of letters of the alphabet one day, associating sounds to those letters another day, and attaching meaning to groups of letters a third day. How can such qualitative changes be hypothesized to fall so many units apart on one particular trait?” (Yen, 1986, p. 311-12) 10 “It is important for educators and test developers to acknowledge that until the achievement traits are much more adequately defined, it is not possible to develop measurement scales that are linearly related to such traits. In fact, it appears impossible to provide such trait definition. Test users are therefore left to use other criteria to choose the best scale for a particular application; choosing the right scale is not an option.” (Yen, 1989, p. 314)

7

among the phenomena. Technically, there is a homomorphism between the empirical

relations among objects in the world and numerical relations on the scale. One such

relation is order: if objects can be ordered with respect to the amount of some attribute,

that order needs to be reflected in increasing (or decreasing) scale values. However,

order is not the only attribute to matter. If objects A and B can be combined (or

concatenated) in such a way that in combination they possess the same amount of some

attribute (length, mass) as object C, a scale for that attribute needs to reflect the results of

that operation. Thus, in an additive representation, the scaled value of the attribute in A

plus the scaled value of the attribute in B equals the scaled value assigned to C.

The importance of convention relative to empirical phenomena turns out to be the

key to scale type. At one end of the spectrum are scales in which there is no role for

convention, at the other scales that are entirely conventional. Scales of the first type are

called absolute. An example is counting: one is not free to change units if the

information to be conveyed by the scale is the number of discrete objects under

observation.11 At the other end of the spectrum are nominal scales, where the number

assigned to an object is no more than a label and the information conveyed could just as

well be represented by a non-numerical symbol. Scales for most physical quantities, such

as length and mass, have a degree of freedom for the conventional choice of a unit. Such

scales are known as “ratio scales” because the ratio of two lengths or two weights is

invariant to the choice of unit: ratios are convention-free. If there is no natural zero, so

11 It should be clear that scale type is as much a matter of how numbers are interpreted as of formal relations among the items being scaled. For example, considered purely from a technical standpoint, number right on a test can be regarded as an absolute scale. Like any scale that counts items (e.g., the score of a football game), number right does not admit of a change of units without a loss of information. However, when test items are not of equal difficulty, number right cannot be regarded as a measure of achievement that generalizes beyond performance on the particular test in question, and as such is not an

8

that the origin of a scale is also determined conventionally, ratios are no longer

convention-free magnitudes. However, ratios of intervals are invariant under change of

unit and change of origin. Such scales are therefore known as interval scales: they are

characterized by two degrees of freedom. Between interval scales and nominal scales lie

ordinal scales. Any increasing function of an ordinal scale conveys the same

information.12

With respect to psychological variables, there is less agreement about the nature

of measurement. There appear to be three prevalent views (Hand, 1996). On one view

(sometimes called “classical measurement”), it is simply assumed that the psychological

variable of interest exists and that there is a ratio (or at least interval) scale on which it is

measurable. The task of measurement is to discover those values. This appears to be the

view of some psychometricians with respect to IRT scales. On a second view, a

psychological variable exists only by virtue of its presence in some model. The model

effectively defines the variable and, when the model is fit to data, provides a means of

determining the scaled value of the variable. This has been called “operational

measurement” and is compatible with the operational perspective on IRT scales described

above, in which IRT models are devices for reducing the dimensionality of data. Two

different models may both contain the term “ability.” There is no basis for deciding

between them on grounds that one better represents “true ability”—each is just more or

less useful for the purposes to which they are put. Finally, there is a third view that holds

absolute scale, a ratio scale, or an interval scale. Except in special circumstances it is not even an ordinal scale.12 Scale type is therefore closely related to the notion of an admissible scale transformation (Stevens, 1946). An admissible transformation preserves the empirical information in the original scale by altering only those elements that are purely conventional. For ratio scales, these are the similarity transformations, corresponding to a choice of unit (for example, substituting centimeters for meters). In the case of interval scales, the admissible transformations consist of affine transformations, g= f(a) + .

9

that measurement of psychological variables is essentially the same as measurement in

the physical sciences—the view sketched at the start of this section, known as

representational measurement. Psychological attributes are postulated to explain

empirically-given relationships (such as the pattern of examinees’ responses to the items

on a test). It is the structure of those relationships that determines the properties of scales

that measure these attributes. Some relationships are so lacking in structure that the

attributes may not be scalable at all—the most we can do is to name them. In other cases

it may be possible to say that there is a larger quantity of some attribute in one person

than another, supporting ordinal scaling. In still other cases there may be sufficient

structure to permit interval-scaled comparisons: the difference in the amount of the

attribute between A and B equals the difference between C and D.

We can aspire to resolve disputes about scale type only if the third of these views

is correct. Classical measurement simply assumes an answer, whereas the question of

scale type is either meaningless or of no importance in operational measurement.

Virtually all discussions of scale type nod in the direction of representational

measurement by invoking Stevens’ classification of admissible transformations. To the

extent that a case can be made that IRT scales are interval scales, it has to be made in

terms of representational measurement theory.

The argument that and are interval-scaled is found in the analysis of conjoint

additive structures. We begin by assuming that the Pij are given—or, more precisely,

that we are given the equivalence classes comprising examinees and items with the same

Pij. Let A1={a, b,…} denote the set of examinees and A2 = {p, q, …} the set of test

items, and let represent the order induced on A1xA2 by Pij. That is, (a,p) (b,q) if Pap

10

Pbq. The sets A1, A2, and the relation are known as an empirical relational system.

If several exacting conditions are met, requiring that the relations between A1 and A2

exhibit still more structure than the ordering of equivalence classes, the resulting

empirical relational system is called a “conjoint additive structure” and the following can

be proved: (1) there are functions 1 and 2 mapping the elements of these sets into the

real numbers (i.e., examinees and items can be scaled); (2) the relation ordering

examinee-item pairs can be represented by an additive function of their scaled values;

that is, (a,p) (b,q) 1(a) + 2(p) 1(b) + 2(q); (3) only affine transformations of

1 and 2 preserve this representation. These transformations correspond to a change of

origin and a change of units; hence, 1 and 2 are interval scales (Krantz et al., 1971).

There are two critical steps in the proof. First, from the relation ordering

examinee-item pairs we must be able to derive relations ordering the elements of each set

separately. For this we require a monotonicity condition (also known as independence):

if Paq > Pbq for some item q, then Par > Pbr for all r. An analogous condition holds for

items: if item q is harder for one examinee to answer, it is harder for all examinees.

Monotonicity establishes orderings of A1 and A2 separately. Without it, not even ordinal

scales could be established for examinees and items.

To obtain interval scales we need further conditions, as illustrated in Figure 2,

which depicts three “indifference curves” (literally: isoprobability curves) over

examinees and items: i.e., three equivalence classes determined by the Pij. Examinees

and items are arrayed along their respective axes according to the induced order on each

set, but no significance should be attached to the distance between a pair of examinees or

a pair of items: the axes are unscaled apart from the ordering of examinees and items.

11

However, note that there is an additional structure to the equivalence classes in Figure 2.

They exhibit a property known as equal spacing: as we move over one examinee and

down one item, we remain on the same indifference curve. Equal spacing is probably the

simplest of all conjoint additive structures, but it is not a necessary condition for conjoint

additivity.

To derive an equal-interval metric, we establish that the distance between one pair

of examinees is equal to the distance between another pair by relating both to a common

interval on the item axis. As shown in Figure 3, the interval between examinees a and b

can be said to equal the interval between examinees b and c in that it takes the same

increase on the item scale (from q to r) to offset both. Thus the item interval qr functions

as a common benchmark for defining a sequence of equal-unit intervals on the examinee

axis. The scaled value of any individual i is then obtained by counting the number of

such intervals in a sequence from some arbitrarily chosen origin to person i: ab, bc, cd ...

This scaling of the heretofore unscaled axes renders the equiprobability contours linear

(and, of course, parallel).

In order that the conclusion ab = bc = cd = … not be contradicted by other

relations in the data, it must be the case that the same conclusion follows no matter which

interval on the item axis is selected as the benchmark. A like condition must hold for

intervals on the examinee axis to serve as a metric for items. In addition to these

consistency conditions, other technical conditions must be met when equal spacing does

not obtain.

To summarize, conjoint measurement scales a sequence of intervals in one factor

by using differences of the complementary factor as a metric. By moving across

12

indifference curves, these sequences can be concatenated to measure the difference

between any two examinees (or items) with respect to the latent factor they embody.

Because the measurement of intervals reduces to counting steps in a sequence, there is an

essential additivity to this empirical structure, on the basis of which we obtain additive

representations of examinee and item traits.

Conjoint additivity uses only the Pij equivalence classes and not the values of Pij to

determine the 1 and 2 scales. The final step from conjoint additivity to IRT requires a

positive increasing transformation F from the positive reals to the interval [0,1], such that

F( 1(a) + 2(p)) = Pap. Because any increasing transformation of 1(a) + 2(p) preserves

the representation of A1xA2 by 1 and 2, it is clear that a function F satisfying this

condition can be found.

Of the three IRT models, only the one-parameter model is consistent with this

simple conjoint additive structure. The two- and three-parameter IRT models, in which

item discrimination varies, violate the monotonicity assumption. This is easily seen with

the aid of Figure 1. The ICC labeled II represents an item with discrimination parameter

= 1, while the curve labeled III has a discrimination of 2. The ICC cross where equals

the common value of the difficulty parameter. An individual whose lies to the left of

this intersection finds item II more difficult than item III; an individual whose is to the

right of the intersection ranks the items the other way around. Because different

individuals produce inconsistent rankings of items, the ranking of examinee-item pairs on

the basis of Pij does not yield orderings of examinees and items and the derivation of the

scales breaks down. Indeed, it is alleged that because only the one-parameter IRT

model produces a consistent ordering of items for all examinees (ICCs in the Rasch

13

model never cross), only the in a one-parameter model can be considered an interval-

scaled measure of ability (Wright, 1999).

This goes too far. A straightforward modification of conjoint additivity

accommodates structures with a third factor that enters multiplicatively, such as the IRT

discrimination parameter. The relevant theorem appears in Krantz et al. (1971, p. 348).

The empirical relations between examinees and items are termed a polynomial conjoint

structure. The extra conditions on examinees and items ensure that we can obtain

separate representations by discrimination classes in the manner just described. When

these conditions are met, we find that there are functions 1, 2, and 3 such that (a,p)

(b,q) 3(p)[ 1(a) + 2(p)] 3(q)[ 1(b) + 2(q)]. 1 and 2 are unique to linear

transformations and 3 is unique to a similarity transformation. Obviously this

representation has the structure of the two-parameter IRT model.13

To summarize: if the empirical relational system is one of conjoint additivity and

the function F correctly specifies the relationship between i- i and the Pij, the IRT

measure of latent ability, , and the IRT difficulty parameter, , are interval-scaled

13 This representation does not include anything that corresponds to the guessing parameter in the three-parameter IRT. While I have not seen an analysis of such a conjoint measurement structure, extending polynomial conjoint measurement to include the three-parameter IRT model would proceed similarly to the extension of conjoint additivity to cover the two-parameter IRT model. For sets of items with the same discrimination parameter, the two-parameter IRT model reduces to the one-parameter model. Proof of scale properties for the two-parameter model involves selecting one such set of items and scaling examinees with respect to it. Because the choice of items is arbitrary, the resulting scale is unique only up to the change of units that would result from the selection of an alternative set with a different value of the discrimination parameter. (For details, see Krantz et al., 1971, Chapter 7.) Incorporating a guessing parameter in the structure would involve following the same logic. Examinees would be scaled using a subset of the data (i.e., for a particular choice of discrimination and guessing parameters.) As with the polynomial conjoint structure, the fact that another choice could have been made introduces a degree of freedom into the representation, though in the case of the guessing parameter, this degree of freedom affects only the function mapping ( 3(q)[ 1(b) + 2(q)]) to Pij and not the relations among 1, 2, and 3. Theproperties of the 1 and 2 would therefore be precisely those established for the polynomial conjoint structure, inasmuch as these properties depend on relations between examinees and items and not on the function mapping equivalence classes to numerical values of Pij.

14

variables. If the empirical relational system is a polynomial conjoint structure and F is

again correct (e.g., the two-parameter logistic IRT model, or possibly the three-parameter

model, if extra conditions entailing a lower asymptote on Pij are met), and are again

interval-scaled.

Notwithstanding the fact that these propositions were proved in the 1960’s, one

continues to find a wide range of opinions about the properties of IRT scales in the

psychometric community (as we have seen). At least some of that diversity of opinion is

due to the following three misconceptions about scales. (1) Because arbitrary monotonic

transformations of and can be shown to fit the data equally well, and cannot be

interval-scaled. At best, they are ordinal variables. (2) Because is interval-scaled, no

scale of achievement related to by anything other than an affine transformation can be

an interval scale. In particular, if = g( ), where g is monotonic but not affine, the

scale is ordinal. (3) Using an IRT model (or specifically the Rasch model) to scale a test

produces an interval scale of ability. Each of these beliefs is wrong. Before quitting this

discussion of representational measurement theory, we need to understand why.

The first of these views derives from Stevens’ stress on the role of admissible

transformations in determining scale type. The problem is that Stevens’ formulation of

the matter fails to make clear just what makes a transformation “admissible.” There is a

sense that information must not be lost when a scale is transformed—but precisely what

information? All transformations rest on the implicit assumption that there is something

that we can alter freely—something, in other words, that is not “information,” at least not

information we care about. Stevens’ criterion for determining scale type remains empty

until this something is specified.

15

Misconception (1) rests on the following argument: we can replace with g( )

for arbitrary monotonic function g and still fit the data (the Pij) equally well, provided we

make an offsetting change to the function F relating and to P. An example appeared

in footnote 7 above, in Lord’s modification of the three-parameter logistic model:

P*ij = F*ij( ) = cj + (1-cj)[1 + exp(- j( -1( ( i)))- j)]-1

Because this model fits the data as well as the original three-parameter logistic model (as

it must, being mathematically equivalent), it is argued that and ( ) contain the same

information about ability. Because the function is not affine but an arbitrary

monotonic function, the conclusion is drawn that and must both be ordinal scales.

The flaw in this argument is the supposition that the only information that matters

is the order over equivalence classes of examinees and items induced by the Pij. But the

proof from conjoint additivity does not conclude that and are interval scales merely

because these mappings from examinees and items to real numbers preserves order

among the Pij. The empirical relational system between classes of examinees and items

contains more structure than the ordering of equivalence classes. It is this additional

structure, as illustrated by the example of the equally spaced conjoint structure above,

that underlies the claim that certain mappings from examinees and items are interval

scales. An arbitrary monotonic transformation of these scales no longer reflects the

relations holding among the scaled items and examinees shown in Figure 3. Such a

transformation loses the information that the distance between examinees a and b equals

the distance between b and c in the sense that both offset the same substitution of one pair

of test items for another. Preserving that information restricts the class of admissible

transformations to affine functions.

16

The preceding comment notwithstanding, it does not follow that there cannot be

another way of assigning numbers to examinees, on the basis of some other property,

producing a new scale = h( ) for a non-affine function h that can also be regarded as an

interval-scaled measure of ability. Clearly and cannot be interval scales of the same

property of examinees. (In the language of representational measurement theory, they

cannot represent the same empirical relational system.) That we might have reason to

regard both as measures of achievement is due to the vagueness and imprecision with

which the term achievement is used, not just in ordinary discourse, but in social science

research.

I illustrate with an example from economics. Consider a firm that employs thirty

workers on thirty machines. Workers are rotated among machines on a daily basis. The

only information the firm has on the productivity of either factor of production is the

daily output of each worker-machine combination. Suppose, for purposes of rewarding

employees or scheduling machines for replacement, the firm decides it needs measures of

the productivity of individual workers and machines. In other words, it wants to scale

these heretofore unscaled entities. (The parallel with testing should be obvious.) A

measurement theorist is called in who observes that the data satisfy the conditions of a

conjoint additive structure, inasmuch as workers and machines can be scaled such that the

resulting isoquants in worker/machine space are linear and parallel. The measurement

theorist confidently announces that these scaled values represent worker and machine

productivity, up to an arbitrary choice of unit of measurement and origin.

Some time later the firm calls in a production engineer, who prowls around the

shop floor with a stopwatch for a week, and then reports the following discovery: the

17

output of a worker/machine pair is a simple function of the downtime of the machine

(beyond a worker’s control) and the amount of time the worker is goofing off. That is,

output Qwm = w m, where w is the proportion of time worker w is attending to his work,

and m,is the proportion of time machine m is running properly. There is no difference

between workers otherwise: during their time on task with a functioning machine, all are

equally productive. The engineer therefore proposes w as the natural (and ratio-scaled)

measure of employee productivity.

It is then noted that the engineer’s productivity measure is not a linear

transformation of the measurement theorist’s measure, in that the two variables differ by

more than a choice of origin and unit. (Indeed, the measurement theorist’s w = ln w.)

Yet each expert swears his measure has at least interval-scale properties, and each is

right. The two measures capture different properties of the relations between workers

and machines. The engineer doesn’t care that his metric ignores the information in the

conjoint additive structure, because he relies on an alternative empirical relational system

(one that includes the position of hands on a stopwatch) to scale the entities on the shop

floor. Both experts have produced interval scales, and it is up to the firm to decide which

scale captures properties of the relations between workers and machines that most matter

to it.

This example shows why the second of the three misconceptions cited above is

false. The parallel with testing is maintained if there is some other empirical relational

system into which examinees and test items fit, affording an alternative metric.

Suggestive examples are found in the education production function literature, in the

form of back-of-the-envelope calculations of the benefits of various educational

18

interventions (e.g., the dollar value of higher achievement associated with smaller class

sizes). Whether this can be done with sufficient rigor to provide an interval scale for

measuring student achievement (and whether it is desirable to construct one along these

lines, if feasible) is a question to which I return below in Section V.

Finally, the notion that the use of an IRT model (or a particular IRT model, such

as the Rasch model) confers interval-scale status on the resulting places the cart before

the horse: it overlooks the requirement that the empirical relational system be a conjoint

additive or polynomial conjoint structure. Applying an IRT model willy-nilly to

achievement test data does not of itself confer any particular properties on the scale score

metric.14

IV. Achievement Scales in Practice

To summarize the argument of the preceding section, under stringent conditions,

an IRT measure of ability can be shown to be an interval-scaled variable. But there are

two important caveats. First, do the data meet these stringent conditions? Second, might

there be some other set of relations holding between examinees and test items that

provides an alternative, and perhaps more satisfactory, basis for constructing an

achievement measure with interval (or even ratio) scale properties? I take up these

questions in turn.

It is highly unlikely that real test data meet the exacting definition of a conjoint

structure. At best, they come close, though it is not easy to say how close. The theory

14 Though an obvious point, this is not always appreciated by researchers. Consider the following statement in a report of the Consortium on Chicago School Research. Having rescaled the Iowa Test of Basic Skills using a Rasch IRT model, the authors claim to have produced a metric with interval scale properties: “A third major advantage of the Rasch equating is that, in theory, it produces a ‘linear test score metric.’ This is an important prerequisite in studies of quantitative change. This allows us to compare directly the gains of individual students or schools that start at different places on the test score metric.” (Bryk et al., 1998).

19

places restrictions on the equivalence classes of A1 x A2 (examinees crossed with items),

but these classes are not given to us. Instead, the data consist of answers to test items,

often binary indicators of whether the answer was correct or not. From these data the

membership of the equivalence classes must be inferred before one can ascertain whether

these classes can be represented by a set of linear, parallel isoprobability curves. Because

the theory stipulates restrictions that hold for every examinee, while the amount of data

per examinee is small, there is little power to reject these hypotheses despite anomalies in

the data. Many restrictions might be accepted that would be deemed invalid if the actual

Pij were revealed.

IRT assumes that conditional on and , the Pij are independent across items and

examinees. Although correlated response probabilities attest to the existence of

additional latent traits that affect performance, unidimensional models are fit to the data

anyway. Sometimes it is clear even without statistical tests that the model does not fit the

data. Consider multiple choice exams, where the lower asymptote on the probability of a

correct response is not zero. The lower asymptote is more important for low ability than

high ability examinees. A conjoint structure for these data must take into account the

lower asymptote as another factor determining Pij or the scaled values of examinee ability

will be wrong. Indeed, if responses to a multiple-choice exam are tested to see whether

the definition of conjoint additivity is met, data that appear to meet this definition will no

longer do so once guessing is factored in unless all items are equally “guessable.”

Even when the data do fit the model, the extent to which they have been selected

for just this reason should be kept in mind. Indeed, it is not clear that the word data is the

right one in this context, as the tests are designed by test makers who decide first on a

20

scaling model and then strive to ensure that their test items meet the assumptions of that

model.15 Even if they are wholly successful in this endeavor, the “data” represent a

selected, even massaged, slice of reality and not a world of brute facts.

Thus, notwithstanding the support given by representational measurement theory to IRT

methods, IRT can fail to produce satisfactory interval scales. Verifying that the

conditions of the theory are met is difficult even for test makers, let alone statisticians and

behavioral scientists who use test scores for value added assessments.

As the worker-machine example shows, even if test data satisfy the conditions of

a conjoint structure, there might be some other scale with a claim to measure what we

mean by achievement. IRT scales have the peculiarity that the increase in ability

required to raise the probability of a correct response by any fixed amount is independent

of the difficulty of the question. That is, raising the probability of answering a very

difficult question from .1 to .9 takes the same additional knowledge as it does to raise the

probability of answering a very simple item from .1 to .9. (Observe that in this argument,

.1 and .9 can be replaced by any other numbers one likes, for example .0001 and .9999.)

That it takes the same increase in ability to master a hard task as an easy one follows

directly from conjoint additivity (specifically, the parallel equiprobability contours that

result when items and examinees are scaled to represent conjointness), but it may be

difficult for many readers to square this notion with other ideas they entertain about

achievement, based on the use of the term in other contexts: how long it takes to

15 For example, in the Rasch model items have a common discrimination parameter (their item characteristic curves do not cross). Test makers using the Rasch model plot ICC’s and discard items that violate this assumption. IRT also assumes that conditional on , probabilities of a correct response are independent over examinees. Violations of this assumption lead items to be discarded on grounds of potential bias. This is probably a good idea, but it further weakens the sense in which we are dealing here with the fit between a model and “data.”

21

accomplish these two tasks, how hard instructors must work, the extent to which a

student who has mastered the more difficult item is in a position to tackle a variety of

other tasks and problems, compared to the student who has mastered the easier item, and

so forth. In short, we may find ourselves in the position of the production engineer in the

worker-machine example: taking into account the other information we possess, we

might find the scale derived from conjoint additivity lacking, notwithstanding its

impeccable pedigree on purely formal grounds as an interval scale.

At this point it may be useful to examine some of the scales that have been

produced using IRT methods. If larger increments of ability (as measured by an

alternative metric grounded in some of the aforementioned phenomena) are in fact

required to produce the same change in Pij as questions become more difficult, IRT

scaling will compress the high end of the scale, diminishing mean gains between the

upper grades and reducing the variance in achievement. Evidence that something of this

kind occurs is found in developmental scales used to measure student growth across

grades. Students at different grade levels are typically given different exams, but by

including a sufficient number of overlapping items on forms at adjacent grade levels,

performance on one test can be linked to performance on another test, facilitating the

creation of a single scale of ability spanning multiple grade levels.

Table 1 below, reproduced from Yen (1986), displays scores for the norming

sample for the Comprehensive Test of Basic Skills (CTBS), Form Q, developed by

CTB/McGraw-Hill and scaled using IRT methods for the first time in 1981.16 In both

16 Prior to 1981 the CTBS was scaled using an older method known as Thurstone scaling. Although it has been claimed that Thurstone scaling produces scores on an interval scale, there is no basis for this claim comparable to the proofs provided for conjoint measurement structures in representational measurement theory: scale properties are not based on the structure of empirical relations holding among test items and

22

subjects, mean growth between grades drops dramatically, and almost monotonically,

between the lowest elementary grades and secondary grades. In addition, the standard

deviation of scale scores declines as the mean score rises.

CTB/McGraw-Hill has since superseded the CTBS with the Terra Nova series.

The decline in between-grade gains seen in the CTBS norming sample is still evident,

though less pronounced, in Terra Nova results for Mississippi and New Mexico, as

shown in Table 2.17 Mean gains in all subjects tend to decrease with grade level, though

there is a break in the pattern between grades 7 and 8.

Between-grade gains can be affected by the differences in the content of tests and

by linking error. It is particularly instructive, therefore, to see the patterns that emerge

when there are no test forms specific to a grade level and no linking error. Northwest

Evaluation Association uses computer-adaptive testing in which items are drawn from a

single, large item bank. Results for all examinees in mathematics in the fall of 2005 are

presented in Table 3. As in the previous tables, we again find that between-grade

differences decline with grade level. Within-grade variance in scores is stable (reading)

or increases moderately (mathematics).

Because this is at base a dispute about how to use words, we need to be careful in

discussing these phenomena. If the test data in fact exhibit a conjoint structure (let us

concede the point for now), the IRT is an interval-scaled variable. Yet this scale

commits us to the conclusion that the variance of reading ability is no greater among high

school students than second-graders. Most of us, I suspect, would respond that this scale

examinees, but on ad hoc assumptions about the distribution of ability in the population tested. 17 Data are from www.SchoolData.org maintained by the American Institutes of Research. New Mexico and Mississippi were two of a handful of states in these data that reported mean scale scores for vertically-linked tests produced by test-makers known to employ IRT scaling methods.

23

fails to capture something about the word ability (or achievement) that causes us to recoil

from such a conclusion.

Readers wondering whether their own pretheoretical notions of these terms accord

with the usage implied by IRT are invited to consider the sample of mathematics test

items displayed in Figure 4, taken from the Northwest Evaluation Association web site.

The items are drawn from a larger chart providing examples of test questions at eleven

different levels of difficulty. The two selections represent items scored at the 171-180

level and the 241-250 level, respectively.18 Consider the following question: if student

A is given the items in the first set, and student B the items in the second set, and if

initially each student is able to answer only 2 of 7 items correctly, which student will

have to learn more mathematics in order to answer all 7 items correctly? Student A has

basics of addition and subtraction to learn, as well (perhaps) as how to read simple ch

and diagrams. None of the required calculations are taxing: all could be done by

counting on one’s fingers. By contrast, student B must make up deficits in several of th

following areas: decimal notation, fractions, factoring of polynomials, solving algebraic

equations in one unknown, solid geometry, reading box plots, calculating percenta

The calculations required are more demanding. However, the correct answer, according

to IRT, is that both require the same increase in mathematics ability.

arts

e

ges.

20

19,

18 According to the data in Table 3, the average second grader tested in 2005 would have answered slightly more than half the questions in the first set correctly, while the average ninth grader would have responded correctly to slightly less than half the questions in the second set. 19 Strictly speaking, this is the correct answer only if items have equal discrimination parameters, that is, the one-parameter IRT model fits the data. In fact, NWEA does use the one-parameter model, and has gone to considerable lengths to verify that the items meet the assumptions of that model. 12 When I put this question to faculty and graduate students of my department in the School of Education at Vanderbilt, 13 of the 108 respondents chose A, 47 chose B, 15 said the amounts were equal, and 33 said the answer was indeterminate. Obviously this was not a scientifically-conducted survey, nor is it clear just what respondents meant by their answers. Conversations with some revealed that they converted the phrase “more mathematics” into something more readily quantified, such as the amount of time a student

24

One might conclude from this evidence that psychometricians should not attempt

to construct over-arching developmental scales for mathematics and reading ability that

span so many age levels. But other concerns are also raised. Between-grade gains begin

to decline at the lowest grade levels in Tables 1-3: gains between third and fourth grades

are markedly lower than gains between second and third. Between-grade comparisons

may matter little for value-added assessment, if most instructors, particularly in the

elementary grades, teach only one grade level. However, there are also implications for

within-grade assessments. If third graders are not in fact learning less than second

graders—if instead IRT methods have compressed the true scale—then the true gains of

higher-achieving students within these grades are understated.21

V. Options for Value-Added Assessment

What options are available to the practitioner who wants to conduct value-added

assessment but is unwilling to accept at face value claims that the IRT ability trait is

measured on an interval scale? Broadly speaking, there are three available courses of

action.

a. Use the scale anyway.

would need to acquire these skills. Persons who said the answer was indeterminate may have meant it was indeterminate in principle (“more mathematics” is meaningless) or simply that that answer couldn’t be determined from the information given. Nonetheless, it is striking how few gave the psychometrically correct response.

21 The phenomena of decreasing gains and diminishing or constant variance with advancing grade level has been treated in the psychometric literature under the heading scale shrinkage. While a number of explanations have been advanced, these explanations typically assume there is a true IRT model that fits the data (with ability, perhaps, multidimensional rather than unidimensional), and that various problems (e.g. a failure to specify the true model, the small number of items on the test, changing test reliability within or across grades, ceiling and floor effects) prevent practitioners from recovering the true values of .Psychometricians have disagreed about the extent to which these explanations account for the phenomena in question. (For notable contributions to this literature, see Yen, 1985; Camilli, 1988; Camilli, Yamamoto, and Wang, 1993; Yen and Burket, 1997.) By contrast, the point I am making here is that even in the absence of these problems, IRT scales are likely to exhibit compression at the high end when compared with pretheoretical notions of achievement grounded in a wider set of empirical relations.

25

b. Choose another measure of achievement with an interval-scaled metric.

c. Adopt analytical methods suited to ordinal data.

In this section I argue that neither a. nor b. is an attractive option. I then go on to

demonstrate the feasibility of c.

a. Using the scale.

Even if the IRT ability scale possesses at best ordinal significance, one might

continue to use it if reasonable transformations of all yield essentially the same

estimates of value added. (Compare the claim that statistics calculated from ordinal

variables are generally robust to all but the most grotesque transformations of the original

scale.22) The practice of many social scientists who are aware that achievement scales

may not be of the interval type but proceed with value-added assessment anyway

suggests that this view may be widespread.

Unfortunately, to test whether reasonable transformations of yield essentially

the same measures of value added requires some sense of what is reasonable. Absent

that, there may be a tendency for researchers conducting sensitivity analyses to decide

that the alternatives that are reasonable are those that leave their original estimates largely

intact. The compression of scales displayed in Tables 1-3 suggests one possibility:

assume an achievement scale in which average gains are equal across grades. I have

examined the consequences of adopting this alternative scaling for data from a sample of

districts in a Southern state.23 Data are from mathematics tests administered in grades 2-

8 during the 2005-06 school year. The tests were scaled using the one-parameter IRT

22 Cliff (1996) attributes this remark to Abelson and Tukey (1959), but it does not appear in that paper.23 Anonymity has been promised to both the state and test maker. Data are available for only a portion of the state. In addition, because between-grade growth is central to the analysis, only those districts that tested at least 90 percent of students in each grade are included. The final sample comprised 98,760

26

model. Results were similar to those we have seen in Tables 2-4, with near-monotonic

declines in between-grade gains. A transformation = g( ) was sought that would

equalize between-grade gains for the median student.24 It turned out that this could be

closely approximated by a quadratic function increasing over the range of observed

scores. Figure 5 depicts the relationship between the transformed and original scaled

values for the median examinee. While one might object to the scale on the grounds

that median student gains are not equal across grades, tapering off as students approach

adolescence, Figure 5 shows that the transformation from to is not driven by the

upper grades—essentially the same curve would be found if data points for grades 7 and

8 were dropped. Moreover, g( ) exhibits only a modest departure from linearity. It

seems unlikely that many researchers would regard it as a grotesque transformation of the

data.

Nonetheless, the consequences of this transformation for the distribution of

growth are pronounced (Table 4). At each grade level, I have calculated the change in

(alternatively, ) required to remain at the 10th, 25th, 50th, 75th, and 90th percentiles of the

achievement distribution when advancing to the next grade. The absolute changes in

and that meet this criterion are affected by the magnitude of the median student growth

(and are therefore scale dependent, i.e., they depend on choice of units). However,

relative changes are invariant to the choice of units. Accordingly, column 1, top panel,

presents the ratio of 25 to 10 (the change required to remain at the 25th percentile over

students in grades 2 through 8 during the 2005-06 school year. 24 To accomplish this, the original median scores were replaced with a new series in which the between-grade gain was set to the overall sample median gain across all grades. This left unchanged the second grade values but altered the values in subsequent grades, as shown in Figure 5. A quadratic function of was then fit to the new series. The fit, as shown, was exceedingly close, though on the resulting scale, median gains can vary by ±.5 points.

27

the change required to stay at the 10th percentile). Comparable ratios for the 50th, 75th,

and 95th percentiles appear in the other columns. The lower panel contains the same

ratios for the scale.

In the original scale, differences in growth at various points of the distribution are

not pronounced. The ratios are greatest in the fifth and sixth grades, but even here, there

is not much difference between a student at the median and one at the 90th percentile. By

contrast, the ratios are much greater using the transformed scale and increase

monotonically as we move to the right, from 25/ 10 to 90/ 10. While the

direction of the change is what we would expect on the basis of the preceding

discussion—less compression at the high end of the -scale compared to the -scale—the

magnitude of the difference is surprising. The impact on value-added assessment

depends on how students are distributed over schools and teachers. Clearly, changes of

the magnitude shown in Table 4 can make a great difference to teacher value added when

that distribution is not uniform.

b. Changing the scale.

Clearly nothing is gained by substituting percentiles, normal curve equivalents, or

standardized scores for . While all are used in the research literature, no one claims that

they are interval-scaled. However, two approaches merit discussion, not because they

work any better, but because the view seems to be gaining currency that they do, or

could.

The first, which I will refer to as binning, assigns every examinee to a group

defined by prior achievement (e.g., deciles of the distribution of prior scores). Gain

scores are normalized by mean gains within bins, either through dividing by the mean

28

gain or standardizing with respect to the mean and standard deviation within each bin.

The normalized gain scores are then used as raw data for further analysis, which could

include value-added assessment of schools and teachers. Examples of this approach are

Springer (2008) and Hanushek et al. ( 2005). In the latter study, binning is explicitly

motivated by concern that the metric in which test results are reported does not represent

true gains uniformly at all points of the achievement distribution.25

Binning does not solve the problem of scale dependence. Binning merely

declares that a normalized gain in one bin is to count the same as one that takes the same

normalized value in another bin. The declaration does not ensure that these two gains are

equal when measured on the true scale (if such a thing exists). Rather than solving the

problem, binning is simply the substitution of a particular transformation for the scale.

The notion that binning represents a solution may derive from the fact that

normalizing within bins removes much of the effect of any prior transformation of scale

(for example, the substitution of the scale for the scale). Differences between bins

have no effect on value-added measures. Only differences within bins matter (though

these are still affected by the transformation whenever the slope of g( ) = at a bin mean

differs from 1). Hence, the results of the binned analysis are less sensitive to prior

transformations of the original scale. This may have led some to believe that scale no

longer matters as much. This is not so, given that the normalized-within-bin scores are

themselves just another transformation of . Moreover, if invariance of this kind were

25 A practice similar to binning has been used in the Educational Value-Added Assessment System (EVAAS) of the SAS Institute, wherein gain scores are divided by the gain required to keep an examinee at the same percentile of the post-test distribution that he occupied in the pre-test distribution (Ballou, 2005). Thus, for exams like the Iowa Test of Basic Skills that exhibit increasing variance at higher grade levels, the transformation pulls up gains of examinees whose pretest scores were below the mean and reduces gains of examinees whose pretest scores were above the mean. There is no reason why this should be regarded as superior to using the original scale.

29

the sole desideratum, percentiles could be used in place of . Percentiles are invariant to

any increasing transformation of , but that does not make percentiles an interval scale of

achievement.

Suppose we decide that the problem of finding interval scales for academic

achievement is intractable. We still might be able to conduct value-added assessment if

achievement—by whatever metric—could be mapped into other variables whose

measurement poses no such difficulties. This mapping could go backward to inputs or

forward to outcomes. In the former, academic achievement would be related to

measurable inputs required to produce that achievement. Thus, instead of worrying

whether one student’s 5 point gain was really the equal of another student’s 5 point gain,

we would concern ourselves with measuring the educational inputs required to produce

either of these gains. If those inputs (e.g., teacher time) turn out to be equal, then for all

purposes that matter, the two gains are equivalent. If those inputs turn out not to be

equal, the teacher or school that has produced the gain requiring the greater inputs has

contributed more and should be so recognized by value-added analysis.

Forward mapping treats test scores as an intermediate output. Value-added

assessment would proceed by tying scores to long-term consequences.

“An important methodological issue…is the problem of choosing the correct metric with which to measure academic growth. Because the metric issue is so perplexing, almost all researchers simply use the particular test at their disposal, without questioning how the test’s metric affects the results…The only solution I see to the problem of determining whether gains from different points on a scale are equivalent is to associate a particular test with an outcome we want to predict (say, educational attainment or earnings), estimate the functional form of this relationship, and then use this functional form to assess the magnitude of gains. For example, if test scores are linearly related to years of schooling, then gains of 50 points can be considered equal, regardless of the starting point. If the log of scores is linearly related to years of schooling, however, then a gain of 50 points from a lower initial score is worth more than a gain of 50 points from a higher initial score. This ‘solution’ is, of course, very unsatisfactory, because the

30

functional form of the relationship between test scores and outcomes undoubtedly varies across outcomes.” (Phillips, 2000, p. 127)

As the final sentence of this passage suggests, we are very far from being able to

carry out either of these programs. Only some educational inputs are easily quantified.

For such inputs as the clarity of a teacher’s explanations or the capacity to inspire

students, the challenges to quantification are at least as great as for academic

achievement. Indeed, the low predictive power of those inputs that are easily quantified

is largely responsible for the current interest in value-added assessment.

The practical difficulty mentioned in the last sentence of the quoted passage is not

the only problem facing the forward mapping of test scores to long-term educational

outcomes. Given the variation in test results in Tables 1-3, it would almost certainly be

the case that the functional form of the relationship between test scores and outcomes

would vary across tests as well as outcomes. The introduction of each new test would

require additional analysis to determine how scores on its metric were related to long-

term objectives like educational attainment and earnings. In many cases, the data for

such analysis would not be available for years to come, if ever. In the interim we would

have to make do with very imperfect efforts to equate the new tests with tests already in

use (for which we would hope this mapping had already been done).

These are the technical issues. There is in addition the difficult normative

question of how to value various outcomes for different students in order to assign a

unique social value to each i. It is not obvious how we would come by these weights.

Even if future earnings were the only outcome that mattered, we would require the

relative values of a marginal dollar of future income for all examinees (whose future

incomes are, of course, unknown at the time the assessment is done). Attaching a price to

31

the non-pecuniary benefits of an education is still more difficult. Education is a

transformative enterprise. The ex ante value placed on acquiring an appreciation of great

literature is doubtless very different from the ex post value. In these circumstances there

is likely to be a great discrepancy between the compensating and equivalent variations

associated with a given educational investment. Accordingly, the weights in our index

would have to reflect the values of “society” rather than the still-unformed persons to be

educated.

It is not clear that we should subject decisions about education to this kind of

utilitarian calculus, as it fails to respect the autonomy of individuals. There is a strong

tradition in our polity of regarding educational opportunity as a right. Individuals have a

claim on educational resources not because distributing resources in this manner

maximizes a social welfare function, but because they are entitled to the chance to realize

their potential as individuals. If we take this seriously, then the notion that teachers and

schools are to be evaluated by converting test scores into the outputs that matter and

weighting these outputs according to their social value is wrong-headed. The point of

education is to provide students with skills and knowledge that as autonomous persons

they can make of what they will. Teachers should therefore be judged on how

successfully they equip their students with these tools—regardless of anyone’s views of

the merits of the final purposes, within wide limits, for which students use them.

c. Analyzing test scores as ordinal data

The final option for researchers uncertain of the metric properties of ability scales

is to treat such scales as ordinal, thus foregoing any analysis based on the distance

between two scores. On the assumption that scales contain valid ordinal information

32

about examinees, statistics based on the direction of this distance remain meaningful.

There are a variety of closely-related statistics of this kind (known generally as measures

of concordance/discordance) employed in the analysis of ordinal data. In this discussion

I will not attempt to identify a particular approach as best. Rather, my objective is to

demonstrate the feasibility of such methods for value-added assessment.

Suppose we want to compare the achievement of teacher A’s class to the

achievement of other students at the same grade level in the school system. If A has n

students, and teachers elsewhere in the system have m students, there are nm possible

pairwise comparisons of achievement. As only ordinal statements are meaningful, each

pair is examined to determine which student has the higher score. If it is A’s student, we

count this as one in A’s favor (+1); if it is the student from elsewhere in the system, we

count this as one against A (-1). Ties count as zeros. The sum of these counts, divided

by the number of pairs, is known as Somers’ d statistic. Somers’ d can be considered an

estimate of the difference in two probabilities: the probability that a randomly selected

student from A’s class outperforms a randomly selected student from elsewhere in the

system, less the probability that A’s student scores below the outsider.

This procedure suffers from the obvious defect that no adjustment has been made

for other influences on achievement. In conventional value-added analyses, this might be

accomplished through the introduction of prior test scores as covariates in a regression

model or by conditioning on prior scores in some other manner. An analogous procedure

in the ordinal framework would be to divide students into groups on the basis of one or

more prior test scores, compute separate values of Somers’ d by group, and aggregate the

resulting statistics using the share of students in each group as weights. In principle there

33

is no reason to restrict the information used to define groups to test scores. Any student

characteristic could be used to define a group. Only data limitations prevent the

construction of ever-finer groups.

Results from an application of this approach are presented Table 5. Because the

data used in the previous example do not contain teacher identifiers, for this application I

have used a different data set provided by a single, large district that contains student-

teacher links. Two sets of value-added estimates are presented for teachers of fifth grade

mathematics. The first is the weighted Somers’ d statistic, based on pairwise

comparisons of each teacher’s students to other fifth graders in the district. To control for

prior achievement, students were grouped by decile of the fourth grade mathematics

score. Students without fourth grade scores were dropped from the analysis. The second

value-added measure is obtained from a regression model in which fifth grade scores

were regressed on fourth grade scores and a dummy variable for the teacher in question.

Separate regressions were run for each teacher so that each teacher was compared to a

hypothetical counterpart representing the average of the rest of the district, preserving the

parallel with the Somers’ d statistic. The coefficients on the dummy variables represent

teachers’ value added. Statistical significance was assessed using the conventional t

statistics in the case of the regression analysis and jackknifed standard errors in the case

of the ordinal analysis.

How much difference does it make to teacher to be evaluated by the one method

rather than the other? The hypothesis that teachers are ranked the same by both methods

is rejected by the Wilcoxon signed rank test (p = .0078). The proportion of statistically

significant estimates is higher using the ordinal measure (which is less sensitive to

34

noisiness in test scores): 86 of the 237 teachers are significant by this measure, compared

to 64 by the other. The maximum discrepancy in ranks is 229 positions (out of 237

teachers in all). In ten percent of cases, the discrepancy in ranks is 45 positions or more.

Depending on the uses to which value-added assessment is put, the question of

greatest importance to teachers may be whether they fall at one end of the distribution or

the other. There are 59 teachers in each quartile of the distribution. The two measures

agree in 47 cases on the teachers in the top quartile, and in 48 cases on teachers in the

bottom quartile. Thus, if falling in the top quartile qualifies a teacher for a reward, 24

teachers (more than a third of the number of awardees) will qualify or not depending on

which measure is used. A comparable figure applies to teachers placing in the bottom

quartile, if that event is used to determine sanctions. In a small number of cases (3), the

effect of using one measure rather than the other is great enough to move a teacher from

the top quartile to the bottom quartile.

In principle it is possible to condition on multiple variables (additional prior test

scores, student demographic characteristics or SES) by defining groups as functions of

several covariates. In practice this is apt to exceed the capacity of the data. Consider, for

example, a data set containing prior test scores in two subjects, plus indicators of race and

participation in the free and reduced-price lunch program. If groups are defined by

deciles of the two test scores plus two binary indicators, the total number of groups is 400

(10x10x2x2). In a district of moderate size, there could be many cells with only one

observation and therefore no matching pair.

A two-stage method circumvents this difficulty. In the first stage, multiple

covariates are used to predict Prob(Yi>Yj) – Prob(Yj<Yi) for all pairs of students i and j,

35

where Y is the dependent variable of ultimate interest (e.g., end-of-year test scores in the

current year). The prediction is of the form iˆ = wk(Xki-Xkj) or wk sgn(Xki-Xkj),

depending on whether the Xk are themselves interval-scaled or ordinal. The wk are

weights that reflect how informative the different covariates are about sgn(Yi-Yj). (For

details, see Cliff, 1996.)

In the second stage, the n students of one teacher are compared to the m students

elsewhere in the same system, using iˆ (or grouped values of iˆ ) as the covariate. iˆ is

therefore analogous to a propensity score: it is a summary measure of the effects of the

stage-one covariates on the probability that a student outranks other students, before

controlling for teachers. The resulting measure of a teacher’s value added is based on

her students’ performance relative to this prediction (in the ordinal sense, of course).

Given that many achievement tests are now administered to thousands of students

throughout a state, it is worth noting that all the data can be used in the first stage to form

an ordinal prediction based on prior achievement, demographics, and SES, while

continuing to rely on within-district comparisons in the second stage for the final

measures of value added.26

VI. Conclusion

Are IRT ability traits measured on an interval scale? It seems hazardous to

assume so. Whether examinees and test items constitute a conjoint structure depends on

the make-up of equivalence classes defined by the Pij, but those are not given. Statistical

testing can reveal whether the data are strongly inconsistent with this hypothesis, but

moderate departures from the conjoint structure almost certainly go undetected.

26 Within district comparisons are preferred, given the impact of curriculum and other district policies on

36

Moreover, even if these conditions hold in the norming samples used by test-makers to

calibrate item difficulties, this provides no assurance they will hold in the population of

students to whom the test is finally administered.

End users of the data, such as practitioners of value-added assessment, typically

have no access to item-level data to test these assumptions themselves. Moreover, even if

the assumptions are met, conjoint additivity may not capture everything we want in a

scale of achievement. It would seem wise, then, to check the plausibility of the resulting

scales. On this count IRT ability scales often do poorly. Gain scores frequently fall from

one grade to the next. While some of this may reflect adolescents’ declining interest in

academic achievement, the patterns set in as early as third grade and the drop between

second and third grade is often the largest. In addition, IRT ability often shows a

diminishing or constant variance from lower to higher grades.

What, then, is the practitioner of value-added assessment to do? It is no good

hoping that the choice of scale makes little difference to estimates of value added. We

have seen that reasonable transformations of the scale can have a substantial effect on

relative gains across the distribution of achievement. No other scales with superior

metric properties are at hand. We can, however, use methods of ordinal data analysis on

the assumption that IRT scales (or any of their monotonic transformations) at least permit

us to rank students.

Ordinal analysis changes the question we ask in value-added assessment. Instead

of measuring mean achievement of a teacher’s students vis-à-vis the students of a

(hypothetical) average instructor, we ask what fraction of the former outperform the

latter. In ordinal analysis, as in regression-based methods, it is possible to control for

achievement.

37

other influences in order to isolate the teacher or school’s contribution. Clearly, if is an

interval-scaled variable, ordinal methods throw away valuable information. Practitioners

should ask themselves, however, whether they are so confident of the metric properties of

scales that they are willing to attribute differences between conventional estimates of

value added and estimates based on ordinal analysis to the superiority of the former.

Ordinal methods have other advantages. They are likely to be more robust to

measurement error in test scores and to various model misspecifications (though the

question of robustness is a complicated one; see Cliff, 1996). The question they answer

may be a more sensible way to evaluate educators, given that it attaches more value to

spreading gains over a wider number of students, compared to larger but more

concentrated gains. However, this paper has considered ordinal methods from one

standpoint only—that of finding appropriate value-added models when test scores are not

expressed on an interval scale. Numerous questions have been raised about the

assumptions and methods of more conventional regression-based analyses. Some of

these concerns can be addressed through modifications of those models. It remains to be

seen whether the same concerns arise with respect to ordinal methods and, if so, how

readily they can be accommodated within the ordinal framework.

38

References

Abelson, Robert, and John Tukey. 1959. Efficient conversion of nonmetric information into metric information. Proceedings of the Social Statistics Section, American Statistical Association, 226-230.

Ballou, Dale. 2005. “Value-Added Assessment: Lessons from Tennessee.” Lissitz, Robert, ed., Value Added Models in Education: Theory and Applications. Maple Grove, MN: JAI Press.

Bryk, Anthony et al. 1998. Academic Productivity of Chicago Public Elementary Schools. Chicago: Consortium on Chicago School Research.

Camilli, Gregory. 1988. “Scale Shrinkage and the Estimation of Latent Distribution Parameters.” Journal of Educational Statistics, 13(3), 227-241.

Camilli, Gregory, Kentaro Yamamoto and Ming-wei Wang. 1993. “Scale Shrinkage in Vertical Equating.” Applied Psychological Measurement, 17(4), 379-388.

Cliff, Norman. 1996. Ordinal Methods for Behavioral Data Analysis. Mahwah, NJ:Lawrence Erlbaum.

Hambleton, Ronald and H. Swaminathan. 1985. Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff.

Hambleton, Ronald, H. Swaminathan and Jane Rogers. 1991. Fundamentals of Item Response Theory. Newbury Park: Sage.

Hand, D.H. 1996. “Statistics and the Theory of Measurement.” Journal of the Royal Statistical Society. Series A (Statistics in Society). 159(3), 445-492.

Hanushek, Eric A. et al. 2005. Charter School Quality and Parental Decision-Making with School Choice. Cambridge, MA: NBER Working Paper 11252.

Krantz, David H., et al. 1971. Foundations of Measurement. Vol. 1, Additive and Polynomial Representations. NY: Academic Press.

Lord, Frederic. 1975. “The ‘Ability’ Scale in Item Characteristic Curve Theory.”Psychometrika, 40(2), 205-217.

Mislevy, Robert. 1987. “Recent Developments in Item Response Theory with Implications for Teacher Certification.” Review of Research in Education, 14.Washington DC: American Educational Research Association.

Northwest Evaluation Association. 2008. Math_Single.pdf. Retrieved February 20, 2008, from www.nwea.org/assessments/ritcharts.asp.

39

Phillips, Meredith. 2000. “Understanding Ethnic Differences in Academic Achievement: Empirical Lessons.” Grissmer and Ross, eds., Analytic Issues in the Assessment of Student Achievement. Washington DC: US Department of Education, National Center for Education Statistics.

Springer, Matthew G. 2008. Accountability Incentives: Do Schools Practice Educational Triage? Education Next. (Winter), 75-79.

Stevens, S.S. 1946. “On the Theory of Scales of Measurement.” Science, 103, 677-680.

Wright, Benjamin D. 1999. “Fundamental Measurement for Psychology.” Embretson and Hershberger, eds., The New Rules of Measurement. Mahwah, NJ: Lawrence Erlbaum.

Yen, Wendy. 1985. “Increasing Item Complexity: A Possible Cause of Scale Shrinkage for Unidimensional Item Response Theory.” Psychometrika, 50(4), 399-410.

Yen, Wendy. 1986. “The Choice of Scale for Educational Measurement: An IRT Perspective.” Journal of Educational Measurement, 23(4), 299-325.

Yen, Wendy and George R. Burket. 1997. “Comparison of Item Response Theory and Thurstone Methods of Vertical Scaling.” Journal of Educational Measurement, 34(4), 293-313.

Zwick, Rebecca. 1992. “Statistical and Psychometric Issues in the Measurement of Educational Achievement Trends: Examples from the National Assessment of Educational Progress.” Journal of Educational Statistics, 17(2), 205-218.

40

Table 1: Scale Scores for Comprehensive Test of Basic Skills, 1981 Norming Sample

Reading/VocabularyMathematics Computation

Grades MeanScore

Std.Dev.

MeanBetween-Grade Change Mean Score Std. Dev.

MeanBetween-GradeChange

1 488 85 ---- 390 158 --- 2 579 78 91 576 77 1863 622 65 43 643 44 674 652 60 30 676 35 335 678 59 26 699 24 236 697 59 19 713 20 147 711 57 14 721 23 68 724 54 13 728 23 79 741 52 17 736 17 8

10 758 52 17 739 16 311 768 53 10 741 18 212 773 55 5 741 20 0

Source: Yen (1986).

41

Table 2: Mean Between-Grade Differences, Terra Nova

Subject_Year 2nd to 3rd 3rd to 4th 4th to 5th 5th to 6th 6th to 7th 7th to 8th 8th to 9th

Mississippi, 2001 Language Arts 30.2 20.4 17.8 6.9 9.9 10.9 ---- Reading 23.8 21.5 15.5 10.5 9.2 12.4 ---- Math 47.3 24.7 19.0 21.7 10.7 17.0 ----

New Mexico, 2003 Language_Arts_2003 ---- 14.8 12.6 5.1 3.7 4.5 8.3Reading_2003 ---- 14.4 13.8 5.0 4.7 11.4 4.9Math_2003 ---- 21.1 14.6 19.0 5.5 16.5 5.4Science_2003 ---- 20.2 13.9 9.0 11.6 11.9 7.1History/SS_2003 ---- 13.3 6.2 11.2 11.5 3.6 3.6

Source: Author’s calculations from school-level data posted on www.SchoolData.org, maintained by American Institute of Research. Data for each school are weighted by enrollment and aggregated to the state level.

42

Table 3: Scale Scores, Northwest Evaluation Association, Fall, 2005

Reading Mathematics Grade Mean Std Change Mean Std Change

2 175.57 16.22 ---- 179.02 11.81 ---- 3 190.31 15.56 14.74 192.96 12.06 13.954 199.79 14.95 9.48 203.81 12.80 10.855 206.65 14.60 6.86 212.35 13.92 8.536 211.49 14.76 4.84 218.79 15.00 6.447 215.44 14.82 3.96 224.59 15.99 5.808 219.01 14.76 3.56 229.38 16.79 4.799 220.93 15.28 1.92 231.76 17.42 2.38

Source: Author’s calculations from data provided by NWEA.

43

Table 4: Effect of Scale Transformations on Between-Grade Growth

Relative to students at the 10th percentile, growth by students at the:

25th percentile Median 75th percentile 90th percentile

Original Scale2nd to 3rd 0.97 1.06 1.03 0.953rd to 4th 1.10 1.03 1.16 1.344th to 5th 0.96 1.15 1.35 1.235th to 6th 1.61 2.12 2.17 2.246th to 7th 1.21 1.39 1.43 1.627th to 8th 1.16 1.17 1.16 1.13

Transformed Scale 2nd to 3rd 2.63 5.06 6.83 7.903rd to 4th 1.69 2.10 2.93 3.954th to 5th 1.30 1.94 2.74 2.875th to 6th 2.12 3.48 4.25 4.916th to 7th 1.60 2.29 2.79 3.507th to 8th 1.51 1.89 2.17 2.35

Data: Mathematics test results furnished to author from a Southern state, 2005-06.

44

Table 5: Comparison of Conventional and Ordinal Measures of Teacher Value Added

Total number of teachers 237Wilcoxon signed ranks test, p-value 0.0078Maximum discrepancy in ranks 229Absolute discrepancy in ranks, 90th percentile 45Number of statistically significant teacher effects, fixed effect estimate 64Number of statistically significant teacher effects,ordinal estimate 86Number of significant effects by both estimates 55Number ranked in the top quartile 59

Number ranked in the top quartile by both estimates 47

Number ranked in the bottom quartile by both estimates 48Number ranked above the median by one estimate, below the median by the other 14

Number ranked in the bottom quartile by one, top quartile by the other 3

Data: 5th grade mathematics teachers, large Southern district, 2005-06, author’s calculations.

45

Figure 1: Item Characteristic Curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ability

Prob

abili

ty o

f cor

rect

resp

onse

I

II

III

46

Figure 2: A Conjoint Additive Structure, Prior to Scaling

Examinees

Test

Item

s

48

Figure 4: Math Items, Two Hypothetical Students

49

Source: Northwest Evaluation Association RIT Charts, Sample Mathematics Items, 2008.

matthew G. springerDirectorNational Center on Performance Incentives

Assistant Professor of Public Policyand Education

Vanderbilt University’s Peabody College

Dale BallouAssociate Professor of Public Policy

and EducationVanderbilt University’s Peabody College

leonard BradleyLecturer in EducationVanderbilt University’s Peabody College

Timothy C. CaboniAssociate Dean for Professional Education

and External RelationsAssociate Professor of the Practice in

Public Policy and Higher EducationVanderbilt University’s Peabody College

mark ehlertResearch Assistant ProfessorUniversity of Missouri – Columbia

Bonnie Ghosh-DastidarStatisticianThe RAND Corporation

Timothy J. GronbergProfessor of EconomicsTexas A&M University

James W. GuthrieSenior FellowGeorge W. Bush Institute

ProfessorSouthern Methodist University

laura hamiltonSenior Behavioral ScientistRAND Corporation

Janet s. hansenVice President and Director of

Education StudiesCommittee for Economic Development

Chris hullemanAssistant ProfessorJames Madison University

Brian a. JacobWalter H. Annenberg Professor of

Education PolicyGerald R. Ford School of Public Policy

University of Michigan

Dennis W. JansenProfessor of EconomicsTexas A&M University

Cory KoedelAssistant Professor of EconomicsUniversity of Missouri-Columbia

vi-Nhuan leBehavioral ScientistRAND Corporation

Jessica l. lewisResearch AssociateNational Center on Performance Incentives

J.r. lockwoodSenior StatisticianRAND Corporation

Daniel f. mcCaffreySenior StatisticianPNC Chair in Policy AnalysisRAND Corporation

Patrick J. mcewanAssociate Professor of EconomicsWhitehead Associate Professor

of Critical ThoughtWellesley College

shawn NiProfessor of Economics and Adjunct

Professor of StatisticsUniversity of Missouri-Columbia

michael J. PodgurskyProfessor of EconomicsUniversity of Missouri-Columbia

Brian m. stecherSenior Social ScientistRAND Corporation

lori l. TaylorAssociate ProfessorTexas A&M University

Faculty and Research Affiliates

E X A M I N I N G P E R F O R M A N C E I N C E N T I V E SI N E D U C AT I O N

National Center on Performance incentivesvanderbilt University Peabody College

Peabody #43230 appleton PlaceNashville, TN 37203

(615) 322-5538www.performanceincentives.org

Test Scaling and Value-Added Measurement · Stanford Achievement Test, and the National Assessment...

Documents

Transcript of Test Scaling and Value-Added Measurement · Stanford Achievement Test, and the National Assessment...