Test Scaling and Value-Added Measurement · Stanford Achievement Test, and the National Assessment...
Transcript of Test Scaling and Value-Added Measurement · Stanford Achievement Test, and the National Assessment...
Working Paper 2008-23December 2008
Test Scaling and
Value-AddedMeasurement
Dale Ballou
IN COOPERATION WITH:LED BY
The NaTioNal CeNTer oN PerformaNCe iNCeNTives(NCPI) is charged by the federal government with exercising leader-ship on performance incentives in education. Established in 2006through a major research and development grant from the UnitedStates Department of Education’s Institute of Education Sciences(IES), NCPI conducts scientific, comprehensive, and independentstudies on the individual and institutional effects of performance in-centives in education. A signature activity of the center is the conductof two randomized field trials offering student achievement-relatedbonuses to teachers. e Center is committed to air and rigorousresearch in an effort to provide the field of education with reliableknowledge to guide policy and practice.
e Center is housed in the Learning Sciences Institute on thecampus of Vanderbilt University’s Peabody College. e Center’smanagement under the Learning Sciences Institute, along with theNational Center on School Choice, makes Vanderbilt the only highereducation institution to house two federal research and developmentcenters supported by the Institute of Education Services.
This working paper was supported by the National Center onPerformance Incentives, which is funded by the United StatesDepartment of Education's Institute of Education Sciences(R30SA06034). is paper was presented at the Wisconsin Centerfor Education Research's National Conference on Value-AddedModeling in March 2008. e author thanks Kurt Scheib and WarrenLangevin for their expert research assistance. e views expressed inthis paper do not necessarily reflect those of sponsoring agencies orindividuals acknowledged. Any errors remain the sole responsibilityof the authors.
Please visit www.performanceincentives.org to learn more aboutour program of research and recent publications.
Test Scaling and Value-Added Measurement:Dale BallouVanderbilt University
Abstract
As currently practiced, value-added assessment relies on a strong assump-tion about the scales used to measure student achievement, namely thatthese are interval scales, with equal-sized gains at all points on the scale rep-resenting the same increment of learning. Many of the metrics in whichtest results are expressed do not have this property (e.g., percentile ranks,normal curve equivalents). However, this property is claimed for the scalesobtained when tests are scored according to the Item Response eory.
is claim requires that examinees and test items constitute, in the ter-minology of representational measurement theory, a conjoint additivestructure. Unfortunately, it is difficult to confirm that this conditionis met. In addition, the internal structure of a conjoint additive systemmay not be the property we seek to represent in an achievement scale, assuggested by the lack of surface plausibility of many scales resulting fromthe application of IRT.
Methods of ordinal data analysis can be employed instead, on weakerassumptions. Instead of comparing mean achievement of a teacher’sstudents to the students of a (hypothetical) average instructor, ordinalanalysis asks what fraction of the former outperforms the latter. efeasibility of such methods for value-added analysis is demonstrated.It is seen that value-added estimates are sensitive to the choice of ordi-nal methods over conventional techniques.
Clearly, if IRT scores are an interval-scaled variable, ordinal methodsthrow away valuable information. Practitioners of value-added measure-ment should ask themselves, however, whether they are so confident ofthe metric properties of these scales that they are willing to attribute dif-ferences between regression-based estimates of value added and estimatesbased on ordinal analysis to the superiority of the former.
1
I. Introduction
Models currently used for value-added assessment of schools and teachers
require that the scale on which achievement is measured be one of equal units: the five-
point difference between scores of 15 and 20 must represent the same gain as the five-
point difference between 25 and 30. If it does not, we will end up drawing meaningless
conclusions about such matters as the average level of achievement, relative gains of
different groups of students, etc., in that the truth of these conclusions will depend on
arbitrary scaling decisions.
A scale that possesses this property is known as an interval scale. It is clear that a
simple number-right score is not an interval scale of achievement when test questions are
not of equal difficulty. The same is true of several popular methods of standardizing raw
test scores that also fail to account for the difficulty of test items, such as percentiles,
normal curve equivalents, or grade-level equivalents normed to a nationally-
representative sample.1
The search for measures of achievement that are independent of the particular
items included on a test led to the development in the 1950’s of Item Response Theory.
IRT is now used to score most of the best-known and most widely administered
achievement tests today, such as the CBT/McGraw-Hill Terra Nova series, the SAT, the
Stanford Achievement Test, and the National Assessment of Educational Progress. IRT
was regarded as a significant advance over earlier scaling methods for the following
reasons: (1) The score of an examinee is not dependent on the difficulty of the items on
the test, provided the test is not so easy that the examinee answers all items correctly or
1 These last methods also have the disadvantage of being dependent on the distribution of ability in the tested population or must rely on arbitrary assumptions about this distribution.
2
so hard that he misses them all; (2) An examinee’s score is not dependent on the
characteristics of the other students taking the same test; (3) Finally, according to some
psychometricians, an examinee’s score is an interval-scaled measure of ability2. This last
claim makes IRT scaling particularly appealing to those practicing value-added
assessment.
However, not all psychometricians share these views, and the literature contains
many confusing and contradictory statements about the properties of IRT scales.
Because many social scientists using test scores to evaluate educational institutions and
policies have little or no training in measurement theory, the first objective of this paper
is to review the issues. The next section describes IRT. It is followed by a summary of
the controversy over scale type in Section III. While in the right conditions, IRT yields
interval-scaled measures of achievement, these conditions are difficult to verify.
Moreover, IRT scales are often at odds with common sense notions about the effects of
schooling and the distribution of ability as students advance through school. I argue in
Section IV that we are rightly suspicious of IRT scales when we see such results. Section
V takes up the implications for value-added assessment, with particular attention to
methods of ordinal data analysis. Concluding remarks appear in Section VI.
2 The psychometric literature uses the term ability. Other social scientists might prefer achievement. It represents the student’s mastery of the domain of the test and should not be confused with innate ability as opposed to knowledge and skills acquired through education.
3
II. Item Response Theory and Ability Scales
In IRT, the probability that student i correctly answers test item j correctly is a
function of an examinee trait (conventionally termed ability) and one or more item
parameters.3 Thus
(1) Pij = Prob[h( i, j) > uij] = F( i, j),
where i is ability of person i, j is a characteristic (possibly vector-valued) of item j, and
uij is an idiosyncratic person-item interaction, as a result of which individuals of the same
level of ability need not answer a given item alike. The uij are taken to be independent
and identically distributed random errors. The function h expresses how ability and item
parameters interact. F is derived from h and assumptions about the distribution of uij.
Common assumptions are that the uij are normal or logistic. i and j are estimated by
maximum likelihood methods.
When there is a single item parameter, the assumption that the uij are logistic
gives rise to the one-parameter logistic model (also known as the Rasch model):
(2a) Pij = [1 + exp(-D( i- j))]-1,
in which D is an arbitrary scaling parameter, invariant over items, that can be chosen by
the practitioner. (If D = 1.7 it makes essentially no difference whether the model is
estimated as a logistic or normal ogive model. Alternatively, D is set to 1.) The estimate
of is known as the scale score. The scale score is the principal measure of performance
on the exam, although other measures derived from it, such as percentile ranks, may also
be reported.4 The item parameter is conventionally termed difficulty. The functional
3 There are many lucid expositions of IRT, including Hambleton and Swaminathan (1985), and Hambleton, Swaminathan, and Rogers (1991). 4 Another measure is the so-called “true score,” which is simply the expected number right = Pij( ) on a
4
form of (2a) implies that it is measured on the same scale as ability.
More elaborate models introduce additional item parameters, as in the two- and
three-parameter logistic models:
(2b) Pij = [1 + exp(- j( i- j))]-1
(2c) Pij = cj + (1-cj)[1 + exp(- j( i- j))]-1
(2b) contains a second item parameter, j, known as the item discrimination parameter
because it enters the derivative of Pij with respect to i (thereby determining how well the
item discriminates between examinees of different ability). In both (2a) and (2b), the
limit of Pij as i - is 0. This is not appropriate for tests where a student who knows
nothing at all can answer an item correctly by guessing. Accordingly, cj allows for a non-
zero asymptote and is conventionally termed the guessing parameter.
The plot of Pij against i is known as the item characteristic curve (ICC). Three
item characteristic curves using the 2-parameter logistic IRT model are depicted in Figure
1. Curve II differs from curve I by an increase in the difficulty parameter, holding
constant item discrimination. Curve III corresponds to an item with the same level of
difficulty as II but a doubling of the discrimination parameter. All three curves have a
lower asymptote of 0. Observe that all three curves are steepest where Pij = .5 and = .5
At this point the slope equals j.
In the IRT models in (2a) – (2c), and are underidentified. Modification of
each by an additive constant obviously leaves Pij unchanged. Multiplicative constants
can be offset by changes to (or absorbed in) the discrimination parameter. Because an
test comprising the universe of items. This suffers from the usual defect of a number-right score: the metric depends on the composition of the universe. 5 Because these models are additive in and , ability and difficulty are expressed on the same scale.
5
additive constant corresponds to a change in the origin of the scale, while a multiplicative
constant represents a change in the scale’s units (e.g., from meters to yards), many
psychometricians have concluded that and are measured on interval scales, which are
characterized by the same conventional choice of origin and unit (see Section III).
However, this opinion is by no means universal, nor is it firmly held. Many
psychometricians, including some who state that these are interval scales, also regard the
scale as arbitrary. Others caution that ability scales should be accorded no more than
ordinal significance. These conclusions appear to be derived from the following
considerations: (1) Because ability is a latent trait, it is impossible to verify by physical
means that all 1-unit increments in represent the same increase in ability. This (the
argument goes) confers an inherent indeterminacy on the scaling of any latent trait.6 (2)
Replacing with a monotonic transformation g( ) while making offsetting changes to the
function F yields a model that fits the data just as well.7 Thus the scale rests on
arbitrary assumptions regarding functional form.8 (3) The notion that measures the
6 “When the characteristic to be measured cannot be directly observed, claims of equal-interval properties with respect to that characteristic are not testable and are therefore meaningless.” (Zwick, 1992, p. 209) 7 An example is Lord’s (1975) transformation of the scale to eliminate correlation between item difficulty and item discrimination, in which i was replaced by the regression of j on j and higher powers of j
evaluated at j = i. The result was a modification of the three-parameter logistic model: P*ij = F*ij( ) = cj + (1-cj)[1 + exp(- j( -1( ( i)))- j)]-1
where ( ) is proportional to -.27 + 1.1694 + .2252 2 + .0286 3. As a nonlinear transformation of the original ability scale, the -scale differs from the -scale by more than a change of origin and unit. Lord saw this as no drawback: “There seems to be no firm basis for preferring the scale to the scale for measuring ability.” (Lord, 1975, p. 205) 8 “A long-standing source of dissatisfaction with number-right and percent-correct scores is that their distribution depends on item reliabilities and difficulties. Radically different distributions of true scores (expected percents correct) can be obtained for the same sample of examinees when they take different tests. The obvious restriction of such scores to their ordinal properties casts doubt upon their use for problems that require interval scale measurement, such as comparing individuals’ gains…Item response theory appeared to offer a general solution, since the same underlying scale could explain the different true-score distributions corresponding to any subset of items from a domain. But this line of reasoning runs from model to data, not from data to model as must be done in practice. Suppose that a given dataset can be explained in terms of a unidimensional IRT model with response curves of the form Fj( ). Corresponding to any
6
amount of something misconceives the entire enterprise. There is no single trait (call it
ability, achievement or what-have-you) to be quantified by this or any other means.9 As
such, the question of scale type is essentially meaningless: there are just various models
for reducing the dimensionality of the data, some more convenient than others.10 I will
refer to this view of IRT scales in the ensuing discussion as the operational perspective.
To summarize, there are some psychometricians who consider to be interval-
scaled, others who think it is ordinal, still others who regard the choice of scale as
arbitrary, even if it is an interval scale, and finally some who are unsure what it is.
Clearly it is disconcerting to find this divergence of views on a question of fundamental
importance to value-added assessment. Is the IRT ability trait measured on an interval
scale or not? Indeed, how does one tell?
III. IRT Models and Scale Type
In the natural sciences, measurement is the assignment of numbers to phenomena
in such a way that relations among the numbers represent empirically-given relations
continuous, strictly increasing function h there is an alternative model with curves F*j( ) = Fj(h( )) that fits the data in precisely the same manner (Lord, 1975). That a particular IRT model fits a dataset, therefore, is not sufficient grounds to claim scale properties stronger than ordinal.” (Mislevy, 1987, p. 248) 9 The claim that a particular unidimensional scaling method is right must be based on the assumption that achievement is unidimensional, that it can be linearly ordered, and that students can be located in this linear ordering independently of performance on a particular achievement test. However, no one has succeeded in identifying or defining a linearly ordered psychological trait in educational achievement, and no one has demonstrated that a particular measurement scale is linearly related to such a trait. A serious obstacle to the establishment of truly (externally verifiable) equal-interval achievement scales is the fact that achievement is multidimensional and qualitatively changing. The nature of what is being learned is constantly being modified. Use of an objective-based approach to achievement highlights the difficulties in hypothesizing and verifying a continuous achievement trait. The student is learning the names of letters of the alphabet one day, associating sounds to those letters another day, and attaching meaning to groups of letters a third day. How can such qualitative changes be hypothesized to fall so many units apart on one particular trait?” (Yen, 1986, p. 311-12) 10 “It is important for educators and test developers to acknowledge that until the achievement traits are much more adequately defined, it is not possible to develop measurement scales that are linearly related to such traits. In fact, it appears impossible to provide such trait definition. Test users are therefore left to use other criteria to choose the best scale for a particular application; choosing the right scale is not an option.” (Yen, 1989, p. 314)
7
among the phenomena. Technically, there is a homomorphism between the empirical
relations among objects in the world and numerical relations on the scale. One such
relation is order: if objects can be ordered with respect to the amount of some attribute,
that order needs to be reflected in increasing (or decreasing) scale values. However,
order is not the only attribute to matter. If objects A and B can be combined (or
concatenated) in such a way that in combination they possess the same amount of some
attribute (length, mass) as object C, a scale for that attribute needs to reflect the results of
that operation. Thus, in an additive representation, the scaled value of the attribute in A
plus the scaled value of the attribute in B equals the scaled value assigned to C.
The importance of convention relative to empirical phenomena turns out to be the
key to scale type. At one end of the spectrum are scales in which there is no role for
convention, at the other scales that are entirely conventional. Scales of the first type are
called absolute. An example is counting: one is not free to change units if the
information to be conveyed by the scale is the number of discrete objects under
observation.11 At the other end of the spectrum are nominal scales, where the number
assigned to an object is no more than a label and the information conveyed could just as
well be represented by a non-numerical symbol. Scales for most physical quantities, such
as length and mass, have a degree of freedom for the conventional choice of a unit. Such
scales are known as “ratio scales” because the ratio of two lengths or two weights is
invariant to the choice of unit: ratios are convention-free. If there is no natural zero, so
11 It should be clear that scale type is as much a matter of how numbers are interpreted as of formal relations among the items being scaled. For example, considered purely from a technical standpoint, number right on a test can be regarded as an absolute scale. Like any scale that counts items (e.g., the score of a football game), number right does not admit of a change of units without a loss of information. However, when test items are not of equal difficulty, number right cannot be regarded as a measure of achievement that generalizes beyond performance on the particular test in question, and as such is not an
8
that the origin of a scale is also determined conventionally, ratios are no longer
convention-free magnitudes. However, ratios of intervals are invariant under change of
unit and change of origin. Such scales are therefore known as interval scales: they are
characterized by two degrees of freedom. Between interval scales and nominal scales lie
ordinal scales. Any increasing function of an ordinal scale conveys the same
information.12
With respect to psychological variables, there is less agreement about the nature
of measurement. There appear to be three prevalent views (Hand, 1996). On one view
(sometimes called “classical measurement”), it is simply assumed that the psychological
variable of interest exists and that there is a ratio (or at least interval) scale on which it is
measurable. The task of measurement is to discover those values. This appears to be the
view of some psychometricians with respect to IRT scales. On a second view, a
psychological variable exists only by virtue of its presence in some model. The model
effectively defines the variable and, when the model is fit to data, provides a means of
determining the scaled value of the variable. This has been called “operational
measurement” and is compatible with the operational perspective on IRT scales described
above, in which IRT models are devices for reducing the dimensionality of data. Two
different models may both contain the term “ability.” There is no basis for deciding
between them on grounds that one better represents “true ability”—each is just more or
less useful for the purposes to which they are put. Finally, there is a third view that holds
absolute scale, a ratio scale, or an interval scale. Except in special circumstances it is not even an ordinal scale.12 Scale type is therefore closely related to the notion of an admissible scale transformation (Stevens, 1946). An admissible transformation preserves the empirical information in the original scale by altering only those elements that are purely conventional. For ratio scales, these are the similarity transformations, corresponding to a choice of unit (for example, substituting centimeters for meters). In the case of interval scales, the admissible transformations consist of affine transformations, g= f(a) + .
9
that measurement of psychological variables is essentially the same as measurement in
the physical sciences—the view sketched at the start of this section, known as
representational measurement. Psychological attributes are postulated to explain
empirically-given relationships (such as the pattern of examinees’ responses to the items
on a test). It is the structure of those relationships that determines the properties of scales
that measure these attributes. Some relationships are so lacking in structure that the
attributes may not be scalable at all—the most we can do is to name them. In other cases
it may be possible to say that there is a larger quantity of some attribute in one person
than another, supporting ordinal scaling. In still other cases there may be sufficient
structure to permit interval-scaled comparisons: the difference in the amount of the
attribute between A and B equals the difference between C and D.
We can aspire to resolve disputes about scale type only if the third of these views
is correct. Classical measurement simply assumes an answer, whereas the question of
scale type is either meaningless or of no importance in operational measurement.
Virtually all discussions of scale type nod in the direction of representational
measurement by invoking Stevens’ classification of admissible transformations. To the
extent that a case can be made that IRT scales are interval scales, it has to be made in
terms of representational measurement theory.
The argument that and are interval-scaled is found in the analysis of conjoint
additive structures. We begin by assuming that the Pij are given—or, more precisely,
that we are given the equivalence classes comprising examinees and items with the same
Pij. Let A1={a, b,…} denote the set of examinees and A2 = {p, q, …} the set of test
items, and let represent the order induced on A1xA2 by Pij. That is, (a,p) (b,q) if Pap
10
Pbq. The sets A1, A2, and the relation are known as an empirical relational system.
If several exacting conditions are met, requiring that the relations between A1 and A2
exhibit still more structure than the ordering of equivalence classes, the resulting
empirical relational system is called a “conjoint additive structure” and the following can
be proved: (1) there are functions 1 and 2 mapping the elements of these sets into the
real numbers (i.e., examinees and items can be scaled); (2) the relation ordering
examinee-item pairs can be represented by an additive function of their scaled values;
that is, (a,p) (b,q) 1(a) + 2(p) 1(b) + 2(q); (3) only affine transformations of
1 and 2 preserve this representation. These transformations correspond to a change of
origin and a change of units; hence, 1 and 2 are interval scales (Krantz et al., 1971).
There are two critical steps in the proof. First, from the relation ordering
examinee-item pairs we must be able to derive relations ordering the elements of each set
separately. For this we require a monotonicity condition (also known as independence):
if Paq > Pbq for some item q, then Par > Pbr for all r. An analogous condition holds for
items: if item q is harder for one examinee to answer, it is harder for all examinees.
Monotonicity establishes orderings of A1 and A2 separately. Without it, not even ordinal
scales could be established for examinees and items.
To obtain interval scales we need further conditions, as illustrated in Figure 2,
which depicts three “indifference curves” (literally: isoprobability curves) over
examinees and items: i.e., three equivalence classes determined by the Pij. Examinees
and items are arrayed along their respective axes according to the induced order on each
set, but no significance should be attached to the distance between a pair of examinees or
a pair of items: the axes are unscaled apart from the ordering of examinees and items.
11
However, note that there is an additional structure to the equivalence classes in Figure 2.
They exhibit a property known as equal spacing: as we move over one examinee and
down one item, we remain on the same indifference curve. Equal spacing is probably the
simplest of all conjoint additive structures, but it is not a necessary condition for conjoint
additivity.
To derive an equal-interval metric, we establish that the distance between one pair
of examinees is equal to the distance between another pair by relating both to a common
interval on the item axis. As shown in Figure 3, the interval between examinees a and b
can be said to equal the interval between examinees b and c in that it takes the same
increase on the item scale (from q to r) to offset both. Thus the item interval qr functions
as a common benchmark for defining a sequence of equal-unit intervals on the examinee
axis. The scaled value of any individual i is then obtained by counting the number of
such intervals in a sequence from some arbitrarily chosen origin to person i: ab, bc, cd ...
This scaling of the heretofore unscaled axes renders the equiprobability contours linear
(and, of course, parallel).
In order that the conclusion ab = bc = cd = … not be contradicted by other
relations in the data, it must be the case that the same conclusion follows no matter which
interval on the item axis is selected as the benchmark. A like condition must hold for
intervals on the examinee axis to serve as a metric for items. In addition to these
consistency conditions, other technical conditions must be met when equal spacing does
not obtain.
To summarize, conjoint measurement scales a sequence of intervals in one factor
by using differences of the complementary factor as a metric. By moving across
12
indifference curves, these sequences can be concatenated to measure the difference
between any two examinees (or items) with respect to the latent factor they embody.
Because the measurement of intervals reduces to counting steps in a sequence, there is an
essential additivity to this empirical structure, on the basis of which we obtain additive
representations of examinee and item traits.
Conjoint additivity uses only the Pij equivalence classes and not the values of Pij to
determine the 1 and 2 scales. The final step from conjoint additivity to IRT requires a
positive increasing transformation F from the positive reals to the interval [0,1], such that
F( 1(a) + 2(p)) = Pap. Because any increasing transformation of 1(a) + 2(p) preserves
the representation of A1xA2 by 1 and 2, it is clear that a function F satisfying this
condition can be found.
Of the three IRT models, only the one-parameter model is consistent with this
simple conjoint additive structure. The two- and three-parameter IRT models, in which
item discrimination varies, violate the monotonicity assumption. This is easily seen with
the aid of Figure 1. The ICC labeled II represents an item with discrimination parameter
= 1, while the curve labeled III has a discrimination of 2. The ICC cross where equals
the common value of the difficulty parameter. An individual whose lies to the left of
this intersection finds item II more difficult than item III; an individual whose is to the
right of the intersection ranks the items the other way around. Because different
individuals produce inconsistent rankings of items, the ranking of examinee-item pairs on
the basis of Pij does not yield orderings of examinees and items and the derivation of the
scales breaks down. Indeed, it is alleged that because only the one-parameter IRT
model produces a consistent ordering of items for all examinees (ICCs in the Rasch
13
model never cross), only the in a one-parameter model can be considered an interval-
scaled measure of ability (Wright, 1999).
This goes too far. A straightforward modification of conjoint additivity
accommodates structures with a third factor that enters multiplicatively, such as the IRT
discrimination parameter. The relevant theorem appears in Krantz et al. (1971, p. 348).
The empirical relations between examinees and items are termed a polynomial conjoint
structure. The extra conditions on examinees and items ensure that we can obtain
separate representations by discrimination classes in the manner just described. When
these conditions are met, we find that there are functions 1, 2, and 3 such that (a,p)
(b,q) 3(p)[ 1(a) + 2(p)] 3(q)[ 1(b) + 2(q)]. 1 and 2 are unique to linear
transformations and 3 is unique to a similarity transformation. Obviously this
representation has the structure of the two-parameter IRT model.13
To summarize: if the empirical relational system is one of conjoint additivity and
the function F correctly specifies the relationship between i- i and the Pij, the IRT
measure of latent ability, , and the IRT difficulty parameter, , are interval-scaled
13 This representation does not include anything that corresponds to the guessing parameter in the three-parameter IRT. While I have not seen an analysis of such a conjoint measurement structure, extending polynomial conjoint measurement to include the three-parameter IRT model would proceed similarly to the extension of conjoint additivity to cover the two-parameter IRT model. For sets of items with the same discrimination parameter, the two-parameter IRT model reduces to the one-parameter model. Proof of scale properties for the two-parameter model involves selecting one such set of items and scaling examinees with respect to it. Because the choice of items is arbitrary, the resulting scale is unique only up to the change of units that would result from the selection of an alternative set with a different value of the discrimination parameter. (For details, see Krantz et al., 1971, Chapter 7.) Incorporating a guessing parameter in the structure would involve following the same logic. Examinees would be scaled using a subset of the data (i.e., for a particular choice of discrimination and guessing parameters.) As with the polynomial conjoint structure, the fact that another choice could have been made introduces a degree of freedom into the representation, though in the case of the guessing parameter, this degree of freedom affects only the function mapping ( 3(q)[ 1(b) + 2(q)]) to Pij and not the relations among 1, 2, and 3. Theproperties of the 1 and 2 would therefore be precisely those established for the polynomial conjoint structure, inasmuch as these properties depend on relations between examinees and items and not on the function mapping equivalence classes to numerical values of Pij.
14
variables. If the empirical relational system is a polynomial conjoint structure and F is
again correct (e.g., the two-parameter logistic IRT model, or possibly the three-parameter
model, if extra conditions entailing a lower asymptote on Pij are met), and are again
interval-scaled.
Notwithstanding the fact that these propositions were proved in the 1960’s, one
continues to find a wide range of opinions about the properties of IRT scales in the
psychometric community (as we have seen). At least some of that diversity of opinion is
due to the following three misconceptions about scales. (1) Because arbitrary monotonic
transformations of and can be shown to fit the data equally well, and cannot be
interval-scaled. At best, they are ordinal variables. (2) Because is interval-scaled, no
scale of achievement related to by anything other than an affine transformation can be
an interval scale. In particular, if = g( ), where g is monotonic but not affine, the
scale is ordinal. (3) Using an IRT model (or specifically the Rasch model) to scale a test
produces an interval scale of ability. Each of these beliefs is wrong. Before quitting this
discussion of representational measurement theory, we need to understand why.
The first of these views derives from Stevens’ stress on the role of admissible
transformations in determining scale type. The problem is that Stevens’ formulation of
the matter fails to make clear just what makes a transformation “admissible.” There is a
sense that information must not be lost when a scale is transformed—but precisely what
information? All transformations rest on the implicit assumption that there is something
that we can alter freely—something, in other words, that is not “information,” at least not
information we care about. Stevens’ criterion for determining scale type remains empty
until this something is specified.
15
Misconception (1) rests on the following argument: we can replace with g( )
for arbitrary monotonic function g and still fit the data (the Pij) equally well, provided we
make an offsetting change to the function F relating and to P. An example appeared
in footnote 7 above, in Lord’s modification of the three-parameter logistic model:
P*ij = F*ij( ) = cj + (1-cj)[1 + exp(- j( -1( ( i)))- j)]-1
Because this model fits the data as well as the original three-parameter logistic model (as
it must, being mathematically equivalent), it is argued that and ( ) contain the same
information about ability. Because the function is not affine but an arbitrary
monotonic function, the conclusion is drawn that and must both be ordinal scales.
The flaw in this argument is the supposition that the only information that matters
is the order over equivalence classes of examinees and items induced by the Pij. But the
proof from conjoint additivity does not conclude that and are interval scales merely
because these mappings from examinees and items to real numbers preserves order
among the Pij. The empirical relational system between classes of examinees and items
contains more structure than the ordering of equivalence classes. It is this additional
structure, as illustrated by the example of the equally spaced conjoint structure above,
that underlies the claim that certain mappings from examinees and items are interval
scales. An arbitrary monotonic transformation of these scales no longer reflects the
relations holding among the scaled items and examinees shown in Figure 3. Such a
transformation loses the information that the distance between examinees a and b equals
the distance between b and c in the sense that both offset the same substitution of one pair
of test items for another. Preserving that information restricts the class of admissible
transformations to affine functions.
16
The preceding comment notwithstanding, it does not follow that there cannot be
another way of assigning numbers to examinees, on the basis of some other property,
producing a new scale = h( ) for a non-affine function h that can also be regarded as an
interval-scaled measure of ability. Clearly and cannot be interval scales of the same
property of examinees. (In the language of representational measurement theory, they
cannot represent the same empirical relational system.) That we might have reason to
regard both as measures of achievement is due to the vagueness and imprecision with
which the term achievement is used, not just in ordinary discourse, but in social science
research.
I illustrate with an example from economics. Consider a firm that employs thirty
workers on thirty machines. Workers are rotated among machines on a daily basis. The
only information the firm has on the productivity of either factor of production is the
daily output of each worker-machine combination. Suppose, for purposes of rewarding
employees or scheduling machines for replacement, the firm decides it needs measures of
the productivity of individual workers and machines. In other words, it wants to scale
these heretofore unscaled entities. (The parallel with testing should be obvious.) A
measurement theorist is called in who observes that the data satisfy the conditions of a
conjoint additive structure, inasmuch as workers and machines can be scaled such that the
resulting isoquants in worker/machine space are linear and parallel. The measurement
theorist confidently announces that these scaled values represent worker and machine
productivity, up to an arbitrary choice of unit of measurement and origin.
Some time later the firm calls in a production engineer, who prowls around the
shop floor with a stopwatch for a week, and then reports the following discovery: the
17
output of a worker/machine pair is a simple function of the downtime of the machine
(beyond a worker’s control) and the amount of time the worker is goofing off. That is,
output Qwm = w m, where w is the proportion of time worker w is attending to his work,
and m,is the proportion of time machine m is running properly. There is no difference
between workers otherwise: during their time on task with a functioning machine, all are
equally productive. The engineer therefore proposes w as the natural (and ratio-scaled)
measure of employee productivity.
It is then noted that the engineer’s productivity measure is not a linear
transformation of the measurement theorist’s measure, in that the two variables differ by
more than a choice of origin and unit. (Indeed, the measurement theorist’s w = ln w.)
Yet each expert swears his measure has at least interval-scale properties, and each is
right. The two measures capture different properties of the relations between workers
and machines. The engineer doesn’t care that his metric ignores the information in the
conjoint additive structure, because he relies on an alternative empirical relational system
(one that includes the position of hands on a stopwatch) to scale the entities on the shop
floor. Both experts have produced interval scales, and it is up to the firm to decide which
scale captures properties of the relations between workers and machines that most matter
to it.
This example shows why the second of the three misconceptions cited above is
false. The parallel with testing is maintained if there is some other empirical relational
system into which examinees and test items fit, affording an alternative metric.
Suggestive examples are found in the education production function literature, in the
form of back-of-the-envelope calculations of the benefits of various educational
18
interventions (e.g., the dollar value of higher achievement associated with smaller class
sizes). Whether this can be done with sufficient rigor to provide an interval scale for
measuring student achievement (and whether it is desirable to construct one along these
lines, if feasible) is a question to which I return below in Section V.
Finally, the notion that the use of an IRT model (or a particular IRT model, such
as the Rasch model) confers interval-scale status on the resulting places the cart before
the horse: it overlooks the requirement that the empirical relational system be a conjoint
additive or polynomial conjoint structure. Applying an IRT model willy-nilly to
achievement test data does not of itself confer any particular properties on the scale score
metric.14
IV. Achievement Scales in Practice
To summarize the argument of the preceding section, under stringent conditions,
an IRT measure of ability can be shown to be an interval-scaled variable. But there are
two important caveats. First, do the data meet these stringent conditions? Second, might
there be some other set of relations holding between examinees and test items that
provides an alternative, and perhaps more satisfactory, basis for constructing an
achievement measure with interval (or even ratio) scale properties? I take up these
questions in turn.
It is highly unlikely that real test data meet the exacting definition of a conjoint
structure. At best, they come close, though it is not easy to say how close. The theory
14 Though an obvious point, this is not always appreciated by researchers. Consider the following statement in a report of the Consortium on Chicago School Research. Having rescaled the Iowa Test of Basic Skills using a Rasch IRT model, the authors claim to have produced a metric with interval scale properties: “A third major advantage of the Rasch equating is that, in theory, it produces a ‘linear test score metric.’ This is an important prerequisite in studies of quantitative change. This allows us to compare directly the gains of individual students or schools that start at different places on the test score metric.” (Bryk et al., 1998).
19
places restrictions on the equivalence classes of A1 x A2 (examinees crossed with items),
but these classes are not given to us. Instead, the data consist of answers to test items,
often binary indicators of whether the answer was correct or not. From these data the
membership of the equivalence classes must be inferred before one can ascertain whether
these classes can be represented by a set of linear, parallel isoprobability curves. Because
the theory stipulates restrictions that hold for every examinee, while the amount of data
per examinee is small, there is little power to reject these hypotheses despite anomalies in
the data. Many restrictions might be accepted that would be deemed invalid if the actual
Pij were revealed.
IRT assumes that conditional on and , the Pij are independent across items and
examinees. Although correlated response probabilities attest to the existence of
additional latent traits that affect performance, unidimensional models are fit to the data
anyway. Sometimes it is clear even without statistical tests that the model does not fit the
data. Consider multiple choice exams, where the lower asymptote on the probability of a
correct response is not zero. The lower asymptote is more important for low ability than
high ability examinees. A conjoint structure for these data must take into account the
lower asymptote as another factor determining Pij or the scaled values of examinee ability
will be wrong. Indeed, if responses to a multiple-choice exam are tested to see whether
the definition of conjoint additivity is met, data that appear to meet this definition will no
longer do so once guessing is factored in unless all items are equally “guessable.”
Even when the data do fit the model, the extent to which they have been selected
for just this reason should be kept in mind. Indeed, it is not clear that the word data is the
right one in this context, as the tests are designed by test makers who decide first on a
20
scaling model and then strive to ensure that their test items meet the assumptions of that
model.15 Even if they are wholly successful in this endeavor, the “data” represent a
selected, even massaged, slice of reality and not a world of brute facts.
Thus, notwithstanding the support given by representational measurement theory to IRT
methods, IRT can fail to produce satisfactory interval scales. Verifying that the
conditions of the theory are met is difficult even for test makers, let alone statisticians and
behavioral scientists who use test scores for value added assessments.
As the worker-machine example shows, even if test data satisfy the conditions of
a conjoint structure, there might be some other scale with a claim to measure what we
mean by achievement. IRT scales have the peculiarity that the increase in ability
required to raise the probability of a correct response by any fixed amount is independent
of the difficulty of the question. That is, raising the probability of answering a very
difficult question from .1 to .9 takes the same additional knowledge as it does to raise the
probability of answering a very simple item from .1 to .9. (Observe that in this argument,
.1 and .9 can be replaced by any other numbers one likes, for example .0001 and .9999.)
That it takes the same increase in ability to master a hard task as an easy one follows
directly from conjoint additivity (specifically, the parallel equiprobability contours that
result when items and examinees are scaled to represent conjointness), but it may be
difficult for many readers to square this notion with other ideas they entertain about
achievement, based on the use of the term in other contexts: how long it takes to
15 For example, in the Rasch model items have a common discrimination parameter (their item characteristic curves do not cross). Test makers using the Rasch model plot ICC’s and discard items that violate this assumption. IRT also assumes that conditional on , probabilities of a correct response are independent over examinees. Violations of this assumption lead items to be discarded on grounds of potential bias. This is probably a good idea, but it further weakens the sense in which we are dealing here with the fit between a model and “data.”
21
accomplish these two tasks, how hard instructors must work, the extent to which a
student who has mastered the more difficult item is in a position to tackle a variety of
other tasks and problems, compared to the student who has mastered the easier item, and
so forth. In short, we may find ourselves in the position of the production engineer in the
worker-machine example: taking into account the other information we possess, we
might find the scale derived from conjoint additivity lacking, notwithstanding its
impeccable pedigree on purely formal grounds as an interval scale.
At this point it may be useful to examine some of the scales that have been
produced using IRT methods. If larger increments of ability (as measured by an
alternative metric grounded in some of the aforementioned phenomena) are in fact
required to produce the same change in Pij as questions become more difficult, IRT
scaling will compress the high end of the scale, diminishing mean gains between the
upper grades and reducing the variance in achievement. Evidence that something of this
kind occurs is found in developmental scales used to measure student growth across
grades. Students at different grade levels are typically given different exams, but by
including a sufficient number of overlapping items on forms at adjacent grade levels,
performance on one test can be linked to performance on another test, facilitating the
creation of a single scale of ability spanning multiple grade levels.
Table 1 below, reproduced from Yen (1986), displays scores for the norming
sample for the Comprehensive Test of Basic Skills (CTBS), Form Q, developed by
CTB/McGraw-Hill and scaled using IRT methods for the first time in 1981.16 In both
16 Prior to 1981 the CTBS was scaled using an older method known as Thurstone scaling. Although it has been claimed that Thurstone scaling produces scores on an interval scale, there is no basis for this claim comparable to the proofs provided for conjoint measurement structures in representational measurement theory: scale properties are not based on the structure of empirical relations holding among test items and
22
subjects, mean growth between grades drops dramatically, and almost monotonically,
between the lowest elementary grades and secondary grades. In addition, the standard
deviation of scale scores declines as the mean score rises.
CTB/McGraw-Hill has since superseded the CTBS with the Terra Nova series.
The decline in between-grade gains seen in the CTBS norming sample is still evident,
though less pronounced, in Terra Nova results for Mississippi and New Mexico, as
shown in Table 2.17 Mean gains in all subjects tend to decrease with grade level, though
there is a break in the pattern between grades 7 and 8.
Between-grade gains can be affected by the differences in the content of tests and
by linking error. It is particularly instructive, therefore, to see the patterns that emerge
when there are no test forms specific to a grade level and no linking error. Northwest
Evaluation Association uses computer-adaptive testing in which items are drawn from a
single, large item bank. Results for all examinees in mathematics in the fall of 2005 are
presented in Table 3. As in the previous tables, we again find that between-grade
differences decline with grade level. Within-grade variance in scores is stable (reading)
or increases moderately (mathematics).
Because this is at base a dispute about how to use words, we need to be careful in
discussing these phenomena. If the test data in fact exhibit a conjoint structure (let us
concede the point for now), the IRT is an interval-scaled variable. Yet this scale
commits us to the conclusion that the variance of reading ability is no greater among high
school students than second-graders. Most of us, I suspect, would respond that this scale
examinees, but on ad hoc assumptions about the distribution of ability in the population tested. 17 Data are from www.SchoolData.org maintained by the American Institutes of Research. New Mexico and Mississippi were two of a handful of states in these data that reported mean scale scores for vertically-linked tests produced by test-makers known to employ IRT scaling methods.
23
fails to capture something about the word ability (or achievement) that causes us to recoil
from such a conclusion.
Readers wondering whether their own pretheoretical notions of these terms accord
with the usage implied by IRT are invited to consider the sample of mathematics test
items displayed in Figure 4, taken from the Northwest Evaluation Association web site.
The items are drawn from a larger chart providing examples of test questions at eleven
different levels of difficulty. The two selections represent items scored at the 171-180
level and the 241-250 level, respectively.18 Consider the following question: if student
A is given the items in the first set, and student B the items in the second set, and if
initially each student is able to answer only 2 of 7 items correctly, which student will
have to learn more mathematics in order to answer all 7 items correctly? Student A has
basics of addition and subtraction to learn, as well (perhaps) as how to read simple ch
and diagrams. None of the required calculations are taxing: all could be done by
counting on one’s fingers. By contrast, student B must make up deficits in several of th
following areas: decimal notation, fractions, factoring of polynomials, solving algebraic
equations in one unknown, solid geometry, reading box plots, calculating percenta
The calculations required are more demanding. However, the correct answer, according
to IRT, is that both require the same increase in mathematics ability.
arts
e
ges.
20
19,
18 According to the data in Table 3, the average second grader tested in 2005 would have answered slightly more than half the questions in the first set correctly, while the average ninth grader would have responded correctly to slightly less than half the questions in the second set. 19 Strictly speaking, this is the correct answer only if items have equal discrimination parameters, that is, the one-parameter IRT model fits the data. In fact, NWEA does use the one-parameter model, and has gone to considerable lengths to verify that the items meet the assumptions of that model. 12 When I put this question to faculty and graduate students of my department in the School of Education at Vanderbilt, 13 of the 108 respondents chose A, 47 chose B, 15 said the amounts were equal, and 33 said the answer was indeterminate. Obviously this was not a scientifically-conducted survey, nor is it clear just what respondents meant by their answers. Conversations with some revealed that they converted the phrase “more mathematics” into something more readily quantified, such as the amount of time a student
24
One might conclude from this evidence that psychometricians should not attempt
to construct over-arching developmental scales for mathematics and reading ability that
span so many age levels. But other concerns are also raised. Between-grade gains begin
to decline at the lowest grade levels in Tables 1-3: gains between third and fourth grades
are markedly lower than gains between second and third. Between-grade comparisons
may matter little for value-added assessment, if most instructors, particularly in the
elementary grades, teach only one grade level. However, there are also implications for
within-grade assessments. If third graders are not in fact learning less than second
graders—if instead IRT methods have compressed the true scale—then the true gains of
higher-achieving students within these grades are understated.21
V. Options for Value-Added Assessment
What options are available to the practitioner who wants to conduct value-added
assessment but is unwilling to accept at face value claims that the IRT ability trait is
measured on an interval scale? Broadly speaking, there are three available courses of
action.
a. Use the scale anyway.
would need to acquire these skills. Persons who said the answer was indeterminate may have meant it was indeterminate in principle (“more mathematics” is meaningless) or simply that that answer couldn’t be determined from the information given. Nonetheless, it is striking how few gave the psychometrically correct response.
21 The phenomena of decreasing gains and diminishing or constant variance with advancing grade level has been treated in the psychometric literature under the heading scale shrinkage. While a number of explanations have been advanced, these explanations typically assume there is a true IRT model that fits the data (with ability, perhaps, multidimensional rather than unidimensional), and that various problems (e.g. a failure to specify the true model, the small number of items on the test, changing test reliability within or across grades, ceiling and floor effects) prevent practitioners from recovering the true values of .Psychometricians have disagreed about the extent to which these explanations account for the phenomena in question. (For notable contributions to this literature, see Yen, 1985; Camilli, 1988; Camilli, Yamamoto, and Wang, 1993; Yen and Burket, 1997.) By contrast, the point I am making here is that even in the absence of these problems, IRT scales are likely to exhibit compression at the high end when compared with pretheoretical notions of achievement grounded in a wider set of empirical relations.
25
b. Choose another measure of achievement with an interval-scaled metric.
c. Adopt analytical methods suited to ordinal data.
In this section I argue that neither a. nor b. is an attractive option. I then go on to
demonstrate the feasibility of c.
a. Using the scale.
Even if the IRT ability scale possesses at best ordinal significance, one might
continue to use it if reasonable transformations of all yield essentially the same
estimates of value added. (Compare the claim that statistics calculated from ordinal
variables are generally robust to all but the most grotesque transformations of the original
scale.22) The practice of many social scientists who are aware that achievement scales
may not be of the interval type but proceed with value-added assessment anyway
suggests that this view may be widespread.
Unfortunately, to test whether reasonable transformations of yield essentially
the same measures of value added requires some sense of what is reasonable. Absent
that, there may be a tendency for researchers conducting sensitivity analyses to decide
that the alternatives that are reasonable are those that leave their original estimates largely
intact. The compression of scales displayed in Tables 1-3 suggests one possibility:
assume an achievement scale in which average gains are equal across grades. I have
examined the consequences of adopting this alternative scaling for data from a sample of
districts in a Southern state.23 Data are from mathematics tests administered in grades 2-
8 during the 2005-06 school year. The tests were scaled using the one-parameter IRT
22 Cliff (1996) attributes this remark to Abelson and Tukey (1959), but it does not appear in that paper.23 Anonymity has been promised to both the state and test maker. Data are available for only a portion of the state. In addition, because between-grade growth is central to the analysis, only those districts that tested at least 90 percent of students in each grade are included. The final sample comprised 98,760
26
model. Results were similar to those we have seen in Tables 2-4, with near-monotonic
declines in between-grade gains. A transformation = g( ) was sought that would
equalize between-grade gains for the median student.24 It turned out that this could be
closely approximated by a quadratic function increasing over the range of observed
scores. Figure 5 depicts the relationship between the transformed and original scaled
values for the median examinee. While one might object to the scale on the grounds
that median student gains are not equal across grades, tapering off as students approach
adolescence, Figure 5 shows that the transformation from to is not driven by the
upper grades—essentially the same curve would be found if data points for grades 7 and
8 were dropped. Moreover, g( ) exhibits only a modest departure from linearity. It
seems unlikely that many researchers would regard it as a grotesque transformation of the
data.
Nonetheless, the consequences of this transformation for the distribution of
growth are pronounced (Table 4). At each grade level, I have calculated the change in
(alternatively, ) required to remain at the 10th, 25th, 50th, 75th, and 90th percentiles of the
achievement distribution when advancing to the next grade. The absolute changes in
and that meet this criterion are affected by the magnitude of the median student growth
(and are therefore scale dependent, i.e., they depend on choice of units). However,
relative changes are invariant to the choice of units. Accordingly, column 1, top panel,
presents the ratio of 25 to 10 (the change required to remain at the 25th percentile over
students in grades 2 through 8 during the 2005-06 school year. 24 To accomplish this, the original median scores were replaced with a new series in which the between-grade gain was set to the overall sample median gain across all grades. This left unchanged the second grade values but altered the values in subsequent grades, as shown in Figure 5. A quadratic function of was then fit to the new series. The fit, as shown, was exceedingly close, though on the resulting scale, median gains can vary by ±.5 points.
27
the change required to stay at the 10th percentile). Comparable ratios for the 50th, 75th,
and 95th percentiles appear in the other columns. The lower panel contains the same
ratios for the scale.
In the original scale, differences in growth at various points of the distribution are
not pronounced. The ratios are greatest in the fifth and sixth grades, but even here, there
is not much difference between a student at the median and one at the 90th percentile. By
contrast, the ratios are much greater using the transformed scale and increase
monotonically as we move to the right, from 25/ 10 to 90/ 10. While the
direction of the change is what we would expect on the basis of the preceding
discussion—less compression at the high end of the -scale compared to the -scale—the
magnitude of the difference is surprising. The impact on value-added assessment
depends on how students are distributed over schools and teachers. Clearly, changes of
the magnitude shown in Table 4 can make a great difference to teacher value added when
that distribution is not uniform.
b. Changing the scale.
Clearly nothing is gained by substituting percentiles, normal curve equivalents, or
standardized scores for . While all are used in the research literature, no one claims that
they are interval-scaled. However, two approaches merit discussion, not because they
work any better, but because the view seems to be gaining currency that they do, or
could.
The first, which I will refer to as binning, assigns every examinee to a group
defined by prior achievement (e.g., deciles of the distribution of prior scores). Gain
scores are normalized by mean gains within bins, either through dividing by the mean
28
gain or standardizing with respect to the mean and standard deviation within each bin.
The normalized gain scores are then used as raw data for further analysis, which could
include value-added assessment of schools and teachers. Examples of this approach are
Springer (2008) and Hanushek et al. ( 2005). In the latter study, binning is explicitly
motivated by concern that the metric in which test results are reported does not represent
true gains uniformly at all points of the achievement distribution.25
Binning does not solve the problem of scale dependence. Binning merely
declares that a normalized gain in one bin is to count the same as one that takes the same
normalized value in another bin. The declaration does not ensure that these two gains are
equal when measured on the true scale (if such a thing exists). Rather than solving the
problem, binning is simply the substitution of a particular transformation for the scale.
The notion that binning represents a solution may derive from the fact that
normalizing within bins removes much of the effect of any prior transformation of scale
(for example, the substitution of the scale for the scale). Differences between bins
have no effect on value-added measures. Only differences within bins matter (though
these are still affected by the transformation whenever the slope of g( ) = at a bin mean
differs from 1). Hence, the results of the binned analysis are less sensitive to prior
transformations of the original scale. This may have led some to believe that scale no
longer matters as much. This is not so, given that the normalized-within-bin scores are
themselves just another transformation of . Moreover, if invariance of this kind were
25 A practice similar to binning has been used in the Educational Value-Added Assessment System (EVAAS) of the SAS Institute, wherein gain scores are divided by the gain required to keep an examinee at the same percentile of the post-test distribution that he occupied in the pre-test distribution (Ballou, 2005). Thus, for exams like the Iowa Test of Basic Skills that exhibit increasing variance at higher grade levels, the transformation pulls up gains of examinees whose pretest scores were below the mean and reduces gains of examinees whose pretest scores were above the mean. There is no reason why this should be regarded as superior to using the original scale.
29
the sole desideratum, percentiles could be used in place of . Percentiles are invariant to
any increasing transformation of , but that does not make percentiles an interval scale of
achievement.
Suppose we decide that the problem of finding interval scales for academic
achievement is intractable. We still might be able to conduct value-added assessment if
achievement—by whatever metric—could be mapped into other variables whose
measurement poses no such difficulties. This mapping could go backward to inputs or
forward to outcomes. In the former, academic achievement would be related to
measurable inputs required to produce that achievement. Thus, instead of worrying
whether one student’s 5 point gain was really the equal of another student’s 5 point gain,
we would concern ourselves with measuring the educational inputs required to produce
either of these gains. If those inputs (e.g., teacher time) turn out to be equal, then for all
purposes that matter, the two gains are equivalent. If those inputs turn out not to be
equal, the teacher or school that has produced the gain requiring the greater inputs has
contributed more and should be so recognized by value-added analysis.
Forward mapping treats test scores as an intermediate output. Value-added
assessment would proceed by tying scores to long-term consequences.
“An important methodological issue…is the problem of choosing the correct metric with which to measure academic growth. Because the metric issue is so perplexing, almost all researchers simply use the particular test at their disposal, without questioning how the test’s metric affects the results…The only solution I see to the problem of determining whether gains from different points on a scale are equivalent is to associate a particular test with an outcome we want to predict (say, educational attainment or earnings), estimate the functional form of this relationship, and then use this functional form to assess the magnitude of gains. For example, if test scores are linearly related to years of schooling, then gains of 50 points can be considered equal, regardless of the starting point. If the log of scores is linearly related to years of schooling, however, then a gain of 50 points from a lower initial score is worth more than a gain of 50 points from a higher initial score. This ‘solution’ is, of course, very unsatisfactory, because the
30
functional form of the relationship between test scores and outcomes undoubtedly varies across outcomes.” (Phillips, 2000, p. 127)
As the final sentence of this passage suggests, we are very far from being able to
carry out either of these programs. Only some educational inputs are easily quantified.
For such inputs as the clarity of a teacher’s explanations or the capacity to inspire
students, the challenges to quantification are at least as great as for academic
achievement. Indeed, the low predictive power of those inputs that are easily quantified
is largely responsible for the current interest in value-added assessment.
The practical difficulty mentioned in the last sentence of the quoted passage is not
the only problem facing the forward mapping of test scores to long-term educational
outcomes. Given the variation in test results in Tables 1-3, it would almost certainly be
the case that the functional form of the relationship between test scores and outcomes
would vary across tests as well as outcomes. The introduction of each new test would
require additional analysis to determine how scores on its metric were related to long-
term objectives like educational attainment and earnings. In many cases, the data for
such analysis would not be available for years to come, if ever. In the interim we would
have to make do with very imperfect efforts to equate the new tests with tests already in
use (for which we would hope this mapping had already been done).
These are the technical issues. There is in addition the difficult normative
question of how to value various outcomes for different students in order to assign a
unique social value to each i. It is not obvious how we would come by these weights.
Even if future earnings were the only outcome that mattered, we would require the
relative values of a marginal dollar of future income for all examinees (whose future
incomes are, of course, unknown at the time the assessment is done). Attaching a price to
31
the non-pecuniary benefits of an education is still more difficult. Education is a
transformative enterprise. The ex ante value placed on acquiring an appreciation of great
literature is doubtless very different from the ex post value. In these circumstances there
is likely to be a great discrepancy between the compensating and equivalent variations
associated with a given educational investment. Accordingly, the weights in our index
would have to reflect the values of “society” rather than the still-unformed persons to be
educated.
It is not clear that we should subject decisions about education to this kind of
utilitarian calculus, as it fails to respect the autonomy of individuals. There is a strong
tradition in our polity of regarding educational opportunity as a right. Individuals have a
claim on educational resources not because distributing resources in this manner
maximizes a social welfare function, but because they are entitled to the chance to realize
their potential as individuals. If we take this seriously, then the notion that teachers and
schools are to be evaluated by converting test scores into the outputs that matter and
weighting these outputs according to their social value is wrong-headed. The point of
education is to provide students with skills and knowledge that as autonomous persons
they can make of what they will. Teachers should therefore be judged on how
successfully they equip their students with these tools—regardless of anyone’s views of
the merits of the final purposes, within wide limits, for which students use them.
c. Analyzing test scores as ordinal data
The final option for researchers uncertain of the metric properties of ability scales
is to treat such scales as ordinal, thus foregoing any analysis based on the distance
between two scores. On the assumption that scales contain valid ordinal information
32
about examinees, statistics based on the direction of this distance remain meaningful.
There are a variety of closely-related statistics of this kind (known generally as measures
of concordance/discordance) employed in the analysis of ordinal data. In this discussion
I will not attempt to identify a particular approach as best. Rather, my objective is to
demonstrate the feasibility of such methods for value-added assessment.
Suppose we want to compare the achievement of teacher A’s class to the
achievement of other students at the same grade level in the school system. If A has n
students, and teachers elsewhere in the system have m students, there are nm possible
pairwise comparisons of achievement. As only ordinal statements are meaningful, each
pair is examined to determine which student has the higher score. If it is A’s student, we
count this as one in A’s favor (+1); if it is the student from elsewhere in the system, we
count this as one against A (-1). Ties count as zeros. The sum of these counts, divided
by the number of pairs, is known as Somers’ d statistic. Somers’ d can be considered an
estimate of the difference in two probabilities: the probability that a randomly selected
student from A’s class outperforms a randomly selected student from elsewhere in the
system, less the probability that A’s student scores below the outsider.
This procedure suffers from the obvious defect that no adjustment has been made
for other influences on achievement. In conventional value-added analyses, this might be
accomplished through the introduction of prior test scores as covariates in a regression
model or by conditioning on prior scores in some other manner. An analogous procedure
in the ordinal framework would be to divide students into groups on the basis of one or
more prior test scores, compute separate values of Somers’ d by group, and aggregate the
resulting statistics using the share of students in each group as weights. In principle there
33
is no reason to restrict the information used to define groups to test scores. Any student
characteristic could be used to define a group. Only data limitations prevent the
construction of ever-finer groups.
Results from an application of this approach are presented Table 5. Because the
data used in the previous example do not contain teacher identifiers, for this application I
have used a different data set provided by a single, large district that contains student-
teacher links. Two sets of value-added estimates are presented for teachers of fifth grade
mathematics. The first is the weighted Somers’ d statistic, based on pairwise
comparisons of each teacher’s students to other fifth graders in the district. To control for
prior achievement, students were grouped by decile of the fourth grade mathematics
score. Students without fourth grade scores were dropped from the analysis. The second
value-added measure is obtained from a regression model in which fifth grade scores
were regressed on fourth grade scores and a dummy variable for the teacher in question.
Separate regressions were run for each teacher so that each teacher was compared to a
hypothetical counterpart representing the average of the rest of the district, preserving the
parallel with the Somers’ d statistic. The coefficients on the dummy variables represent
teachers’ value added. Statistical significance was assessed using the conventional t
statistics in the case of the regression analysis and jackknifed standard errors in the case
of the ordinal analysis.
How much difference does it make to teacher to be evaluated by the one method
rather than the other? The hypothesis that teachers are ranked the same by both methods
is rejected by the Wilcoxon signed rank test (p = .0078). The proportion of statistically
significant estimates is higher using the ordinal measure (which is less sensitive to
34
noisiness in test scores): 86 of the 237 teachers are significant by this measure, compared
to 64 by the other. The maximum discrepancy in ranks is 229 positions (out of 237
teachers in all). In ten percent of cases, the discrepancy in ranks is 45 positions or more.
Depending on the uses to which value-added assessment is put, the question of
greatest importance to teachers may be whether they fall at one end of the distribution or
the other. There are 59 teachers in each quartile of the distribution. The two measures
agree in 47 cases on the teachers in the top quartile, and in 48 cases on teachers in the
bottom quartile. Thus, if falling in the top quartile qualifies a teacher for a reward, 24
teachers (more than a third of the number of awardees) will qualify or not depending on
which measure is used. A comparable figure applies to teachers placing in the bottom
quartile, if that event is used to determine sanctions. In a small number of cases (3), the
effect of using one measure rather than the other is great enough to move a teacher from
the top quartile to the bottom quartile.
In principle it is possible to condition on multiple variables (additional prior test
scores, student demographic characteristics or SES) by defining groups as functions of
several covariates. In practice this is apt to exceed the capacity of the data. Consider, for
example, a data set containing prior test scores in two subjects, plus indicators of race and
participation in the free and reduced-price lunch program. If groups are defined by
deciles of the two test scores plus two binary indicators, the total number of groups is 400
(10x10x2x2). In a district of moderate size, there could be many cells with only one
observation and therefore no matching pair.
A two-stage method circumvents this difficulty. In the first stage, multiple
covariates are used to predict Prob(Yi>Yj) – Prob(Yj<Yi) for all pairs of students i and j,
35
where Y is the dependent variable of ultimate interest (e.g., end-of-year test scores in the
current year). The prediction is of the form iˆ = wk(Xki-Xkj) or wk sgn(Xki-Xkj),
depending on whether the Xk are themselves interval-scaled or ordinal. The wk are
weights that reflect how informative the different covariates are about sgn(Yi-Yj). (For
details, see Cliff, 1996.)
In the second stage, the n students of one teacher are compared to the m students
elsewhere in the same system, using iˆ (or grouped values of iˆ ) as the covariate. iˆ is
therefore analogous to a propensity score: it is a summary measure of the effects of the
stage-one covariates on the probability that a student outranks other students, before
controlling for teachers. The resulting measure of a teacher’s value added is based on
her students’ performance relative to this prediction (in the ordinal sense, of course).
Given that many achievement tests are now administered to thousands of students
throughout a state, it is worth noting that all the data can be used in the first stage to form
an ordinal prediction based on prior achievement, demographics, and SES, while
continuing to rely on within-district comparisons in the second stage for the final
measures of value added.26
VI. Conclusion
Are IRT ability traits measured on an interval scale? It seems hazardous to
assume so. Whether examinees and test items constitute a conjoint structure depends on
the make-up of equivalence classes defined by the Pij, but those are not given. Statistical
testing can reveal whether the data are strongly inconsistent with this hypothesis, but
moderate departures from the conjoint structure almost certainly go undetected.
26 Within district comparisons are preferred, given the impact of curriculum and other district policies on
36
Moreover, even if these conditions hold in the norming samples used by test-makers to
calibrate item difficulties, this provides no assurance they will hold in the population of
students to whom the test is finally administered.
End users of the data, such as practitioners of value-added assessment, typically
have no access to item-level data to test these assumptions themselves. Moreover, even if
the assumptions are met, conjoint additivity may not capture everything we want in a
scale of achievement. It would seem wise, then, to check the plausibility of the resulting
scales. On this count IRT ability scales often do poorly. Gain scores frequently fall from
one grade to the next. While some of this may reflect adolescents’ declining interest in
academic achievement, the patterns set in as early as third grade and the drop between
second and third grade is often the largest. In addition, IRT ability often shows a
diminishing or constant variance from lower to higher grades.
What, then, is the practitioner of value-added assessment to do? It is no good
hoping that the choice of scale makes little difference to estimates of value added. We
have seen that reasonable transformations of the scale can have a substantial effect on
relative gains across the distribution of achievement. No other scales with superior
metric properties are at hand. We can, however, use methods of ordinal data analysis on
the assumption that IRT scales (or any of their monotonic transformations) at least permit
us to rank students.
Ordinal analysis changes the question we ask in value-added assessment. Instead
of measuring mean achievement of a teacher’s students vis-à-vis the students of a
(hypothetical) average instructor, we ask what fraction of the former outperform the
latter. In ordinal analysis, as in regression-based methods, it is possible to control for
achievement.
37
other influences in order to isolate the teacher or school’s contribution. Clearly, if is an
interval-scaled variable, ordinal methods throw away valuable information. Practitioners
should ask themselves, however, whether they are so confident of the metric properties of
scales that they are willing to attribute differences between conventional estimates of
value added and estimates based on ordinal analysis to the superiority of the former.
Ordinal methods have other advantages. They are likely to be more robust to
measurement error in test scores and to various model misspecifications (though the
question of robustness is a complicated one; see Cliff, 1996). The question they answer
may be a more sensible way to evaluate educators, given that it attaches more value to
spreading gains over a wider number of students, compared to larger but more
concentrated gains. However, this paper has considered ordinal methods from one
standpoint only—that of finding appropriate value-added models when test scores are not
expressed on an interval scale. Numerous questions have been raised about the
assumptions and methods of more conventional regression-based analyses. Some of
these concerns can be addressed through modifications of those models. It remains to be
seen whether the same concerns arise with respect to ordinal methods and, if so, how
readily they can be accommodated within the ordinal framework.
38
References
Abelson, Robert, and John Tukey. 1959. Efficient conversion of nonmetric information into metric information. Proceedings of the Social Statistics Section, American Statistical Association, 226-230.
Ballou, Dale. 2005. “Value-Added Assessment: Lessons from Tennessee.” Lissitz, Robert, ed., Value Added Models in Education: Theory and Applications. Maple Grove, MN: JAI Press.
Bryk, Anthony et al. 1998. Academic Productivity of Chicago Public Elementary Schools. Chicago: Consortium on Chicago School Research.
Camilli, Gregory. 1988. “Scale Shrinkage and the Estimation of Latent Distribution Parameters.” Journal of Educational Statistics, 13(3), 227-241.
Camilli, Gregory, Kentaro Yamamoto and Ming-wei Wang. 1993. “Scale Shrinkage in Vertical Equating.” Applied Psychological Measurement, 17(4), 379-388.
Cliff, Norman. 1996. Ordinal Methods for Behavioral Data Analysis. Mahwah, NJ:Lawrence Erlbaum.
Hambleton, Ronald and H. Swaminathan. 1985. Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff.
Hambleton, Ronald, H. Swaminathan and Jane Rogers. 1991. Fundamentals of Item Response Theory. Newbury Park: Sage.
Hand, D.H. 1996. “Statistics and the Theory of Measurement.” Journal of the Royal Statistical Society. Series A (Statistics in Society). 159(3), 445-492.
Hanushek, Eric A. et al. 2005. Charter School Quality and Parental Decision-Making with School Choice. Cambridge, MA: NBER Working Paper 11252.
Krantz, David H., et al. 1971. Foundations of Measurement. Vol. 1, Additive and Polynomial Representations. NY: Academic Press.
Lord, Frederic. 1975. “The ‘Ability’ Scale in Item Characteristic Curve Theory.”Psychometrika, 40(2), 205-217.
Mislevy, Robert. 1987. “Recent Developments in Item Response Theory with Implications for Teacher Certification.” Review of Research in Education, 14.Washington DC: American Educational Research Association.
Northwest Evaluation Association. 2008. Math_Single.pdf. Retrieved February 20, 2008, from www.nwea.org/assessments/ritcharts.asp.
39
Phillips, Meredith. 2000. “Understanding Ethnic Differences in Academic Achievement: Empirical Lessons.” Grissmer and Ross, eds., Analytic Issues in the Assessment of Student Achievement. Washington DC: US Department of Education, National Center for Education Statistics.
Springer, Matthew G. 2008. Accountability Incentives: Do Schools Practice Educational Triage? Education Next. (Winter), 75-79.
Stevens, S.S. 1946. “On the Theory of Scales of Measurement.” Science, 103, 677-680.
Wright, Benjamin D. 1999. “Fundamental Measurement for Psychology.” Embretson and Hershberger, eds., The New Rules of Measurement. Mahwah, NJ: Lawrence Erlbaum.
Yen, Wendy. 1985. “Increasing Item Complexity: A Possible Cause of Scale Shrinkage for Unidimensional Item Response Theory.” Psychometrika, 50(4), 399-410.
Yen, Wendy. 1986. “The Choice of Scale for Educational Measurement: An IRT Perspective.” Journal of Educational Measurement, 23(4), 299-325.
Yen, Wendy and George R. Burket. 1997. “Comparison of Item Response Theory and Thurstone Methods of Vertical Scaling.” Journal of Educational Measurement, 34(4), 293-313.
Zwick, Rebecca. 1992. “Statistical and Psychometric Issues in the Measurement of Educational Achievement Trends: Examples from the National Assessment of Educational Progress.” Journal of Educational Statistics, 17(2), 205-218.
40
Table 1: Scale Scores for Comprehensive Test of Basic Skills, 1981 Norming Sample
Reading/VocabularyMathematics Computation
Grades MeanScore
Std.Dev.
MeanBetween-Grade Change Mean Score Std. Dev.
MeanBetween-GradeChange
1 488 85 ---- 390 158 --- 2 579 78 91 576 77 1863 622 65 43 643 44 674 652 60 30 676 35 335 678 59 26 699 24 236 697 59 19 713 20 147 711 57 14 721 23 68 724 54 13 728 23 79 741 52 17 736 17 8
10 758 52 17 739 16 311 768 53 10 741 18 212 773 55 5 741 20 0
Source: Yen (1986).
41
Table 2: Mean Between-Grade Differences, Terra Nova
Subject_Year 2nd to 3rd 3rd to 4th 4th to 5th 5th to 6th 6th to 7th 7th to 8th 8th to 9th
Mississippi, 2001 Language Arts 30.2 20.4 17.8 6.9 9.9 10.9 ---- Reading 23.8 21.5 15.5 10.5 9.2 12.4 ---- Math 47.3 24.7 19.0 21.7 10.7 17.0 ----
New Mexico, 2003 Language_Arts_2003 ---- 14.8 12.6 5.1 3.7 4.5 8.3Reading_2003 ---- 14.4 13.8 5.0 4.7 11.4 4.9Math_2003 ---- 21.1 14.6 19.0 5.5 16.5 5.4Science_2003 ---- 20.2 13.9 9.0 11.6 11.9 7.1History/SS_2003 ---- 13.3 6.2 11.2 11.5 3.6 3.6
Source: Author’s calculations from school-level data posted on www.SchoolData.org, maintained by American Institute of Research. Data for each school are weighted by enrollment and aggregated to the state level.
42
Table 3: Scale Scores, Northwest Evaluation Association, Fall, 2005
Reading Mathematics Grade Mean Std Change Mean Std Change
2 175.57 16.22 ---- 179.02 11.81 ---- 3 190.31 15.56 14.74 192.96 12.06 13.954 199.79 14.95 9.48 203.81 12.80 10.855 206.65 14.60 6.86 212.35 13.92 8.536 211.49 14.76 4.84 218.79 15.00 6.447 215.44 14.82 3.96 224.59 15.99 5.808 219.01 14.76 3.56 229.38 16.79 4.799 220.93 15.28 1.92 231.76 17.42 2.38
Source: Author’s calculations from data provided by NWEA.
43
Table 4: Effect of Scale Transformations on Between-Grade Growth
Relative to students at the 10th percentile, growth by students at the:
25th percentile Median 75th percentile 90th percentile
Original Scale2nd to 3rd 0.97 1.06 1.03 0.953rd to 4th 1.10 1.03 1.16 1.344th to 5th 0.96 1.15 1.35 1.235th to 6th 1.61 2.12 2.17 2.246th to 7th 1.21 1.39 1.43 1.627th to 8th 1.16 1.17 1.16 1.13
Transformed Scale 2nd to 3rd 2.63 5.06 6.83 7.903rd to 4th 1.69 2.10 2.93 3.954th to 5th 1.30 1.94 2.74 2.875th to 6th 2.12 3.48 4.25 4.916th to 7th 1.60 2.29 2.79 3.507th to 8th 1.51 1.89 2.17 2.35
Data: Mathematics test results furnished to author from a Southern state, 2005-06.
44
Table 5: Comparison of Conventional and Ordinal Measures of Teacher Value Added
Total number of teachers 237Wilcoxon signed ranks test, p-value 0.0078Maximum discrepancy in ranks 229Absolute discrepancy in ranks, 90th percentile 45Number of statistically significant teacher effects, fixed effect estimate 64Number of statistically significant teacher effects,ordinal estimate 86Number of significant effects by both estimates 55Number ranked in the top quartile 59
Number ranked in the top quartile by both estimates 47
Number ranked in the bottom quartile by both estimates 48Number ranked above the median by one estimate, below the median by the other 14
Number ranked in the bottom quartile by one, top quartile by the other 3
Data: 5th grade mathematics teachers, large Southern district, 2005-06, author’s calculations.
45
Figure 1: Item Characteristic Curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ability
Prob
abili
ty o
f cor
rect
resp
onse
I
II
III
46
Figure 2: A Conjoint Additive Structure, Prior to Scaling
Examinees
Test
Item
s
47
48
Figure 4: Math Items, Two Hypothetical Students
49
Source: Northwest Evaluation Association RIT Charts, Sample Mathematics Items, 2008.
50
matthew G. springerDirectorNational Center on Performance Incentives
Assistant Professor of Public Policyand Education
Vanderbilt University’s Peabody College
Dale BallouAssociate Professor of Public Policy
and EducationVanderbilt University’s Peabody College
leonard BradleyLecturer in EducationVanderbilt University’s Peabody College
Timothy C. CaboniAssociate Dean for Professional Education
and External RelationsAssociate Professor of the Practice in
Public Policy and Higher EducationVanderbilt University’s Peabody College
mark ehlertResearch Assistant ProfessorUniversity of Missouri – Columbia
Bonnie Ghosh-DastidarStatisticianThe RAND Corporation
Timothy J. GronbergProfessor of EconomicsTexas A&M University
James W. GuthrieSenior FellowGeorge W. Bush Institute
ProfessorSouthern Methodist University
laura hamiltonSenior Behavioral ScientistRAND Corporation
Janet s. hansenVice President and Director of
Education StudiesCommittee for Economic Development
Chris hullemanAssistant ProfessorJames Madison University
Brian a. JacobWalter H. Annenberg Professor of
Education PolicyGerald R. Ford School of Public Policy
University of Michigan
Dennis W. JansenProfessor of EconomicsTexas A&M University
Cory KoedelAssistant Professor of EconomicsUniversity of Missouri-Columbia
vi-Nhuan leBehavioral ScientistRAND Corporation
Jessica l. lewisResearch AssociateNational Center on Performance Incentives
J.r. lockwoodSenior StatisticianRAND Corporation
Daniel f. mcCaffreySenior StatisticianPNC Chair in Policy AnalysisRAND Corporation
Patrick J. mcewanAssociate Professor of EconomicsWhitehead Associate Professor
of Critical ThoughtWellesley College
shawn NiProfessor of Economics and Adjunct
Professor of StatisticsUniversity of Missouri-Columbia
michael J. PodgurskyProfessor of EconomicsUniversity of Missouri-Columbia
Brian m. stecherSenior Social ScientistRAND Corporation
lori l. TaylorAssociate ProfessorTexas A&M University
Faculty and Research Affiliates
E X A M I N I N G P E R F O R M A N C E I N C E N T I V E SI N E D U C AT I O N
National Center on Performance incentivesvanderbilt University Peabody College
Peabody #43230 appleton PlaceNashville, TN 37203
(615) 322-5538www.performanceincentives.org