Statistical assessment of estimated transformations in ...jorge.gonzalez/papers/TR/Assess_TR.pdf ·...

1

Running head: Assessing equating transformations

Statistical assessment of estimated transformations in observed-score equating

Abstract

Equating methods make use of an appropriate transformation function to map the scores of

one test form into the scale of another so that scores are comparable and can be used

interchangeably. The equating literature shows that the ways of judging the success of an

equating (i.e., the score transformation) might differ depending on the adopted framework.

Rather than targeting different parts of the equating process and aiming to evaluate the

process from different aspects, this paper views the equating transformation as a standard

statistical estimator and discusses how this estimator should be assessed in an equating

framework. For the kernel equating framework, a numerical illustration shows the potentials

of viewing the equating transformation as a statistical estimator as opposed to assessing it

using equating-specific criteria. A discussion on how this approach can be used to compare

other equating estimators from different frameworks is also included.

Keywords: equating transformation estimates, statistical evaluation, equating-specific

evaluation criteria, simulation

Acknowledgement

The research in this article by Marie Wiberg was funded by the Swedish Research Council

grant 2014-578. Jorge González was partially funded by the FONDECYT grant 1150233.

jgonzale

Texto escrito a máquina

This paper has been published. Please cite it as: Wiberg, M., & González, J. (2016). Statistical assessment of estimated transformations in observed-score equating. Journal of Educational Measurement, 53(1), 106-125.

2

Statistical assessment of estimated transformations in observed-score equating

Equating methods are statistical tools used to ensure that scores from different test forms

are comparable and can be used interchangeably. The comparability is obtained using an

appropriate transformation function that maps the scores of one test form into the scale of the

other. To set up the problem, we let X and Y be random variables denoting the scores from test

forms X and Y with cumulative distribution functions (cdfs) FX(x) and FY(y), respectively. We

assume that scores on X are to be transformed to the Y scale. A general transformation

function for the comparison of any two samples or distribution of random variables can be

defined as

1( ) ( ( ))Y Y Xx F F xϕ −= (1)

(Wilk & Gnanadesikan, 1968). In the equating literature, this function is known as the

equipercentile transformation (e.g., Braun & Holland, 1992), and it has been shown that all

equating transformations are particular cases of it. Different equating frameworks lead to

parametric, semiparametric, and nonparametric estimators of ϕ (González & von Davier,

2013). Examples of such frameworks include traditional equating methods (e.g. Kolen &

Brennan, 2014), observed-score kernel equating methods (von Davier, Holland and Thayer,

2004; von Davier, 2013), item response theory methods (Lord, 1980; Kolen & Brennan,

2014), local equating methods (van der Linden, 2011), or combinations of these as in Wiberg,

van der Linden, and von Davier (2014). Because of the large number of possible equating

transformations, we need statistical tools to assess which equating estimators should be used

in different situations.

Although it might appear evident from this setup that the main object of inference is the

equating transformation ϕ , equating methods within their respective frameworks have

typically been evaluated using what we call equating-specific evaluation measures. For

example, in kernel equating, one measure that has been used is the percent relative error

(PRE), which essentially compares the moments in the observed and the equated score

distributions (see e.g. von Davier et. al, 2004; Jiang, von Davier & Chen, 2012). In contrast,

studies using traditional equating methods have reported the so-called “difference that

matters” (DTM), and this was originally defined as the difference between equated scores and

scale scores that are larger than half of a reported score unit (Dorans & Feigenbaum, 1994).

Equating-specific evaluation measures, target different parts of the equating process (e.g.,

moments of a distribution) and aim to evaluate it based on different aspects.

3

Another usual practice for the evaluation of equating transformations is the use of

summary indices (e.g., Han, Kolen, & Pohlmann, 1997). These indices make use of a

particular equating to be the standard against which other equatings are compared. They

measure discrepancies between equivalent scores for two different equating methods. We call

these measures equating evaluation criteria. For a summary of traditional equating evaluation

criteria and how they have been implemented, see Harris and Crouse (1993) and Kolen and

Brennan (2014). Harris and Crouse (1993) pointed out that there is no single criterion that is

preferable, but rather several can be used. Although their review is now quite old, their

conclusions are still true today as noted in Kolen and Brennan (2014, p. 325).

The viewpoint taken in this paper is that the equating transformation is the parameter of

interest for making statistical inferences and that the estimations of the transformation need to

be statistically assessed using measures such as bias, standard errors (SE), mean square error

(MSE), and root mean square error (RMSE) as would be standard for any evaluation of an

estimator of an unknown parameter. Although the use of these measures in equating has been

criticized due to the fact that a true value of the parameter is needed for the comparison

(Skaggs & Lissitz, 1986), in this paper it is explicitly shown how it is possible to define a true

equating transformation against which the estimated equating transformation can be

evaluated. Another important viewpoint proposed in this paper is that if we want to compare

different equating transformations we might need to examine several scenarios as the

definition of what constitutes a true equating function may differ depending on the choice of

the equating method.

The aim of this paper is to propose how to perform a statistical evaluation of equating

transformation parameter estimates within an observed-score equating framework. Thus, this

paper is an important step towards viewing the equating transformation as a statistical

estimator, which enriches the statistical theory viewpoint of equating (in line with von Davier,

2011a; 2011b, van der Linden, 2011; González & von Davier, 2013). Although we will give

general directions on how to handle a statistical assessment of equating transformations in

various observed-score equating frameworks, the emphasis will be on kernel equating. This

choice is mainly based on the fact that in this case a "true" value of the equating

transformation (i.e., the parameter of interest) can easily be defined. Contributions in this

paper include a new evaluation measure, a novel way to use PREs when assessing equating

transformations, and the observation that several competing scenarios should be examined if

we want to make a fair comparison of equating transformations.

4

The structure of the paper is as follows. We first briefly summarize the kernel equating

framework, including a description of equating-specific evaluation measures within this

approach. In the next sections, the general strategy for the statistical evaluation of equating

transformation parameter estimates is presented and discussed when used within three

different equating frameworks, including kernel equating. Measures of statistical evaluation

criteria to evaluate test scores are then described, and this is followed by a numerical

illustration. The final section contains some concluding remarks.

The kernel equating framework

Let jx and ky denote the possible score values that the random variables X and Y can

take with 1, ,j J= … and 1, ,k K= … , respectively. For test takers randomly selected from the

target population T scoring X = xj and Y = yk, we can define the score probabilities

{ ; }j jr Pr X x T= = and { ; }k ks Pr Y y T= = , respectively. Kernel equating is an observed-score

framework whose goal is to find an optimal equating transformation between the test scores X

and Y for a target population T, and the process consists of five steps: 1) presmoothing, 2)

estimation of score probabilities, 3) continuization, 4) equating, and 5) the calculation of the

standard errors of equating (von Davier et al., 2004; von Davier, 2013). The equating

transformation, obtained in step 4, is defined as

1 1ˆ ˆˆ ˆ ˆ ˆ ˆ( ) ( ; , ) ( ( ; ); ) ( ( ))Y X Y XY Y h h h hx x r s F F x s r F F xϕ ϕ − −= = = , (2)

where r and s are vectors of the estimated score probabilities (step 2) obtained from the

estimated score distributions (step 1), and Xh and Yh are bandwidth parameters controlling

the degree of smoothness for the continuization (step 3). A Gaussian kernel is usually utilized

for continuization of the cdf, and for X this is defined as

( ; ) ( ( ))T X j jjF x h r R x= Φ∑ , (3)

where ( )Φ ⋅ denotes the standard normal cdf, (1 )

( ) X j X XTj

X X

x a x aR x

h a

µ− − − =

,

∑=j jjXT rxµ , ∑ −=

j jXTjXT rx 22 )( µσ , and )/( 222XXTXTx ha += σσ .

Other choices besides the Gaussian kernel are also possible (e.g., a logistic or a uniform

kernel). The most common way to find an optimal bandwidth is to minimize a penalty

function comprising two components. One component accounts for the distance between the

estimated score probabilities and the corresponding estimated density function, and the other

acts as a smoothness penalty term that avoids rapid fluctuations in the approximated density

5

(for details, see von Davier, 2011). Alternatives to the penalty function approach have been

described in Häggström and Wiberg (2014), Liang and von Davier (2014), and Andersson and

von Davier (2014).

The (asymptotic) standard error of equating (SEE) due to random sampling from the

target population T is defined as:

ˆˆSEE ( ) ( ) Var( ( ))Y Y Yx x xσ ϕ= = , (4)

where ˆ ( )Y xϕ is defined in Equation 2 and the delta method is used to calculate the variance.

Interestingly, the definition of SEE in Equation 4 is equivalent to that given in statistical

inference for an estimated parameter of interest (in this case, the equating transformation).

However, as it will be seen later in this paper, the mathematical shape of the quantity under

the square root sign varies according to the adopted equating framework. In this sense, the

SEE could also be considered as an equating-specific evaluation measure.

Equating-specific evaluation measures within the kernel-equating framework

The PRE has been used for evaluation in the kernel equating framework. If we denote the

pth moment of the distribution of test scores Y and the equated scores ( )Y Xϕ as

( ) ( ) pp k kk

Y y sµ =∑ and ( ( )) ( ( ))pp Y Y j jj

X x rµ ϕ ϕ=∑ , respectively, then the PRE is defined

as

( ( )) ( )PRE( ) 100

( )p Y p

p

X Yp

Y

µ ϕ µµ

−= (5)

(von Davier et al., 2004). Another commonly used equating-specific measure not only in

kernel equating, but also in traditional methods of equating, is the DTM. DTM has been used

as a criterion to decide between two transformations that differ in some respect. Originally,

this meant any differences between equated scores and scale scores that are larger than half of

a reported score unit. In this paper, DTM will be used in connection with the measure of

differences of two equating estimators, which is defined formally in the numerical illustration

section.

These two evaluations share the common feature that they are especially developed to

evaluate an equating, although they handle different aspects of the performed equating. These

evaluations are quite different from general statistical measures used to assess different

estimators because they do not rely on comparing a true value with an estimated value of a

parameter of interest. It should, however, be noted that statistical indices such as root mean

6

square difference (RMSD), mean absolute difference (MAD), and mean signed difference

(MSD) have been used previously to evaluate an equating (Harris and Crouse,1993), although

not in the way that will be implemented here. As mentioned before, we will define a true

equating transformation against which the estimated equating transformation will be

evaluated. Bias and MSE have also been used to evaluate equating functions, sometimes with

several replications as noted in Kolen and Brennan (2014, p. 311). They do not, however,

emphasize the comparison of an estimated equating with its true equating but rather compare

a criterion equating against an estimated equating function. Additionally, previous studies

have typically used large populations from which replicated samples have been drawn and

used to calculate some of the above-mentioned measures. This study differs from previous

studies in that a probability model is utilized to generate replicated data so that Monte Carlo

methods can be used to calculate the evaluation measures, as is usually done in statistical

assessment studies.

Statistical evaluation of equating transformation parameter estimates

Because a point estimator is defined as any function of the sample, measures of the

quality of an estimator have been developed in order to choose between different estimators.

Most of these measures are based on the deviation of the estimated value of a parameter from

its true value. We will describe three of these measures when using the equating

transformation as the unknown parameter of interest. It is important to emphasize that the test

scores X and Y are random variables and that the equating transformation, as for any other

estimator in statistics, is just a mathematical rule that becomes a random variable when

observed data (i.e., the realizations of random variables) are used to obtain an estimation. It is

thus necessary to use expectations to evaluate the quality of this estimator.

Let ˆ( )xϕ denote an estimator of ( )xϕ . The definitions of bias, MSE, and RMSE are well

known and are shown here explicitly for the case of equating transformations:

ˆ ˆBias( ) [ ]fEϕ

ϕ ϕ ϕ= − , (6)

2ˆ ˆMSE( ) [( ) ]fEϕ

ϕ ϕ ϕ= − , and (7)

2ˆ ˆRMSE( ) [( ) ]fEϕ

ϕ ϕ ϕ= − , (8)

where the expectation is taken over the distribution of the equating function estimates fϕ . For

similar definitions within the local equating framework, see Wiberg and van der Linden

(2011). Because these expectations are not generally available in closed form, randomly

7

generated score data are used to calculate them in practice using Monte Carlo simulation.

Likewise, the definition for the SE is

ˆ ˆSE( ( )) Var( ( ))Y Yx xϕ ϕ= , (9)

and as mentioned before it coincides exactly with the definition of the SEE. Note that SE can

be obtained from MSE and bias because the former can be decomposed as the square of the

latter plus variance.

As pointed out earlier, there are also equating evaluation criteria or statistical indices that

have been used to evaluate equatings. Three commonly used indices are MSD, MAD, and

RMSD (Han, Kolen & Pohlmann, 1997; Harris and Crouse, 1993). These measures have been

shown to be particularly useful for comparing two different equating estimators within a

framework (e.g., one estimator that uses a Gaussian kernel and the other using a logistic

kernel within the kernel equating framework). Harris and Crouse (1993) pointed out that most

of these summary measures have been (and will continue to be) used due to their prominence

in the literature and not based on their applicability in a particular situation. We will explore

modified versions of these indices to compare a true equating transformation against an

estimated equating transformation, in line with the adopted viewpoint that the equating

transformation is a standard statistical estimator.

Assessing ˆ( )xϕ within an observed-score equating framework

In order to use any of the statistical evaluation measures described in equations 6-9 and

the modified versions of the equating evaluation criteria, we need to define a true equating

transformation. A (unique) true equating transformation is not known, except for the case of

simulations. Here, an explicit probability model will be used to generate score data so that the

calculation of true equated values is possible. Depending on the adopted framework, the

mathematical form of the transformation ϕ might change. Whichever framework is adopted,

the true value of ϕ is obtained for the equating framework using a particular definition of

equating. Here, Braun and Holland’s (1982) definition stated in Equation 1 is used. As long as

the interest is to assess the equating transformation within an observed-score framework, this

should not yield too many difficulties because the mathematical form of the true parameter

and the estimator is the same.

Once the true transformation is defined, simulated data are used to evaluate different

estimators. Simulations are typically used to approximate the sampling distribution of the

estimator under a particular set of conditions. Examining different situations through

8

simulations is very useful when assessing the behavior of different estimators before they are

applied to real data.

Next, the statistical assessment of ˆ( )xϕ will be discussed for the three observed-score

equating frameworks of kernel equating, item response theory (IRT) observed-score equating,

and local equating. For all of these cases, the assessment of the equating transformation

follows the same procedure:

1. Create simulated score data for the two test versions from a known probability model.

2. Obtain the true equating transformation.

3. Estimate the equating transformations for the different scenarios or different estimators

ϕ of interest.

Although the assessment strategy is common for all methods, both the mathematical

definition of the equating transformation ( )xϕ and the data-generating mechanism used to

generate replicated data will differ.

In order to make a fair comparison that does not favor any of the assessed estimators, the

estimated transformations will be compared against the true equating transformation as

defined using one of the competing models. This means that if two transformations are

compared, two sets of true equated values will be generated and compared with the two

assessed estimators. This way it is possible to check the robustness of misspecified models,

for example, those that include certain feature in the estimation that the one used to produce

the true equated values does not have.

Assessing ˆ( )xϕ within observed-score kernel equating

To assess the estimated equating transformation ˆ( )xϕ within the kernel equating framework,

we need to define a true equating transformation ( )xϕ . The true transformation will depend

on what we are interested in examining, for example, different bandwidth selection methods,

different kernels, or how the equating transformation behaves if we have a symmetric or a

skewed test score distribution. Because the equating transformation might depend on more

than one parameter, one has the flexibility to decide which parameter should be fixed and

which should be estimated when defining the true equating transformation. For the

assessment of the equating transformation, the three steps outlined above are followed. We

define the estimated equating transformation as

1,

ˆ ˆˆ ˆ ˆ( ) ( ; , , , , ) ( ( ))x y Y XY Y h h x y h hx x F F h h r s F F xϕ ϕ −= = , (10)

9

where xh , yh , r , and s are defined as above, andxhF and

yhF are used to make explicit the

dependence of ϕ on the kernel used. Suppose the interest is in assessing the use of different

kernels. Because using different kernels has an impact on the estimation of the bandwidths

Xh and Yh , these values need to be estimated in each replication for the different scenarios.

The same is true for the score probabilities r and s. If the interest instead is to compare

different bandwidth selection methods, one could let the score probabilities be fixed and

estimate the bandwidths using different bandwidth selection methods. These are just two

examples of possible comparisons, and the first will be illustrated later in this paper.

Additionally, in order to make a fair comparison that does not favor any of the assessed

estimators, the estimated transformations will be compared against the true equating

transformation as defined using either of the competing models (e.g. models defined using

different kernels).

Assessing ˆ( )xϕ within IRT observed-score equating

In IRT observed-score equating, an IRT model is used to produce an estimated distribution of

observed number-correct scores on each test form, and these are then used to equate scores

using equipercentile methods. The equating transformation is defined as in Equation 1 where

the involved cdfs are defined as

( ) ( | ) ( )ZF z F z g dθ θ θ= ∫ , ,Z X Y=

where ( )g θ is the distribution of θ (Kolen & Brennan, 2014). Possible scenarios that can be

evaluated include different ability distributions ( )g θ , different IRT models, and different

ways to obtain the conditional distributions of scores ( | )F X θ and ( | )F Y θ (González,

Wiberg & von Davier, In press). For the assessment to be fair, if the interest is in comparing

equating transformations obtained from using either the two (2PL) or the three (3PL)

parameter logistic IRT models, one needs to examine two possible true equating

transformations provided that the data are suitable for being modeled with both of these IRT

models. The true equating transformation will thus consist of Equation 1 with either the 2PL

or the 3PL. Also, for the assessment to be fair, we need to simulate data following both a 2PL

and a 3PL model.

Assessing ˆ( )xϕ within local equating

10

A more recent equating framework is local equating (van der Linden, 2011; van der

Linden & Wiberg, 2010; Wiberg, et al, 2014). Instead of using the marginal distributions of

scores, this method utilizes the conditional distributions of scores given ability or any simple

classification of it, and this leads to a family of equating transformations of the form

1| |( ; ) ( ( )), Y Xx F F xθ θϕ θ θ−= ∈ℜ . (11)

Assessing ˆˆ( ) ( ; )x xϕ ϕ θ= within local equating will be similar to IRT observed-score

equating if we use the local equating methods that rely on the assumption that the data fit an

IRT model. We can then proceed and obtain estimates of the equating transformations

similarly as in IRT observed-score equating. Although it is possible to use various evaluation

measures, the assessment has typically been based on bias measures (e.g. van der Linden &

Wiberg, 2010; Wiberg, et al., 2014). In this framework, possible scenarios that can be

evaluated are different estimation methods for θ , different IRT models, and different ways to

obtain the conditional distributions.

Measures of statistical assessment when equating test scores

In order to practically evaluate the statistical measures in Equations 6-9, the Monte Carlo

method will be used with replicated data generated from a known probability model. For each

assessment measure, the true and estimated equated values are compared for each test score.

Let ix denote a specific test score, where 0, ,i n= … and n is the number of possible score

values. The simplest evaluation measure is the bias, which for an equated value ( )Y ixϕ over

1000 replications where each replicate is denoted by l , is defined as

1000( )

1

1ˆ ˆBias( ( )) ( ( ) ( ))

1000l

Y i Y i Y il

x x xϕ ϕ ϕ=

= −∑ , (12)

followed by the MSE defined as

1000

( ) 2

1

1ˆ ˆMSE( ( )) ( ( ) ( ))

1000l

Y i Y i Y il

x x xϕ ϕ ϕ=

= −∑ , (13)

and the RMSE defined as

1000( ) 2

1

1ˆ ˆRMSE( ( )) ( ( ) ( ))

1000l

Y i Y i Y il

x x xϕ ϕ ϕ=

= −∑ , (14)

where ( )ˆ ( )lY ixϕ is the estimated equated score for the l th replication. As mentioned before,

ˆ( ( ))Y iSE xϕ can be calculated subtracting the squared bias from the MSE and taking squared

root. We will also use modifications of the previously mentioned indices (MSD, MAD, and

11

RMSD) that compare a true equating transformation against an estimated equating

transformation using the full set of replications. For that aim, we adjusted the indices to be

used with replications and defined them as the following average measures:

1000

1

1AMSD( ) MSD

1000Y ll

ϕ=

= ∑ , (15)

1000

1

1AMAD( ) MAD

1000Y ll

ϕ=

= ∑ , and (16)

1000

1

1ARMSD( ) RMSD

1000Y ll

ϕ=

= ∑ , (17)

where 0

1ˆMSD [ ( ) ( )]

n

Y i Y iix x

nϕ ϕ

= = − ∑ ,

0

1ˆMAD | ( ) ( ) |

n

Y i Y iix x

nϕ ϕ

= = − ∑ , and

2

0

1ˆRMSD [ ( ) ( )]

n

Y i Y iix x

nϕ ϕ

= = − ∑ .

These definitions are in line with how Chen (2012) used these measures, although no

formulas were given in that study. Note that MSD, MAD, and RMSD produce a single

number that corresponds to an average over the total number of scores. We have also explored

the possibility of redefining these measures in order to use the full range of score values rather

than averaging over them. In this case, we started from the versions of these measures given

in Han et al. (1997) and extended them into average at each score point using the replicated

data. A proposed measure that we will use is the average point absolute difference (APAD),

and this is defined as

1000( )

1

1ˆAPAD( ( )) | ( ) ( ) |

1000l

Y i Y i Y il

x x xϕ ϕ ϕ=

= −∑ . (18)

Note that although it is also possible to define an average point signed difference (APSD), the

resulting formula becomes mathematically equivalent to the bias (Equation 12) and thus is

excluded here. Also, an average point root square difference (APRSD) where the absolute

value function in APAD is exchanged with a square root and the difference is raised to the

power two is mathematically equivalent to the formal definition of APAD and is excluded

here.

Numerical illustration

12

In order to illustrate the statistical assessment of the equating transformations, simulated

data were used with a large number of replications to evaluate kernel equating using the

Gaussian and logistic kernels. The definition of the equating transformation using a standard

logistic kernel is similar to that used for the Gaussian kernel in Equation 4, but replacing (·)Φ

with

1(v)

1 exp( )K

v=

+ −,

where the random variable V has mean of 0 and a variance of 2 2 / 3Vσ π= (Lee & von Davier,

2011). A comparison between these two kernels using SEE and cumulants was presented in

Lee and von Davier (2011) where data from chapter 7 in von Davier et al. (2004) are used

with the equivalent group design. For the numerical illustration, the reported values of r , s ,

xh , and yh in these studies were used as true parameter values to obtain true equated scores.

The score data came from two 20-item test forms X and Y where M = 1455 and N =1453 test

takers were administered test forms X and Y, respectively. For the true parameter values of

the Gaussian kernel, the optimal bandwidths were 0.622Xh = and 0.579Yh = , and for the

logistic kernel the optimal bandwidths were 0.512Xh = and 0.446Yh = . Given the true

parameter values for jr and ks , which are given in Table 7.4 (von Davier, et al., 2004), 1000

instances of score frequencies 1( , , )Jn n= …n and 1( , , )Km m= …m were generated from the

multinomial distributions 1Mult( , , , )Jr r…n and 1Mult( , , , )Ks s…m with the same sample

sizes as in the original data. For each replication, we estimated Xh and Yh as well as the jr

and ks values. The first step in kernel equating is presmoothing. For simplicity, we used the

same loglinear model that originated from the "true" score probabilities parameters to

presmooth the simulated data as was given in von Davier et al. (2004)

1

log( ) ( )rT

tj r ri j

t

r xα β=

= +∑ and log( ) ( )sT

tk s st k

i t

s yα β=

= +∑ ,

with 2rT = and 3sT = . Other possibilities are to either automatically select the loglinear

models (using e.g. Akaike Information Criterion (AIC) or Bayesian Information Criterion

(BIC), a chi square test, or a likelihood ratio test) or to skip the presmoothing step and argue

that the same mistake is made in all replications. This could also have been done here,

although we have chosen to include the presmoothing step for illustration reasons because it is

part of the five steps of kernel equating.

13

From the generated data, we estimated equated values for both the Gaussian kernel and

the logistic kernel. The estimated equating values in both of these scenarios were then

compared with the true equated values from both the Gaussian and logistic kernel to yield a

total of four comparisons or scenarios. This way it is possible to check the robustness of

misspecified models, for example, those that use a kernel in the estimation that is different

from the one used to produce the true equated values. The four different scenarios were

examined with all of the previously described statistical evaluation measures (bias, MSE,

RMSE, AMSD, AMAD, ARMSD, APAD) given in Equations 12–18. Additionally, adjusted

versions of the PRE and SEE that use the full set of replications as well as an overall criteria

of equating differences (EDIFF) were used and are described in the next subsection. To

perform the numerical illustration, either of the R packages kequate (Andersson, Bränberg, &

Wiberg, 2013) or SNSequate (González, 2014) could be used. In this case we used

SNSequate.

Adjusting the equating-specific evaluation measures

To take advantage of the full set of replications, we adjusted the previously described

equating measures PRE and SEE. To adjust SEE, for each score value ix , the SEE is obtained

by calculating the average over 1000 replications as

1000

( )

1

1SEE ( ) SEE ( )

1000a lY i Y i

l

x x=

= ∑ , (19)

where SEE ( )Y ix is defined as in (4). Similarly, to adjust the PRE we average over

replications, and define PREa as

1000 ( )

1

1PRE ( ) PRE ( )

1000l

a lp p

== ∑ , (20)

where ( )PRE ( )l p is the value of PRE as defined in Equation 5 for the l th replicate. Finally the

DTM, or modifications of it, has been used in several equating studies (e.g. Yang, Bontya and

Moses, 2011; Brossman, 2010). Here, DTM is used as a criterion to decide between

differences of two equating estimators measured by EDIFF defined as

1 2

1000( ) ( )

1

1ˆ ˆEDIFF | ( ) ( ) |

1000l l

Y i Y il

x xϕ ϕ=

= −∑ , (21)

where 1

ˆYϕ and2

ˆYϕ represent two different equating estimators (Liang & von Davier, 2014).

14

Results

The bias, MSE, and RMSE for the different equated values for the Gaussian and logistic

kernels using either of them as the true equating transformation are shown in Figures 1 and 2.

The suffixes G and L in these figures indicate which kernel (Gaussian or logistic) was used to

obtain true equated values, and the words gauss and logis indicate the kernel used for the

estimations. From Figure 1 it is evident that the bias at the endpoint is larger for misspecified

models (i.e., gauss.L and logis.G) as expected, and the bias also differs depending on which

kernel is used. It should also be noted that the size of the bias is slightly different depending

on whether the Gaussian or the logistic kernel is used as the true equating transformation.

Looking at the lower extreme (until score 7), the bias for gauss.G appears to be smaller, and

for both correctly specified models (i.e. gauss.G and logis.L) they are convergent from the

middle to the upper parts of scores.

Figure 1: Bias for equated values over 1000 replications. G and L indicate that data were

generated using the Gaussian and logistic models, respectively.

The MSE, shown in the left panel of Figure 2, was almost identical in the four scenarios,

except for slight differences at the extremes of the score scale. This is not surprising because

both the bias and the SE (right part of Figure 3) only displayed small differences across the

score scale. The RMSE also yielded the same pattern for all scenarios and was never larger

than .30 for any of the scores.

15

Figure 2. MSE to the left and RMSE to the right for equated values (over 1000 replications).

G and L indicate that data were generated under the Gaussian and logistic model,

respectively.

The results for the average loss measures are shown in Table 1. From Table 1 it can be

seen that values of AMSD in all scenarios were close to zero when comparing Gaussian and

logistic kernels, and this indicated a small average loss regardless of the kernel that was used.

In all cases, the values for misspecified models were larger than for the correctly specified

models. The AMAD values indicated that the Gaussian model performed better because it

outperformed the logistic model when the model was correctly specified and it was also more

robust when the model was misspecified. The values of ARMSD were all very similar across

simulation scenarios.

Table 1. Average loss measures for the four examined scenarios. gauss.G logis.G gauss.L logis.L AMSD −0.00031 0.00361 −0.00419 −0.00026 AMAD 0.15866 0.16074 0.16040 0.15923 ARMSD 0.18147 0.18296 0.18283 0.18135 The suffixes (G and L) indicate which kernel (Gaussian or logistic) was used to obtain true equated values, whereas the words gauss and logis indicate the kernel used for the estimations.

The results for the proposed APAD are given in the left part of Figure 3. There is clear

consistency of the curves when correctly specified models are evaluated across the full range

of scores, which is the opposite conclusion for misspecified models where none of the models

appear to be robust, particularly at the extremes of the score scale.

16

The SE in the right part of Figure 3 yielded very similar results regardless of the used

kernel. In order to compare the results of the SE obtained using our approach to the SEE

reported in Lee and von Davier (2011), we have added the SEE reported using both Gaussian

(gauss.b) and a logistic (logis.b) kernel. Interestingly, the SE plot displays essentially the

same results as those shown in Lee and von Davier (2011) (Figure 10.3) for the SEE even

though they did not use replicates. This means that the simulations support the analytical

results. Note that we can only compare the results of the SE with the SEE in Lee and von

Davier (2011), as they did not use any of the other measures proposed here in their evaluation.

It is notable that the SEE curves in Figure 3 are almost identical to the RMSE curve shown in

Figure 2. This might be due to the fact that for the former, the average of 1000 replicates used

as a criterion was very close to the true equating value in the latter.

Figure 3. APAD (over 1000 replications) for the four examined scenarios (left panel) and SE

(over 1000 replications) using either a logistic or a Gaussian kernel together with the

obtained results by Lee and von Davier (2011) (right panel). G and L indicate that data were

generated under the Gaussian and logistic model, respectively.

The results in Figure 3 are in line with what Figure 4 shows for the EDIFF measure.

Although none of the differences are larger than a DTM, it can be seen that the largest

differences between the Gaussian and logistic model occur at the extremes of the scores scale.

Figure 5 shows plots for the adjusted SEE (left panel) and PRE (right panel). The

adjusted SEE gives almost identical results as the ones reported in Lee and von Davier (2011).

The plot of the adjusted PRE values for the first 10 moments of the distributions shows that

the average PRE is lower if a Gaussian kernel instead of a logistic kernel is used, at least from

17

the third moment on. Note, because ( )p Yµ in Equation 5 is the same regardless of the

estimated values, there are only two curves in the right panel of Figure 5 compared to Figures

1, 2, and 3 where four curves, one for each scenario, are shown.

Figure 4. The equating difference (EDIFF) over the different test scores (over 1000

replications) when comparing the Gaussian kernel against the logistic kernel.

Figure 5. The adjusted SEE (over 1000 replications) and the SEE from Lee and von Davier

(2011) when comparing the Gaussian kernel against the logistic kernel (left panel); and the

average PRE (over 1000 replications) using either a logistic or a Gaussian kernel (the right

panel).

18

Concluding remarks

This paper was motivated from the insight that test scores are random variables and thus

an equating transformation is nothing more than a statistical estimator. This implies that the

obtained equating transformation, and thus the equated scores, should be treated as estimates

of a statistical estimator.

The aim of this paper was to propose a statistical evaluation of equating transformation

parameter estimates within an observed-score equating framework. The proposed approach to

assessing equating transformations has interesting features. First, it offers a statistical

perspective on equating that allows the use of general statistical tools such as probability

models used for replications. Second, it emphasizes that it could be fairer to use several true

equating models instead of just one. Third, it gives us the possibility to use statistical

evaluation methods that are already developed.

If we want to assess an equating transformation we propose to proceed as when working

with a standard statistical inference problem. Our approach differ from other equating

assessment studies in that simulated data from a known probability model are used for

evaluations. This way, it is possible to assess different components of the equating

transformations (e.g., different kernels) as well as to examine different equating methods in

different (simulated) situations (e.g., it allows us to determine which equating method is most

suitable if the score distributions are symmetric or non-symmetric). These ideas were

explicitly illustrated for kernel equating using a multinomial probability distribution and

discussed for being implemented within either the IRT observed score or local equating

frameworks. In all of these cases, an explicit probability model (i.e., the multinomial and the

underlying Bernoulli model for binary IRT data) was available. A more challenging part is to

assess the equating transformation within equipercentile observed-score equating because one

has to decide which probability model can be used to generate the data. If we use a

multinomial distribution, we already have discrete test scores, but if we decide to use, for

example, a normal distribution we need to first discretize the scores. In any case, it is unclear

which probability model better represents the true situation. When parametric inference is

impossible or requires complicated formulas for the calculation of standard errors, the

bootstrap (Efron & Tibshirani, 1993) would be a valuable alternative.

The comparison of the standard errors obtained using the proposed approach with those

that would be obtained using bootstrap standard errors is a topic of future research. In the

future one should elaborate on comparing different approaches to equating in different

19

situations. In these future situations the analytical standard error of equating may not always

be available and thus a comparison with bootstrap standard errors will be of great importance.

All the evaluation measures used in this paper consider a point-wise comparison of each

score value in the score scale. In this sense, the equating transformation is being considered as

a multivariate functional parameter ( ) ( (0), (1), , ( )) 'nϕ ϕ ϕ= …φ x so that each equated score is a

component in the vector equating parameter. If one is interested in globally assessing the

equating transformation, measures such as the MSE and Bias should be reformulated in terms

of ( )φ x . This multivariate setting has been considered in Rijmen, Qu, and von Davier (2011)

and Andersson and Wiberg (2016) where the equating function is treated as a multivariate

parameter and the asymptotic covariance matrices of equating functions are derived for

hypothesis testing. Investigating the implications of this approach on the proposed evaluation

method described in this paper is a subject of future research.

In this paper, only simulations were used but the outlined approach can also be used with

real data. If there is not an explicit probability model to generate data, a statistical model (e.g.,

a polynomial loglinear model or an IRT model) can be fit to the real data. Provided that the

model assumptions are fulfilled, the best possible model is chosen using a goodness of fit test,

AIC or BIC, and it is considered as the true model. The true parameters that are needed to

obtain the true equated scores can be obtained from this true model, and these are then used to

obtain sample replicates from the large population of test data. This approach has been used in

some studies as mentioned in Kolen and Brennan (2014, p.313) and references therein.

The focus in this paper has been on how to assess the equating transformation within an

observed-score equating framework. An area for future research is how to make comparisons

of equating transformations between different equating frameworks. To assess between

different equating frameworks is, however, more complicated and difficult, if not impossible,

because the true value of the equating parameter differs between frameworks. This problem is

not shared with regular statistics where the parameter that indexes a probability distribution is

a unique abstract element with no specific mathematical form. How to proceed with a fair

comparison in this setting will depend on which frameworks the methods come from. If we

use, for example, IRT observed-score equating or local equating, the data need to fit an IRT

model. But does that automatically mean that this requirement favors the IRT observed-score

framework or the related local equating over, for example, kernel equating or equipercentile

equating? If we want to compare either of the estimators in the IRT observed-score

framework with the kernel equating framework, we have to decide which probability model is

20

the most fair to generate data from. One could argue, for example, that to generate data with a

multinomial model will suit kernel equating better. Alternatively, such a decision might be

avoided if we use two equating models and compare the resulting scenarios of using two true

models, as we did in the illustration. The same issues will arise when more than two equating

transformations from different frameworks are to be compared because there might be a

number of possible candidates for the true equating transformation making the comparison

quite complicated. A recent reflection related to this is given in Chen (2012) who declared

that observed differences between equating methods can come both from the framework used

for equating and from the particular equating estimator. Thus, an important point of our

approach is to include all possible scenarios when comparing competing equating

transformations to assure that none are favored over any others. Chen (2012) concluded that

more research is needed to determine the best equating practice and that there is a need to find

a good and practical criterion to identify the most appropriate equating method. This paper

can be seen as one step in that direction. However, we still have to work with the two most

important questions of how one should define the true ( )xϕ and how one should generate

data to avoid an unfair comparison of methods. These questions are ongoing challenges as

new methods are proposed and new comparisons are examined.

References

Andersson, B. Bränberg, K., & Wiberg, M. (2013). Performing the kernel method of test

equating using the package kequate. Journal of Statistical Software. 55, 1-25.

Andersson, B., & von Davier, A. A. (2014). Improving the Bandwidth Selection in Kernel

Equating. Journal of Educational Measurement, 51(3), 223-238.

Andersson, B. & Wiberg, M. (2016). Item response theory observed-score kernel equating.

Manuscript submitted for publication.

Braun, H. I. & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis

of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating

(pp.9-49). New York: Academics.

Brossman, B. G. (2010). Observed score and true score equating procedures for

multidimensional item response theory. Doctoral thesis, University of Iowa.

Chen, H. (2012). A comparison between linear IRT observed-score equating and Levine

observed-score equating under the generalized kernel equating framework. Journal of

Educational Measurement, 49(3), 269-284.

21

Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the

SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, NJ:

ETS.

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall.

González, J. (2014). SNSequate: Standard and Nonstandard Statistical Models and Methods

for Test Equating. Journal of Statistical Software, 59(7), 1-30.

González, J. & von Davier, M. (2013). Statistical models and inference for the true equating

transformation in the context of local equating. Journal of Educational Measurement,

50(3), 315-320.

González, J., Wiberg, M., & von Davier, A. A. (in press). A note on the Poisson's binomial

distribution in item response theory. Applied Psychological Measurements.

Han, T., Kolen, M.J., & Pohlmann, J. (1997). A comparison among true- and observed-score

equating and traditional equipercentile equating. Applied Measurement in Education, 10,

105-121.

Harris, D. J. & Crouse, J. D. (1993). A study of criteria used in equating. Applied

Measurement in Education, 6, 195-240.

Häggström, J. & Wiberg, M. (2014). Optimal bandwidth in observed-score kernel equating.

Journal of Educational Measurement. 51( 2), 201-211.

Jiang, Y., von Davier, A. A., & Chen, H. (2012). Evaluating equating results: percent relative

error for chained kernel equating. Journal of Educational Measurement, 49(1), 39-58.

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking: methods and

practices. (3rd ed.). New York: Springer.

Lee, Y.-H. & von Davier, A. A. (2011). Equating through alternative kernels. In A. A. von

Davier (Ed.), Statistical models for equating, scaling, and linking. Chapter 10: pp. 159-

173. New York: Springer.

Liang, & T. von Davier, A. A. (2014). Cross-validation: an alternative bandwidth-selection

method in kernel equating, Applied Psychological Measurement, 38(4), 281-295.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.

Hillsdale, NJ: Erlbaum.

Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile

observed-score equatings. Applied Psychological Measurement, 8, 452-461.

R Core Development Team (2014). R: A language and Environment for Statistical

Computing. Vienna, Austria: R Foundation for statistical Computing. ISBN 3-900051-07-

0.

22

Rijmen, F., Qu, Y., and von Davier, A. A. (2011). Hypothesis testing of equating differences

in the kernel equating framework. In A. A. von Davier (Ed.) Statistical Models for Test

Equating, Scaling, and Linking. Chapter 19: pages 317–326. New York: Springer.

Skaggs, G & Lissitz, R. W. (1986). IRT Test Equating: Relevant Issues and a Review of

Recent Research. Review of Educational Research, 56(4), 495-529.

van der Linden, W. J. (2011). Local observed-score equating. In A. A. von Davier (Ed.),

Statistical models for equating, scaling, and linking. Chapter 13: pp. 201-223. New York:

Springer.

van der Linden, W. J., & Wiberg, M. (2010). Local observed-score equating with anchor-test

designs. Applied Psychological Measurement, 34, 620–640.

von Davier, A. A. (2011a). Statistical models for test equating, scaling, and linking. New

York: Springer.

von Davier, A. A. (2011b). A statistical perspective on equating test scores. In A. A. von

Davier (Ed.) Statistical models for test equating, scaling, and linking. Chapter 1: pp.1-17.

New York: Springer.

von Davier, A. A. (2013). Observed-score equating: An overview. Psychometrika. 78(4), 605-

623.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test

equating. New York: Springer.

Yang, W-L., Bontya, A. M., and Moses, T. P. (2011). Repeater effects on score equating for a

graduate admissions exam. Research Report ETS RR-11-17.

Wiberg, M. & van der Linden, W. J. (2011). Linear local observed-score equating. Journal of

Educational Measurement, 48, 229-254.

Wiberg, M., van der Linden, W. J., & von Davier, A. A. (2014). Local observed-score kernel

equating. Journal of Educational Measurement. 1, 57-74.

Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of

data. Biometrika. 55, 1-17.

Statistical assessment of estimated transformations in ...jorge.gonzalez/papers/TR/Assess_TR.pdf ·...

Documents

Transcript of Statistical assessment of estimated transformations in ...jorge.gonzalez/papers/TR/Assess_TR.pdf ·...