CHAPTER 1 CRITERION-REFERENCED MEASUREMENT: ITS MAIN ... · der Linden, 1979) we have given a...

Evaluation in Education. 1982, Vol. 5, pp. 97-118 Printed in Great Britain. All rights reserved

CHAPTER 1

CRITERION-REFERENCED MEASUREMENT: ITS M A I N APPLICATIONS, PROBLEMS AND FINDINGS

W i m J . v a n d e r L i n d e n

Technische Hogeschool Twente, Postbus 217, 7500 A E Enschede, The Netherlands

ABSTRACT

The need for c r i te r ion- re fe renced measurements has mainly arisen from the in t roduct ion of ins t ruc t iona l programs organized according to modern pr inc ip les from educational technology. Some of these programs are discussed, and i t is indicated for what purposes c r i te r ion- re fe renced measurements are used. Three main problems of c r i te r ion - re fe renced measurement are d is t ingu ished: The problem of c r i t e r i on - referenced scor ing and score in te rp re ta t ion , the problem of c r i te r ion- re fe renced item and test analysis, and the problem of mastery tes t ing. For each of these problems a va r ie ty of solut ions and approaches have been suggested. I t is the purpose of the paper to prov ide an overv iew of these and to int roduce the reader to the or ig inal l i te ra ture .

The recent rise of a number of new learning strategies has basical ly changed the meaning of measurement in education and made new demands on the const ruc t ion, scor ing, and analysis of educational tests. Educational measurements sat is fy ing these demands are usual ly called cr i ter ion-referenced, whi le t rad i t ional measurements are often known as norm-referenced.

The common feature of these learning strategies is the i r object ive-based character . Al l lead to ins t ruc t iona l programs being set up and executed according to wel l -def ined, c lear -cu t learning object ives. The organizat ional measures taken to realize these object ives d i f fe r , however. For example in learn ing for mastery, one of the most popular new developments (Block, 1971b; Bloom, 1968; Bloom, Hast ings, ~, Madaus, 1971, chap. 3) , the fo l lowing is done: F i rs t , the learning objectives are kept f i xed du r ing the implementation of the program. Second, the program is d iv ided into a sequence of small learning uni ts , and students are allowed to proceed to the next un i t on ly af ter they have mastered the preceding one. T h i r d , end-of - un i t tests are administered p rov id ing students and ins t ruc to rs wi th quick feedback. One of the pr inc ipa l uses of these tests is to separate students who master the un i t from those who do not (mastery tes t ing ) . Four th, when a s tudent does not master the un i t , he is given cor rect ive learn ing materials or remedial teaching. F i f th , as extra learning time is needed for going

97

98 Wire J. van der Linden

through these materials and inst ruct ion, mastery learning allows some d i f fe rent ia t ion in tempo. There is no d i f fe rent ia t ion in level, however, since the learning objectives are kept f ixed and equal to all students. (For these f ive propert ies of mastery learning and a f u r t h e r explanat ion thereof , see Warries, 1977a.)

The context in which mastery learning seeks to realize its goal is regular group inst ruct ion. In each uni t all students receive the same instruct ion unt i l the end-o f -un i t test. Al though they have much in common with mastery learn ing, ind iv idual ized instruct ional systems l ike the Pi t tsburgh Ind iv idua l ly Prescribed Instruct ion (IPI) (Glaser, 1968) and Flanagan's Program for Learning According to Needs (PLAN) (Flanagan, 1967) d i f fe r in this respect. These systems allow students to reach the learning objectives in d i f fe rent ways. Each student is prov ided with his own route through the instruct ional units and with learning materials matching his enter ing behavior and apt i tudes. Ind iv idual ized instruct ion has mainly been motivated by the view- point under ly ing Apt i tude Treatment Interact ion (ATI ) research (Cronbach ~, Snow, 1977), namely that subjects can react d i f f e ren t l y to treatments and that treatments which are best on the average may therefore be worst in indiv idual cases. The methodology needed to test whether ATI 's are present has been reviewed by Cronbach and Snow (1977, chaps. 1-4) and Plomp (1977). Decision rules for assigning students to treatments are given in van der Linden (1980d, 1981b) and Mijn (1980).

The most fa r - reach ing mode of ind iv idual izat ion takes place in Computer-Aided Inst ruct ion (CAI) (Atk inson, 1968; Suppes, 1966; Suppes, Smith, F, Beard, 1975). In CAI the instruct ion is in teract ive and each next step depends on the student 's response to the preceding one, thus creat ing a system in which the student is, wi th in given boundar ies, choosing his own indiv idual ins t ruc t ion.

Compared with t rad i t ional educational approaches, one of the most conspicuous proper t ies of the test ing programs inherent in these modern learning strategies is the i r f requency of test ing. At several points of time, tests are involved for several purposes. Beg inn ing -o f -un i t tests describe the enter ing behavior and skil ls of students who are about to start with the unit . When the uni t i tself covers a sequence of object ives, scores on these tests can be used for deciding where to place students in this sequence. When ind iv idua l iza t ion is based on ATI research, apt i tude tests are usual ly administered to assign students to those treatments that promise best results.

Once the student has taken up the uni t , intermediate tests can be administered for monitor ing his learn ing. The pr inc ipal use of these tests is to describe relevant aspects of the learning process enabl ing students and ins t ruc to r to re-adjust it when necessary. This intermediate use of tests is known as format ive evaluat ion.

Tests are also used as end-o f -un i t tests for descr ib ing the level of master the student has attained when he has completed the uni t . If the information from these tests serves as a basis for deciding whether the student has mastered the uni t and can be moved up to the next uni t , the test ing procedure is

Criterion-- Referenced Measurement 99

known as mastery testing. End-o f -un i t tests may be used fo r d iagnost ic purposes as wel l ; then they indicate the objectives for which the student's performances are poor, and which par t of the un i t he has to cover again.

In all these modes of test ing, scores are used only for ins t ruc t iona l purposes. In par t i cu la r , they are not used for g rad ing students (summative eva luat ion) . For that purpose a separate test is o rd ina r i l y administered at the end of the program, cover ing the content of the uni ts but ignor ing the un i t test scores given ear l ier .

For a f u r t h e r review of test ing procedures in modern learning strategies, refer to Glaser and Nitko (1971), Hambleton (1974), Nitko and Hsu (1974), and Warries (1979b).

As already indicated, the in t roduct ion of such strategies as learning fo r mastery and ind iv idua l ized ins t ruc t ion has led to a change in the use and the in terpre ta t ion of test scores. The modes of test ing out l ined above are all concerned wi th a behavioral description of s tudents. By doing so, i t is possible to control the learn ing process and to make optimal decisions, fo r example, on the placement of students in ins t ruc t iona l uni ts and the i r end-of - un i t mastery level. Trad i t iona l test ing procedures are, however, bet ter sui ted for differentiating between subjects, and mostly serve as an ins t rument for ( f i xed-quota) selection. The psychometr ic analysis of these tests is genera l ly adapted to th is use.

The attempts to remove t rad i t iona l assumptions from educational test ing and to replace these by assumptions bet ter adjusted to the use of tests in modern learn ing strategies arose in the beg inn ing of the s ix t ies. As a resul t the term "c r i te r ion - re fe renced measurement" has been coined. Elsewhere (van der Linden, 1979) we have given a review in which three main problems of c r i te r ion- re fe renced measurement are d is t ingu ished, each fo r the f i r s t time s ignal ing a d i f fe ren t "h is tor ica l " paper. These problems are: The problem of c r i te r ion- re fe renced scor ing and score in te rpre ta t ion , the problem of c r i te r ion- re fe renced item and test analysis, and the problem of mastery decisions. Following are shor t in t roduct ions to these three problems.

CRITERION-REFERENCED SCORING AND SCORE INTERPRETATION

Glaser (1963), in his paper on ins t ruc t iona l technology and the measurement of learning outcomes, confronted two possible uses of educational tests and the i r areas of appl icat ion. The f i r s t is that tests can supp ly norm-referenced measurements. In norm-referenced measurement the performances of subjects are scored and in terpre ted wi th respect to each other . As the name indicates, there is always a norm group, and the in terest is in the relat ive s tanding of the subjects to be tested in th is group. Th is f inds expression in scor ing methods as percent i le scores, normalized scores, and age equivalents. Tests are const ructed such that the relat ive posit ions of subjects come out as re l iab ly as possible. An outs tand ing example of an area where norm-

100 Wim J. van der Linden

referenced measurements are needed is test ing for selection (e .g . , of applicants for a job) . In such applications the test must be maximally d i f fe ren t ia t ing in o rder to enable the employer" to select t i le best appl icants.

The second use is that tests can supply c r i t e r i o n - r e f e r e n c e d m e a s u r e m e n t s .

In c r i te r ion- re fe renced measurement the interest is not in using test scores for ranking subjects on the continuum measured by the test, but in carefu l ly speci fy ing the behavioral referents (the "c r i t e r ion " ) per ta in ing to scores or points along this continuum. Measurements are norm-referenced when they indicate how much bet ter or worse the performances of ind iv idual subjects are compared with those of other subjects in the norm group; they are cr i te r ion- referenced when they indicate what performances a subject with a given score is able to do, and what his behavioral reper tory is, wi thout any reference to scores of other subjects. This descr ip t ive use of test scores is needed in the test ing programs of the learning strategies mentioned ear l ier . For a f u r t he r c lar i f icat ion of the dist inct ion between norm-referenced and cr i te r ion- referenced test score in terpreta t ions, we refer to Block (1971a), Carver (1974), Ebel (1962, 1971), Flanagan (1950), Olaser and Klaus (1962), Glaser and Nitko (1971), Glass (1978), Hambleton, Swaminathan, Algina and Coulson (1978), Linn (1980), Nitko (1980), and Popham (1978).

How to establish a relation between test scores and behavioral referents is what can be called the problem of c r i te r ion- re fe renced score in terpre ta t ion. We have also called this the problem of local va l id i t y , since Glaser (1963) goes beyond the classical va l id i t y problem and does not ask for the va l id i t y of a t e s t , which is the classical quest ion, but for the va l id i t y of the in terpretat ion of test s c o r e s . The va l id i t y of the test must, as it were, be locally specif iable for points on the continuum under ly ing the test (van der Linden, 1979).

Several answers have been proposed to the problem of c r i te r ion- re ferenced score in te rpre ta t ion , some of which will be mentioned here. One is a const ruct ive approach based on the idea of referencing test scores to a domain of tasks (e .g . , Hively, 1974; Hively, Maxwell, Rabehl, Sension, E, Lundin,1973; Hively, Patterson, r, Page, 1968; Osburn, 1968). In this approach, the test is a random sample from a domain of tasks (or may be conceived of as such) defined by a c lear-cut learning object ive and test scores are in terpre ted with respect to this domain. Domain-referenced test ing usual ly involves the use of binomial test models and has been the most popular way of c r i te r ion - re fe renc ing so far. Another approach is an empirical method in which the congruence between items and objectives is determined by assessing how sensit ive the items are to object ive-based inst ruct ion. The objectives are then used to in te rp re t the test performances. An example of this approach is Cox and Vargas' (1966) pre test -post tes t method, which has led to a mul t i tude of var iants and modif ications ( for a review, see Berk, 1980b; van der Linden, ]981a). A th i rd approach is subject ive and uses content -mat ter specialists to judge the congruence between items and learning objectives (Hambleton, 1980; Rovinei l i ~, Hambleton, 1977). Yet another approach has been followed by Cox and Graham (]966) who conceived the continuum to be referenced as a Guttman scale and used scalogram analysis to scale items per ta in ing to a sequence of learning objectives. By so doing,

Criterion-- Referenced Measurement 101

they were able to pred ic t for points along the scale which ar i thmetical ski l ls the students in the i r example were able to perform. Wr ight and Stone (1979, chap. 5) used the Rasch model to l ink points on the cont inuum under l y ing the test to behaviors. The crucial d i f ference between scalogram analysis and th is approach is that the former assumes Guttman item character is t ic curves and allows only determinist ic statements about behaviors whereas use of the Rasch model entails probabi l i s t ic in te rpre ta t ions .

The above f i ve approaches were mentioned because they have been most popular so fa r or indicate promising developments. For a more exhaust ive review, we refer to Nitko (1980).

CRITERION-REFERENCED ITEM AND TEST ANALYSIS

Popham and Husek's (1969) paper is a second milestone in the h is to ry of c r i te r ion- re fe renced measurement. I t goes beyond Glaser's paper in that i t adds a d is t inc t ion in item and test analysis to norm-referenced and c r i t e r i on - referenced measurement. Accord ing to Popham and Husek, both types of measurement d i f f e r so basical ly that classical models and procedures for item and test analysis are inadequate for c r i te r ion- re fe renced measurement and the i r outcomes sometimes even misleading.

The key-word in the analysis of items and tests for norm-referenced measurement, wi th its emphasis on d i f fe ren t ia t ing between subjects, is v a r i a n c e . Classical models and procedures rely on the presence of a large amount of va r i ab i l i t y of scores. In c r i te r ion- re fe renced measurement, however, th is condit ion wil l seldom be fu l f i l l ed because i t is not the va r i ab i l i t y of scores which is cr i t ica l but t he i r relation to a c r i te r ion . In th is connect ion, Popham and Husek refer to classical re l iab i l i t y analysis. Consistency, both in te rna l l y and temporar i ly , is a desirable p roper ty not only of norm-referenced but of c r i te r ion- re fe renced measurements as well . Nevertheless, classical re l i ab i l i t y coeff ic ients are var iance-dependent and thus often low for c r i te r ion - re fe renced measurements. For th is reason, they plead for new models and procedures for item and test analysis. These must be models and procedures which, more than classical test theory , make allowance for the specif ic requirements c r i te r ion- re fe renced measurement imposes on item const ruc t ion, test composition, scor ing, and score in terpre ta t ion .

Popham and Husek's plea has led to a va r ie ty of proposals. A proposal to adapt classical test theory for use with c r i te r ion- re fe renced measurement is due to L iv ingston (1972a). He argued that in c r i te r ion- re fe renced measurement we are no longer interested in est imating the deviat ions of ind iv idua l t rue scores from the mean but in estimates of deviat ions from the cu t -o f f score. Central in his approach is a c r i te r ion - re fe renced re l iab i l i t y coef f ic ient def ined as the rat io of t rue to observed score var iance about the cu t -o f f score. L iv ingston 's proposal st imulated others to con t r ibu te (Har r is , 1972; Lovett , 1978; Shavelson, Block, & Ravi tch, 1972; see also L iv ing ton , 1972b, 1972c, 1973), and has had the merit of making a larger publ ic aware of the issue of c r i te r ion - re fe renced test analysis.


As noted above, random sampling of tests from item domains has been the most popular way of c r i te r ion - re fe renc ing test scores. However, when tests are sampled and the concern is with domain scores and not with t rue test scores, classical test theory does not hold (unless all items in the domain have equal d i f f i cu l t y ) . Brennan and Kane (1977a) proposed to use genera l i z ib i l i t y theory when domain sampling takes place and transformed Liv ingston's re l iab i l i t y coeff ic ient into what they called an index of dependabi l i ty . At the same time they presented a modification of this index which can be in terpre ted as a signal-noise rat io for domain-referenced measurements, while later two general agreement coeff icients were given from which these indices can be der ived as special instances (Brennan & Kane, 1977b; Kane & Brennan, 1980). A summary of these developments is given in Brennan (1980).

In the foregoing approaches the emphasis is on developing test theory ref lect ing the re l iab i l i t y or dependabi l i ty with which deviat ions of indiv idual t rue or domain scores from the cu t -o f f score can be estimated. It can be argued, however, that when cr i te r ion- re ferenced tests are used for mastery decisions the concern should not be so much with these deviat ions as with the " re l i ab i l i t y " and " va l i d i t y " of the decisions. Since approaches along this l ine belong to the content of the next section, we wil l postpone a discussion of the i r results unt i l then.

The f i rs t proposal of a c r i te r ion- re fe renced item analysis was made by Cox and Vargas (1966). As already indicated in the preceding section, the i r pre test -post tes t method of item val idat ion is based on the idea that c r i te r ion- referenced items ought to be sensit ive to object ive-based inst ruct ion. It measures this sens i t iv i ty by the d i f ference in item p-va lue before and af ter inst ruct ion. The rationale of the method is discussed in Coulson and Hambleton (1974), Cox (1971), Edmonston and Randall (1972), Hambleton and Gorth (1971), Henrysson and Wedman (1973), Millman (1974), Roudabush (1973), Rovinel l i and Hambleton (1977), and Wedman (1973, 1974a, 1974b). S l ight ly d i f fe ren t approaches can be found in Brennan and Stolurow (lCJ71), Harr is (1976), Herbig (1975, 1976), Kosecoff and Klein (1974), and Roudabush (1973). Popham (1971) offers a chi -squared test for detect ing items with atypical di f ferences in p-va lue. Saupe's (1966) correlat ion between item and total change score is often considered the counterpar t of the norm- referenced i tem-test corre lat ion.

A cr i t ical review of item analysis based on the pretest -post tes t method is given in van der Linden (1981a) who also offers an a l ternat ive der ived from latent t ra i t theory. (See also van Naerssen, 1977a, ]977b.) Other reviews of c r i te r ion- re fe renced item analysis procedures are given in Berk (1980b) and Hambleton (1980). For reviews of test analysis procedures, we refer to Berk (1980a), Hambleton, Swaminathan, Alg ina, and Coulson (1978), and Linn (1979).

Popham and Husek's diagnosis of the role of test score variance in the analysis of c r i te r ion- re fe renced measurement has been a hot ly-debated issue (Haladyna, 1974a, 1974b; Millman E, Popham, 1974, Simon, 1969; Woodson, 1974a, 1974b). Our own opinion is that , al though it has given rise to many

Criterion--Referenced Measurement 103

worthwhi le developments and comes close to a fundamental problem in test theory , i t is erroneous as far as the classical test model is concerned. Apar t from the requirement of f in i te var iance, this model does not contain any assumption with respect to score var iance. (A ful l statement of these assumptions can be found in, e .g . , Lord ~, Novick, 1968, section 3 .1 . ) Low var iab i l i t y of test scores in c r i te r ion- re ferenced measurement can therefore never inval idate the classical test model. Latent t ra i t theory has called attent ion to the more fundamental problem of populat ion-dependent item and test parameters and offers models in which these are replaced by parameters being not only var iance- independent but independent of any d is t r ibut iona l character ist ic ( e .g . , Birnbaum, 1968; Lord, 1980; Wright ~- Stone, 1979; for an informal in t roduct ion, see van der Linden, 1978b). In fact, Popham and Husek's paper alludes to this problem but wrong ly claims it as an exclusive problem of c r i te r ion- re ferenced measurement.

MASTERY DECISIONS

Hambleton and Novick (1973) were the f i r s t to introduce a decis ion- theoret ic approach to c r i te r ion- re ferenced measurement. In modern learning strategies, such as the ones br ie f l y reviewed above, the ult imate use of c r i te r ion- referenced measurements is mostly not measuring students but making instruct ional decisions. Hambleton and Novick compare the use of norm- referenced and cr i te r ion- re ferenced measurements with the concepts of f i xed- quota and quota- f ree selection (Cronbach ~, Gleser, 1965).

Norm-referenced tests are needed to d i f fe rent ia te between subjects in order to select a predetermined number of best performing persons, regardless of the i r actual level of performance ( f i xed-quota selection). Cr i te r ion- re ferenced tests are mostly used to select all persons exceeding a performance level f ixed in advance, regardless of the i r actual number (quota- f ree select ion). With c r i te r ion- re ferenced tests, this f ixed level of performance is known as the mastery score, and selection with this score as mastery test ing.

It is important to note that in mastery test ing there is thus only one point on the c r i te r ion- re ferenced continuum that counts: the mastery score. This score divides the continuum in mastery and non-mastery areas.

There is an a l ternat ive conception of mastery which is not based on a continuum model, as the former, but on a state model. In this conception mastery and non-mastery are viewed as two latent states, each character ized by a d i f fe ren t set of probabi l i t ies of a successful response to the test items. There is no state between mastery and non-mastery and it is not necessary to set a mastery score, as with continuum models. By f i t t i ng the model to test data it is left to nature to def ine who is a master and who is not. Relevant references to the l i te ra ture on state model are Bergan, Cancell i , and Luiten (1980), Besel (1973), Dayton and Macready (1976, 1980), Emrick (1971), Emrick and Adams (1969), Macready and Dayton (1977, 1980a, 1980b), Reulecke (1977a, 1977b), van der Linden (1980c, 1981c, 1981d, 1981e), Wilcox

104 Wire J. van der Linden

(1979a), and Wilcox and Harris (1977). For a discussion of the dif ferences between continuum and state models, we refer to Meskauskas (1976) and van der Linden (1978a).

Though mastery is def ined using t rue scores or states, mastery decisions are made with test scores containing measurement er rors . Due to this, decision errors can be made, and a student can be wrongly classified as a master (false posi t ive er rors) or a non-master (false negative e r ro rs ) . An important mastery test ing problem is how to choose a cu t -o f f score on the test so that decisions are made as optimally as possible. A second problem is the psychometric analysis of mastery decisions. Classical psychometric analyses view tests as instruments for making measurements, and this point of view has pervaded its models and procedures. When tests are used for making decisions, this is no longer correct , however.

Hambleton and Novick (1973) proposed the use of Bayesian decision theory to optimize the cu t -o f f score on the test. Important in apply ing decision theory is the choice of the loss funct ion represent ing the "seriousness" of the decision outcomes on a numerical scale. Several have been used so far: threshold loss, l inear loss, and normal ogive loss funct ions ( for a comparison, see van der Linden, 1980a). The threshold loss funct ion is used, e . g . , in Hambleton and Novick (1973), Huynh (1976), and Mel lenbergh, Koppelaar, and van der Linden (1977). L iv ingston (1975) and van der Linden and Mellenbergh (1977) show how semi- l inear, respect ive ly , l inear loss funct ions can be used to select optimal cu t -o f f scores. Ti le use of normal ogive loss funct ion is suggested by Novick and Lindley (1978).

All these appl icat ions of decision theory to the optimal cut -o f f score problem can be used with a Bayesian as well as an empirical Bayes in terpre ta t ion . Ful ly Bayesian approaches are presented and discussed in Hambleton, Hutten, and Swaminathan (1976), Lewis, Wang, and Novick (1975), and Swaminathan, Hambleton, and Algina (1975).

It is also possible to base the making of mastery decisions on the Neyman- Pearson theory of test ing statist ical hypotheses. Then no expl ic i t loss funct ions are chosen but the consequences of false posit ive and false negat ive er rors are evaluated via determining the sizes of the type I and type II er rors of test ing the hypothesis of mastery. Approaches along this l ine, all based on binomial e r ro r models, can be found in Fhaner (1974), Klauer (1972), Kriewall (1972), Millman (1973), van den Br ink and Koele (1980), and Wilcox (1976).

A minimax solution to the optimal cu t -o f f score problem, which resembles the (empir ical) Bayes approach in that the losses must be specified exp l i c i t l y but does not assume the ava i lab i l i t y of (subject ive) information about t rue score, is presented in Huynh (1980a).

The problem of the analysis of mastery decisions was also addressed by Hambleton and Novick (1973). They proposed to determine the re l iab i l i t y of mastery decisions by assessing the consistency of decisions from tes t - re tes t or paral lel test administrat ions using the coeff ic ient of agreement. Simi lar ly,

Criterion--Referenced Measurement 105

an independent ly measured cr i ter ion could be used for the determinat ion of the va l i d i t y of mastery decisions. Swaminathan, Hambleton, and Algina (1974) suggested to replace th is coef f ic ient , which was already proposed fo r th is purpose by Carver (1970), by the chance-corrected coeff ic ient kappa of Cohen (1960). These two coeff ic ients suppose the ava i lab i l i t y of two test adminis t rat ions. Both Huynh (1976) and Subkoviak (1976) presented a s ingle administ rat ion method of estimating the coef f ic ients, which were der ived using the beta-binomial model and an equiva lent set of assumptions, respect ive ly . Another single administrat ion method was given by Marshall and Haertel (1976). Al l these methods have been ex tens ive ly examined and compared to each other : Alg ina and Noe (1978), Berk (1980a), Divgi (1980), Huynh (1978, 1979), Huynh and Saunders (1980), Marshall and Serl in (1979), Peng and Subkoviak (1980), Subkoviak (1978, 1980), T raub and Rowley (1980), and Wilcox (1979e).

The idea that the " re l i ab i l i t y " of decisions can be determined via t he i r consistency goes back to the assumption that classical test theory holds for decisions just as i t does for measurements. Mel lenbergh and van der Linden (1979) have shown that the assumption is incorrect and that tes t - re tes t or paral lel test consistency of decisions does not necessari ly ref lect t he i r accuracy. They recommend the use of coeff ic ient delta (van der Linden F~ Mel lenbergh, 1978) which indicates how optimal the decisions actual ly made are wi th respect to the t rue mastery and non-mastery states. Comparable approaches have been undertaken by de Gru i j te r (1978), L iv ingston and Wingersky (1979), and Wilcox (1978a).

Also re levant to the issue of mastery set t ing wi th cont inuum models are techniques fo r set t ing mastery scores. These techniques all somehow t rans la te the learning objectives into a mastery score on the t rue score cont inuum under l y ing the test. Hence, they should be d is t ingu ished from the dec is ion- theoret ic approaches mentioned above which, once the mastery score on the cont inuum has been set, indicate how to opt imal ly select the cu t -o f f score on the test. Several techniques of set t ing mastery scores have been proposed. Reviews of these techniques are g iven, e . g . , in Glass (1978), Hambleton (1980), Jaeger (1979), and Shepard (1979, 1980). An approach account ing fo r possible uncer ta in ty in set t ing standards is g iven in de Gru i j te r (1980).

CONCLUDING REMARKS

In th is paper we have presented an overv iew of the way c r i te r ion - re fe renced measurement is used in modern ins t ruc t iona l programs, what its main problem areas are, and how these have been approached. The purpose of the paper was to summarize developments and results and to prov ide an in t roduct ion to the or ig inal l i te ra ture .

I t should be noted, however, that not all aspects of cr i te r ion- re ferenced measurement have been considered. For example, we did not refer to

106 Wirn J. van der Linden

developments in the area of c r i te r ion- re fe renced item wr i t i ng (Mil lman, 1980; Popham, 1980; Ro{d E- Haladyna, 1980), developing guidel ines for evaluat ing c r i te r ion- re fe renced tests and the i r manuals (Hambleton & Eignor, 1978), estimating domain scores and binomial model parameters (Hambleton, Hut ten, Swaminathan, 1976; Jackson, 1972; Novick, Lewis, & Jackson, 1972; Wilcox, ]978b, 1979d), determining the length of a mastery test (Fhaner, 1974; Hsu, 1980; Klauer, 1972; Kr iewal l , 1972; Millman, 1973; Novick & Lewis, 1974; van den Br ink &Koe le , 1980; van der Linden, 1980b; Wilcox, 1976, ]979b, 1980a, 1980b), and estimating propor t ions of misclassif icat ion in mastery test ing (Huynh , 1980; Koppelaar, van der Linden, E, Mel lenbergh, 1977; Mel lenbergh, Koppelaar, & van der Linden, 1977; Subkoviak & Wilcox, 1978; Wilcox, 1977, 1979c). Several of these topics are also dealt wi th in the review by Hambleton et al. (1978) referred to ear l ier and in the recent Applied Psychological Measurement (1980) special issue on c r i te r ion- re fe renced measurement.

This paper has thus not touched on all aspects of c r i te r ion- re fe renced measurement. Yet, we hope it has given an impression of the d i rect ions the f ie ld of c r i te r ion- re fe renced measurement is tak ing . Three "h is to r ica l " papers have guided us in exp lo r ing the f ie ld. We were able to observe that a great va r ie ty of solut ions have been formulated in response to these papers. Most of the solut ions promise important improvements of educational test ing technology and, thereby , of educational pract ice.

REFERENCES

Alg ina, J. & Noe, M.J. A s tudy of the accuracy of Subkoviak 's single adminis t rat ion estimate of the coef f ic ient of agreement using two t rue-score estimates. Journal of Educational Measurement, 15, 101-110, 1978.

Appl ied Psychological Measurement. Methodological cont r ibu t ions to c r i te r ion - referenced test ing technology, (#), 1980.

Atkinson, R.C. Computer based instruction in initial reading. In Proceeding of the 1967 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Test ing Service, 1968.

Bergan, J . R . , Cancel l i , A .A . & Lui ten, J.W. Mastery assessment wi th latent class and quasi - independence models represent ing homogeneous item domains. Journal of Educational Statist ics, 5, 65-82, 1980.

Berk, R.A. A consumers' guide to c r i te r ion - re fe renced test re l iab i l i t y . Journal of Educational Measurement, 17, 323-350, 1980a.

Berk, R .A . Item analysis. In R.A. measurement: The state of the ar t . Un ive rs i t y Press, 1980b.

Berk (Ed . ) , Balt imore, MD:

Cri ter ion- referenced The Johns Hopkins

Criterion-Referenced Measurement 107

Besel, R. Using group performance to in te rp re t ind iv idua l responses to c r i te r ion- re fe renced tests. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, Louisiana, 1973. (EDRS No. ED 076 658)

Birnbaum, A. Some latent t ra i t models and their use in inferr ing an

examinee's abi l i ty. In F.M. Lord ~ M.R. Novick, Stat ist ical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968.

Block, J.H. Criterion-referenced measurements: Potential. School Review, 79, 289-298, 1971a.

Block, J.H. (Ed.) Mastery Learning: Theory and Practice. New York: Holt, Rinehart and Winston, 1971b.

Bloom, B.S. Learning for mastery. Evaluation Comment, 1 (2), 1968.

Bloom, B.S., Hastings, J .T . , 8 Madaus, G.F. Handbook on format ive and summative evaluation of s tudent learn ing. New York, N.Y. : McGraw-Hill, 1971.

Brennan, R.L. Applications of generalizabil i ty theory. In. ( Ed. ) , Cr i te r ion- referenced measurement : The state of Baltimore, MD: The Johns Hopkins Universi ty Press, 1980.

R.A. Berk the ar t .

Brennan, R.L. & Kane, M.T. An index of dependability for mastery tests. Journal of Educational Measurement, I#, 227-289, 1977a.

Brennan, R.L. & Kane, M.T. Signal/noise ratios for domain-referenced tests. Psychometr ika, #2, 609-625, 1977b.

Brennan, R.L. ~ Stolurow, L.M. An empirical decision process for formative evaluation. Research Memorandum No. q. Harvard CAI Laboratory, Cambridge, Mass., 1971. (EDRS No. ED 048 343)

Carver, R.P. Two dimensions of tests: Psychometic and edumetric, American Psychologist , 29, 512-518, 1974.

Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.

Coulson, D.B. & Hambleton, R.K. On the val idat ion of c r i te r ion- re fe renced tests designed to measure ind iv idua l mastery. Paper presented at the Annual Meeting of the American Psychological Association, New Orleans, ]974.

Cox, R.C. Evaluative aspects of criterion-referenced measures. In W.J. Popham (Ed.) , Cr i te r lon- re ferenced measurement: An in t roduct ion . Englewood Clif fs, N.J.: Educational Technology Publications, 1971.


Cox, R.C. S Graham, G.T . The development of a sequent ia l ly scaled achievement test. Journal of Educational Measurement, 3, 147-150, 1966.

Cox, R.C. & Vargas, J.S. A comparison of item selection techniques for norm-referenced and c r i te r ion- re fe renced tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, I l l inois, 1966. (EDRS No. ED 010 517)

Cronbach, L.J. & Gleser, G.C. Psychological tests and personnel decisions, (2nd Ed. ) . Urbana, II1.: Un ive rs i t y of I l l inois Press, 1965.

Cronbach, L.J. ~, Snow, R.E. Apt i tudes and ins t ruc t iona l methods: A handbook for research on in teract ions. New York: I r v ing ton Publ ishers, Inc . , 1977.

Dayton, C.M. & Macready, G.B. A probabi l i s t ic model for" val idat ion of behavioral h ierarchies. Psychometr ika, q l , 189-204, 1976.

Dayton, C.M. & Macready, G.B. A scaling model wi th response er rors and i n t r i ns i ca l l y unscalable respondents. Psychometr ika, qS, 343-356, 1980.

de Gru i j te r , D.N.M. A c r i te r ion - re fe renced point biserial correlat ion coef f ic ient . T i jdschr i f t voor Onderwi js rearch, 3, 257-26], 1978.

de Gru i j te r , D.N.M. Account ing for uncer ta in ty in performance standards. Paper presented at the Fourth In ternat ional Symposium on Educational Tes t ing , An twerp , Belgium, June 24-27, 1980.

D ivg i , D.R. Group dependency of some re l iab i l i t y indices for mastery tests. Appl ied Psychological Measurement, #, 213-218, 1980.

Ebel, R.L. Content standard test scores. Educational and Psychological Measurement, 22, 15-25, 1962.

Ebel, R.L. Cr i te r ion- re fe renced measurements: Limitat ions. School Review, 79, 282-288, 1971.

Edmonston, L.P. & Randall, R.L. A model for est imating the re l iab i l i t y and va l id i t y of c r i te r ion- re fe renced measures. Paper presented at the Annual Meeting of the American Educational Research Associat ion, Chicago, I l l ino is , 1972. (EDRS No. ED 965 591)

Emrick, J .A . An evaluat ion model for mastery test ing. Educational Measurement, 8, 321-326, 1971.

Journal of

Emrick, J .A . & Adams, F.N. An evaluation model for ind iv idual ized ins t ruc t ion (Report RC 2674). York town Hts . , N . Y . : IBM Thomas J. Watson Research Center, 1969.

Fhaner, S. Item sampling and decis ion-making in achievement test ing. Br i t i sh Journal of Mathematical and Stat ist ical Psychology, 27, 172-175, 1974.

Criterion - Referenced Measurement 109

Flanagan, J.C. Units, scores and norms. In E.F. Lindquist (Ed.) , Educational Measurement, Washington, D.C. : American Council on Education, 1951.

Flannagan, J.C. Functional education for the seventies. Phi Delta Kappan, zig, 27-32, 1967.

Glaser, R. Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519-521, 1963.

Glaser, R. Adapting the elementary school curriculum to individual performance. In Proceedings of the 1967 Inv i tat ional Conference on Test ing Problems. Princeton, N.J. : Educational Testing Service, 1968.

Glaser, R. ~- Cox, R.C. Criterion-referenced testing for the measurement of educational outcomes. In R.A. Weisgerber [Ed.) , Ins t ruc t iona l process and media innovat ion. Chicago; Rand McNally, 1968.

Glaser, R. t, Klaus, D.H. Proficiency measurement: Assessing human performance. In R. Gagne (Ed.), Psychological pr inc ip les in system development. New York: Holt, Rinehart, ~, Winston, 1962.

Glaser, R. 8 Nitko, A.J. Measurement in learning and instruction. In R.L. Thorndike fEd.) , Educational Measurement. Washington, D.C.: American Council on Education, 1971.

Glass, G.V. Standards and cri teria. Journal of Educational Measurement, 15, 237-261, 1978.

Haladyna, T.M. Effects of di f ferent examples on item and test characteristics of cri terion-referenced tests. Journal of Educational Measurement, 11, 93-99, 1974a.

Haladyna, T.M. An invest igat ion of fu l l - and subscale re l iabi l i t ies of c r i te r ion- re fe renced tests. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, I l l inois, 1974b. (EDRS No. ED 091 435)

Hambleton, R.K. Testing and decision-making procedures for selected individualized instructional program. Review of Educational Research, zlzl, 371-400, 1974.

Hambleton, R.K. Test score val id i ty and standard-setting methods. In R.A. Berk fEd. ) , Cr i te r ion- re ferenced measurement: The state of the ar t . Baltimore, MD: The Johns Hopkins University Press, 1980.

Hambleton, R. K 8 Eignor, D.R. Guidelines for evaluating cri terion- referenced tests and test manuals. Journal of Educational Measurement, 15, 321-327, 1978.


Hambleton, R.K & Gorth, W.P. Cr i te r ion- re fe renced test ing: Issues and appl icat ions (Report No. TR-13) . Amherst: Un ive rs i t y of Massachusetts, School of Education, 1971. (EDRS No. ED 060 025)

Hambleton, R .K . , Hut ten, L .R . , & Swaminathan, H. A comparison of several methods for assessing student mastery in object ives-based ins t ruc t iona l programs. Journal of Experimental Education, 45, 57-64, 1976.

Hambleton, R.K. & Novick, M.R. Toward an in tegrat ion of theory and method for c r i te r ion - re fe renced tests. Journal of Educational Measurement, 10, 159-170, 1973.

Hambleton, R .K . , Swaminathan, H. , A lg ina, R., & Coulson, D.B. Cr i te r ion - referenced test ing and measurement: A review of technical issues and developments. Review o[ Educational Research, 48, 1-47, 1978.

Harr is , C.W. An in terpre ta t ion of L iv ingston 's re l iab i l i t y coef f ic ient for c r i te r ion - re fe renced tests. Journal of Educational Measurement 9, 27-29, 1972.

Harr is , N .D .C . An index of ef fect iveness for c r i te r ion- re fe rencec items used in pre- tes ts and posttests. Programmed Learning and Educational Technology, 11, 125-132, 1974.

Henrysson, S. & Wedman, I. Some problems in const ruct ion and evaluat ion of c r i te r ion - re fe renced tests. Scandinavian Journal of Educational Research, 18, 1-12, 1973.

Herbig, M. Zur Vor tes t -Nach tes t -Va l i d ie rung Lehrz ie lo r ien t ie r te r Tests. Ze i tschr i f t f u r Erz iehungswissenschaf t l iche Forschung, 9, 112-126, 1975.

Herbig, M. Item analysis by use in pretests and posttests: A comparison of d i f f e ren t coeff ic ients. Programmed Learning and Educational Technology, 13, 49-54, 1976.

H ive ly , W. In t roduct ion to domain-referenced tes t ing. Technology, IZl, 5-10, 1974.

Educational

Hive ly , W., Maxwell , G, Rabehl, G. , Sension, D, & Lundin , S. Domain- referenced cur r i cu lum evaluat ion: A technical handbook and a case s tudy f rom the Minnemast Project (CSE Monograph Series in Evaluation No. 1). Los Angeles: Center for the Study of Evaluat ion, 1973.

H ive ly , W., Patterson, H . L . , & Page, S.H. A un iverse-de f ined system of ar i thmet ic achievement tests. Journal of Educational Measurement, 5, 275-290, 1968.

Hsu, L.M. Determination of the number of items and passing score in a mastery test. Educational and Psychological Measurement, qO, 709-714, 1980.

Criterion -- Referenced Measurement 111

Huynh, H. On the re l iab i l i t y of decisions in domain-referenced test ing. Journal of Educational Measurement, 13, 253-264, 1976a.

Huynh, H. Statist ical considerations of mastery scores. Psychometrika, #1, 65-79, 1976b.

Huynh, H. Rel iabi l i ty of mult iple classif ications. 317-326, 1978.

Psychometrika, 43,

Huynh, H. Statist ical inference for two re l iab i l i ty indices in mastery test ing based on the beta-binomial model. Journal of Educational Statistics, #, 231-246, 1979.

Huynh, H. A nonrandomized minimax solution for passing scores in the binomial e r ror model. Psychometrika, z15, 167-182, 1980a.

Huynh, H. Statist ical inference for false posit ive and false negative er ror rates in mastery test ing. Psychometrika, 45, 107-120, 1980b.

Huynh, H. 8 Saunders, J.C. re l iab i l i t y of mastery tests. 351-358, 1980.

Accuracy of two procedures for estimating Journal of Educational Measurement, 17,

Jackson, P.H. Simple approximations in the estimation of many parameters. Brit ish Journal of Mathematical and Statistical Psychology, 25, 213-229, 1972.

Jaeger, R.M. Measurement consequences of selected standard-set t ing models. In M.A. Bunda and J.R. Sanders (Eds. ) , Practices and problems in competency-based measurement. Washington, D .C. : National Council on Measurement in Education, 1979.

Kane, M.T. & Brennan, R.L. Agreement coefficients dependabi l i ty for domain-referenced tests. Applied Measurement, 4, 105-126, 1980.

as indices of Psychological

Klauer, K.J. Zur Theorie und Praxis des binomialen Modells lehrz ie lor ient ie r ter Tests. In K.J. Klauer, R. Fricke, M. Herbig, H. Rupprecht, & F. Schott, Lehrzielorientierte Tests. Dusseldorf: Schwann, 1972.

Koppelaar, H., van der Linden, W.J., & Mellenbergh, G.J. A computer program for classif ication proport ions in dichotomous decisions based on dichotomously scored items. Tijdschri f t voor Onderwijsresearch, 2, 32-37, 1977.

Kosecoff, J .B. & Klein, S.P. Instruct ional sens iv i ty stat ist ics appropr iate for objective-based test items (CSE Report No. 91). Los Angeles, Cal. : Un ivers i ty of Cal i fornia, Center for the Study of Evaluation, 1974.


Kriewal l , T .E. Aspects and appl icat ions of c r i te r ion- re fe renced tests. I l l inois School Research, 9, 5-18, 1972.

Lewis, C. , Wan9, M., & Novick, M.R. Marginal d is t r ibu t ions for the estimation of propor t ions in m groups. Psychometr ika, #0, 63-75, 1975.

L inn, R.L. Issues of re l iab i l i t y in measurement for competency-based programs. In M.A. Bunda and J .R. Sanders (Eds . ) , Practices and problems in competency-based measurement. Washington, D .C . : National Council on Measurement in Education, 1979,

L inn, R.L. Issues of va l id i t y for c r i te r ion- re fe renced measures. Appl ied Psychological Measurement, q, 547-561, 1980.

L iv ings ton , S.A. Cr i te r ion- re fe renced appl icat ions of classical test theory . Journal of Educational Measurement, 9, 13-26, 1972a.

L iv ings ton , S.A. A reply to Harr is ' "An in terpre ta t ion of L iv ingston 's re l iab i l i t y coef f ic ient for c r i te r ion - re fe renced tests" . Journal of Educational Measurement, 9, 31, 1972b.

L iv in9s ton , S.A. Reply to Shavelson, Block, and Ravitch's "Cr i t e r i on - referenced test in9: Comments on re l i ab i l i t y " . Journal of Educational Measurement, 9, 139-140, 1972c.

L iv ings ton , S.A. A note on the in terpre ta t ion of the c r i te r ion- re fe renced re l iab i l i t y coeff ic ient . Journal of Educational Measurement, 10, 311, 1973.

L iv ings ton , S.A. A u t i l i t y -based approach to the evaluation of pass~fail test ing decision procedures (COPA-75-01). Pr inceton, N .J . : Center for Occupational and Professional Assessment, Educational Test ing Service, 1975,

L iv ings ton , S.A. & Wingersky, M. Assessing the re l iab i l i t y of tests used to make pass/fai l decisions. Journal of Educational Measurement, 16, 247-260, 1979.

Lord, F.M. Appl icat ions of item response theory to pract ical test ing problems. Hil lsdale, New Jersey: Lawrence Erlbaum Associates, Pub l . , 1980.

Lord, F.M. ~, Novick, M.R. Stat ist ical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968.

Lovett , H .T . Cr i te r ion- re fe renced re l iab i l i t y estimated Educational and Psychological Measurement, 37, 21-29, ]977.

by anova.

Macready, G.B & Dayton, D.M. The use of probabi l i s t ic models in the assessment of mastery. Journal of Educational Stat ist ics, 2, 99-120, 1977.


Macready, G.B 8 Dayton, C.M. The nature and use of state mastery models. Appl ied Psychological Measurement, zl, 493-516, 1980a.

Macready, G.B. ~, Dayton, C.M. A two-stage condit ional estimation procedure fo r unres t r ic ted latent class models. Journal of Educational Stat is t ics, 5, 129-156, 1980b.

Marshal l , J .L . 8 Haertel, E.H. A s ingle-adminis t rat ion re l iab i l i t y index for c r i te r ion- re fe renced tests: The mean sp l i t -ha l f coeff ic ient of agreement. Paper presented at the Annual Meeting of the American Educational Research Associat ion, Washington, D .C. , 1975.

Marshal l , J .L . & Ser l in , R.C. Character is t ics of four mastery test re l iab i l i t y indices: Inf luence of d is t r ibu t ion shape and cu t t ing score. Paper presented at the Annual Meeting of the American Educational Research Associat ion, San Francisco, Cal i forn ia, Apr i l 8-12, 1979.

Mel lenbergh, G .J . , Koppelaar, H. ~, van der Linden, W.J. Dichotomous decisions based on dichotomously scored items: A case s tudy. Stat ist ica Neerlandica, 31, 161-169, 1977.

Mel lenbergh, G.J. ~, van der Linden, W.J. The internal and external opt imal i ty of decisions based on tests. Appl ied Psychological Measurement, 3, 257-273, 1979.

Meskauskas, J .A . Evaluation models for c r i te r ion- re fe renced test ing: Views regard ing mastery and s tandard-se t t ing . Review of Educational Research, 46, 133-158, 1976.

Millman, J. Passing scores and test lengths for domain-referenced measures. Review of Educational Research, z/3, 205-216, 1973.

Millman, J. Cr i te r ion- re fe renced measurement. In W.J. Popham [Ed. ), Evaluation in education. Berkeley, Cal i fornia: McCutchan, 1974.

Millman, J. Computer-based item generat ion. In R.A. Bark (Ed . ) , Cr i te r ion- re ferenced measurements: The state of the ar t . Balt imore, MD: The Johns Hopkins Un ive rs i t y Press, 1980.

Millman, J. 8 Popham, W.J. The issue of item and test var iance for c r i te r ion - referenced tests : A c lar i f icat ion. Journal of Educational Measurement, 11, 137-138, 1974.

Ni tko, A.J . D is t ingu ish ing the many var iet ies of c r i te r ion - re fe renced tests. Review of Educational Research, 50, 461-485, 1980.

Ni tko, A .J . & Hsu, T . L . Using domain- referenced tests for s tudent placement, diagnosis and attainment in a system of adapt ive, ind iv idua l ized ins t ruc t ion . Educational Technology, lZl, 48-54, 1974.

114 Wim J. van der Lin den

Novick, M.R. & Lewis, C. Prescr ib ing test length for" c r i te r ion- re fe renced measurement. In C.W. Harr is , M.C. A lk in , r, W.J. Popham, Problems in cr i ter ion- re ferenced measurement (CSE Monograph Series in Evaluat ion, No. 3). Los Angeles: Center" for" the Study of Evaluat ion, Un ive rs i t y of Cal i forn ia, 1974.

Novick, M .R . , Lewis, C. , & Jackson, P.H. The estimation of propor t ions in m groups. Psychometrika, 38, 19-46, 1972.

Novick, M.R. r, L ind ley, D.V. The use of more real ist ic u t i l i t y funct ions in educational appl icat ions. Journal of Educational Measurement, 15, 181-191, 1978.

Osbur'n, H.G. Item sampling for achievement test ing. Psychological Measurement, 28, 95-104, 1968.

Educational and

Peng, C . - Y . J . & Subkoviak, M.J. A note on Huynh 's normal approximation for" estimating c r i te r ion - re fe renced re l iab i l i t y . Journal of Educational Measurement, 17, 359-368, 1980.

Plomp, Tj. Enkele methodologische en stat is t ische aspecten van AT I - onderzoek. In Apt i tude Treatment Interact ion (VOR Publ ikat ie nr. 5). Amsterdam: Veren ig ing voor Onderwi jsresearch, 1977.

Popham, W.J. Indices of adequacy for c r i te r ion - re fe renced test items. In W.J. Popham fEd . ) , Cri ter ion-re ferenced measurement: An int roduct ion. Englewood Cl i f fs , N .J . : Educational Technology Publ icat ions, 1971.

Popham, W.J. Cri ter ion-referenced measurement. Englewood Cl i f fs , N .J . : Prent ice-Hal l , 1978.

Popham, W.J. Domain specif icat ion strate9ies. In R.A. Cr i ter ion- re ferenced measurement: The state of the art . The Johns Hopkins Un ive rs i t y Press, 1980.

Berk (Ed . ) , Baltimore, MD:

Popham, W.J. & Husek, T .R . Implicat ions of c r i te r ion- re fe renced measurement. Journal of Educational Measurement, 6, 1-9, 1969.

Reulecke, W.A. Ein Modell f u r die k r i te r iumor ien t ie r te Testauswer tung. Zei tschr i f t f u r Empirische Padagogik, 1, 49-72, 1977a.

Reulecke, W.A. A statistical analysis of deterministic theories. In H. Spada & W.F. Kempf (Eds.), Structural models of thinking and learning. Bern, Switzerland: Hans Huber Publishers, 1977b.

Roid, G. & Haladyna, T. The emergence of an item-writing technology. Review of Educational Research, 50, 293-315, 1980.

Roudabush, G.E. Item selection for criterion-referenced tests. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, Louisiana, 1973, (EDRS No. ED 074 147).


Rovinelli, R.J. 8 Hambleton, R.K. On the use of content specialists in the assessment of criterion-referenced test item val id i ty . T i j dsch r i f t voor Onderwi jsresearch, 2, 49-60, 1977.

Saupe, J.L. Selecting items to measure change. Journal of Educational Measurement, 3, 223-228, 1966.

Shavelson, R., Block, J. , & Ravitch, M. Criterion-referenced testing: Comments on rel iabi l i ty. Journal of Educational Mesurement, 9, 133-138, 1972.

Shepard, L.A. Setting standards. In M.A. Bunda and J.R. Sanders (Eds.), Practices and problems in competency-based measurement. Washington, D.C.: National Council on Measurement in Education, 1979.

Shepard, L. Standard setting issues and methods. Appl ied Psychological Measurement, zl, 447-467, 1980.

Simon, G.B. Comments on "Implications of criterion-referenced measurement". Journal of Educational Measurement, 6, 259-260, 1969.

Subkoviak, M.J. Estimating rel iabi l i ty from a single administration of a criterion - referenced test. Journal of Educational Measurement, 13, 265-276, 1976.

Subkoviak, M.J. Empirical investigation for estimating rel iabi l i ty of mastery tests. Journal of Educational Measurement, 15, 111-116, 1978.

Subkoviak, M.J. Decision-consistency approaches. In R.A. Berk (Ed.) , Cr i te r ion- re fe renced measurement: The state of the ar t . Baltimore, MD: The Johns Hopkins Universi ty Press, 1980.

Subkoviak, M.J. 8 Wilcox, R. Estimating the probab i l i t y of correct classif icat ion in mastery test ing. Paper presented at the Annual Meeting of the American Educational Research Association, Toronto, 1978.

Suppes, P. The use of computers in education. Scient i f ic American, 215, 206-221, 1966.

Suppes, P., Smith, R., & Beard, M. Univers i t y - leve l computer-assisted ins t ruc t ion at Stanford: 1975 (Technical Report No. 265). Stanford, Cal.: Insti tute for Mathematical Studies in the Social Sciences, Stanford Universi ty, 1975.

Swaminathan, H., Hambleton, R.K. , & Algina, J. Reliabil i ty of cr i ter ion- referenced tests: A decision-theoretic formulation. Journal of Educational Measurement, 11, 263-267, 1974.

Swaminathan, H., Hambleton, R.K. , ~, Algina, J. A Bayesian decision- theoretic procedure for use with criterion-referenced tests. Journal of Educational Measurement, 12, 87-98, 1975.


Traub, R.E. & Rowley, G.L. Rel iabi l i ty of test scores and decisions. Applied Psychological Measurement, q, 517-545, 1980.

van den Br ink, W.P. S Koele, P. Item sampling, guessing, and decision- making in achievement test ing. Bri t ish Journal of Mathematical and Statistical Psychology, 33, 104-108, 1980.

van der Linden, W.J. Forget t ing, guessing, and mastery: The Macready and Dayton models revis i ted and compared with a latent t ra i t approach. Journal of Educational Statistics, 3, 305-318, 1978a.

van der Linden, W.J. Het klassieke testmodel , latente trek modellen en evaluat ie-onderzoek (VOR Publ ikat ie nr. 7). Amsterdam, The Netherlands: Verenig ing roar Onderwi jsresearch, 1978b.

van der Linden, W.J. Cr i te, ' iumgeor ienteerd toetsen. In E. Warries (Ed . ) , Beheersingsleren een leerstrategie. Gr'oningen, Tile Netherlands: Wolters- Noordhoff , 1979.

van der Linden, W.J. Decision models for use with c r i te r ion- re ferenced tests. Appl ied Psychological Measurement, #, 469-492, 1980a.

van der Linden, W.J. Passing score and length of a mastery test. Paper presented at the 11th European Mathematical Psychology Group Meeting, Liege, Belgium, September 3-4, 1980b.

van der Linden, W.J. Simple estimates for the simple latent class mastery testing model. Paper presented at the Fourth Internat ional Symposium on Educational Test ing, Antwerp, Belgium, June 24-27, 1980c.

van der Linden, W.J. Statistical aspects of optimal treatment assignment. Paper presented at the 1980 European Meeting of the Psychometric Society, Groningen, The Netherlands, June 19-21, 1980d. (EDRS No. ED 194 565)

van der Linden, W.J. A latent t ra i t look at pretest -post test val idat ion of c r i te r ion- re fe renced test items. Review of Educational Research, 51, 379-402, 1981a.

van der Linden, W.J. Using apt i tude measurements for the optimal assignment of subjects to treatments with and wi thout mastery scores. Psychometrika, #6, in press, 1981b.

van der Linden, W.J. Estimating the parameters of Emrick's mastery test ing model. Appl ied Psychological Measurement, 5, 517-530, 1981c.

van der Linden, W.J. On the estimation of the proport ion of masters in c r i te r ion- re fe renced measurement. Submitted for publ icat ion, 1981d.

van der Linden, W.J. The use' of moment estimators for mixtures of two binomials with one known success parameter. Submitted for publ icat ion, 1981e.


van der L inden, W.J. ~, Mel lenbergh, G.J. Optimal cu t t i ng scores using a l inear loss func t ion . Applied Psychological Measurement, 1, 593-599, 1977.

van der Linden, W.J. E, Mel lenbergh, G.J. Coeff ic ients for tests from a decision theoret ic point of view. Applied Psychological Measurement, 2, 119-134, 1978.

van Naerssen, R.F. Lokale betrouwbaarheid: begr ip en operat ional isat ie. Ti jdschr i f t voor Onderwijsresearch, 2, 111-119, 1977a.

van Naerssen, R.F. i temkarakter is t ieken. 1977b.

Grafieken voor de schat t ing van de hel l ing van Ti jdschr i f t voor Onderwijsresearch, 2, 193-201,

V i jn , P. Prior information in l inear models (Thes is ) . Groningen, The Nether lands: R i j ksun ivers i te i t Groningen, 1980.

Warries, E. Beheersingsleren als een nieuwe leerstrategie: In le id ing. In E. Warries (Ed . ) , Beheerslngsleren een leerstrategie. Groningen, The Nether lands: Wolters-Noordhoff , 1979a.

Warries, E. Het meten van de graad van beheers ing. Beheersingsleren een leerstrategie. Groningen, Wolters- Noordhof f , 1979b.

In E. Warries (Ed . ) , The Nether lands:

Wedman, I. Theoretical problems in construction of cr i ter ion-referenced tests (Educational Report Umea No. 3). Umea, Sweden: Umea Un ive rs i t y and Umea School of Education, 1973.

Wedman, I. On the evaluat ion of c r i te r ion- re fe renced tests. In H.F. Crombag & D.N.M. de Gru i j te r (Eds . ) , Contemporary issues in educational test ing. The Hague, The Nether lands: Mouton, 1974a.

Wedman, I. Reliabi l i ty, val id i ty, and discrimination measures for cr i ter ion- referenced tests (Educational Reports Umea No. 4). Umea, Sweden: Umea Un ive rs i t y and Umea School of Education, 1974b.

Wilcox, R.R. A note on the length and passing score of a mastery test. Journal of Educational Statist ics, I , 359-364, 1976.

Wilcox, R.R. Estimating the l ikel ihood of fa lse-pos i t ive and fa lse-negat ive decisions in mastery tes t ing: An empirical Bayes approach. Journal of Educational Statist ics, 2, 289-307, 1977.

Wilcox, R.R. A note on decision theoret ic coeff ic ients for tests. Applied Psychological Measurement, 2, 609-613, 1978a.

Wilcox, R.R. Estimating the t rue score in the compound binomial e r ro r model. Psychometrika, q3, 245-258, 1978b.


Wilcox, R.R. Achievement tests and latent s t ruc ture models. Br i t i sh Journal of Mathematical and Stat ist ical Psychology, 32, 6]-71, 1979a.

Wilcox, R.R. App ly ing ranking and selection techniques to determine the length of a mastery test. Educational and Psychological Measurement, 39, 12-22, 1979b.

Wilcox, R.R. Comparing examinees to a control . Psychometr ika, ztzl, 55-68, 1979c.

Wilcox R.R. Estimating the parameters of the beta-binomial d is t r ibu t ion . Educational and Psychological Measurement, 39, 527-536, 1979d.

Wilcox, R.R. Prediction analysis and the re l iab i l i ty of a mastery test. Educational and Psychological Measurement, 39, 825-839, 1979e.

Wilcox R.R. An approach to measuring the achievement or prof ic iency of an examinee. Appl ied Psychological Measurement, #, 241-251, 1980a.

Wilcox, R.R. Determining the length of a c r i te r ion- re ferenced test. Appl ied Psychological Measurement, q, 425-446, 1980b.

Wilcox, R.R. S Harr is, C.W. On Emrick's "An evaluation model for mastery tes t ing" . Journal of Educational Measurement, lq, 215-218, 1977.

Woodson, M. I .C .E . The issue of item and test var iance for c r i te r ion- referenced tests. Journal of Educational Measurement, 11, 63-64, 1974a.

Woodson, M. I .C .E . The issue of item and test variance for c r i te r ion- referenced tests. Journal of Educational Measurement, 11, 139-140, 1974b.

Wright, B.D. & Stone, M.H. Best test design: A handbook for Rasch Measurement. Chicago, II1.: MESA Press, 1979.

CHAPTER 1 CRITERION-REFERENCED MEASUREMENT: ITS MAIN ... · der Linden, 1979) we have given a...

Documents

Transcript of CHAPTER 1 CRITERION-REFERENCED MEASUREMENT: ITS MAIN ... · der Linden, 1979) we have given a...