Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers:...

11
Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont. Burlington t is the third week of school and you have scheduled your first session with George, a 7- year-oid boy who is meeting with little success in the early stages of learning to read following a history of language impairment including severe phonologic difficulties dating from preschool. Studying his file, there is no question in your mind that alnnost any of the norm- referenced tests you might use to reevaluate his speech and language skills will confirm his overall difference from his peers. In fact, you believe many of these tests will provide you with insight into some of the strengths and weaknesses of his communication. At least as urgent a need as the continuing documenta- tion of his problems, however, is your desire to know what gains he has made on the behavioral objectives targeted when he was last seen and where to begin your efforts to help him. The questions that press their way to the front of your mind are myriad: What is his intelligibility in the classroom? Which grammatical morphemes continue to be missing from his expressive language? What simplification processes characterize his speech errors? What is his understanding of vocabulary that will be used in his second-grade class? What understanding does he have ABSTRACT: Although frequently used in the assessment and treatment of communication disorders, criterion- referenced measures are often not well understood, making them both familiar and alien—thus, familiar strangers. This article is designed to better acquaint test users with tbe characteristics associated with tbe use and evaluation of criterion-referenced measures, particularly as they differ from norm-referenced measures. Guidelines are proposed for tbe evaluation and selection of stan- dardized criterion-referenced measures as well as for tbe development and ongoinj^ evaluation of informal criterion-referenced measures. KEY WORDS: criterion-referenced measurement, informal measures, test development, test evaluation concerning the sound structure of words and syllables that he can draw on for reading? Fortunately, as you contemplate the clinical question.s posed by George, you will have methods in mind to answer them, many of them involving the use of probes of your own construction. Despite their pervasiveness in your repertoire of clinical methods, and despite the skill with which you use them, you may only vaguely be aware that such measures can be described as "crite- rion-referenced," and even less aware of a body of research and writing designed to aid you in their development and use. Table I provides specific examples of a variety of such measures for school-age children. Although many of these measures are not referred to by their proponents as "criterion-referenced," and although several of the measures listed would almost certainly fail to fit within more narrow definitions of the term or be considered well-developed criterion-referenced measures (cf. Nitko, 1983), all of these measures are "criterion-referenced" in that they are frequently interpreted in terms of estabiished performance levels. One purpose of this article, therefore, is to call to the reader's attention some of the basic properties of such measures—"familiar strangers"—that one may use daily yet have little opportunity to learn about. Criterion-referenced measures are described as "familiar" here because they are probably more commonly used by speech-language pathologists than almost any other type of measure (Muma, Pierce. & Muma. 1983). Despite their ubiquity, however, such measures may also be the least understood of those we use—hence the label "strangers" to convey the sense of their mystery. Because norm-referenced tests have historically garnered more attention (e.g.. Brown. 1989; McCauley & Swisher, 1984a, 1984b), the importance of criterion-re fere need measures and principles guiding their use frequently have been unappreciated. In this article, therefore, the intent is to review the basic characteristics of criterion-referenced measures (especially as these contrast with norm-referenced measures) and outline a set of 122 LANGLAGE. SPEECH, AND HEARING SERVICES IN SCHOOLS • Vol. 27 • 0161-1461/96/2702-0122 ® American Speech-Language-Hearing Association

Transcript of Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers:...

Page 1: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

Familiar Strangers:Criterion-Referenced Measures in

Communication DisordersRebecca |. McCauley

University of Vermont. Burlington

t is the third week of school and you havescheduled your first session with George, a 7-year-oid boy who is meeting with little success

in the early stages of learning to read following a historyof language impairment including severe phonologicdifficulties dating from preschool. Studying his file, there isno question in your mind that alnnost any of the norm-referenced tests you might use to reevaluate his speech andlanguage skills will confirm his overall difference from hispeers. In fact, you believe many of these tests will provideyou with insight into some of the strengths and weaknessesof his communication.

At least as urgent a need as the continuing documenta-tion of his problems, however, is your desire to know whatgains he has made on the behavioral objectives targetedwhen he was last seen and where to begin your efforts tohelp him. The questions that press their way to the front ofyour mind are myriad: What is his intelligibility in theclassroom? Which grammatical morphemes continue to bemissing from his expressive language? What simplificationprocesses characterize his speech errors? What is hisunderstanding of vocabulary that will be used in hissecond-grade class? What understanding does he have

ABSTRACT: Although frequently used in the assessmentand treatment of communication disorders, criterion-referenced measures are often not well understood,making them both familiar and alien—thus, familiarstrangers. This article is designed to better acquaint testusers with tbe characteristics associated with tbe use andevaluation of criterion-referenced measures, particularlyas they differ from norm-referenced measures. Guidelinesare proposed for tbe evaluation and selection of stan-dardized criterion-referenced measures as well as for tbedevelopment and ongoinj^ evaluation of informalcriterion-referenced measures.

KEY WORDS: criterion-referenced measurement, informalmeasures, test development, test evaluation

concerning the sound structure of words and syllables thathe can draw on for reading?

Fortunately, as you contemplate the clinical question.sposed by George, you will have methods in mind toanswer them, many of them involving the use of probesof your own construction. Despite their pervasiveness inyour repertoire of clinical methods, and despite the skillwith which you use them, you may only vaguely beaware that such measures can be described as "crite-rion-referenced," and even less aware of a body ofresearch and writing designed to aid you in theirdevelopment and use.

Table I provides specific examples of a variety of suchmeasures for school-age children. Although many of thesemeasures are not referred to by their proponents as"criterion-referenced," and although several of the measureslisted would almost certainly fail to fit within more narrowdefinitions of the term or be considered well-developedcriterion-referenced measures (cf. Nitko, 1983), all of thesemeasures are "criterion-referenced" in that they arefrequently interpreted in terms of estabiished performancelevels. One purpose of this article, therefore, is to call tothe reader's attention some of the basic properties of suchmeasures—"familiar strangers"—that one may use daily yethave little opportunity to learn about.

Criterion-referenced measures are described as "familiar"here because they are probably more commonly used byspeech-language pathologists than almost any other type ofmeasure (Muma, Pierce. & Muma. 1983). Despite theirubiquity, however, such measures may also be the leastunderstood of those we use—hence the label "strangers" toconvey the sense of their mystery. Because norm-referencedtests have historically garnered more attention (e.g.. Brown.1989; McCauley & Swisher, 1984a, 1984b), the importanceof criterion-re fere need measures and principles guidingtheir use frequently have been unappreciated. In this article,therefore, the intent is to review the basic characteristics ofcriterion-referenced measures (especially as these contrastwith norm-referenced measures) and outline a set of

122 LANGLAGE. SPEECH, AND HEARING SERVICES IN SCHOOLS • Vol. 27 • 0161-1461/96/2702-0122 ® American Speech-Language-Hearing Association

Page 2: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

Table 1. Examples from a v<iriely of development.il communi-cation disorders of measures thai til within a broad definitionof crilerion-reterencinj^.

Communication disorder Measures

Hearing impairnieni

Fluency disorders

Voice disorders

disorders

Language disorders andlanguage-learningdisabilities

Pure tone threshold testing (Carhart& Jerger, 1959)Speech Recognition Threshold(American Speeeh-Language-HearingAssociation, 1988)Phonetic Level Speeeh Evaluation(Ling, 1976)

Percentage of syllables stutteredProportion of disfluency typesSlutlering Severity Index forChildren and Adults (SSI-3) (Riley,l<J94t

Maximum phonation time (Colton &Casper. 1990)Voeal intensity (Colton & Casper,1990)Mean fundamental frequency (Colton& Casper, 1990)

' Profile of phonological process use(Khan & Lewis, 1486)

' Perceniage of consonants correct(Shriberg & Kwiatkowski. 1982)

' Percentage of intelligible words(Weiss, Gordon. & Lillywhite. 1987)

' Language sample analyses (Lahey,1988)

' Mean lenglh ot" uttenince (Brown.1973)

• Decoding Skills Test (Richardson &

guidelines lor clinicians to keep in mind us they use suchmeasures.

BACKGROUNDHistorically, some of the obscurity surrounding criterion-

referenced tests undoubtedly stems from their recentevolutioti—the term criterion-referencing having beencoined only a litlle more than .10 years ago (Glasen 1963;Glaser & Klaus, 1962). At that time, the idea of analternative to norm-referetieing grew out of the frustrationtest users experienced with the adequacy of norm-refer-enced tests for some purposes. Although such measureswere often satisfactory for the purpose of identifyingchildren as sufficiently different from their peers to requirehelp in a particular area, they were far less useful for thepurpose of planning the nature of that help. Iti particular,when it was realized that achievement tests might ideallyreflect lhe extent to which an individual had mastered aparticular body of knowledge or achieved a particular levelof skill attainment (Glaser, 1963), theorists began toconsider modifications in test development and interpreta-tion that were, at times, at odds with procedures recom-

mended for norm-referenced instruments (Carver. 1974;Gronlund, 1982). In this new approach to lest development,for example, the performances of groups of individuals whowere like those for whom Ihe test was developed remainedof interest as a means of establishing performance stan-dards, rather than as a group of scores against which futuretest takers were to be compared (as would be the case ifnorm-referencing were used).

in the 1970s and '^Os, there was a proliferati{>n ofcriterion-referenced tests and of scholarly activity devotedto the implementation of eriterion-relerencing, particularlyin educational applications (Nitko, 1983). Much of thiswork, however, took place independently under a variety ofnames, including domain-referenced testing, objectives-referenced testing, competency-based testing, and masterytesting (Berk, 1984a). In 1985. the influential Standards forEducational ami Psychological Te.Kting (American Educa-tional Research Association, American PsychologicalAssociation, and National Council of Measurement inEducation [AERA, APA, & NCME], 1985) recognized thegrowing stature of such tests, which were defined asfollows: "A criterion-referenced test, A test that allows itsusers to make score interpretatiotis in relation to a func-tional performance standard, as distinguished from thoseinterpretations that are made in relation to the performanceof others" (p. 90).

The Standards also recognized "domain-referenced tests."where a domain-referenced test was defined as '"a test thatallows users to estimate the amount of a specified contentdomain that an individual has learned. Domains may bebased on sets of instructional objectives, for exatnple"{AERA, APA, & NCME. p. 91), Eor many authors,however, the distinction between these closely relatedeoncepts is not maintained, and criterion-referenced testingis used as the cover term. In the interests of simplicity, itwill be used that way here.

In the late 1980s and early '90s, it appears that "crite-rion-referencing" has been superseded as the compellingcutting-edge of educational and psychological testing. Inparticular, developments such as computerized interactiveatid adaptive testing, as well as portfolio assessment, appearto have assumed that position. Nevertheless, criterion-referencing continues to receive attention as an invaluablemode of test score interpretation (e.g., Cizek, 1993;Millman & Greene, 1989; Salvia & Ysseldyke, 1991).

Currently, in the field of speech-language pathology, thevalue of criterion-referencing is frequently affirmed in textsdealing with diagnostic methods (Haynes, Pindzola, &Emerick, 1992; Newhoff & Leonard, 1983) and in textsdescribing assessment and intervention for specific catego-ries of communication disorder (Lahey, 1988; Nelson. 1993;Reed. 1994; Wallach & Butler, 1994). These sourcesgenerally recommend that criterion-referenced measures beused in two instances: (a) when the norms that might beused to provide a norm-referenced interpretation of testperformance are unavailable or inappropriate, or (b) wheninformation concerning specific skills or behaviors of thetest taker is required by the clinical question being posed.

The first of these two instances occurs frequently, whenone tests a student who speaks a dialect or has a cultural

McCauiey 123

Page 3: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

background that differs from that of the nortnative satnpte(Taylor & Anderson. 1988; Terreli. Arensberg. & Rosa.1992). It aiso occurs when one tests a studetit with mentalretardation (Snyder-McLean & McLean, 1988). becauseneither age nor "mental age" can be expected to yield ameaningfui basis for selecting a truly relevant normativegroup for such children (Salvia & Ysseidyke. 1991).

The second instance in which criterion-referencedmeasurement tends to be recommended occurs whenspecific information concerning behaviors or knowiedge isrequired—as it is at many points during the natural courseof a clinical relationship. During diagnosis of a develop-mental communication disorder, for example, an apprt)priatecriterion-referenced measure can help support a specificdiagnosis or provide a focus for treatment planning robustenough to guide the selection of treatment tasks, stimuli,and settings (Hutchinson, 1988; McCauley & Swisher.l9S4b). Similarly, as treatment progresses, informationconcerning specific patterns of skill or knowledge acquisi-tion and generalization will facilitate planning by highlight-ing areas in which change is or is not occurring throughthe use of criterion-referenced pre- and post testing.Recently, Tomblin (1994) proposed the term "client-referenced" lo refer to clinical measures in which thecriterion relates to the client's own pas! performance on themeasure of interest. For the purpose of examining changesover time, criterion-referencing is applied frequently in theform of the assessment component of behavioral objectives.

ESSENTIAL CHARACTERISTICS OFCRITERION REFERENCED MEASURES

As is evident from Table I. both formal, or standardized,tests and informal clinician-developed measures can becriterion-referenced. In fact, currently in speech-languagepathology, criterion referencing is still far more common ininformal than in formai measures. Returning to the case ofGeorge, for example, at least two of the five questionsraised concerning his communication (i.e.. those dealingwith intelligibility and classroom vocabulary) would requirethe use of an informal criterion-referenced measure. Use ofinformal criterion-referenced measures places an especiallygreat burden on the clinician to consider their adequacybecause they will almost always lack the documentationtypically supplied by test developers for standardizedmeasures.

Another point that shouid be evident from examination ofthis lable is that some of the measures listed as criterion-referenced can also be used with norms and thus beconsidered norm-referenced as well. For example, if theKhan-Lewis Phonological Analysis (Khan & Lewis. 1986)were to be used to address George's use of simplificationprocesses, his overall score could first be cotnpared againstthe performance of a group of same-age peers to confirmthat he has a phonological impairment (a norm-referencedinterpretation of the test performance) and then could beused to plan treatment when the most-frequently occurringprocesses are selected as targets (a criterion-referenced

interpretation of test performance). In fact, norm- andcriterion-referencing are often and probably best consideredas modes of interpretation rather than mulually-exclusivecategories of tests (AERA. APA. NCME, 1985; Nitko.1989). However, it is rare that a single instrument cansupport both modes of interpretation with equal validity.Should a test developer attempt to devise an instrumentwith such capabilities, the volume of evidence required tosuggest validity for both purposes would be immense.

A comparison of norm- and criterion-referencing forstandardized measures may prove helpful in highlightingthe specific features that make an instrument that is goodas a criterion-referenced measure less likeiy to be good asa norm-referenced measure and vice versa. Table 2 summa-rizes four features of criterion- versus norm-re fere needmeasures that are often described in discussions of thisdistinction (Carver. 1974; Gronlund. 1982).

First, whereas norm-referenced measures are designed torank individuals in terms of relative performance, criterion-referenced measures are designed to distinguish betweenperformance levels through the specification of knowledgethe test taker has or of tasks the test taker is able toperform successfully. This difference leads to somewhatincompatible bases for the selection of items and scoresused to summarize test performance. It also leads to thecomparison of a test taker's test score, not to a group ofscores obtained by a normative sample, as is the case innorm-referenced interpretations, but to a specific score orset of scores, which are alternatively called cutoffs,criterion levels, or performance standards.

Selection of a specific score or scores to differentiateperformance levels, for exampie between those scores seenas adequate or indicative of "mastery" from those seen asinadequate or indicative of "nonmastery." is one of thethorniest problems facing developers of criterion-referencedinstruments (Berk, 1984a; Cizek, 1993; Nitko, 1983;Shepard. 1984). It also presents a point of possible confu-sion because test developers often use empirical informa-tion concerning group performances to guide their selectionof cutoffs, thereby blurring the concepts of norms versusstandards unless it is understood that norm-referencingoccurs when an individual's performance is primarilyinterpreted in relation to Ihe performance of others.

Table 2. Comparison of norm- and criterion-referenring.

Norm-referencinf! Criterion-referencing

Fundamental purposeindividuais

to r^ink

Test planning addresses a broadcon ten I

Items are chosen to distinguishamong individuais

Performance can be summarizedmeaningfully using percenlileor standard scores

Fundamental purpose is todislinguish specific levelsof performance

TesI planning addresses aclearly specified domain

Items are chosen lo covercontent domain

Performance can be summa-rized meaningfully usingraw scores

124 LANGLfAGE. S P R E C H . AND H[;ARtNG SERVICES IN SCHOOLS • Vol. 27 April 1996

Page 4: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

Interpretatioti of pure-tone threshold testing provides agood exampie of a criterion-referenced measure fur whichstandards were established using tiormative data. Althoughnormative populations were used lo establish Ihe standardson which pure-tone testing is based, the way anindividual's performanee is interpreted is no! through directreference to the hearing performance of others (e.g.. Johnscores at the 30th percentile for hearing at 2,000 Hz), butthrouj^h reference lo a standard (e.g., John shows a hearingsensitivity of 50 HL al 2.000 Hz).

Second, whereas norm-referenced tests usually containitems intended to sample broadly from the skill, ability, orbehavior being assessed, criterion-referenced tests containitems that more thoroughly cover a smaller domain. Thismeans that from the first stages of item preparation, testcontent is chosen in a way that samples more or lessnarrowly Irom the domain being tested.

There are numerous schemes for moving from theinferences to be made to decisions concerning the way inwhich items will be posed, the nature of expected re-sponses, and lhe way in which they will be evaluated(Nitko, 1983; Popham, 1984). Many of them involve thedevelopment of very detailed specifications before thecreation of items to ensure thorough coverage of the skillarea to be tested, with the level of detail required fallingwell above that required for norm-referenced measures.

For example, in the item plan guiding the deveiopmentof jusl one subtest, the authors of the Decoding Skills Test(Richardson & DiBenedelto, 1985) developed items thatvaried ail of lhe following characteristics: mono- versuspolysyllabic structure, single versus multiple consonants,short versus long vowels versus vowel digraphs, and realversus nonsense words. From this example, it should beevident that attempts to produce a criterion-referencedmeasure for language performance wiil necessarily befocused on oniy a very smail area of ianguage. or will needlo be incredibly, perhaps impossibly, lengthy.

Third, test items in norm-referenced tesls are typicailychosen on the basis of their ability to discriminate amongindividuais. whereas test items in criterion-referenced testsarc chosen lo discriminate differences in performance. Forboth types of tests during item tryout, items are adminis-tered to a group of test takers who are similar to those forwhom the test is intended. Then, during item analysis, thegroup's performance on each item is scrutinized.

For a norm-referenced lest, the test developer wouldchoose for the final version of the lest those items onwhich test takers performed variably. Thus, any items thatwere passed by aii test takers or failed by ail test takerswould be rejected because they did not discriminate amongindividuals, in contrast, for a criterion referenced tesi, thetest developer would choose for the final version of the testthose items that were always pas.sed by individuais (calledmasters) who had acquired, or mastered, a particular skiiland always failed by those (called n<m masters) who hadnol mastered that same skiil. This choice would be maderegardiess of how many test takers in the item iryoul hadfaiied or passed the individuai item.

An example of item seiection for a criterion-refereneedinstritnient can be drawn from the Orai-Speech Mechanism

Screening Examination {St. Louis & Ruscello. 19S7). inwhich numerous items were inciuded despite preliminarydata suggesting that aimost all of the pilot group passedthose items. The inciusion of those items was justified bytheir importance to a complete picture of the unimpairedfunction of the oral-speech mechanism.

Finaiiy, because the score we use to summarize aperformance reflects the way in which we are interpretingit, norm-referenced tests more often use types of sct>resthat refiect the test taker's score relative to those of othertest takers—-namely, percentiles, standard scores, and soforth. In contrast, criterion-referenced tesls frequently makeuse of raw scores or percentages that are compared to aspecific cutoff or criterion levei—bolh for formai andinformal measures.

In communication disorders, decisions regarding behav-iorai objectives, which are often stated in terms of informaicriteria of speech or language behavior, serve as excellentexamples of the use of raw scores or percentages as cutoffscores for decision making. Often, such cutoff scores aredetermined by the arbitrary seiection of a ievei that isviewed as reasonabiy, but not overly, strict—80%. forexample (Eger, Mient, & Cusbman, 1986). This approachhas been termed the "feet on the desk" procedure (Nitko,1983) because the seiection process is guided by the kindof cogitation that can occur with one's feet comfortablypianted on the desk, rather than that occurring afler lhebrow-moistening aeeumuiation of empiricai evidence asinput to the decision-making process.

in summary, then, although criterion-referenced andnorm-referenced measures share some important commonal-ties, they present distinct challenges to their developers andsubsequent users. Whereas much has heen written explain-ing the means by which norm referenced test developersand test purchasers can better ensure the quality of suchmeasures, Iittie comparabie guidance has been provided forcriterion-referenced measures in the fieid of communicationdisorders. In the next two sections of this article, guidelinesare offered for the review and development of criterion-referenced measures.

GUIDELINES FOR STANDARDIZEDCRITERION REFERENCED MEASURES

Despite the significant differences between norm-referenced and criterion-referenced measures describedabove, the Standard.'! for Educational and PsychologicalTe.<itini- (AERA. APA, & NCME, 1985) provides guideiines.or standards, for test deveiopment and use ihat encompassboth types of measures. Most standards are stated in ageneral manner, thereby permitting differences in technicalimplementation, depending on the type of measure. Othersare clearly more relevant for one of the two types ofmeasures. The intermingling of standards for both types isa tangible outgrowth of the Slamlards' focus on particularinstances of test use and interpretation as the point atwhich decisions regarding psychometric adequacy arcuitimateiy relevant (Messick, i989).

McCauiey 125

Page 5: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

The specific guidelines offered in Table 3 were devel-oped with the intent of preserving the Standards'' unitaryapproach to measuremeni while acknowledging the uniqueand emerging method,s used in the development of stan-dardized criterion-referenced instruments. They representonly a limited subset of characteristics thought to beimportant to the evaluation of criterion-referenced mea-sures. {Readers arc encouraged to refer to the StandardslAERA, APA, & NCME. 1985|. Berk 11984b), and Nitko[19831 for more comprehensive guidelines.)

In this section, each guideline is offered in terms ofrelevant recommendations from the Siatulard.s for Educa-tional and Psychological Testing {AERA, APA, & NCME,19X5), limitations resulting when deficiencies are noted,and means by which the test user t:an recognize a test'sfulfillmenl of the guideline. In addition, ways in whichthese guidelines differ from Ibose relevant to tests support-ing norm-referenced interpretations will he described.

(1) Clear Definition of the Test Domain

The specificalions uscti in constructing Items or selectingi>bscrvati(ins and in designing the test instrument as a wholeshcmld be stated clearly. The defitiition of a universe or c!om;iinthat is used fitr ciitistruclitig or selecting items shouid bedescribed. (AERA. APA. & NCMB, 1985, p. 25)

Because tbe meaningfulness of a criterion-referencedinterpretation rests on tbe comparison of a sample of the testtaker's behaviors to a specified standard of performance, it iscrucial that the test user be aware not only of bow broadly,hut also how broadly or narrowly, the content domain hasbeen defined during the test's development. Without detailedknowledge of bow items were selected or written, the testuser may misconstrue the test taker's competencies and thusundermine tbe value of tbe test's use. Tbus, for example, theautbor of a measure designed to examine phonologicalprocess use in a child's speech shouid include an account ofthe possible frequency of occurrence of a given process forthe stimuli used, as well as an account of the phoneticcontext in which eacb process was probed. This informationsbould be provided because both factors are known toinfluence apparent patterns of process use (McReynoids &Elbert, 1981: Stoel-Gammon & Dunn, 1985).

In order to be assured that the test domain bas beenclearly defined, tbe test user can look for a clear descriptionof the area of knowledge or behavior being assessed. Thatdescription should state the tasks to be completed, tbe plan

Table 3. Guidelines for the evaluation of standardizedcriterion-referenced measures.

Giiidi'linc number (iuidelint ccntenl

1 Clear definition of the test domain2 Evidetice of validity3 Evidence of reliabiHly4 Careful descriplion of the test takers used in

studies of reliability and validity5 Detailed description of test administration6 Detailed description of test user qualifications

used to guide item construction, and the methods used toevaluate the appropriateness of included items. As part of asuitable plan, users can expect to find clear delineations ofwbetber items are interchangeable or are organized along acontinuum based on difficulty or complexity, degree ofproficiency, or level of development (Nitko. 1983).

With the increasing integration of language with curricularconcerns for school-age clients, curriculum-based languageassessment has been suggested as providing tbe frameworkfor the specification of informal, often observational,measures (e.g.. Nelson, 1989). Regardless of their source,clear item specifications enable a test's user to decide aboutthe appropriateness of tbe measure for use in answering aspecific clinical question concerning a given client.

Information concerning test content and methods in itemdevelopment and selection are also relevant wben norm-referenced interpretations of test performance are planned.However, the need for more detailed delineation of bow thecontent domain was conceptualized and sampled is height-ened for a test that will be used to choose treatmentmaterials or methods of intervention.

(2) Evidence of Validity for the Test User'sIntended Application

"Evidence of validity should be presented for the majortypes of inferences for which the use of a test is recom-mended, A rationaie should be provided to support theparticular mix of evidence presented for tbe intended uses"(AERA, APA, & NCME, 1985, p. 13).

When a test taker's performance is to be interpreted interms of having met or not met a particular performancestandard, tbe ultimate concern is that of validity, wherevalidity refers to tbe ability of a measure to be usedmeaningfully for a given purpose. Evidence of validity hastraditionally been marshaled under the headings of content,criterion-related, and construct validity, which correspond tooverlapping (rather than alternative) empiricai and logicalapproaches taken during a measure's development and use.

Because the ultimate determination of validity must bemade with the testing purpose and test taker in mind, wbatconstitutes adequate evidence will vary considerably withvariations in testing purposes and test takers. For example,evidence supporting the use of a measure to belp identifywhere to begin treatment for a cbild with difficulties inphonological processing wili be quite different from thaisupporting tbe use of tbe same measure to help identifytbat a child is likely to possess precursory skills forlearning to read. Euller discussions of the complexityinvolved in validity assessments for criterion-referencedmeasures are avaiiable in Hamblelon (i984).

Some of the types of evidence suggesting validity for tbeuse of a criterion-referenced measure can be iiiustratedusing the bypothelicai example of a criterion-referencedianguage scale contemplated for use In the description oflanguage skills of a young student who speaks a nonstand-ard dialect. In such a case, the description of tbe test'sdesign and deveiopment under tbe first guideline offeredabove would be particularly important, in addition.

126 LANGL'AGE, SPEHCH, AND HEARING SRRVICES IN SCHOOLS • Vol. 27 April i996

Page 6: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

evidence that experts in the domain of interest guided oragreed with the inclusion and ordering of specific taskswould provide further evidence that the measure would belikely to function as intended, that is, that it has contentvalidity for this particular application. Although it is oftenclaimed that such information is sufficient to documentvalidity for a criterion-referenced instrument, other types ofinformation are indeed quite pertinent to that conclusion(Hamblelon. iyX4).

Additional sources of relevant validity evidence for thehypothetical case described above would be the results ofstudies showing that children for whom there were noconcerns regarding language were successful in performinglasks at their age level, or that children who failed at theirage level were unable to respond correctly to the testedcontent using other, validated measures. Such studies mayhe conducted before item analysis during the test's develop-ment (Berk. 1984a). during studies designed to establishperformance standards (Shepard, 1984). or after the test'scompletion during validation studies (Hambleton. 1984).Also crucial for this application of the language measurewould be data showing ihat passing performances wereobtained by children who are known to have normallanguage skills and who are from the same minority groupas the child to be tested.

Readers will recognize the kinds of evidence describedabove as quite similar to those discussed for norm-referenced interpretations (e.g.. McCauley & Swisher.1984a). One difference, however, that may also be evidentis that, whereas the decision to be defended in norm-referenced interpretation is one of relative ranking com-pared to a particular normative group, the decision to bedefended in criterion-referenced interpretations is often adicholomous decision, often a "pass-fail" or "mastery-nonmastery" decision as to whether an individual's perfor-mance met or did not meet a specified performance levei.Thus, in order for validity to be supported, the specificcriterion level used as a cutoff as well as the measure as awhole must be positively evaluated (Shepard, 1984).

(3) Evidence of Reliability

"Typically test developers and publishers have primaryresponsibility for obtaining and reporting evidence concern-ing reliability and errors of measurement adequate forintended purposes" (AERA. APA. & NCME. 1985, p. !9).

In the evaluation of reliability for a criterion-referencedmeasure or interpretation, the desired characteristic can takea variety of forms: consistency in decision outcome (e.g..master/nonmastery). consistency in distance from the cutoffscore, or consistency in obtained scores when closelyrelated versions of a measure are administered (Berk.1984a). Which of these forms is seen as most appropriatewill depend on the application to which the measure willbe put and the orientation of the test author. Measures usedto describe observed patterns of consistency may be labeledagreement indices or reliability coefficients and are almostoverwhelming in number (Berk, 1984a: Feldt & Brennan.1989: Subkoviak, 1984).

Without information regarding consistency, the test usermay mistakenly classify an individual as requiring furtherattention in a specific area or. conversely, as havingmastered a specific area when those inferences are notvalid. For speech-language pathologists. this may result inthe seiection of inappropriate treatment objectives orinappropriate timing of movement along a continuum oftasks or stimuli.

As an example, consider the use of a criterion-referencedmeasure to determine whether or not a child possesses thephonological processing skills required in early readingacquisition. The test user would want to know that thechild's passing or failing such a measure would be consis-tent regardless of the exact items used, the identity of theevalualor, and so on. Evidence that relevant groups ofchildren were consistently classified by the measure wouldhelp provide that assurance. However, given the stillemerging nature of criterion-referenced measures relevant tocommunication disorders, such evidence may not beoffered. Instead, reliability coefficients based on methodstraditionally used for norm-referenced measures may beused (e.g., Richardson & DiBenedetto, 1985; St. Louis &Ruscello. 1987).

Thus, reliability evidence for criterion-referencedmeasures may sometimes take a form that is neitherfamiliar nor straightforward, given expectations derivedfrom years of confidently evaluating the reliability of norm-referenced measures. At least one expectation, however, canbe shared for both types of measures—namely, that awealth of evidence supporting high reliability counts fornothing without evidence supporting validity. Stated morespecifically in terms relevant to criterion-referencing, aclassification decision will mean nothing if it is consistentbut incorrect.

(4) Careful Description of the Test TakersUsed in Studies of Reliability and Validity

"The composition of the validation sample should bedescribed in as much detail as is practicable. Available dataon selective factors that might reasonably be expected toinfluence validity should be described." (AERA, APA. &NCME. 1985, p. 14). "The procedures that are used toobtain samples of individuals, groups, or observations forthe purpose of estimating reliabilities and standard errors ofmeasurement, as well as the nature of population should bedescribed" (AERA. APA. & NCME. 1985, p. 20).

Although the content of this guideline was touched onunder the heading of validity above, isolating it as aseparate guideline is meant to further emphasize thespecificity required in clinicians' judgments of test ad-equacy. The relevance of this guideline to criterion-referenced measures might easily be overlooked becausescore interpretation does not hinge on a comparison of thechild's score to the scores of a specified group, as it doesfor norm-re fere need measures. Nonetheless, an appreciationof ways in which the child to be tested may differ fromthose on whom the test was developed can inform judg-ments regarding the appropriateness of the measure.

McCauley 127

Page 7: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

Further, such information can enrich the inferences to bedrawn should the lest user decide that the test sample issufficiently similar to the child to warrant its use.

Let's return to the example used to discuss implementa-tion of the validity guideline—a language scale to be usedwith a child who speaks a dialect other than that for whichthe measure was initially designed. If information providedby the test author suggests that the measure had heenstudied using appropriately large groups of children,including those from the child's minority group, thenrelatively straightforward use of the measure for descriptionof the child's language skills can follow. Alternatively, if itappears that the test had not heen studied for such children,the clinician may find no more acceptable alternativemeasures and so may decide to use the instrument, hutinclude appropriate cautionary statements in its interpreta-tion and supplement its use with an informal measuredesigned to explore apparent strengths and weaknessessuggested hy ihc standardized measure. Without informationprovided by the tesl author concerning the sample used invalidation studies, however, the test user is not forewarnedof the need for caution.

This guideline is. of course, quite relevant lo both norm-referenced and criterion-referenced measures, but may bemore commonly followed for norm-referenced measuresbecause of the centrality of comparisons between the testtaker and the groups studied during the tesi's development.

(5) Detailed Description of Test Administra-tion

"The directions for test administration should be pre-sented with sufficient clarity and emphasis so that it ispossible to approximate for others the administrativeconditions under which the norms and the data on reliabil-ity and validity were obtained" (AERA, APA, & NCME,1985. p. 29).

The reference to norms in the quotation given above willbe irrelevant for criterion-referenced measures that are notalso norm-referenced. Otherwise, however, this recommen-dation is vitally important for the appropriate use ofstandardized criterion-referenced measures. Knowledge ofand compliance with recommended procedures during testadministration increase tbe likelihood that the measure willfunction as intended.

(6) Description of Test User Qualifications"Test manuals should identify any special qualifications

tbat are required to administer a test and to interpret itproperly. Statements of user qualifications should identifythe specific training, certification, or experience needed"(AERA, APA. & NCME, 1985, p. 36).

The importance of this guideline increases as thecomplexity of the response required from the test taker orthe scoring and test interpretation processes used by tbeclinician increases. Tbe availability of such informationallows potentiai test users to determine wbether they arecurrently qualified to use the instrument and what actions

(e.g.. training, referral) will be needed if they are not.Thus, adequate description of the user's qualifications willcontribute to a measure's appropriate use.

As an example, one might consider the use of theStuttering Severity Instrument for Children and Adults(Riley. 1994) to help describe a specific child's stuttering.Knowiedge of the specific (e.g., experience with thespecific instrument) and general qualifications expected ofexaminers (e.g., previous experience assessing children withdisfiuencies) will help ensure thai test users do not fail torecognize disfiuencies and thereby underestimate theseverity of the child's probiem because of their owninexperience.

As had been the case for tbe previous guidelinesprovided here, this guideiine appiies both to norm-refer-enced and criterion-referenced measures. In the case ofnorm-referenced measures, knowledge of expected test userqualifications heips preserve tbe integrity of tbe normativecomparison; whereas for criterion-referenced measures, itpreserves tbe integrity of the comparison to a performancecriterion.

In summary, the six guideiines offered for the evaluationand selection of standardized criterion-referenced measuresare cioseiy related to those likely to be offered for similarpurposes in relation to standardized norm-referencedinstruments. With regard to some guideiines (e.g., thatrelated to the need for cieariy defined administrationprocedures), only smali differences exist in the kinds ofinformation being sought for the two kinds of measures:with regard to others (e.g.. that related to reliability), anabundance of quite different, alternative methods exist forthe newer, criterion-referenced measures, yet no ciear front-runners have emerged to serve as the basis for explicitrecommendations.

SUGGESTIONS FOR THE DEVELOPMENTAND USE OF INFORMAL CRITERION-REFERENCED INSTRUMENTS

In preceding sections. Iittie of the proffered guidancerelates to the informai criterion-referenced measures neededto answer many of the types of questions posed in clinicalpractice. Despite tbe need for sucb measures to be consis-tent (i.e., reliabie) and useful for the purposes to whichthey are put (i.e.. valid), they are rarely discussed in termsof reiiabiiity and validity. The StandardK (AERA. APA, &NCME, i985) indicate that aithough the guidelines can be"usefuliy applied to the full range of assessment tech-niques" (p. 4). iess rigor can reaiisticaily be expected formeasures tbal can be characterized as informal.

Recommendations regarding tbe appropriate use ofinformai measures are most often encountered in textsdcaiing with specific communication disorders (e.g.,Bernthal & Bankson, 1993) and in research articles inwhich a set of tasks or items is described in considerabledetail along witb specific tbeoretical justifications for itsuse (e.g.. Lewis & Ereebairn, i992). Such iocations aredesirable because they promote combined diseussions of

128 LANGUAGE, SPEECH, AND HEARING SERVICES IN SCHOOLS • Voi. 27 Aprii 1996

Page 8: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

what 10 measure and how to measure it within largerdiscussions of the nature of the disorder or how itsdamaging effects may be ameliorated or circumvented.Rarely, however, are informal criterion-referenced measuresdisciisseJ within frameworks encouraging explicit attentionto the promotion of iheir reliability and validity, with a fewnotable exceptions (e.g.. Cole, Mills, & Dale, 1989:Leonard. Prutting. Perozzi & Berkley. 1978: Vetter, 1988).

Vetter (198S) describes a process to be followed whenthe clinician has determined that alternative standardizedmeasures are inappropriate or ab.sent. In Figure \. thatprocess is illustrated using an example from George, the7-year-old boy described ai the beginning of this article.Whereas clinicians may wish to adopt this process fullyonly for those informal measures they are likely to userepeatedly, it articulates steps in decision making that canbenefit tbe development of any informal measure.

Five major steps are concerned with the design of ibemeasure:

1. identification ol' the specific question to be answeredconcerning the client;

2. selection of stimulus items that cover the desiredcontent, are relevant to that content, and are ofappropriate difficulty for tbe client;

3. simultaneous identification of expected, desirableresponses that can reasonably be executed by iheclient and reliably scored by tbe clinician;

4. formulation of instructions likely to be understoodby ihe client: and

5. development of decision-making guidelines, includ-ing performance guidelines.

In the case of George, the clinical question concerned bisuse of the grammatical morpheme ing. A set of stimuli waschosen that consisted of 20 pictures illustrating actions tbatcould be deseribed using the targeted form in combinationwith a variety of verbs likely to be in George's vocabulary.Sentences were selected as the desired response because tbeclinician believed tbat George was capable of producingsentences of that complexity. Tbe specific instruction thatwa.s used—"What's tbis person doing?"—was designed toestablisb tbe need for the iii}- form. Sixteen correct itemswere selected as tbe criterion for passing on tbe somewhatarbitrary reasoning tbat 80% correet production wouldsuggest a relatively well-established form in George'sproduction.

As illustrated in Figure I. once the measure is adminis-tered and interpreted according to the predetermined design,additional steps depend on tbe measure's success. If themeasure failed to serve ils intended purpose, tbe clinicianwould alter some aspect of tbe procedure's design and useit in its revised form. On tbe other hand, if it appeared toperform as desired, the clinician would reeord its essentialfeatures for later use. In the case of George, George mightproduce a reasonable description of the picture but fail touse the targeted form, thus requiring either a modificationof task instructions or tbe addition of a small number oftraining items. Alternatively, if implementation of theprocedure were to suggest tbat George was able to follow

Figure 1. Steps in the development of an int'orm<il, criterion-referenced procedure.

Formulate the specific clinical question:

Does George use the morpheme -ingcorrectly in conversalion?

Select stimulus items:20 pictures showing

action

Identify desired response:Sentences with

V+ingf

Formulate instructions:"Tell me what the person's doing"

Develop decision-making guidelines:Pass if 16 or more sentences are correct.

Administer the informal procedure

JEvaluate procedure effectiveness

Not effective Effective

Revise preliminarysteps as needed Record procedure for

for later use

Adapted from Vetter. D. K. (1988). Designing informal assessmentprocedures. In D.E. Yoder & R.D. Kent (E-:ds.). Devisiun makiiifi in.spcech-liini>iici_m' luuiwlo^y (pp. 192-193). Philadelphia: B. C.Decker. Used with permission.

the instructions, but performed poorly in his use of ing. theclinician could incorporate that morpheme in treatmentgoals and save materials used for tbis probe for later usewith other children, or witb George after treatment badbeen undertaken using other materials.

Although the techniques proposed by Vetter (1988) arearguably cumbersome for fast-paced clinical lives, tbeyreflect a method for tbinking about criterion-referencedmeasures that addresses many basic, psychometric concerns.Tbus. tbey can serve not only as recommendations forprocedures to be used in the development of informalmeasures, but also as a summary for tbe guidelinesprovided in the earlier section o\' this article devoted tostandardized criterion-referenced measures.

McCauley 129

Page 9: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

CONCLUSIONS

it is iioped Ihat readers wiii conciude their reading oftiiis artitie with a sense of having become somewhat morefamiliar with criterion-referenced measures—both byrealizing whal they already knew about such measuresthrough use and familiarity with basic concepts underlyingtest use and by being expo.sed to discussions of theirunique characteristics, use, and development. Repealedcomparisons of norm- and criterion-referencing wereintended to emphasize the intrinsic similarities of theseapproaches to score interpretation while acknowledging theincompatibilities that maiie their simuitaneous appiicationunwieldy, though not impossible.

Although the major thrust of this articie was an educa-tional one, at least one ciinicai need of a noneducationalvariety suggests itseif. The relatively smail number ofstandardized measures among those inciuded in Table i wasintended to reflect their relative scarcity, indeed, standard-ized measures are lacicing for many applications. Ailhoughdeficiencies persist in the content knowledge required tosupport the deveiopment of such measures for manyapplications, continuing deveiopments in the study ofnormai and abnormal development of communication siciils(e.g., Smit. i993: Windsor, 1994) are paving the way forfuture deveiopment of appropriate criterion-referencedmeasures in many areas.

in addition, a recent trend in educational and psychologi-cal testing (e.g., Bennett, i993; Gitomer, 1993) involves aturning away from a dominant focus on the paper andpencii tests that have for so iong provided a major focusfor psychometricians in favor of lhe constructed response.or performance, measures more typical of communicationmeasures. This trend may do much to suppiement thetechnicai knowiedge needed for improvements to take placein our description of what chiidren with speech-languagedeficits do and do nol do in their efforts to communicate.

ACKNOWLEDGMENTSThe aulhur thanks Cliris Murphy. Many Houghton, and Elena

Plante. as well as several anonymous reviewers, for their help withthis manuscript. A somewhat belaled thanks is also extended toRonald Berk and Darrel! Sabers, who some time ago introduced melo lhe concept of criterion-referencing. Sabers" taped lectures, inparticular, provided me with deserl travel lime ihat wa.s far moreenjoyable than even that provided by country music or talk radio.

REFERENCES

American Educational Research Association. AmericanPsycholoBital .Association, & Nationai Council on Measure-ment in Education. (1985). Slundurds for eiiticaiionul aiutp.syilwlo);iciil teslirig. Washington. DC: American PsychologicalAssociation.

American Speech-Language-Hearing Association. (1988. March).Guidelines for determining threshold level for speech. Asha. 30.85-89.

Bennett. R. E. (1993). On lhe meanings of constructed response.In R.E. Bennett & W.C. Ward (Eds.). Consiruiiion ver.sus choicein cogniiive meii.mremeni: lssue.% in cunslrucled response,performance le.siing. ami portfolio assessment (pp. 1-27).Hillsdale. NJ: Lawrence Eribaum Associates.

Berk, R. P. (1984a). Conducting the item anaiysis. In R.P. Berk(Ed.), A fiiiide lo criierion-referenced test construction (pp. 97-143). Baltimore, MD: Johns Hopkins University Press.

Berk, R. P. (Ed.). (1984b). A guide to criterion-referenced testconstruction. Baltimore. MD: Johns Hopkins University Press.

Bernthal. J. E., & Bankson, N. W. (1993). Articulation andphonological disorders (3rd ed.). Englewood Cliffs. NJ:Prentice-Hall.

Brown, J. R. (1989). The truth about scores chiidren achieve ontests. Language. Speech, and Hearing Services in Schools, 20.366-371.

Brown, R. (1973). A finst ianguage. the early stages. Cambridge,MA: Harvard University Press.

Carhart, R., & Jerger, J. F. (1959). Preferred method for clinicaldetermination of pure-lone thresholds. Journal of Speech andHearing Disorders. 24. 330-345.

Carver. R. P. (1974). Two dimensions of tests: Psychomeiric andedumetric. American Psychologist. 29. 512-518.

Cizek, G. J. (1993). Reconsidering standards and criteria. Journalof Educationai Measurement. 30, 93-106.

Coie, K. N.. Mills, P. E., & Dale, P. S. (1989). Examination oflesl-retesl and splil-half reliability for measures derived fromlanguage samples of young handicapped children. Language.Speech, and Hearing Services in Schoois. 20. 259-267.

Coiton, R. H., & Casper, J. K. (1990). Understanding voiceproi}lems: A physiological perspective for diagnosis andtreatment. Baltimore, MD: Williams & Wilkins.

Eger, D., Mient, M. G., & Cushman. B. B. (1986. May). When isenough enough?: Articulation therapy dismissal considerations inIhe public schools. Asha. 28. 23-25.

Eeldt. L. S., & Brennan, R. L. (1989). Reliability. In R.L.Linn (Hd.). Educuiional measurement (3rd ed., pp. 105-146).New York: American Council on Education and MacmillanPublishing Co.

Gitomer, D. H. (1993). Performance assessment and educationaimeasurement. In R.E. Bennett & W.C. Ward (Eds.). Constructionversus choice in cognitive measurement: Issues in constructedresponse, performance testing, and portfolio asxessment (pp.241-263). Hillsdale. NJ: Lawrence Ertbaum Associate.s.

Glaser, R. (1963). instructional technology and the measurementof learning outcomes. American Psychologi.\t. i8. 519-521.

Glaser, R., & Klaus, D. J. (1962). Proficiency measurement:Assessing human performance. In R. Gagne (Ed.). Psvchoiogicalprincipies in systems development (pp. 419-476). New York:Holt, Rinehart, & Winston.

Gronlund. N. E. (1982). Constructing achievement tests. Engle-wood Cliffs. NJ: Prentice-Hall.

Hambleton, R. K. (1984). Validating the test scores. In R.P. Berk(Ed.), A guide to criterion-referenced test construction (pp. 199-223). Baltimore. MD: Johns Hopkins University Press.

Haynes, W. O.. Pindzoia, R. H.. & Rmerick, L. L. (1992).Diagnosis and evaluation in speech pathology (4lh ed.).Englewood Cliffs. NJ: Prentice-Hali.

Hutchinson, T. (1988, Spring). Coming changes in speech andlanguage testing. Hearsay: Journal of the Ohio Speech andHearing Association. iO-13, 27.

130 LANGUAGE, SpbEcu, AND HbARtNG SERVICES IN SCHOOLS • Vol. 27 April 1996

Page 10: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.

Khan. I,., & I.ewi.s. N. (1986). Khan-Lewis Phonological Anulysis.Austin. TX: Pro-Ed.

Lahey, M. (I9SS1. Language disorders and iaitgiiage development.New York: Macmillan.

Leonard. 1... Prutting, C. A., Perozzi, J, A.. & Berkley, R. K.(1978. May). Nonstandardized approaches lo the assessment oflanguage behaviors. Axha, 20. 371-379.

Lewis, B. A.. & Freehairn, L. (1992). Residual effects ofpreschool phonology disorders in grade schooi. adolescence,and adulthood. Journal of Speech and Hearing Research, J5.819-«31.

Ling, D. (1976). Speech and the hearing-impaired child: TheoryiDid practice. Washington. DC: Alexander Graham Bell Associa-tion for the Deaf.

McCauiey, R. .1.. & Swisher, L. (iyK4a|. Psychomeiric review oflangtiage and articulation lests for preschool children. Journal afSpeech and Hearing Disorders. 49. 34-42.

McCauiey, K. J.. & Swisher, L. (t984b). Use and misuse ofnorm-referenced tests in clinical assessment: A hypotheticalcase. Journai of Speech and Hearing Disorders. 42. 338-348.

McReynolds. L. V., & Elhert. M. (1981). Criteria tor phonologi-cal process analysis. Journal of Speech and Hearing Disorders.46. 197-204.

Messick, S. (1989). Validity. In R.L. Linn (Ed.). Educationaimeasurement (3rd ed., pp. 13-103). New York: .AmericanCouncil on Education and Macmillan Publishing Co.

Millman, J., & Greene, J. (1989). The specification and develop-mcn! of tests of achievement and ability. In R.L. Linn (Ed.).Educational measurement (3rd ed.. pp. 335-366). New York:American Council on Education and Macmillan Publishing Co.

Munia. J.. Pierce. S., & Muma. D. (1983. June). i..anguagetraining in speech-language pathology: Suhslanlive domains.Asha. 25. 35^0.

Nelson. N. W. (19X9). Curriculum-based language assessment andintervention. Language. Speech, and Hearing Services inSciiools. 20. 170-184.

Neison, N. W. (1993). Childiwod ianguage disorders In context:infancy ihrough adoiescence. New York: Merrill.

Newhoff, M., & Leonard, L. B. (1983). Diagnosis of developmen-tal language distirders. In I.J. Meitus & B. Weinberg (Eds.).Diagnosis in speech-ianguage pathology (pp. 71-112). Balti-more. MD: University Park Press.

Nitko. A. M. (1983). Criierion-referenced interpretations. In A.M.Nitko (Ed.). Educational tests and measurement: An introduction(pp. 442^63) . New York: Harcourt. Brace & Jovanovich.

Nitko, A. M. (1989). Designing tests llial arc integrated wiihinsiruciion. In R.L. Linn (Ed.). Educational measurement (3rded.. pp. 447-474). New York: American Council on Educationand Macmilian Publishing Co.

Popham, W. J. (1984). Specifying the domain of content orbehaviors. In R.P. Berk (Ed.), A guide to criterion-referencedtest construction (pp. 29-48). Baltimore. MD: Johns Hopkin.sUniversity Press.

Reed, V. A. (1994). An introduction to chiidren with languagedisorders (2nd ed.). New York: Merrill.

Richardson, E,, & DiBenedelto, B. (1985). Decoding Skills Test.Parkton, MD: York Press.

Riley, G. D. (1994). Stuttering Severity Instrument for Childrenand Adults (3rd ed.). Austin. TX: Pro-Ed.

Saivia. J., & Ysseidyke, J. E. (1991). Assessment (5th eil.).Bosion: Houghton Mifflin.

Shepard, L. A. (1984). Setting performance standards, in R.P.Berk (Ed.). A guide to criterion-referenced test construction (pp.169-198). Baltimore. MD: Johns Hopkins University Press.

Shriberg, L., & Kwiatkowski, J. (1982). Phonological disordersIN. A procedure for a.ssessing severity of involvement. Journalof Speech and Hearing Disorders. 47. 256-270,

Smit, A. B. (1993). Phonologic error distributions in lhe towa-Nebraska articulation norms project: Word-inilial consonantclusters. Journal of Speech and Hearing Research. 36. 931-947.

Snyder-McLean, L.., & McLean, J. E. (1988). Sociocom-municalive competence in the severely/profoundly handicappedchild: Assessment. In D.E. Yoder & R.D. Kent (Eds.). Decisionnuiiiing in speech-language pathology (pp. 64-65). Philadelphia:B. C. Decker.

St. Louis. K. O.. & Ruscello. D. M. (1987). Oral SpeechMechanism Screening Examination (Revised). Austin. TX:Pro-Ed.

Stoel-Gammon. C , & Dunn. C. (1985). Normai and disorderedphonology in children. Baltimore. MD: University Park Press.

Suhkoviak. M. J. (1984). Estimating the rcliabiiily of masiery-nonmastery classifications. In R.P. Berk (Ed.). A guide tocriterion-referenced test construction (pp. 267-291). Baltimore.MD: Johns Hopkins University Press.

Taylor. O. L.. & Anderson.. N. B. (19881. Communicationbehaviors lhal vary from standard norms: Assessment. In D.E.Yodcr & R.D. Kent (Kds.). Decision making in speech-languagepathology (pp. 84-85). Philadelphia: B. C. Decker.

Terrell, S. L., Arensherg, K., & Rosa. M. (1992). Parent-childcomparalive analysis: A criterion-referenced method for thenondiscriminatory assessment of a child who spoke a relativelyuncommon dialect of English. Language. Speech, and HearingServices in Schools. 23. 34.-42.

Tombiin. J. B. (1994). Perspectives on diagnosis. In J.B. Tomblin,H.L. Morris. & D.C. Sprieslersbach (Eds.). Diagnosis in speech-language pathoiogy (pp. 1-28). San Diego: Singtilar PublishingGroup.

Vetter. D. K, (1988). Designing informal assessment procedures, inD.E. Yoder & R.D. Kent (Eds.). Decision making in speech-language pathology (pp. i92-193). Philadelphia: B. C. Decker.

Waiiach, G. P.. & Butler, K. G. (1994). Language tecmungdisahliities in school-age children and adolescents: Someprinciples and appiications. New York: Macmillan.

Weiss, C. E., Gordon, M. E.. & l.iiiywhite, H. S. (1987).Articiilatorv and phonologic disorders. Baltimore. MD: Williams& Wilkins.

Windsor. J. (1994). Children's comprehension and production ofderivational suffixes. Journai of Speech ami Hearing Research.37. 408-417.

Received April 21 , 1994Accepted November X.

Contact author: Rebecca J. McCauiey, Department of Communi-cation Science. University of Vemioni. Allen House. 46i MainSireet. Burlington. VT 05405-0010.

McCauiey

Page 11: Familiar Strangers: Criterion-Referenced Measures in …€¦ · Familiar Strangers: Criterion-Referenced Measures in Communication Disorders Rebecca |. McCauley University of Vermont.