TESTING COMPUTER ASSISTED LANGUAGE TESTING: TOWARDS …

22
CALICO Journal, Volume 12 Number 1 37 TESTING COMPUTER ASSISTED LANGUAGE TESTING: TOWARDS A CHECKLIST FOR CALT José Noijons, Dutch National Institute for Educational Measurement (Cito), The Netherlands ABSTRACT Much computer assisted language learning (CALL) material that includes tests and exercises looks attractive enough but is clearly lacking in terms of validation: the possibilities of the computer and the inventiveness of the programmers mainly determine the format of tests and exercises, causing possible harm to a fair assessment of pupils' language abilities. This article begins with a definition of computer assisted language testing (CALT), followed by a discussion of the various processes involved. E3oth advantages and disadvantages of CALT are outlined. Psychometric aspects of computer adaptive testing are then discussed. Issues of validity and reliability in CALT are acknowledged. A table of factors in CALT distinguishes between test content and the mechanics of taking a test, before, during and after a test. The various factors are examined and comprise a table for developing a CALT checklist. The article ends with a call for professional testers and developers of educational software to work together in developing CALT. KEYWORDS Assessment, computer assisted language testing, computer adaptive testing, item banking, item response theory, test factors, test functions, test objectives, item types, checklists.

Transcript of TESTING COMPUTER ASSISTED LANGUAGE TESTING: TOWARDS …

CALICO Journal, Volume 12 Number 1 37

TESTING COMPUTER ASSISTED LANGUAGE TESTING: TOWARDS ACHECKLIST FOR CALT

José Noijons,Dutch National Institute for Educational Measurement (Cito), The Netherlands

ABSTRACT

Much computer assisted language learning (CALL) material that includes testsand exercises looks attractive enough but is clearly lacking in terms of validation:the possibilities of the computer and the inventiveness of the programmersmainly determine the format of tests and exercises, causing possible harm to afair assessment of pupils' language abilities.

This article begins with a definition of computer assisted language testing(CALT), followed by a discussion of the various processes involved. E3othadvantages and disadvantages of CALT are outlined. Psychometric aspects ofcomputer adaptive testing are then discussed. Issues of validity and reliability inCALT are acknowledged.

A table of factors in CALT distinguishes between test content and the mechanicsof taking a test, before, during and after a test. The various factors are examinedand comprise a table for developing a CALT checklist. The article ends with acall for professional testers and developers of educational software to worktogether in developing CALT.

KEYWORDS

Assessment, computer assisted language testing, computer adaptive testing, itembanking, item response theory, test factors, test functions, test objectives, itemtypes, checklists.

CALICO Journal, Volume 12 Number 1 38

INTRODUCTION

The literature on CALL has paid relatively little attention to computer assistedlanguage testing (CALT), although the computer has played a role in languagetesting for more than 20 years. In the Netherlands Cito, the Dutch NationalInstitute for Educational Measurement has used computers to score the multiplechoice reading tests of hundreds of thousands of secondary school students since1967. Computers have assisted in research studies on validation and reliability,and for some ten years, have been indispensable in calibration and establishingthe equivalency of language tests. In the Dutch situation the quality of languagetests would have been considerably lower if computers had not been available.

WHAT IS CALT?

Computers may have been important in language testing, yet their role waslimited: only a relatively small group of professional language testers made useof computers in the production and validation of language tests. When we todayrefer to CALT we must think of a wider phenomenon to be defined as follows:

CALT is an integrated procedure in which language performance is elicited and assessedwith the help of a computer.

In its purest form three integrated procedures can be distinguished which relateto the following processes:1. generating the test

2. interaction with candidate

3. evaluation of responses

However, a strict separation of these processes is not quite possible: in manycases evaluation of responses will run parallel to or be integrated into theinteraction with a testee.

The first procedure, generating the test, may imply that the computer itselfgenerates assignments (as in some cloze tests). But it may also refer to a processin which the computer selects a number of items from an item bank, randomly orfollowing some selection procedure. In the latter case another process has

CALICO Journal, Volume 12 Number 1 39

already taken place: a number of items have been collected and calibrated for usein the construction of a test. We shall look into the generating procedures in moredetail later.

Modern CALT research focuses much more attention on the second procedure,interaction with the candidate. In on-line interaction between computer and testeedifferences between CALT and traditional (pencil and paper) tests are clearest.All sorts of serious problems may arise in this instance, because a relatively newprocesses which may interfere with the testing of the language skill orproficiency involved. The development of procedures for these processesdemands more careful attention.

When there is not an on-line interaction between computer and candidate, wemay still speak of CALT, certainly if the further processing of data is done withthe help of computers (cf. the use of machine-scorable answer-sheets or the useof other devices that store a testee's responses). However, when we speak ofCALT in this context we imply on-line interaction.

The third procedure relates to the evaluation of responses. In many CALT programs,a candidate's data may already have been evaluated during the precedingprocess, but in this third procedure all data are called up for a final evaluation ofthe complete response. Depending on the test format and the interaction, thesedata may relate to a testee's responses, the time used, routing, to give a fewexamples. These data may be combined with earlier input concerning pass/failscores, earlier performances of a candidate, performances of comparablecandidates, etc. It is not strictly necessary in CALT for a test to produce a finalassessment on screen or on paper, but it is important that a testee's data be madeavailable for such a final assessment: after all testing is about measuring andassessment.

ADVANTAGES OF CALT

There is no doubt that CALT offers advantages over traditional testing. Yet someof the advantages mentioned below seem to be theoretical rather than practical.Computers may offer possibilities that language testers have not thought of. Tosimply adopt all these possibilities without paying close attention to theirdidactic and pedagogical implications would be unwise.

CALICO Journal, Volume 12 Number 1 40

Time

Because the computer can measure the time a testee takes to do each assignment,the criterion “time needed” can be used to make a distinction betweencandidates, but only if such a distinction is relevant. It may be of use in testingreading skills, but only if testees have been trained in speed reading.Measurement of time needed may be useful in tests of the efficiency of thelearning process. Time needed in itself is not always an indication of readingproficiency. A slow reader is not necessarily a poorer reader.

Routing

The computer can register a candidate's route through a test. How often doess/he go back to an assignment, how often does s/he correct his answers, whendoes s/he ask for help? The answers to these questions may be of use if we areinterested in a testee's problem-solving strategies. Also, when testing preliminaryskills such as morphology or syntax during the learning process, suchregistration of a candidate's route may be useful. However, if we use CALT totest reading proficiency or listening comprehension as final objectives, help andfeedback will be less obvious features. Information about routing would notnecessarily be very useful.

Storage

The computer can have available large amounts of information stored in a wellorganized way: a candidate may have easy access to this information. Testeesmay use this information when being tested in writing proficiency and literature.

Psychometrics

The computer is very fast in performing psychometric calculations that make iteasy to put together tests with the right (or with equal) difficulty indices ordiscrimination parameters. A continuous evaluation of the suitability of items fora given population is possible. Computers can compute which assignment (froman item bank) would fit best with the candidate's measured ability so far.

CALICO Journal, Volume 12 Number 1 41

Multimedia

A computer can be the central facility in a multimedia environment: it cancontrol devices such as printers, sound cards, videodisc players and CD-ROM.The computer may offer interesting possibilities in language testing: writing testsmay offer stimuli in a realistic context and listening tests may gain authenticity.Also literature may be tested with relevant visual or aural material in this way.However, the on-board facilities of the computer screen are limited, can be tiringand are sometimes counter productive. Visual stimuli may causemisunderstandings, which are easy to correct in exercises but will causeproblems in testing, especially if the tester is not available for help.

Standardization

As a general rule reliability of tests much depends on standardization. Testadministered in identical circumstances tend to give more reliable results. CALTfacilitates standardization, not only as far as the actual assignments areconcerned, but also regarding the test format as a whole. CALT can providetestees with a standard introduction which is consistent and independent ofindividual tester.

DISADVANTAGES OF CALT

CALT cannot always equal classical testing methods, let alone improve on them.Problems arise in the following fields:

Expertise

A computer itself cannot- really evaluate. It can compare a testee's input withinput provided by the tester beforehand, but very often this comparison is toolimited to be able to give a render judgment without a human assessorintervening. CALT may be a help to assessors, but most judgements will have tobe given by a human tester personally. Also, a computer may have muchinformation ready- yet it can only make use of it to the extent that it wasprogrammed to do so.

CALICO Journal, Volume 12 Number 1 42

To produce good CALT programs a test developer must have the following areasof expertise:a) knowledge of language and language proficiency

b) expertise in testing

c) insight into testing programs, existing programs and programs that are to bedeveloped.

Given the fact that many teachers and software developers have limited expertisein (conventional) testing it comes as no surprise that there is so little CALT ofgood quality.

Technology

A computer does not understand or read very well, although its speech isreasonably understandable. Its presentation of text leaves much to be desired.Therefore we can only make limited use of these functions. The fact that openinput is not always possible will invalidate some CALT. A testee's response tendsto be reduced to the touch of a limited number of keys. In behavioral terms, thereis quite some distance between a candidate's response and the ability that is to bemeasured.

Authoring programs or systems tend to offer testers little scope for tests. A testermay feel tied or restricted by such programs or systems and the testee will notappreciate being presented with tests on a variety of subjects in an unvaryingformat. Authoring languages are much more versatile but they will often requirefrom test developers the expertise associated with specialist programmers.Authoring languages are in fact reduced to authoring systems or programs fortesters to be able to use them.

Implementation

CALT calls for rather sophisticated hardware. It is increasingly available, but notnecessarily for language teachers and it is only seldom suitable for testingsituations. What is needed is an infrastructure that is useful for both teaching andtesting. Such an infrastructure is very costly and with the rapid superannuationof hardware, it is a questionable investment.

CALICO Journal, Volume 12 Number 1 43

There is some software for CALT, but it is not always easy for teachers to fit readymade testing programs into everyday classroom procedures. CALT is not alwaysas efficient as it claims to be. Teachers are prepared to spend much of theirbudgets buying these programs, but they will find that authoringlanguages/systems require much time to master and then use to develop CALT.Thus time pressure may be a hindrance to the implementation of CALT.

THE DEVELOPMENT OF CALT

In developing CALT (or any other type of test for that matter) it should be bornein mind that tests must meet with two important criteria:• validity: the test must measure what the test developer has meant it to

measure.

• reliability: the test must be reliable: it must yield consistent, reliable andreproducible scores.

In this article we shall only refer to CALT specific problems of validity andreliability. There are many more standards to comply with in reliable and validtesting, but inclusion of those would be beyond the scope of this article.

VALIDITY

It is in matters of validity that most problems with CALT arise, although there isno formal reason that the construct of CALT should produce less valid tests thanthat of more traditional testing, as is pointed out by Green (1988). In manycomputer based tests, it seems that the test developer started from a validobjective, yet the limitations of a program, system, language or the tester's ownlimitations, have resulted in a test that has little to do with the testing objectives/he had in mind.

Also, we must always ask ourselves if CALT is the most valid method of testinga certain language skill or proficiency. If more valid tests are available, shouldnot these be preferred? For example, in reading comprehension, CALT is at adisadvantage compared to many conventional pencil and paper tests. It can befatiguing to read from a screen, large sections of a text cannot be viewed

CALICO Journal, Volume 12 Number 1 44

simultaneously, or at one time, and browsing is complicated. Thus, there is thedanger that optical stamina and memory will influence CALT readingcomprehension results.

Next, each and every assignment or item should be valid in itself. If we take acomputer generated cloze test in which every nth word is to be deleted, the resultmay be that only one type of word is deleted. One may wonder whether or notthe test developer had this in mind: if s/he wished to test his students'knowledge of conjunctions, for example, s/he should have deleted theconjunctions. In fact, this could have been done on a printed format just as easily.It is true that some programs allow testers to adapt the choice of deletions, butthen the procedure is made less efficient. There are even programs with largedata banks that will recognize such parts of speech as conjunctions, so thatteachers are tempted to produce cloze tests because of the relative ease withwhich the computer can produce them. However, professional language testershesitate to use the (traditional type of) cloze tests, because there is some doubtabout their validity. They are perceived as highly artificial and “untestlike” tasksby many test takers (Bachman 1990).

Lastly each item in the test should measure the same proficiency or skill.Preferably they should and subtests should be used for the various subskills thatcan be distinguished in a skill or proficiency. If items in a test do not all measurethe same (sub)skill, is there a guarantee that the various (sub)skills are aspects ofanother overall skill? Also, care must be taken that the selection of items in thetest is a valid representation of the (sub)skills involved. Thus tests that make useof item banks must have valid selection procedures.

RELIABILITY

Concerning reliability, the degree to which a test yields consistent andreproducible scores, CALT may have clear advantages over classical methods oftesting, certainly if the latter involve nonobjective methods of scoring.Procedures in pencil and paper tests where testees have to mark their answer onan answer-sheet may produce unpredictable results. Machines may not be ableto read certain sheets and processing by testing personnel may introduce humanerrors. Also, the answering process in CALT may be much more controlled, sothat double answers or inadvertent skipping of answers can be prevented.

CALICO Journal, Volume 12 Number 1 45

Yet CALT is not a guarantee for reliable testing. Some questions remain:• Can each test be reproduced and is a record made of the actual assignments a

candidate has been given?

• Will candidates have the same scores if tested under identical circumstances?

• How foolproof is the test? If other media are incorporated, do theseprocedures always guarantee unimpeded testing?

• Does the test function independent of outer circumstances? Does the teststand up to varying circumstances that have nothing to do with what is beingtested (changes in localities, background noise, lighting)?

COMPUTERIZED ADAPTIVE TESTING

Some of the advantages of CALT mentioned above have resulted in theproduction of interactive, tailored tests where the test adapts itself to themeasured ability of the candidate. Such testing is usually referred to as CAT,Computerized Adaptive Testing. We shall use the acronym CAT forcomputerized adaptive language testing as well, so as to avoid confusion with themore general term CALT for any type of computerized language test.

The special feature of CAT consists in the fact that the testee is presented with anassignment, dependent on his responses to preceding assignments. If a candidategives a correct response, s/he is given a more difficult assignment; if s/he gives awrong response, s/he is given a simpler assignment. A variation of this is whereit is not the ability only that determines the choice of the next assignment, but thecharacter of the testee's response. The test may therefore also be content driven: aparticular response to an assignment calls for a new assignment that is in linewith the preceding one.

Dependent on the format of the test, the candidate's ability will be known afters/he has responded to a given number of assignments of comparable difficulty.Such tests can vary in length, depending on the consistency of the candidate'sresponse behavior. The testee is usually given a short introductory test (with a

CALICO Journal, Volume 12 Number 1 46

standard number of questions) to make a provisional estimation of his ability.After this, the actual CAT procedure starts. CAT procedures usually work on thetheory that a candidate's observed behavior (his response) is based on a latentability (trait). Tests that have been calibrated with the help of this theory measurethe position of a candidate's ability on a continuum of that ability (from low tohigh ability). This supposition is also basic to the so called item response theory(IRT). For each item in a collection of items (an item bank, for example) that allmeasure the same ability, we can determine the area on the ability continuum inwhich such an item measures this ability most accurately (gives mostinformation). See Weiss (1983).

CAT has a number of advantages over other types of traditional and CALT tests.Larson (1987) and Alderson (1986) discuss a number of them:• reduction of testing time: CAT needs fewer items in a test.

• reduction of test frustration: CAT adapts itself to a candidate's ability, thecandidate will not be asked very difficult or oversimple questions.

• immediate feedback: the testee (or the tester) will learn about a testee's abilityimmediately after the test is over.

• absence of time pressure: the candidate can go through a test at his or her ownpace.

• fewer testing personnel: programs are self supporting.

• greater test security: each test is unique, fraud is almost excluded.

• simple pretesting: new items can easily be added to the item bank and bepretested without interference with a candidate's test results.

• easy removal of faulty items: the format and the actual appearance of the testneed not be impaired when faulty items are removed.

CAT has some disadvantages as well:• Large scale calibration of items is necessary. Depending on the test, some 200

observations per item are required at Cito, the Dutch National Institute forEducational Measurement.

CALICO Journal, Volume 12 Number 1 47

• Large item banks (with calibrated items) are necessary. Depending on thescales that are used, banks of some 1500 items are no exception.

The following paragraphs will treat CALT and CAT together, as most of thediscussion is equally applicable to both. In some cases CAT may call for extraprocedures, which will be indicated.

THE DISTINCTION EXERCISE TEST

Although this article is about testing, it is interesting to see that in CALL thedistinction between an exercise (an assignment meant to develop a skill orproficiency) and a test (an assignment meant as a measuring instrument) isdisappearing, probably because the efficiency of exercises increase when there iscontinuous and immediate assessment. A computer is capable of such “interim”assessment, and in some cases, the computer may be able to present candidateswith those assignments that fit best with the measured ability of a candidate. Yetdifferences between tests and exercises remain.

The criteria used in the evaluation of a performance in an exercise are oftenclosely linked to the design of the exercise. Judgments only pertain to a testee'sability to do the exercise and not to the language skill or proficiency that theexercise hopes to develop. This may be the case in some drills and exercises withlittle variation: the candidate responds on the basis of the analogy betweenassignments. This problem may occur in tests as well, in which case, it seriouslyinvalidates a test.

Feedback and help differ considerably in exercises and tests. A test usually haslittle or no feedback and help is only given to make test procedures runsmoothly. In conventional (pencil and paper) tests, there is a tradition ofunfriendliness or even hostility towards testees. CALL exercises with theirlearner friendly interfaces may have a positive influence on the development ofCALT.

Feedback in CALT may be useful if a testee's performance is hampered becauseof the testing situation and not because of limited proficiency. But there willremain a fundamental difference between tests and exercises in this field. In anexercise feedback and help are meant to develop an ability or skill, whereas intests, they serve to help gain an insight into the ability or skill that has beenachieved.

CALICO Journal, Volume 12 Number 1 48

TEST FACTORS

When a choice must be made from existing CALT programs or when CALT is tobe developed, a number of fundamental questions must be asked as to thefunction of tests and their objectives. On the basis of a systematic review offactors in testing, existing checklists for CALL tend to be focused on theinteraction with the testee and the user interface. Little attention is paid to contentand evaluation of responses. On the basis of the distinction between factors intesting by Alessi (1985), a table of factors in CALT has been drawn up (see Figure1).

As will be noted a distinction has been made between test content and taking thetest, that is between content and mechanics. Also, three stages within the testingprocess are distinguished: before, during and after the interaction. It is quitepossible that for particular tests such a strict separation of processes cannot bemade, yet for the sake of clarity this setup has been preferred.

CALICO Journal, Volume 12 Number 1 49

SOME EXPLANATORY NOTES TO THE TABLE OF FACTORS IN CALT

Test Content, Function/Purpose

Which educational objective is being served? Four basic decisions may bedistinguished in testing:• decisions as to the mastery of subject material: is the candidate ready for the

following lesson?

• decisions as to placement: which route is best for a candidate to reach a givenobjective?

• decisions as to classification: which route and which objectives fit a candidatebest?

• decisions as to selection: can the candidate be admitted to a particular course?Figure 2 may help to clarify the above.

CALICO Journal, Volume 12 Number 1 50

When determining the function of a test, both in CALT and in non CALT,basically one of these four decisions is at stake. Which decision is to be taken onthe basis of the test results?

Probably the simplest type of CALT to produce are mastery tests because:• test content is easy to define as learning objectives tend to be specific;

• the test taker's responses are easy to predict because of the above;

• a meaningful and useful assessment can be given.

CALT can be of great use in diagnostic and placement tests, because:• such tests are meant for individual candidates;

• assessment need not be of an absolute or final character.

CALT can also be useful in self assessment, again because of the individual natureof such tests. CALT lends itself less readily for achievement and/or proficiencytesting, especially when integrated skills are to be tested. One of the mainproblems is that such testing often has a final and decisive character. Thecandidate's future may be at stake. Is the computer capable of such (automated)assessment? We must also bear in mind that in this type of testing, candidateswill be nervous because of the importance of the test. CALT in this case must beabsolutely error free.

Test Objective

Given a particular function or purpose, a test covers one or more test objectives.It would be beyond the scope of this article to describe how test objectives aregenerated. In general they will parallel the learning objectives that have beenformulated in an earlier phase. The two most important constituent parts of a testobjective are content and observed behavior. The latter may cause problems inCALT. Such behavior is often limited and may therefore give too littleinformation about a candidate's mastery of a particular skill or of his proficiencyin a particular field. The opposite may also occur in classical tests of oralproficiency: a testee's behavior is so complex that it is very difficult to assess. Thepoint is that we should formulate test objectives in such a way that on the onehand, they fit into CALT, but on the other hand, are a valid reflection of realistic

CALICO Journal, Volume 12 Number 1 51

language behavior. When more test objectives have to be incorporated into onetest we should create a number of subtests and for each objective, choose themost suitable procedure, item format, etc. Scores can also be weighed more easilyin this format. Finally, we should not hesitate to ask the question if, given aparticular test objective, CALT is the most suitable test method to be used.

Test Length

Up to a certain point reliability may increase when a test is longer. CALT canoffer many items efficiently in a short time. However, working before a screen isfatiguing. This may be another reason to create subtests. The candidate will thushave time to relax at regular intervals. Also, a test should not contain too manyitems of the same type accompanying different test objectives, as this may causea rather careless response. We may come across this problem in tests producedwithin an authoring program. (See also Time.)

Generating Items

With clearly defined test objectives, such as vocabulary, automatic generating ofitems can be of great help. But in most cases we would prefer an item bank inwhich items have been included according to a clear paradigm and in whichitems are regularly checked for their suitability. When an item bank has been setup systematically, it is quite possible to have the computer select the itemsautomatically. Such systems are currently widely used. (Cito has a number ofthem, even though they are not usually interactive.)

The size of an item bank is determined by the function or purpose of the actualtests that are composed from the bank. If that function is mastery or diagnosticthe item bank may remain rather small. If the (proficiency or achievement) testhas a social effect, so if the test is used to determine whether a candidate haspassed an (entrance) examination, the item bank must be much larger. Onereason is that candidates should not get familiar with the items. In CAT, the itembanks should be fairly large if the test is to adapt to a testee's ability properly.When item banks are to be used in more than one test, it is important that theitem bank has a lucid structure showing how items are connected with contentarea and objectives.

CALICO Journal, Volume 12 Number 1 52

A good item bank can be a great help in schools and other educationalinstitutions. Where such item banks are available (e.g. from testing institutes likeEducational Testing Service or Cito), teachers are able to test flexibly. They arefree to use those items that fit in with their lessons. Yet they can be sure that thescores of their students and those of students elsewhere can be compared. Agood item bank thus enhances reliability and validity of tests.

Item Type

Remarkably, language teachers familiar with CALL have fewer objections tomultiple choice questions than most teachers (and others), who often condemnMC in tests of language skills. It is understandable that CALL and CALT oftenuse the multiple choice format, principally because open input is notoriouslydifficult to process.

Yet the frequent use of multiple choice questions is not without danger. Forexample, it appears that true/false and yes/no questions are somewhat suspect(Grosse 1985). The choices yes and true are significantly more attractive than theircounterparts. This may due to the fact that candidates prefer the “truth” when indoubt. Also, it is difficult to produce untrue statements. It is better to distinguishbetween things that have or have not been said or things that have or have not beenproven.

The construction of multiple choice questions is complicated: very often itemscontain distracters that are not (completely) wrong, and correct answers may beincomplete; alternatives may differ too much from each other in sentenceconstruction or vocabulary. Often alternatives do not relate to the questionproperly. It is no surprise that the construction of multiple choice questions is aspecialty for a small group of test developers.

CALT offers many advantages over classical testing as far as the presentation ofmultiple choice questions is concerned:• alternatives can be presented in random order;

• alternatives can be presented one after the other, which makes comparisonsbetween alternatives more difficult.

CALICO Journal, Volume 12 Number 1 53

It should be possible to go back to earlier questions to change answers, as ispossible in classical testing. The exception of course is with CAT, where such apossibility would conflict with the test format. (See also Test Length.)

Feedback

As we saw above the purpose of feedback is different for a test and an exercise.In testing, feedback can often be restricted to procedural instructions (when acandidate does not follow the correct procedures). In the case of tests likediagnostic tests, feedback on content may be useful. Feedback on scores andprocedural help will be addressed later.

Time

CALT offers facilities, among which the very exact registration of the number ofcorrect responses is the most important. In some countries such as the USA,power tests in language testing exist, but it is not a very common phenomenon.According to Green (1988), assessment is more reliable if the number of items isfixed and the time is measured, than the other way round. At Cito, paradigms forthe relation speed/ precision have been developed for other proficiencies.

Most candidates will actually perform less well under time pressure, so if time isnot a relevant factor in a proficiency test, it is advisable to allow a candidateample time. As mentioned previously, CALT can be fatiguing. When candidatesare forced to look at the screen continuously, test time must not be too long.

Registration of Responses

We need data to assess. As the collection of data in CALT is relatively easy,irrelevant information will be stored. Sometimes data has to be kept secret andcandidates should not have access to them. Registration of data makes it possibleto check items for their suitability in an item bank or to analyze group scores on atest to determine the level of a group or to evaluate the quality of a test.

CALICO Journal, Volume 12 Number 1 54

Evaluation

As documented earlier, it is doubtful if meaningful evaluation is always possiblewithout the intervention of a human tester. It depends largely on the function ofthe test and its objectives. The more a test approaches the character of anexercise, the more the evaluation can be automated. However, when anevaluation has to be based on a number of performances on different tests, it maybe a problem to do without a human evaluator.

It is important to realize that a survey of scores is not sure evaluation. Of course,an assessment of correct and incorrect answers has occurred, but such scores arenot meaningful. A score of 55% correct answers does not mean that a candidatehas “passed” a test. In the case of mastery testing, pass/fail scores will usuallyneed to be much higher. It is clear that we must indicate beforehand what thepass/fail score should be, so that the computer can calculate how the candidate'sscore relates to this.

If evaluation is not automated, the method of data registration determineswhether a valid and reliable evaluation is possible. CALT does offer possibilitiesto improve evaluation in language testing in this field.

Presentation of Test Results to Candidate

The more interactive CALT is, the more likely that a candidate will expect somestatement about his/her scores on the screen. The candidate may see a score anda pass/fail decision on the screen. It depends on the type of test whether thatfeedback is sufficient. In diagnostic testing, some feedback has to be given (but“you should work harder” would not be very helpful). When evaluation is notautomated, databases can be a useful tool. They force a tester to process datasystematically and to present and store information in an orderly way.

CALICO Journal, Volume 12 Number 1 55

TAKING THE TEST

Entrance to the Test

Data that should be registered are:• candidate's ID

• test number/version

• candidate's response in each item

There should be a procedure to prevent candidates from taking the wrong test. Inmany cases a tester hands the candidate the appropriate diskette, but a checkshould still be made. CALT is sensitive to fraud as far as access is concerned. Ifmuch is at stake, precautions should be taken to make sure that candidate X is Xand not a stand in, XY.

Test Instructions

The candidate should get clear information about the performance that isexpected. S/He should also know what facilities there are (a help key, forexample) and what restrictions he has to observe (no note-taking). Finally s/heshould be told what is to be tested and how much time is at his/her disposal.

Examples of Items

The candidate should be presented with an example of the sort of items he willencounter in the test. S/He will thus learn how to handle the computer and thetest format. A test should not start before the candidate has actually stated thats/he understands all the procedures.

Check of Test

Many testing programs are foolproof, yet it is of great importance that the tester(and preferably an expert outsider) makes a trial run through the test.

Fraud

When a candidate cheats during the test it must be possible to stop the testwithout losing the responses given so far.

CALICO Journal, Volume 12 Number 1 56

Breakdowns

In the case of equipment breakdowns, responses thus far should not be lost.There should be clear instructions what to do in case of breakdowns, so that thecandidate will not be frustrated. It should be possible to continue the test at thepoint where the trouble started. Breakdowns will obviously occur morefrequently when candidates have not been instructed properly.

Feedback

There should be feedback on those actions that are not permitted in a test, suchas incorrect use of (function) keys or the space bar. There is nothing morefrustrating for a candidate than a beeping computer that does not do what thecandidate wants.

End of Test

When time is a factor in the testing, the candidate should know how much timehe or she has left (how many questions are left). Also, candidates should beinformed about what to do when the test has ended.

Storage of Data

After the test has ended, a candidate should not have access to the test datawithout permission. Many knowledgeable candidates may gain improper accessto these data if precautions are not taken.

Printing of Data

It is advisable to have the relevant data printed at once (possibly for the testee aswell). This a good way of having a copy of all the data in case somethingirreversible happens to the (hard) disk.

Checklists

As was mentioned earlier, there are many checklists for CALL that may be ofgreat use in CALT. It is usually the user interface that is given most attention inthese checklists and the interaction with the candidate is of great importance.However, there may be a need for a checklist that pays more attention to aspects

CALICO Journal, Volume 12 Number 1 57

of testing. Such a checklist, based on the table of factors in CALT, is now beingdeveloped at Cito. The checklist will have two functions; it may be used:• to check whether existing programs address the points being evaluated;

• to make sure that in new CALT, attention will be paid to the pointsmentioned.

In fact, CALT would profit if professional testers and software developersworked together in creating new CALT. If (authoring) programs were to containtesting modules that had been approved of by acknowledged test developers ortesting organizations, users (testers, teachers and testees) would be sure thatnewly created CALT would come up to the standards that have been set foreducational testing and that have been taken for granted in conventional testing.

SUGGESTED READINGS

Bunderson, C. Victor, Dillon K. Inouye, and James B. Olsen (1988). “The FourGenerations of Computerized Educational Measurement.” EducationalMeasurement, edited by R. L. Linn. The American Council on Education. New York:Macmillan.

Guidelines for Computer-based Tests and Interpretations (1989). Washington: AmericanPsychological Association.

Dunkel, Patricia (Ed.), (1991). Computer-assisted Language Learning and Testing: ResearchIssues and Practice. New York: Newbury House.

______. (1991). “Computerized Testing of Nonparticipatory L2 ListeningComprehension Proficiency: An ESL Prototype Development Effort.” The ModernLanguage Journal 75, 1, 64-73.

Garrett, Nina (1991). “Technology in the Service of Language Learning: Trends andIssues.” The Modern Language Journal 75, 1, 74-101.

Kearsley, G. (1986). Authoring, A Guide to the Design of Instructional Software. Reading,MA: Addison-Wesley.

Weiss, David J., et al. (1983). New Horizons in Testing: Latent Trait Test Theory andComputerized Adaptive Testing. New York: Academic Press.

Zettersten, Arne (1986). New Technologies in Language Learning. Oxford: Pergamon Press.

CALICO Journal, Volume 12 Number 1 58

REFERENCES

Alderson, J. Charles (1986). “Computers in Language Testing.” Computers in EnglishLanguage and Research, edited by G. Leech and C. Candlin. London: Longman.

Alessi, Stephen M., and Stanley R. Trollip (1985). Computer-based Instruction. EnglewoodCliffs: Prentice Hall.

Bachman, Lyle F. (1990). Fundamental Considerations in Language Testing. Oxford: OxfordUniversity Press.

Green, Bert F. (1988). “Construct Validity of Computer-based Tests.” Test Validity,edited by Howard Wainer and Henry I. Braun. Hillsdale, NJ: Lawrence Erlbaum.

Grosse, Martin E., et al. (1985). “Validity and Reliability of True-False Tests.” Educationaland Psychological Measurement 45, 1.

Larson, Jerry W. (1987). “S-CAPE: A Spanish Computerized Adaptive PlacementExam.” Modern Media, II.

van Weeren, J. (1988). “On Determining the Function and Quality of Language Tests.”Evaluation and Testing in the Learning and Teaching of Languages for Communication.Strasbourg: Council of Europe.

BIODATA

José J. E. A. Noijons is a staff member in the languages department of Cito, theDutch National Institute for Educational Measurement. He has been involved inthe development of tests of speaking, writing and listening comprehension. Hiscurrent work in CALL is in interactive testing of listening and speaking and inelectronic item banking (for examinations in literature). Future projects are thedevelopment of a computer adaptive test of listening comprehension.

AUTHOR'S ADDRESS

Cito Phone: +31 85 521447P. 0. Box 1034 Fax: + 31 85 5213566801 MG Arnhem E-mail: [email protected] Netherlands