Placement Testing - Language Testing Resources Website

An English language placement test:issues in reliability and validityGlenn Fulcher English Language lnstitute, University ofSurrey

This report describes a reliability and validity study of the placement test whichis used at the University of Surrey as a means of identifying students who mayrequire English language support whilst studying at undergraduate or postgraduatelevels. The English Language Institute is charged with testing all incoming stu-dents, irrespective of their primary language or subject specialization. F[ir andaccurate assessment of student abilities, and refening individuals to appropriatelanguage support courses (the in-ressional programme), is an essential supportservice to all academic departments. The goal of placement testing is to reduceto an absolute minimum the number of students who may face problems or evenfail their academic degrees because of poor language ability or study skills. Thisstudy looks at the administrative and logistic constraints upon what can be done,and assesses the usefulness of the placement test developed within this context.

I IntroductionIt has recently been noted that although placement testing is probablyone of the most widespread uses of tests within institutions, there isrelatively little research literature relating to the reliability and val-idity-of such measures (Wall, Claptram and A-lderson, 1994):'Publi-cations which deal with placement tests frequently provide qualitativeassessments of instruments (Goodbody, 1993), or are concerned withthe placement of linguistic minority students in programmes whichare not related to language teaching (Schmitz and DelMas, 1990; Tru-man, 1992). Wall, Clapham and Alderson ( 1994) offer one of thefew empirical studies of a placement test which is designed to screenstudents entering a British university for deficiencies in English lang-uage skills, which might impede their progress in undergraduate orpostgraduate studies. They investigate face validity (student percep-tions of the test), content validity (asking whether tutors thought thattest content represented programme content), construct validity(through a correlational study), concurrent validity (with self-assess-ment, and the assessment of tutors) and reliability. The main problemthey discovered in attempting to validate the University'of Lancasterplacement test was in finding appropriate external criteria to conducta concurrent study.

Language Testing 1997 14 (2) 113-138 0265-5322(97)LT127OA @ 1997 Arnold

ll4 An English language placement test

The methodology used to ascertain the reliability and validity ofthe test in this study is similar to that of the Lancaster study in manyways, and is described in detail below. However, this study doesattempt to exand on their methodology for the evaluation of place-ment tests within a university context in four areas. These are:

. the use of pooled judgements in establishing cut scores for place-ment;

o the use of additional statistical means to analyse test data;o the consideration of the need to develop parallel forms; ando the use of student questionnaires to investigate face validity.

II BackgroundThe English Language Institute (ELI) has a number of roles withinthe University of Surrey, ranging from its support function in provid- ling English language and study skills courses for students at the uni-versity, English for Academic Purposes (EAP) for overseas studentswho intend to study in an English-medium tertiary institution, and anMA in linguistics (TESOL) for teachers of English, in distance-learn-ing mode. As part of its service function related to the in-sessionalprogramme, the ELI is charged by the university to administer anEnglish language test to all students entering the university on taughtcourses (undergraduate and postgraduate, English as primary langu-age speakers, and speakers of other languages). The purpose of thetest is to identify students whose lack of language skills or abilityto communicate may cause problems in their academic work withtheir departments.

----A-major-eonstraint in assessing studems is -tha[ togeth-erwith-organization (handing out and collecting papers, giving verbalinstructions), the ELI must complete testing within one hour for eachtest administration. This means that the test cannot exceed 45 minutes,and must therefore be restricted in its length and content. Secondly,all scripts must be processed within five days of the day of the test,and results disseminated. In 1994, 1619 students took the test. Thenumber of markers available, and the number of scripts which eachcan process in this time frame, also dictates that much of the paperbe objectively scored. However, no language test can use only oneresponse format, or sample only one construct, if it is to be consideredvalid on even prima-facie grounds (APA, 1985: 73; 75). For thisreason, a writing component is included.

The placement test acts as a screening device to reduce the numberof students who attend an oral interview. It is those who are invitedto come to the ELI for an oral interview who have the greater prob-ability of requiring English language support. These students are

Glenn Fulcher 115

'referred' for oral testing. A small number of students are informedafter the interview that they do not need to attend the in-sessionalcourses. It is acknowledged (see section VI below) that all tests con-tain error, and it is during the interview procedure that we attempt toidentify students who have been misclassified by the test. Most stu-dents are referred to a programme, although attendance is optional.Lists of all referred students are sent to heads of department, and laterin the year heads are also sent an update of student attendance atcourses, and a qualitative assessment of individual student progressfor those who did attend.

III Test design

I Test formatPrior to October lgg4: the test contained one piece of writing, andwas marked subjectively. It became clear that the ELI did not knowwhether this was a reliable or valid placement instrument, but it wasconsidered inappropriate on other grounds. A single essay title maybe biased in favour of some students and against others, and anysingle task of this nature is unlikely to elicit an adequate sample uponwhich decisions can be made (APA, t985: 75; Upshur, l97l: 47;Van Weeren, 1981: 57; Shohamy, Reves and Bejerano, 1986; Sho-hamy, 1983; 1988; 1990). A new test was devised, with the follow-ing format:

Section I: Essay 1: descriptive (no choice of essay title).

---------Essay -2: argumentative (one title to-te seleeted-from--three options).

Section 2: Stnrcture of English (10 items).Section 3: Reading comprehension (8 items).

Essay titles for Section I were screened carefully to avoid bias infavour of, or against, students from any particular discipline, or cul-ture. Essay I is of a general nature, whilst ttre titles for Essay 2 arcan attempt to reflect the interests of the three faculties within theuniversity: Human Studies, Engineering and Science. In Section 2,

item content was selected for known difficulty in the in-sessional pro-gramme grammar revision course. The context of the items is generaluniversity life, on the grounds that subject-specific contexts may, onceagain, disadvantage some subsections of the test-taking population.In Section 3, six texts were selected, for their general academic inter-est. Of these, three were drawn from the humanities, and three fromthe sciences. However, the journals from which the texts were taken

116 An English language placement test

are 'popular' in that the passages are written for the interested(intelligent) layperson, not for the specialist.

2 Pilot study

The new format was piloted during the summer of 1994, using 67students attending the ELI Summer School English for Academic Pur-poses (EAP) programme. All items in Sections 2 and 3 of the testwere studied using classical item analysis. Only items with a facilityindex between 0.3 and 0.8 were retained, and items with a discrimi-nation index of less than 0.3 were either abandoned, or rewritten andpiloted for a second time. The point biserial correlation for all remain-ing items was above 0.4. Four pilot versions of the test were originallywriffen, and 75Vo of all questions discarded, leaving one operationaltest with items drawn from all four pilot tests. This is a reminder that,no matter how experienced one may be in test development, there isalways a need for pretesting all items before tests become operational,and decisions are taken on the basis of test results.

In Section l, writing samples were collected, graded by six tutors,and features of performance at each level of a rating scale established.Prototypical samples from each band level were then used in the train-

ing of raters prior to the operationalization of the test in October 1994.

IV Method

The final version of the test was used operationally with the univer-sity's entire intake on taught (undergraduate or postgraduate) coursesin October 1994. The- main-studies were all carried out using thlspopulation.

I Reliability

In order to establish estimates of the reliability and validity of the. placement test, a number of approaches were taken. In the investi-gation of reliability, correlation coefficients, means and standard devi-ations (inter- and intrarater reliability), were established for ratingpatterns on Section I of the test. For Sections 2 and 3, it was decidedto fit a logistic model. The reason for this decision was essentiallybecause of the need, within a university setting, to have multipleforms of a test, and for the interpretation of scores on these forms tobe comparable.

This implies that forms must be equated in some way. Using alogistic model, it is possible to create multiple parallel forms, using'anchor' items from previous form(s) which can be used to calibrate

Glenn Fulcher ll7

new items in later forms, even if this is difficult for short tests. Inthe first attempt to fit a Rasch model to the data, it became clearthat there were two distinct populations within the total test-takingpopulation. These distinct populations are those whose primary langu-age is English, and those whose primary language is other thanEnglish. Consider Table 1, which is a simple )( test of significanceof primary language classification and referrals, on the basis of thetest scores (for a discussion of cut scores for referrals, see SectionVI below). Using Yates' correction for a 2x2 gnd, )(=516.61, ahighly significant result, indicating that these are certainly separatepopulations. [t is not possible, unfortunately, to isolate differencesbetween referrals or test scores between learners whose primary lang-uage is not English, because of the small n size of many of the pri-mary languages represented by the 614 overseas students tested.

The question which arose, therefore, was on which population tostandardize the objective components of the test. It was finallydecided to standardize on the group which did not have English as aprimary language, as this is the population which is more likely torequire English language support. Those speakers of English as a pri-mary language (73 in this sample) whose score profiles are similarto those of the referrals of the norming population will still be ident-ified by the test, and remedial action can be taken.

The Rasch model allows the test developer to calibrate both itemdifficulty and learner ability to the same scale (Crocker and Algina,1986: 340-41), measured in logits. This was done using the programRASCAL (Assessment Systems Corporation, 1994) for Sections 2and 3 of the test separately. This was done as there was no theoreticalreason to stl-spect ttratrhscorfihined-scores of Section 2 and 3 wouldrepresent a unidimensional scale. The disadvantage of this approach,however, is that Section 2 consists of only ten items, and Section 3of eight items. Although n = 614, the low number of items inevitablyreduces test reliability. This could not be avoided, because of the timeconstraints as described above.

Tabfe 1 X table to compare primary and nonprimary English language speakers to testwhether they belong to the same test-taking populaUon

Not referred Relened

Non-English primary languageEnglish primary languageToal

36273

435

252902

1184

61410051619

I 18 An English language placement test

2 Validirya Correlation and principal components analysis.' Construct val-idity was assessed using correlation, and a principle componentsanalysis. Each of the sections of the test were designed to measuredifferent aspects of English language proficiency, and so the factorialstructure of the test is an issue.

b Analysis of cut scores across referred and nonreferred stu-dents: Cut scores for the test as a whole, and each section of thetest, were established after the pilot study. Scores were considered inrelation to lecturers' assessments of whether students with certain pro-files were ready for study in a tertiary-medium institution. Cut scoreswere established using this 'pooled judgements' technique (Popham,1978: 165). Groups of essays at a range of scores were presentedrandomly to lecturers who were asked to decide whether this was anacceptable piece of work for undergraduate or postgraduate work inthe university. Discussion was acceptable, and the cut point for eachessay established at the point where the highest agreement in makingdichotomous judgements was reached. However, the results requireempirical investigation to ensure that the cut scores are in fact leadingto appropriate decisions on whether to refer students to language support prognammes.

c Concurrent validity: Thirty-three students from the overseasstudent population taking the test had recently taken the Test ofEnglish as a Foreign I-anguage (TOEFL). Although this number issmall, it is nevertheless possible to begin to investigate the relation-ship between-this-test and-auniversity placement test. As more data -are gathered in future years, it should be possible to establish stableconcuffent validity statistics with a number of major tests which arecurrently used for entrance purposes.

d Content validity: Content validity was investigated byrequesting three subject specialists to comment on questions set inthe placement test. One was drawn from each of the three facultieswithin the university: Human Studies, Engineering and Physics. Onlyone informant suggested that the test was not content valid in oneparticular field.

e Feedback from students: It is clearly important to obtain quali-tative feedback from students on the opemtion of the test. All testshave consequences for the test-takers and for the institutions whichbase decisions on scores. [f the test is not perceived to be fair by thetest-takers and score users, the role of the placement test within the

Glewt Fulcher I 19

Table 2 lnter-rater reliability - Pearson product mornenl correlations

Rater 1 Rater 2 Rater 3

Rater 2Rater 3Rater 4

0.920.870.75

0.940.83 0.93

institution is compromised. Including a qualitative study of this naturetherefore relates not to the technical qualities of a testing instrument,or to accurate decision-making, but to one aspect of the social conse-quences of testing for the institution.

1V Retiability o

Reliability of subtests was initially calculated during the pilot study,and recalculated during the first operational testing. The followingfigures relate to the operational version of the test, with n=614.

I Section I: writing

To establish the reliability of the assessment of writing samples, 20essays were selected from the population, and were marked by fourtutors. After a period of two weeks, the tutors were then asked to re-mark a subset of six samples. This allows the calculation of inter-and intrarater reliability. Table 2 shows the results of the inter-raterreliability study. It can bg_seen 1[4gelgglqgplbetwg-e1 raters is wellwithin acceptable reliability ranges for this type of test, with an aver-age Pearson product moment correlation of 0.87. The average bandawarded was 5.46, with a standard deviation of l.2L Table 3 showsthat the four raters did not differ significantly from this in their indi-vidual grade profiles.

In the intrarater reliability study, the average reliability coefficientwas 0.69, somewhat lower than the inter-rater coefficients, but stillnot so low as to cause undue worry. This figure indicated a need forfurther rater training before the second operational testing session, in

Table 3 lnter-rater reliability - Variation in means and standard deviations

Rater 1 Rater 2 Rater3 Rater4 Average

5./161.21

5.501.64

5.501.64

5.331.O3

Mean 5.50sD 1.00

nA An English language placement test

October 1995. The intrarater reliability coefficients were: rater l:0.83; rater 2: A.57; rater 3: 0.68; and rater 4: A.7O (Pearson productmoment correlations).

2 Section 2: structure

Rasch scaling was conducted using RASCAL (Assessment SystemsCorporation, 1994). Table 4 shows the questions in order of difficulty,from the easiest to the most difficult, in logits, together with the stan-dard error associated with the difficulty estimate, and the )( fitstatistic.

From Table 4 it can be seen that there are three misfitting items.That is, they do not meet the criterion of unidimensionality in thissubtest, and must therefore be removed, and replaced by other itemsin rhis form of the test. It is instructive to return to misfitting iternsto attempt to provide a linguistic rationale for why they misfit. In thiscase, it is particularly enlightening to look at item 5, because of thevery large misfit statistic. Item 5 is:

She's always other students on her course, even though sheis very busy herself.

helphelpedused to helphelping

,With hindsight, it appears ob-vious that-there-are two pssible keysto this item, but this was not spotted during pretesting and itemrevision. Such mistakes highlight the importance for thorough pretest-ing and post hoc analysis of all test items, on tests where important

Table 4 Rasch analysis of Section 2

Difficulty Standard enor

a.

b.c.d.

721

68

10gI45

-1.370-1.0854.770-{.668-,[email protected]

0.1290.1200.1110.1090.1050.0990.0960.0930.0930.124

7.5U14.2795.1379.298

14.466 misflt?2.ffi5 mlsfit9.5978.0558.0s5

45.036 misfit

Glenn Fulcher l2l

decisions are being made. It cannot be emphasized enough that how-ever experienced an item writer or test designer someone may be, itis not enough to rely on 'eyeballing' tests or test items.

With the test centred on item difficulty, mean item difficulty was0.00, and the standard deviation of difficulty 1.26. Average abilitywas 0.89, with a standard deviation of 1.15. The test characteristiccurve for Section 2 is presented in Figure l. This plots estimatedproportion of test items correct as a function of the ability of thestudent on the latent trait (structure of English), and may be inter-preted as a nonlinear regression curve for relating raw scores to thelatent trait.

When using a Rasch model, it is possible to look at reliability inrwo ways. First, we may calculate a reliability coefficient which isthe equivalent of KR20 (an estimate of internal consistency) used inclassical test alralysis and, secondly, we may talk about test infor- *mation. The reliability coefficient may be understood as the degreeto which the test characteristic curve in Figure I may be relied uponas a translation of raw scores to latent trait scores. This section ofthe test contains only l0 items, a decision taken because of time con-straints as discussed above. As reliability is related to test length,although it was hoped that reliability coefficients would be adequate,they were not expected to be very high. The reliability coefficient forSection 2 was 0.63, which is below what would be required for ahigh-stakes test. However, this may be reasonable for a placementtest of this size.

1.00

-3.0 -2.O -1.0 0.0 1.0 2.o 3.0

Ability

Figure I Test characteristic curve for Section 2 (structure)

Eac696uJo


The amount of information which a test provides diffen accordingto region on the latent trait scale. Where the test characteristic curveis steepest, there is more discrimination amongst test-takers, andhence the test is providing greater information. Test information forSection 2 of this test is presented in Figure 2. Figure 2 indicates thatSection 2 of the test is providing the most information from -1.0 to0.5 on the latent trait scale, which is precisely what was required ofthis placement test. It is not necessary to discriminate finely betweenstudents at the higher end of the latent continuum. Similarly, belowa certain ability level, fine discrimination is not necessary. What iscrucial in this kind of testing is providing the maximum amount ofinformation around the region of the scale where decisions are beingmade regarding whether students below this score should attendEnglish support programmes. This result is therefore welcomed, as

the test does appedr to be providing the most reliable information atprecisely the point at which it is required.

We may conclude, with some certainty, that for a subtest with onlyten items, Section 2 is providing adequate information for the pur-poses to which the test is being put.

3 Section 3: reading comprehension

Table 5 shows the results of the Rasch analysis for Section 3. Onlyitem 17 was found to misfit, and this item was therefore removed

Ability

Figure 2 Test information curve for Section 2 (structurel

col'EL€g

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Glenn Fulcher 123

Table 5 Rasch analysis of Section 3

Difficulty Standard error

11

14121617181315

-2.252-1.501-o.7304.2600.0161.4361.5691.723

10.19010.4594.6603.983

14.989 mlslll5.0329.1808.240

o.1270.1080.0970.0940.0930.1060.1090.112

from this form of the test. With the test centred on item difficulty,mean item difficulty was 0.00, and the standard deviation of difficulty1.48. Average ability was {.01, with a standard deviation of 1.22.The test characteristic curve for Section 3 is presented in Figure 3.The curve in Figure 3 is not as steep as that in Figure 2, but this isonly to be expected, with only eight questions in the subtest. It isdifficult to achieve reasonable discrimination with so few items in asubtest. Similarly, the reliability coefficient (equivalent of KR20) isonly 0.59. The test information curve in Figure 4 is much flatter thanthe curve in Figure 2.In an ideal world, the test should be lengthened,as the discussion of this subtest in section VI shows.

The test has been found to be reasonably reliable for its purpose.However, the reliability coefficients do not meet the levels which

-3.0 -2.O -1.0

Figure 3 Test characteristic curye for Section 3 {readingl

E'E e o.socoF696ulo

1.0 2.O


-2.0 -1.0 0.0 1.0

Ability

Figure4 Test characteristic curve for Section 3 (reading)

would be required for a high-stakes test, and this appearsfunction of test length. The relationship between reliabilitylength may be described as:

go(!Eoc

tobeaand test

n- ry4:-grtt(l - rttd)

where n is the number of times the test length must be increased withitems or raters to achieve the desired reliability, rttd is the desiredlevel of reliability and rtt is the present observed reliability of thetest. For Section 2, to achieve a reliability coefficient of 0.80, it wouldbe necessary to increase test length by 2.34 (28 questions), and forSection 3, 2.77 (22 questions). This would essentially double theadministration time of the test!

Compromises need to be made between desired levels of testreliability and the lack of time for testing within a large educationalorganization. Such decisions cannot be taken by the test designersalone, but need to be discussed at all levels within the institutionin relation to general testing policy, including the assessment of theconsequences of the unreliability which the institution is preparedto tolerate

Glenn Fulcher 125

VI Validity

1 Correlational and principal components analysis

The test was designed to measure three separate abilities: writing,English structure and reading. If this is actually the case, the corre-lation coefficients between the three sections should be modest, andany attempt to factor the matrix should show that subtests load ondifferent factors. Table 6 provides the correlation matrix in the bottomtriangle, and the reliability coefficients of the subtests in the diagonal.The upper-right triangle presents the correlation coefficients correctedfor attenuation (taking unreliability of measures into account).Although all are significant (to be expected, in subtests which are alllanguage related), variance overlap is not so high that we would wishto challenge the view that each subtest is tapping some unique abilityas well as a common ability. It should also be noted that each ofthe two writing tasks also appears to be tapping different skills tosome degree.

The correlation mahix was subjected to principal componentsanalysis (PCA), in order to discover if the tasks were loading ondifferent factors. Eigen values for the extraction of factors were setat 0.6, on the assumption that some of the factors in which we areinterested are indeed, as Farhady (1983: 18-19) argues, less thanunity. Three factors were extracted, ild rocated using the Varimaxtechnique. The results are presented in Table 7. The three-factor sol-ution is clearly interpretable in relation to the correlation matrix. Fac-tor I is a writing factor, factor 2 is a reading factor and factor 3 is astructure factor, which the argumentative essay also loads on. How-ever, it must be Ctrbssed ih-at-thii-is nof iuffibiant evidence upon whichto claim construct validity. Exploratory analysis of this type is sugges-tive of further avenues for investigating construct validity, but cannever be sufficient to meet the requirement for convergent and diver-gent evidence.

Table 6 Correlation matrix showing assochtion between tasks

Writing task 1 Writing task 2 Structure Reading

Writing task 1

Writing task 2StructureReading

0.870.43o.27o.17

0.490.870.450.30

o.44o.42o.440.59

0.360.610.6:t0.27

126 An English language placemcnt test

Table 7 Varimax-rotated factor solution

Factor 1 Factor 2 Factor 3

Writing 1

Wdting 2StructureReadingPer cent ol total varianceetqlained by rotated@mponents

o.gfl)05520.0870.087

30.572

0.0490.210o.110o.9&l

25.615

0.0990.5790.9420.145

31.363

2 Analysis of cut scores across referred and nonreferred students

Cut scores for referrals were decided on the basis of evidence fromthe pilot study, and implemented in an operationalruse of the test. Itwas therefore imperative that the judgements made be considered inthe light of real outcomes. Tables 8 and 9 give the descriptive stat-istics for nonreferrals and referrals respectively, by task and total testscore. The first thing to ensure is that the cut scores established inthe pilot study did in fact divide the test-taking population into two

Table 8 Descdptive statistics for nonrefenals by task and total

Writing 1 Writing 2 Structure-L Reading-L Total (raw)

No. cases 1184Minimum 0Maximum IRange 9Mean 6.92sD 2.14Standard enor0.06

10294.64__2.953.591.850.880.03

1184_l)

II6.851.010.029

1 159-2.77

2.745.510.631.10o.03

1 1840

333i!26.573.250.10

Table 9 Descriptive statistics lor referrals by task and total

Writing 1 Writing 2 Struc{ure-L Reading-L Total (raw)

No. cases 435Minimum OMaximum 7Range 7Mean 5.31sD 1.24Standard enorO.06

4350II4.921.660.08

408-2.672.955.620.881.210.06

416-2.77

2.745.51

-0.351.130.06

43s2

292720.244.130.20

Glenn Fulcher 127

Table 1O Probability of real ditferences on subtests between referals and nonreferralswith 1 df (Bartlett tesl lor homogeneity ol group variances)

Sublest Writing 1 Writing 2 Structure-L Reading-L Total (raw)

Xp)e15.4.8 0.0@ 178.6

pX0.000 67.38

p)ep0.000 0.52 0.471

)ep38.73 0.000

x01234567890y 10 2 2 13 36 1U 196 32 0 0 0n00004302417091692110r 10 2 2 13 /t{t 174 437 741 169 21 10

Table 11 Flefenals by scores on Writing 1

T435

11841619

statistically distinct groups, and this is shown to be the case in Tablel0 for Sections I and 2 of the test, and the total test score.

Section 3, the reading subtest, fails to discriminate adequatelybetween referrals and nonreferrals. Although Figure 4 shows thatinformation is being provided at the appropriate point on the scale,this is not enough to make decisions. It is clear that in the case ofthis subtest, the use of only eight items seriously affects the validityof the subtest for its intended purpose.

Showing that there is a statistical difference between referrals andnonreferrals in Sections I and 2 of the test does not, in itself, tell us

that the cut score has been adequately placed. In the following fourtables,(Tables I I - t 4) are the-numbers of-students-sho-are-referred(y), the number not referred (n) by raw score (x), and the total num-ber of students with each score, for the two writing tasks, stntcture

Table 12 Refenals by scores on Writing 2

T435

11841619

Table 13 Refenals by scores on Structure

x01234567890y3206847 151 168A2 100n00021232556941781912T 32 0 6 10 48 174 423 716 179 19 12

o0

1010

2345678912 16 29 39 Tf 98 80 560 0 1 10 74 230 375 338

12 16 30 49 151 3?8 455 394

x 0 1

y I 2n 0 0T9 2

1017

146163

T435

1 1841619


Table 14Refenals by scores on Reading

and reading, respectively. Students who did not answer a question arecounted as 'omitting' (O). Score ranges where numbers are high-lighted in bold represent the areas in which there is potential forerrprs of judgeme-nt to be made yhq m.aking referrals.

Ttre cut score for tasks I and 2 (Section l) was 6, for Section 2,a raw score of 6 (or ability level 0.38) and Section 3, a raw score of4 (or ability level 0.02). The overall cut score was 22. Raters wetbasked to consider the evidence from each individual section in makinga decision regarding referral. For example, if the overall score was22+, and on only one of the tasks did the score fall below the cutpoint, then they were probably not to refer the student. This accountsfor some of the nonreferrals in band 5 on both the writing tasks, ascore of 5 on Section 2 and a score of 2 or 3 on Section 3, as wellas those who were referred within a number of bands above the cutscore.

In the case of Sections 2 and 3, however, w€ are able to calculatethe standard error of the cut score, because of the methodology whichhas been used. The standard error associated with an ability level of0.AZ is 0.852, meaning that we can be 95Vo confident that a scaled- cut score of 0.02 is 0.02+ 1.67, or anywhere ffim -J35To l.S9. Thatis a raw score from 3 to 7 . The standard eror increases as one movesfurther from the mean! In Section 3, the standard error associatedwith an ability level of 0.38 is 0.727, which means that we can be95Vo confident that a score of 0.38 is 0.38 + I.43, somewhere between

-1.05 to 1.81, or in raw score terms, anywhere from 3 to 6. However,in practice, we can see that the potential for error is much higher inSection 3 (Table 14) than in any other subtest, as it fails to diicrimi-nate between sfudents as a result of its length.

It is only when these calculations are made that the reliability coef-ficients take on meaning. It highlights the fact that decision-makingon cut scores involves elTor, and that sensitive interpretation of scoreson all tasks is required before referring or not referring a student to anEnglish language support programme. However, this evidence lendssupport to the current practice of interviewing all referred students,to ensure that some students do not needlessly attend the in-sessionalprogramme if it is not necessary. It also highlights the importance of

x012345678OTy162369109111762542043sn O 10 40 151 274 361 239 92 15 11 1184T 16 33 109 260 385 437 255 96 17 11 1619

Glenn Fulcher 129

increasing test length when possible, within the administrative andlogistic constraints of large organizations.

One further point does need to be made. There is as yet no wayof discovering whether nonreferrals have mistakenly been so categor-ized, unless they self-refer, or are referred by their subject tutors.These numbers should, however, be low, as the policy of the ELI is:if in doubt, refer. It is easier to correct such an error at a later stage.

3 Concurrent validiry

Concurrent validity was investigated by considering the associationof scores on the placement test, with scores on the TOEFL. A totalof 33 students had taken the TOEFL test immediately prior to arrivalat the university, and this proximity allows some degree of compari-son between the results of the two tests. Table 15 gives the correlationcoefficients between the TOEFL scores of the students, the total scoreon the placement test and each of the components of the placementtest.

From this strength of association between the total placement testscore and TOEFL, the best prediction of the placement test score is:

Placement test - -34.73 + 0.1 (TOEFL)

Using the cut score established for the placement test, an estimatedTOEFL score of 555 would be required before a student could followa university course without the need for additional English languagesupport. However, a correlation of 0.64 is clearly not high enough to--E able-t6takC th-eie kinds of decisions without the use of-the paCFment test itself. This was confirmed by an examination of the scatterplot of the scores of the 33 students, which revealed two outliers withscores of 5 l0 and 577 on TOEFL, and placement test scores of 12and 9 respectively. Once these students are removed, the TOEFL-TOTAL correlation is only 0.56, providing the best prediction at:

Placement test = -5.4 + 0.6 (TOEFL)

Table 15 Conelation between TOEFL and the university placement test

Placement test Section 1 Section 2 Section 3 Section 4(writing) (writing) (structure) (reading)

0.34 (n/s) 0.64


The new estimate for the lowest TOEFL score required to follow acourse of study without English language support would be 499,which appears to be inappropriate on experiential grounds. In con-clusion to this section, only moderate association was discoveredbetween the placement test and the TOEFL. This is most likely asso-ciated with the small sample size, but may also be a factor of thedifference in content and pu{pose between the two tests. Nevertheless,it is hoped that concurrent validity may be further investigated as

additional data are gathered over a number of years, producing amuch larger sample of students from whom estimates may be made.

4 Content validiry

Content validity was investigated by asking three informants, oneeach from the Faculties of Human Studies, Engineering and Physics,to 'eyeball' test items. In'only one case did an informant feel parti-cularly strongly about a test item, which is given here in full:

TextHints of a revolution in superconductivity have been found by Frenchresearchers who claim to have discovered a substance that loses all electricalresistance at only 20 degrees Celsius below zero. Superconductors today needto be cooled to -135"C. Superconductors are materials whose resistance to anelectrical current disappears completely. Superconducting magnets could maketrains levitate and drive ships, and superconducting cables could produce amuch more efficient national grid network. Superconductivity was first notedin metals cooled to within a few degrees of absolute zeto (-273"C). Then in1986 ceramics that could superconduct at much higher temperatures caused astorrn in the scientific community, holding out the hope that other materialswould superconduct at room temperatures. Until now, the best efforts have

----succeeded-only-in producing a material that superconducts-rvhen cooled.with-_liquid nitrogen - which is commercially useful.

QuestionResearch into superconductivity is being undertaken because(a) the use of liquid nitrogen is not practical in'nonrommercial applications(b) the railway systems of the world wish to levitate their trains(c) ceramics is a current topic of discussion within the French scientific com-

munity(d) superconductors only operate at low temperatures

From the pilot study, the item statistics given in Table 16 wereobtained, ild the item included as a result. However, the expert judgefrom the Faculty of Physics argued that this item would be unfair toanyone taking the test with scientific training. The reason for under-taking research into superconductivity, he argued, was a combinationof (a), (b) and (c). Proposition (d), which he recognized as therequired answer, is true, but without the fact that liquid nitrogen isimpracticd (a), it would not matter that (d) was true. Further, if prop-osition (b) about the industrial applications were not important, then

Glenn Fulcher 131

ljo-coiloa;oa

a

(O O) F- Ol O,rrru?e?oooo(>ttrtt

rC)Al(DOrqqqqooooo

C)t-(IrgrOol--ol9ooooo

r(\l(I)rtO

?oo@to@olr9u?9ooooo

lo!oo

(o

c.9

=ctFEi6

.G'bE

=E6oo.cr

c6o-

co6c.E'ExE€r.tC

EctlL

E.9Eoo.o(L

Eo.=.!PG'oactcEo

a,ooqtoo

6Eo=

o(tEoa6oEo

lt'o.Eth.F(6.tt

Eo

(o

ooIF


there would be no motivation (other than academic interest, whichwas not given as an option) to do the research. The main problemwith using option (d) as the key, for this informant, was that althoughthe text indicates that it is true, there is no evidence to suggest thatit is the prime motive for the research interest. Most scientists are notinterested in developing high-temperature superconductors just for thesake of having something that operates at high temperatures, butbecause they wish to levitate trains, and the like.

Although it is possible for language testers and applied linguists towrite good items which are relevant to the disciplines of the studentswho are taking placement tests, this exercise has provided clear evi-dence that it is important for subject specialists to be consultedregarding the way in which they would interpret texts as 'insider'readers. Failure to do this may lead to building bias into test itemsat the construction phase, which may not be detected in the pretestingphase without conducting time-consuming differential item functionstudies.

5 Feedback from students

After the tests had taken place, a questionnaire was circulated to allstudents who had been referred for English language support. Of the435 questionnaires distributed,T l were completed and returned. Thisis a response rate of 16.32Vo, indicating that the following resultsmust be treated with some caution.

The questionnaire requested students to indicate, first, whether theythought the in-sessional placement test was a 'fair' test of their abilityto opera ejn English il-au_ae_ademic,-context, measured on a fiv.e.point Likert scale, and serondly whether they had attended any ofthe English courses within the in-sessional programme at the EnglishLanguage Institute. There were open-ended prompts to discover whystudents thought the test was 'unfair', if they did, how it could beimproved, ood to find out why referred students had not attended anEnglish course, if they had not. The questionnaire was circulated toall referred students two weeks after the date of the placement test.By this time, referred students were expected to have enrolled inappropriate courses, and so the results of the questionnaire could becompared directly with course attendance.

a Perception of 'fairness': Table 17 indicates that, generally, stu-dents did perceive the in-sessional placement test to be fair, with amean score of 3.1 on the Likert scale. The raw scores are presentedin Table 18, to show that of the 7l respondents, one individual didnot answer the question, and only 14 thought that the test was either'very unfair' or 'unfair'.

0.113.171

Glenn Fulcher 133

Table 17 Student perception ol test faimess

Minimum Maxirnum Standard enor

0.92

Table 18 Flesponses to the question

No response Very unfair Unfair Fair Quite fair Very lair Total

Of the 14 respondents who thought that the test lvas unfair, fiveprovided no reason for their view. The responses for the other ninestudents were as follows:

7110

Student /: Greek

Student 2: British

Student 3: Bosnian

Student 4: British

Student 5: JapaneseStudent 6: French

Student 7: Norwegian

Student 8: Japanese

Student 9: Singaporean

More time was needed to complete thetestThe testing environment was poor: theroom was too hot and there was no airconditioningThe test should have been based onmaterials which students had alreadystudiedMore time was needed to complete thetest,-ffid thE-qre-stiohs -wore' ambiguous'There should be a listening componentThe use of multiple-choice questionsshould be avoided, as these are notcapable of testing proficiency in EnglishThe questions were ambiguous. Thestudent also commented that thequestionnaire was badly designed, and hecould not understand any of the questionsThe invigilator turned up very late on theday of the test, and test administrationwas poorThere should be a vocabulary cornponent

It is clear from these responses that dissatisfaction was, for the mostpart, not a function of the test itself, but of the restrictions which areimposed on the testing process by the time and facilities available for


testing. The complaint of 'ambiguity' in the questions can be takenas a comment on the distracters in the multiple-choice questions. [na number of the questions the distracters operated extremely well, and

some students who got the answers incorrect complained bitterly.Students who reported that they found the test fair, quite fair or

very fair, also echoed some of these views. Thirteen studentsexpressed the need for more time to answer the questions, but at leasthalf these also thought that the test should be longer too. [n particular,three students requested additional writing tasks, five studentsrequested a speaking test (which is practically impossible for theentire test-taking population) and four students requested a listeningtest. One student wished to have a subject-specific test module(business). The only other observation which was echoed, was that ofthe poor test-taking environment, particularly the lack of ventilation inrooms allocated for the test. There is some concern over the appropri-ateness of rooms allocated for testing, particularly for those studentswhose primary language is not English, as it is well known thatenvironmental conditions can have a negative impact on test scores(Ascher, 1990), and those who supply rooms should be made awtueof this for future administrations.

b Attendance on the in-sessional programme: The response to thequestion of whether students had attended English language pro-grammes was a simple yes/no choice. The results are compiled inTable 19 by their response to the first question (Ql = perception offairness of the test). A statistical link cannot be tested, because thenurnber of entries in cells is less than the proportion which wouldgenerally be consideretl-aceeptaltle to-Fmfide reliatlle p values.

Of the 20 students who were referred to English language support,did not attend an English programme and replied to the questionnaire,18 provided reasons for nonattendance:

Table 19 The number of students attending English language programmes by responseto the question on the perceived laimess of the placement test

Response to Ql Did not attend Did attend Total

l,lo responseVery unfairUnfairFairQuite fairVery fair

Total

1

3 (75%l2 (20Phl6 (17Vol6 (357.)2 (W"ln egv"l

1

41035174

71

01

8I112

51

Glenn Fulcher 135

. Not enough time because of subject-specific study commitments(eight students).

o Claimed they did not know about the in-sessional programme(three students).

. After the oral interview, they were not required to attend (threestudents).

. Does not know how to use the library, BDd spends all free timelooking for subject-specific materials (one student).

o Attending English classes is a waste of time (three students).

ln the last case, there appear to be two different attinrdes. In the caseof one student whose primary language was English, he was'offended' at the suggestion that he may need additional support. Inthe other two cases, the number of years of English as a second langu-age study w?p seen to 'exempt' them from the need for further langu-.age study. Cftre student wrote: 'I don't think because is useful except-essay writing. I spent too many yeius learning English.'

Attendance at in-sessional programme courses is generally veryhigh, and the number of students referring themselves is higher thanthe number of referrals who do not attend. Total attendances for thein-sessional programme in the L994195 academic year reached 5llstudents. For an optional programme, this seems to be a reasonablysatisfactory state of affairs.

VII Discussion

I Ingistic and administrational constraints

-Wftlrin ry large institution there will be-logistieandadffirstratio-n-al -

constraints. What is not often recognized, however, is that these con-straints lead to limitations on testing, which have a direct impact onreliability of score interpretation. It is therefore important that insti-tutions make policy decisions which take this relationship intoaccount. It should never be assumed that decisions taken by those inadministration do not have academic impact. In the case of the Uni-versity of Surrey, I practical compromise has been reached in theform of a two-tier testing system, with the less precise placement testscreening out students from the oral interview stage of the process.This has clearly had a negative impact upon the usefulness of thereading subtest, which needs to be lengthened.

2 The need for pretesting

This study has highlighted the importance of rigorously pretestingall items for inclusion on placement tests. Even with trialling, some

t 36 An English language placement test

unsuitable items will survive. Without pretesting and post hoc analy-sis, no institution could be sure that its placement tests were providingbetter information than tossing a coin 1619 times. It is important thatinstitutions such as universities know the estimates of eror in theirplacement procedures, in fairness to the students, and to improvescreening and language support provision

3 Equating test formsThe methodology used in this study will allow forms of the test tobe equated, using anchor items. This means that test results will becomparable from year to year, and the English Language Institutewill be able to monitor the English language ability, as defined bythe test, of incoming students in the future. This will allow it to alertthe university authorities to sudden or gradual changes in Englishlanguage support requirements. It is very common for teachers andstudents to make comparisons across tests or test formats. In manycases this is not possible, but when forms are equated, it is both poss-ible and beneficial.

Section I tasks cannot be equated in the same way as Sections 2and 3. In the case of Section I it will be necessary to link writingprompts using expert judgements, ensuring that prototypical answersawarded at each band of the rating scale are available for rater train-ing. From form to form, much more care will have to be taken withthe interpretation of Section l, even though its reliability in the formdiscussed here is higher than Sections 2 or 3. This type of equating isessentially judgmental, and one of the less exacting forms of linkage.

-fn

the OaSe of Sections 2 and 3, a Stfong form of tEStE-qUatingmaybe attempted, involving strong statistical linking.

4 Future research

In the next academic y€ff, a further form of the test will be produced,and equated with the first form, using a single-group design with con-current validation (Petersen, Kolen and Hoover, 1989: 256). In prin-ciple, this could be done each year until enough forms have beendeveloped to ensure test security even if one form were compromised.As scores on all forms would be strictly comparable (Linn, 1993:85)it would not matter which form students take. This would add totest security, while maintaining score interpretability. However, thisproject may be somewhat ambitious. Equating tests is much moredifficult with short tests than longer tests, as the burden of informationprovided by each individual item is much higher in shorter tests(Linn, 1993: 88).

Glenn Fulcher 137

If, however, we are able to produce multiple forms, it also becomespossible to conduct score gain studies. It is a truism that in Britishuniversities there is no evidence to suggest that after a certain amountof time on any given programme, a student will 'improve' by 'x' ,whatever the unit '.r' is in terms of ability. Indeed, score gain studies,even if conducted, would be meaningless, without a clear understand-ing of what 'r' is. In this case, the unit of measurement is the scalewhich has been established in the October 1994 study. This is arbi-trary, but person and item free (Masters, 1990). As such, score gainstudies may be conducted with new groups of students over differenttimescales, following different courses. If successful, this would allowthe English Language Institute to say to a department that a particularstudent would require (given error) from 'a' to'D' months of langu-age tuition to reach a level at which he or she would, with p prob-ability, be able to cope with an academic course with (or without)English language support.

This kind of information is not currently available, but in principlecould be, as the result of a careful development of research withinthe context of specific language programmes of large educationalinstitutions.

Vm Conclusion

We are aware that there are other approaches to developing placementtests, many of which are effective where the numbers of students

__taking the tr$_qlg small 4nd the constraints in local _administrationallow the use of lengthy instruments which include oral components(Paltridge, 1992). Each approach must be sensitive to the constraintsimposed by, and the information requirernents of, unique institutions.Similarly, when conducting score gain studies in the future, criterion-based approaches such as that suggested by Brown (1989) will beconsidered, especially if multiple forms of the current test are avail-able. We nevertheless believe that the current test fulfils its purposeas well as can be expected within the context described in this article.

The methodology used in developing the evaluation of this place-ment test was based on Wall, Clapham and Alderson (1994), as thefirst article in the field of language testing to address the issue ofthe assessment of placement instruments in any depth. However, themethodology used in this study includes additional aspects whichwould appear to be worthy of investigation in the particular contextof placement testing. It is hoped that the approach adopted, buildingon Wall, Clapham and Alderson's pioneering work, may be of useto others in the evaluation of their placement tests.

138 An English language plncement test

LK Rderences

APA 1985: Standards for educational and psychological rasdng. Wash-ingon, DC: Arnerican Psychological Association.

Asc.her, C. 1991J: Assessing bilingual swdents tor placenunt and instruaion.ERIC Digest 65, 8D322n3. New York: ERIC Clearinghouse onUrban Education.

Ass€ssment Systems Corporation 1994: MSCAL 3.5, Rasch analysis pro-8rarz. Minnesota, MN: ASC.

Brown, JJ). 1989: lmproving ESL placement tests using two perspectives.TESOL Suarterb 23, 65-83.

Crocker, L and Algin& J. 1986: Introduction to classical and n dem testtheory. Chicago,Il-: Holt, Rinehart & Winston.

Farhady, H. 1983: On the plausibility of the unitary language profcierrcyfacton In Oller, J.W,, {ior, Issues in language tef,ring, Rowley, MA: .Newbury House, I l-28.

Goodbody, M.W. 1993: bning the students choose: a placement procedurefor a pre-sessional course. In Blue, G.M., di|cr, Itnguage, leamingand success: studying rhrough English, L,ondon: Modem English Pub-lications and the British Council, 49-57.

Lin4 RJ^ 1993: Linking results of distinct assessmqtts. Arylicd Measure-ment in Hucation 6, 83-102.

Maslers, GJ{. 1990: Psychometric aspects of individual tneasurcmenl Inde Jong, H.A.L. and Stevenson, D.K., editors, IndiviAnhzinS thcassessment of language abilities, Clevdon: Multilingual Mauen,56-70.

Paltridge, B. 1992: EAP placement testing: an integrated approarh. Englishlor Specifc Purposes ll,243-68.

Jctaseorl{S"I(olen,}IJ.lnd Hoover, ILD. 1989: Scaling norming-and---.-equating, In Linn, R.L., ditor, &lucational mcaswemmt, New York:National Council on Measurement in E<lucation and Maqnillan, ..

z\t-62.Pophanr, WJ. l9?8: Criterion-referenced mcasuremznt. Englewood Cliffs,

NJ: hentice-Hall.Slchrdq C.C. and DelMes, RC. 1990: Determining the validity of place.

ment exams for developmental college anicula. Applied Measurementin Educarton 4, 37-52.

Shohamy, E. 1983: The stability of oral proficiency ass€ssment in the oralinterview procedure. Language lzaraing 33, 527 -40.

- 1988: A proposed framework for testing the oral language of secondforeign language leatms. Studies in Second language Acquisition lO,t65-79.

- 1990: Ianguage testing priorities: a different perspecive. Foreign Lan-guage Awals 23, t85-94.

Shohamy, E., Reves, T. and Bejarano, Y, 1986: Intoducing a new comprc-hensive test of oral proficienry. English Language Teaching Journal40,2t2-20.

Glenn Fulcher 139

Tnrman, \ry.L. 1992: College placement testing. AMATYC Review 13,58-64.

Utrxhur, J.A. l97l: Objective evaluation of oral proficiency in the ESOLclassroom. TESOL Quarterly 5, 47 -59.

van Weeren, J. l98l: Testing oral proficiency in everyday situations. InKlein-Braley, C. and Stevenson, D.K., editors, Practice ard problemsin language testing. Volume /, Frankfurt: Bern, 54-59.

Wall, D., Clapham, C. and Alderson, J.C. 1994: Evaluating a placementtest. Innguage Testing lL, 321-344.

Placement Testing - Language Testing Resources Website

Documents

Transcript of Placement Testing - Language Testing Resources Website