The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...

7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

1/22

The Challenges of Creating a Valid and Reliable Speaking Test as Part of a

Communicative English ProgramDavid Jeffrey

David Jeffrey is an instructor in the Communicative English Program at the Niigata Universityof International and Information Studies, Japan. He has had experience both in socio-economicdevelopment in South Africa and teaching English in Japan.


2/22

Abstract

This paper describes the challenging evolution and present form of the speaking test, which is thebackbone of the Communicative English Program (CEP) of the Niigata University of

International and Information Studies (NUIS), a private university in Japan, and discusses how

valid and reliable this test appears to be. CEP is a semi-intensive and skill-based teachingprogram, founded in 2000. It is part of the Department of Information Culture of NUIS.

One of the biggest challenges in setting up the CEP teaching program was the need to create a

speaking test that could accurately measure the fluency criteria (content and communicationstrategies) of communication, as well as the accuracy criteria (grammar, vocabulary and

pronunciation). It had to be practical and optimise on resources at the same time.

The CEP speaking test is described in detail, using illustrative examples from the test itselfwherever possible. Attention is also given to the use of interrater reliability correlations as a

measure of the consistency between the examiners while applying the testing criteria.

Finally, the usefulness of reflecting on both the evolution and present forms of testing procedures

is considered in terms of its potential contribution to both the professional development of

teachers and their teaching programs.

The Origins of the CEP Speaking Test

The origins of the CEP speaking test can be traced back to 1999, when the coordinator of CEP,

Hadley, was a teacher at the Nagaoka National College of Technology, in Japan. He, together

with his co-worker Mort, created a speaking test (the forerunner of the CEP speaking test) to

assess the oral proficiency of their learners in terms of their ability to use English as a naturalcommunicative skill. Their primary concern was to find out what their learners coulddo, rather

than what they knew (Hadley and Mort, 1999).

As a result, the speaking test that emerged at this time was one that measured mainly the fluency

aspects of conversation (or the skills of making meaningful conversation), as well as the

accuracy aspects (such as vocabulary and grammatical correctness) that are also considered an

important part of conversation. The speaking test thus gave primary attention to the fluencyaspects of conversation, and secondary concern to the accuracy aspects of conversation. This had

the effect of making the examining process of this test more subjective in nature. They

consequently became concerned about its internal reliability, especially from a point of view of

the examining process as well as the necessity for maximising interrater reliability.

Interrater reliability measures the consistency between different examiners. Hadley and Mort

(1999, p. 2) described it as:

the degree of correlation between two or more examiners, with the goal of determining

whether they are using the same set of criteria when testing the oral proficiency of theirlearners.


3/22

The Speaking Test Evaluation Sheet used in these early days can be seen in Figure 1 on the next

page. Please note the different weightings applied to the testing categories, which give higherpriority to communicative ability and fluency (and lower priority to features of accuracy).


4/22

Figure 1 - Speaking Test Evaluation Sheet, as Used atthe Nagaoka National College of Technology in 1999

Communicative Ability Includes lengths of utterances, flexibilityto speakers of differing levels.

Complexity of responses (Multiply by 6)

0/1 / 2 / 3 / 4 / 5 = ____ x 6 = ____

Fluency Appropriate speed, pauses and discoursestrategies (Multiply by 4)

0/1 / 2 / 3 / 4 / 5 = ____ x 4 = ____

Vocabulary Did the student use a wide variety ofwords and phrases, or use new

vocabulary used in class? (Multiply by 3)

0/1/ 2 / 3 / 4 / 5 = ____ x 3 = ____

Non-verbal Strategies Did the student supplement oralcommunication with appropriate

gestures, eye contact and body language?(Multiply by 3)

0/1 / 2 / 3 / 4 / 5 = ____ x 3 = ____

Grammar How accurate and appropriate was thestudents grammar? (Multiply by 2)

0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____

Pronunciation Was effort made to use correctintonation, or was the accent a barrier to

communication? (Multiply by 2)

0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____

It is important to stress that the aim was not to achieve exactly the same results between the

examiners at this time, but rather to achieve similar results that were fairly high. Indeed,

theoretical observers that Hadley and Mort referred to, such as Heaton (1997, p. 164, in Hadleyand Mort, p. 5) note that internal reliabilities of objective tests (such as multiple-choice tests) are

inherently higher than subjective tests (such as oral tests), and therefore what may not be

considered an acceptable level for a multiple-choice test may be acceptable for an oral test.Heaton adds that a moderate level of internal reliability is in fact desirable for an oral test

because such tests also rely on many uncontrolled variables within natural communicative

expression, rather than the direct questions and discreet answers required by objective tests.

The split-half method was used to check the internal reliability of the speaking test at this time.

The split-half method involved dividing the test into two nearly equal parts, correlating the

scores together for the two parts, and adjusting this coefficient using the Spearman-BrownProphecy Formula (The Spearman-Brown Prophecy Formula is used to, according to Henning

(1987, p. 197):

adjust estimates of reliability to coincide with changes in the numbers of items or

independent raters in a test.

Hadley and Mort (1999, p. 3-5) were disappointed with the results of their interrater reliabilitytesting, because a measure of only 0.54 was achieved, which was lower than the desired leveleven when taking into consideration that speaking tests are subjective in nature. It suggested that

they needed to become more aligned with each other in terms of their common understanding of

issues whilst internalising their examining criteria. They note possible reasons for this as havingbeen, firstly, a general lack of confidence and feelings of distraction amongst the two examiners.

Secondly, the scoring bands and their meanings needed to be more explicit.

As a result of Hadley and Morts (1999, p. 9) scoring bands and their meanings being somewhat

inexplicit, one examiner used a basic criterion of:


5/22

would a native speaker, who is unaccustomed with Japanese speech patterns and

mannerisms, be able to understand this student?

On the other hand, the second examiner used a basic criterion of:

based upon my experience of living in Japan for eight years, can I understand what thisstudent is trying to say.

They concluded that they had not been explicit enough in their basic pedagogic criteria for rating

learners, and worked out a middle ground between the two, which was stated as:

will a native speaker of English, who is sincerely open to communicating with Japanese, be

able to understand what the learner is trying to say, even though he or she is mostly

unaccustomed with Japanese mannerisms and speech patterns?

They also concluded that the results had helped them come to some important decisions to apply

for forthcoming oral testing, particularly with respect to the examiners endeavouring tounderstand each others pedagogical stance to improve interrater reliability. These considerations

laid the basis of the CEP speaking test.

Defining What We Wanted in a Speaking Test

Although we had some philosophical and practical background as a foundation to the CEP

speaking test in the beginning, thanks to the work already carried out by Hadley and Mort atthe Nagaoka National College of Technology, there was still much to consider in refining the

speaking test to specifically meet the requirements of CEP and to make it as valid and reliable as

possible, especially in terms of interrater reliability.

Our starting point was to consider exactly what it was we wanted to achieve, and to work

towards that goal. We began by going back to basics in considering the importance of testing,

particularly oral testing in a communicative program.

Definitions of tests were considered as one starting point, such as Bachmans (1990, p. 20) w ho

defines a language test as:

a measurement instrument designed to elicit a specific sample of anindividuals behaviour(and) quantifies characteristics of individuals accordingto explicit procedures.

As well as Underhill (1987, p. 7), who refers to speaking test as:

a repeatable procedure in which the learner speaks and is assessed on thebasis of what he says.

Weir (1990, p. 7) says:


6/22

in testing communicative language ability we are evaluating samples ofperformance, in certain specific contexts of use, created under particular testconstraints, for what they can tell us about a candidates communicative capacityor language ability.

We found that going back to the theoretical basics of speaking tests very beneficial as a commonstarting point from which to advance.

This process wasnt easy, and it was time-consuming, especially given the many demands thatdesigning as well as implementing a program simultaneously placed on us. In retrospect, it is no

wonder that speaking tests are generally considered a necessary evil by many teachers and

learners alike. However, they remain an indispensable means of providing teachers and learnerswith information of the teaching and learning process.

Hughes (1989, p. 1) says:

Many language teachers harbour a deep mistrust of tests and of testersthis mistrust is frequently well-founded. It cannot be denied that a greatdeal of testing is of very poor quality. Too often language tests have aharmful effect on teaching and learning; and too often they fail to measureaccurately whatever it is they have intended to measure.

We were aware of this potential shortcoming. Despite our frustrations, we put as much effort as

possible to make the CEP speaking test as valid and reliable as possible, so that the

testing be done well, to lessen the potential mistrust that learners and teachers might harbour

towards it, especially given the centrality of the speaking test within the CEP program.

The Importance of Reliability and Validity

Reliability and validity were the central concepts around which we worked to create the CEPspeaking test, but what is specifically meant by reliability and reliability, and why are they

important?

Reliability and validity are interrelated and rely on many aspects. In a broad sense, Henning

(1987, p. 198) defines validity as:

the extent to which a test measures the ability or knowledge that it is purported to measure.


7/22

He defines reliability as:

the consistency of the scores obtainable from a test. It is usually an estimate ona scale of zero to one of the likelihood that the test would rank testees in thesame order from one administration to another proximate one (p. 198).

Reliability is therefore concerned with whether a test gives consistent results, as Underhill (1987,p. 9) says:

If the same learners are tested on two or three occasions, do they get the samescore each time?

Validity, on the other hand, is concerned with whether a test measures what it is supposed to.

Many important aspects of tests have a bearing on validity and reliability, and some worth

mentioning here include backwash effects, face validity, content validity and construct validity.

Hughes (1989, p. 1) states

the effect of testing on teaching and learning is known as backwash.

Backwash effects can be positive or negative, and they have a positive effect if they motivate

both teachers and learners to prepare for the tests. Related to this, is the importance ofconsidering the potential forward wash effects of tests that take place at the beginning of

teaching cycles, that motivate learners to learn and perform better for future tests (Hunt, 1998, p.

68).

Henning (1987, p. 192) defines face validity as:

a subjective impression, usually on the part of examinees, of the extent to which the testand its format fulfils the intended purpose of measurement.

Face validity is closely associated with content validity, defined by Henning (1987, p. 190) as:

usually a non-empirical expert judgement of the extent to which the content ofa test is comprehensive and representative of the content domain purported to bemeasured by the test.

Face and content value therefore refer to the extent to which the test is recognisable as a fair test

by learners, who thereby perform to their ability as a result. Tests that lack face and contentcause negative backwash effects and result in student underperformance, as well as the resultsbeing contested by both teachers and learners.

Henning (1987, p. 190) defines construct validity as:

the validity of the constructs measured by a test.


8/22

Construct validity is related to content validity, in that it is concerned with the contents of the test

and their wider context. Construct validity thus refers to whether the test shares the same

philosophy of the teaching program of which it is a part, and can be measured by both statisticaland intuitive methods, according to Underhill (1987, p. 106) who adds that:

construct validity is not an easy idea to work withto reduce it to its simpleststatement it says: does the test match your views on language learning? Inpractice, there may be little difference between construct and content validity.

These were just some of the many things that needed a good deal consideration whilst creatingthe CEP speaking test.

CEP and its Speaking Test Today

Almost three years ago, when CEP was founded, it was clear that its teaching philosophy wouldbe communicative. What was meant by communicative and how this philosophy would be

reflected in our teaching and testing methodology was also a matter of much contemplation.

In the first year of CEP the co-ordinator and the two instructors of CEP were engaged

considerable innovation in syllabus design and implementation, which included extensive lesson

planning and the creation of the CEP intranet server for the administrative files, a system that hasproved to be extremely convenient, and where the final results of the speaking tests are stored.

Our CEP website was also made at that time, and can be viewed

athttp://www.nuis.ac.jp/~hadley/cepweb/cep/. The first year of CEP was like building and sailing

a ship on a rough sea.

Now, after intense innovation, followed by consolidation and refinement, CEP has become a

semi-intensive, skills-based, International Language (EIL) program. Small classes of 22 learnersare streamed into six distinct levels of language proficiency, and meet once a day for 45 minutes

from Monday to Friday where they study courses that focus on oral communication, listening

and reading skills. CEP consists of 8 teaching cycles (4 for each semester). Each cycle lasts 3

weeks. The first 2 weeks of each cycle are devoted to classroom activities, and the last week isdevoted to testing activities. Although a listening test is also undertaken during the last week, it

is the speaking test that is the most important and the most challenging for both the examiners

and the learners. There are thus 8 speaking tests in an academic year.

Oral communication skills are considered to be the most important part of CEP, and thus the

CEP speaking test has been, and still is, given considerable attention. It is true to say that the

CEP speaking test has become the backbone of the CEP program, although listening and readingtests are also undertaken. The speaking tests are considered to be the most demanding of the tests

by the learners, and are considered to be the most accurate reflection on how they are managing

in CEP, given the communicative philosophy of CEP.

It should also be stressed that the tradition of considering the fluency aspects of conversation as

being most important has been sustained, with a significant amount of time and effort beingdevoted to create the present form of the CEP speaking test and to make it as valid and reliable
http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/


9/22

as possible. Accuracy is regarded in CEP as attention to, and familiarity with aspects of form,

whereas fluency is regarded as a skill (as automated knowledge). CEP recognises that too

much attention to accuracy jeopardises fluency, and thus diverts the bulk of attention away fromaccuracy, by focusing on meaning. Thus CEP aims to encourage learners to concentrate more

on whatthey are saying, and less on how they are saying it.

Learners in CEP are therefore taught to express matters that are important to them and their lives,focussing on Japanese issues as they relate to the international setting. The learners are

encouraged to learn how to confidently and effectively communicate their concerns, cultural

viewpoints and personal interests by taking ownership of English and using it as a means ofmeaningful interchange with people of other countries, and to relate what it means to be Japanese

in a positive way to others in the world community. CEP thus wants its learners to learn how to

authentically express who they are as Japanese, in English, and be able to relate who they are and

why they think the way they do to people of other cultures.

The New Interchange: English for International Communication (NIC) Levels 1, 2 and 3

(Richards, 1998) are the base texts used for the homework, listening and speaking activities.Although the NIC units are followed in the order as prescribed by their writers, the sequencing of

lessons within these units are determined by the policy of CEP to move from accuracy (grammar

homework) to fluency (conversation), with the listening activities bridging the two.

Hence, the approach is to begin with activity-based homework activities, as a form of

consciousness-raising. Homework checking takes about 5 minutes, and listening activities

another 10 minutes, so that the remaining bulk of 30 minutes can be given to fluency-basedconversational activities. In the conversational activities, learners are encouraged to place most

emphasis on fluency (as opposed to accuracy), and conversational content and strategy, as well

as physical gestures and eye contact, play important roles. Learners are taught how to open and

close conversations, introduce and develop topics, understand and use common usefulexpressions as well as idiomatic phrases in the classroom. The speaking test then checks whether

they have internalised what the have done in the classroom, and whether they can take ownership

over what the have learned and use it as a skill.

To reinforce the need to practice speaking as much as possible during the classroom activities,

learners receive points in the form of plastic coins during their classroom speaking activities,which they cash in at the end of class. The points are recorded and contribute towards their final

semester score. This encourages the learners to talk, and they are awarded primary for

conversational effort, for making an effort to communicate meaningfully, and points are not

taken away for mistakes. For example, if the teacher asks a question during a listening exerciseand a student attempts to answer the question, that student will receive a participation point,

which counts towards the year-end grade, irrespective of whether the question is answered

correctly or not.

In addition to measuring mainly the fluency aspects of oral communicative ability that is

encouraged in the classroom, the speaking test gives a definite purpose for the homework and

classroom activities for each cycle. Without doing the homework, learners find themselvesunable cope adequately with the listening and speaking activities in the classroom (given the


10/22


11/22

environmental issues and learning are:

Discuss the environmental problems of JapanDiscuss what can be done to stop the illegal dumping of rubbishDiscuss ways to help abandoned animals

Discuss good ways to improve your English skillsDiscuss some new things you would like to learn to doDiscuss the reasons why you study English

One of the three learners is asked to choose a number, from one to six (the number of cards inthe continually shuffled pack). Without showing the corresponding card to the learners, the

examiner reads the question slowly and clearly to the learners twice, and the learners then think

about the question for ten seconds (this makes the task more directed and allows the weaker

learners some thinking time). The learners must then discuss this question for three minutes. Theexaminers listen and give each student a grade (at no point in this process may examiners clarify

the meaning of any word in the question, and the conversation is exclusively between the three

learners). Their scores are entered into the examiners computers, onto the prepared master fileswith the formulas to calculate each learners score in accordance with the weightings of the

testing categories. The learners final scores are an average of the three examiners grades. After

the three minutes is up, the learners are asked to stop. A new group of learners is called into the

room, and the process is continued until all the learners have been tested.

The learners are given their results in their next class on Monday. The rapid feedback is possible

by virtue of the scoring process being computerised, and is beneficial to the learners in that theirmemories of the tests are still fresh and they can recall what they did and reflect on what they

scored fairly easily.

Score SheetsThe scores of the CEP speaking test are based on two score sheets comprised of rating bands

consisting of assessment criteria for the examiners, and were created by the examiners in mutualagreement thus reflecting their combined pedagogical stance.

There are two rating band sheets in CEP, one for the higher proficiency levels of classes A, B

and C and the other for the lower proficiency levels of classes C, D and E. This is to make theprocess as fair as possible for the learners, especially the lower proficiency learners. The

different levels allow learners to undergo class work and testing under conditions that are not too

easy or too difficult given their current proficiency levels.

Despite the different levels, attention is given to the reliability of the bands so that they do not

overlap, are sufficiently described, as free as possible from allowing for the possibility of

subjective and impressionistic elements to enter into the evaluation, and afford as muchpossibility for the examiners to enter similar scores.

To make the process fair, the upper limit for the higher proficiency levels (A, B and C) does notrequire a student to have the ability of a native English speaker, but an ability similar to a


12/22

Japanese person who has spent three to five weeks abroad in a summer home stay program. The

requirements for the lower proficiency levels (D, E and F) are 3 bands lower than the A, B and C

bands, with the upper limit being an ability to converse on a simple level. Although on asimple level is subject to many interpretations, the examiners undergo extensive discussion

during the norming sessions to be sure that a mutual understanding of what it means is reached

and internalised. A simplified version of the score sheets can be seen in Figure 2.

Figure 2 - Simplified Version of Score Sheets (used in the examinationroom by the examiners)

CEP ABCAssessment CriteriaFluency(Content ofContributions)

Fluency(CommunicationStrategies)Accuracy (Grammar& Vocabulary) Accuracy(Pronunciation)

5Offers

manydetailsorexample

sOffers

valid &pertinentreasons&opinionsAble toconverseon topicwithoutstruggle

VeryactivelyengagesothersUses

gesturesandmaintainseye contactskilfully

Uses/understands complexgrammaticalstructures often

Uses/understands new andsophisticatedvocabulary often

native-like accentnatural rhythm

and intonation

4Offers a

fewdetailsorexamplesOffers

reasons&opinionsAble toconverseon topicwithminimalstruggle

OftenactivelyengagesothersUses

gesturesandmaintainseye contactappropriately

Uses/understands complex

grammaticalstructures witha few mistakes

Uses/understands some newandsophisticatedvocabulary.

non-native accentwith very fewmispronunciationsnatural rhythm

and intonation withslight mistakes

3Offers a


simplereasonsoropinionsAble toconverseon topic

Sporadically

active inengagingothersUses few

gesturesandmaintainssome eyecontact

Uses/understands less

complexgrammaticalstructures

Uses/understands basictopical vocabulary

non-native accentwithsome mispronunciationsnon-native rhythm

and intonation


13/22

withstruggle

2Offers

detailsorexamples whenaskedOffers

reasonsoropinionswhenaskedGreatly

struggles ontopic

SomewhatPassiveand issometimes

engaged byothersUses few

gesturesandmaintainslittle eyecontact

Uses/understands simplegrammaticalstructures

Uses/understands somebasictopical vocabulary

non-native accentwithmany mispronunciationsnon-

native rhythm and

intonationinterferes withcomprehension

1 Struggleswithdetailsor

examples whenasked

Struggleswithreasonoropinionwhenasked

Conversesmainlyon

unrelated topic

Passiveand isusuallyengaged byothersUses

almost nogestures nor eyecontact.

Uses/understands simplegrammaticalstructures withdifficulty

Uses/understands somebasictopical vocabulary withdifficulty

non-native accentwithmany mispronunciations causingdifficulty inunderstandingnon-native rhythm

and intonationcausing difficultyin comprehension

0 Strugglesgreatlywithdetailsorexamples whenasked

Strugglesgreatly

withreasonsoropinionswhenasked

Conversesonly onunrelated topic

VerypassiveUses

neithergesturesnor eyecontact.

Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrorsUses some

basictopical vocabulary with greatdifficulty

non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm

and intonationinterferesgreatly withcomprehension


14/22


15/22

CEP DEFAssessment Criteria

Fluency(Content ofContributions)Fluency(CommunicationStrategies)

Accuracy (Grammar& Vocabulary) Accuracy (Pronunciation)

5Offers

simpledetailsorexamplesOffers

simplereasonoropinionsAble toadequatelyconverse ontopic

VeryactivelyengagesothersUses

gesturesandmaintainseyecontactappropriately

Uses/understands simple

grammaticalstructures well

Uses/understands newvocabulary well

non-native accentwith somemispronunciationsnon-native rhythm

and intonation

4 OfferslimiteddetailsorexamplesOffers

verysimplereasonsoropinions

Converseson topic

with alittlestruggle

Active inengagingothersGestures

andmaintainseyecontact

Uses/understands simplegrammaticalstructures witha few mistakes

Uses/understands somenewvocabulary.

non-native accentwith very severalmispronunciationsnon-native rhythm

and intonation withslight mistakes

3Offers


fewreasonsoropinions

Converseson topicwithstruggle, andtends towander

Sporadicallyactive inengagingothersUses few

gesturesandmaintainssome eyecontact

Uses/understands simplegrammaticalstructures witherrors

Uses/understands basictopical vocabulary

non-native accentwithnumerous mispronunciationsnon-native rhythm

and intonation withseveral mistakes

2Offers

fewdetails

Somewhatpassive

Uses/understands simplegrammatical

non-native accent isheavy and interfereswith comprehension


16/22

orexamples whenaskedOffers

limitedreasonsor

opinionwhenasked

Strugglesgreatlytoconverse ontopicandshifts tounrelatedtopics

and issometimes engagedby othersUses

almost nogesturesand

maintainslittle eyecontact

structures withnumerouserrors

Uses/understands somebasictopical vocabul

ary

non-native rhythmand intonationinterferes withcomprehension

1 Strugglesgreatlywithdetailsorexamples whenasked

Strugglesgreatlywithreasonor

opinionwhenaskedAbleonly toconverse onunrelated topic

Verypassiveand isengagedby othersUses

neithergesturesnor eyecontact.

Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrors

Uses/understands somebasictopical vocabulary with greatdifficulty

non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm

and intonationinterferes greatly withcomprehension

0Offers

nodetailsorexamples whenaskedOffers

noreasonoropinionwhenaskedTotal

breakdown

DoesntparticipateTotal

communicationbreakdown.

Doesnt usesimplegrammaticalstructuresDoesnt

understandbasicvocabulary

non-nativeaccent breakscommunicationnon-native rhythm

and intonation stops

the conversation


17/22

Grading Procedure

During the speaking tests, the learners are graded in terms of content ofconversation (40%), communication strategies (30%), grammar and vocabulary

(15%) and pronunciation (15%).It should benoted that a larger portion of thegrade is allocated to the content of conversation and the communicationstrategies (fluency-based activities), than to grammar and vocabulary andpronunciation (accuracy-based activities). Content of conversation relate to theability to converse on a topic with some detail, giving reasons for opinions, whilecommunication strategies relate to starting and closing conversations, respondingto questions, soliciting information, which includes gestures and eye contact. Asimplified copy of the grading procedure can be seen below in Figure 3.

Figure 3Simplified Version of Grading Procedure

Content Communication and

Participation

Vocabulary and

Grammar

Pronunciation

40% 30% 15% 15%

Speaks a lot Active Uses new words from

textbook

Understandable

Gives examples Uses gestures Uses grammar from

textbook

Explains why Looks at Partner while

speaking

5 (good) 5 (good) 5 (good) 5 (good)

4 4 4 4

3 3 3 3

2 2 2 2

1 (bad) 1 (bad) 1 (bad) 1 (bad)

Speaks a little Not active (passive) Uses High

SchoolLevel English

Very Strong Accent

No examples Few gestures Doesnt use new words

or grammar from

textbook

Hard to understand

No reasons why Doesnt look at

Partner when speaking

(___ x 8 = ___) + (___ x 6 = ___) + (___ x 3 = ____) + (___ x 3 = ___)


18/22

The learners are reminded at the beginning of each cycle that the questions in thespeaking tests will be based on the activities they will cover in the classroom. It isfelt that the weighting of scores, in terms of their distribution between fluency andaccuracy, reflects the required balance in CEP between required knowledge

(accuracy) and skills (fluency).The examiners use laptop computers and record the test results in formatted files.Their scores are combined after the testing so that the learners final grades are theaverage of the three examiners scores.

Norming Procedure

The CEP examiners undergo a norming procedure on a regular basis (once a month, at the end of

each cycle, just prior to the speaking tests).

The norming process is taken seriously given the importance for the CEP examiners tounderstand and internalise the common testing standards in order for the speaking tests to be as

fair and reliable as possible. To this end, it is also important that the examiners base their scoreson the learners performances in the test itself, and not on how they might be expected to

perform based on performance in the classroom. Some learners converse competently in the

classroom, yet perform poorly in the speaking tests, while others are the opposite, but the CEPexaminers make much effort to examine only on what happens in the test, regardless of their

subjective opinions of the learners abilities.

During the norming sessions, the rating band sheets (a more detailed version than those used inthe classroom) are studied closely and the terms and concepts therein discussed to make sure that

common understandings are reached. Then tape excerpts of learners taking the speaking tests arewatched and each examiner assigns a grade accordingly. Then they discuss the grades they gave.If the scores of the examiners vary no more than 8 percentage points from each other, it is

considered that an adequate level of norming is reached. The goal of this practice is not so that

the examiners will give exactly the same score to learners each and every time, but rather that the

same standards are understood and applied.

The learners are also exposed to the speaking test standards, so that they understand the

requirements. A copy of the examiners rating sheet written in Japanese and a simplified version

of the examiners rating sheet written in English is given to each student at the beginning of theacademic year, and they are asked to grade a group of learners shown to them on a video

undergoing a speaking test. After the learners discuss the scores they gave to them, the teacherstell them what score each student actually received by the examiners, and the reasons why theyreceived them. Learners are also given additional handouts to prepare them for the tests. All

effort is made to emphasise that conversational activities in the classroom are linkedto the tests

not only by topic but also by the required conversational authenticity and fluency that learnerspractice in the classroom.


19/22

Each student then prepares four questions based on the topics of the two NIC units covered in the

previous two weeks as homework, and brings them to class for the review day prior to the

speaking tests. They practice these in the classroom in groups of three for three minutes at atime, as a final preparation for the test.

How Valid and Reliable is the CEP Speaking Test?

It seems that the CEP speaking test, in its current form, has come a long way fromits forerunner, and has a fairly high degree of validity and reliability for thefollowing reasons:

The different weightings for accuracy and fluency in the scoring sheets,with most emphasis on fluency, that remain unchanged throughout the year;The content of the scoring sheets also remaining the same throughout the

year, in terms of terminological contents and mathematical formulae;The explicit nature of the terminological content of the scoring sheets,

created with input from all the examiners reflecting a combinedunderstanding of their pedagogical stance;The homogeneity of the learners in terms of race, age, academic status,

socio-economic and academic background;Their consistency with the philosophy of CEP;The consistency in application, directly after the first two weeks in each

cycle;The consistency and thoroughness of the norming procedure by the

examiners, to maximise interrater reliability;The number of examiners, three in total;The process of making learners aware of the purpose of the speaking testthrough their exposure to the norming process in class just prior to the test;The practicing of three minute conversations in rotation in the class just

prior to the speaking test, simulating the test with questions created by thelearners;The fairly large contribution of the speaking tests to the final CEP grade;The consistent and fairly high correlations achieved despite a new examiner

joining the team and having undergone only one intensive norming session(see below);The rapid feedback of results, helping learners relate their score to their

performances more easily; andThe backwash and forward wash effects that motivate the learners continue

to practice in classroom in preparation for the speaking tests, thusinternalising the link between classroom practice and test performance.

Interrater Reliability Correlations to Establish the Validity and Reliability of theCEP Speaking Test


20/22

Interrater reliability testing is consistently carried out in CEP, but the split-half method that was

used in the forerunner of the speaking test has not been applied again, mainly because of

problems encountered with the method when it was used prior to the time of CEP. Hadley andMort (1999 p. 50) noted:

although the split-half method is used with success with many more objective test designs,it is not certain if our test instrument can be measured objectively. We suspect that that thisinstrument may be more organic in nature, and cannot be easily separated into different parts.

The alternative convenience in using the correlation formula, as opposed to the split-half method,provided by the Microsoft Excel software package (which uses a simple regression method,

which is then correlated using the Pearson r correlation coefficient) has proven itself to be more

convenient and accurate in CEP.

It is easy to apply given that the scores are entered into computers and stored on the CEP intranet

program, and has been done for the previous two years (2001 and 2002). The Microsoft Excel

software package makes a regression analysis possible between two variables, and therebyallows for interrater correlations to be made between two examiners simultaneously. CEP had its

first speaking tests for the current academic year on May 16th

and 17th

. For the first time, three

examiners were used instead of two (the more the number of examiners the better

the interrater reliability and thereby the better the internal reliability). The interrater correlationresults between the three examiners for this test are shown in Table 1.

Table 1: Interrater Reliability Correlations of the Three CEP ExaminersCorrelation betweenExaminer A and C

Correlation betweenExaminer A and B

Correlation betweenExaminer C and B

0.87 0.89 0.87

These correlations are inspiring, in that reveal that the examiners are not giving exactly the same

scores (as that would render a correlation of 1.00), and neither are the scores too different from

each other (as that would render a correlation of less than 0.50). They are also slightly higher

than those of the previous two years (2001 and 2002) when two examiners were used.

An interesting observation is that, while examiners B and C have been in the CEP program for

two consecutive years, examiner A recently joined in April 2002 and only underwentone norming session with the other examiners prior to the testing. The correlations thus also

suggest that the CEP speaking test has a fair degree of internal reliability in that it renders similar

outcomes irrespective of who is examining the test, and this also suggests that

the norming procedure is working well in practice. Since this test two more speaking tests havebeen undertaken and the interrater correlations have continued to be at acceptable levels.

Detailed comparisons of interrater reliability correlations from cycle to cycle (test to test) havenot been made to date, but this would be a useful exercise, as it would give some idea of the

changing impact of the norming procedure over time. Similar trends could be established of the

learners scores from cycle to cycle. Such comparisons could yield interesting results,particularly in an effort to measure the relationship over time between aspects of reliability and


21/22

validity (for example, comparisons between interrater correlations and backwash and forward

wash effects through changes in the learners scores). Although this is beyond the scope of this

paper, it could be a subject of interesting academic research in the future.

Conclusion

We have found it very useful to consider the aspects of validity and reliability in the creation theCEP speaking tests. It has made us look at testing, especially oral testing, in a more critical way,

and to be more aware about the need for validity and reliability, especially through

the norming procedure to achieve acceptable levels of interrelater reliability. It made us realise

that there is a lot more to oral testing that we initially envisaged, and that maintaining an oral testin good form needs constant attention. It is not an objective test that, once created can be filed at

taken out only at times of use. The speaking test has consequently become a living part of CEP,

in that all classroom activity is given a specific purpose, and the examiners

undergo norming procedures before each speaking test is administered. Before going throughthis rigorous process, I often questioned the need to be so thorough and precise concerning all

aspects of the speaking test, particularly during the discussions in the norming sessions, andsometimes felt that we overdid it at the expense of other things that needed to be done. Inhindsight, I have come to realise the relevance of what we were doing, and appreciate the

product that I helped create all the more. We feel that our experience has equipped us with a

better understanding of the complex dynamics of oral testing and will certainly be of goodbenefit for our future professional development.

I would therefore recommend that all serious ESL teachers who have not yet closely considered

the validity and reliability of their speaking tests, to do so, as the insight gained would be verybeneficial to their professional development as well.


22/22

References

Bachman, L. F. (1990). Fundamental Considerations in Language Teaching. OxfordUniversity Press.

Hadley, G & Mort, J. (1999). An Investigation of Interrater Reliability in Oral Testing.Nagoya National College of Technology Journal, 35(2), 45-51, fromhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm.

Heaton, J. B. (1997). Writing English Language Tests. New York: Longman.

Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research.Heinle & Heinle Publishers.

Hughes, A. (1989).Testing for Language Teachers. Cambridge University Press.

Hunt, D. (1998). Designing a Reading Comprehension Test for Oral EnglishClasses.

The Shizuoka Gakuen College ReviewJournal, 11, 61-80.

Richards, J. C. (1998). New Interchange 1: English for International Communication.Cambridge University Press.

Richards, J. C. (1998). New Interchange 2: English for International Communication.

Cambridge University Press.

Richards, J. C. (1998). New Interchange 3: English for International Communication.

Cambridge University Press.

Underhill, N. (1987). Testing Spoken Language. Cambridge University Press.

Weir, C. J. (1990). Communicative Language Testing. Prentice Hall International.
http://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm

The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...

Documents

Transcript of The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...