The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...
-
Upload
viki-afrina -
Category
Documents
-
view
238 -
download
0
Transcript of The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
1/22
The Challenges of Creating a Valid and Reliable Speaking Test as Part of a
Communicative English ProgramDavid Jeffrey
David Jeffrey is an instructor in the Communicative English Program at the Niigata Universityof International and Information Studies, Japan. He has had experience both in socio-economicdevelopment in South Africa and teaching English in Japan.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
2/22
Abstract
This paper describes the challenging evolution and present form of the speaking test, which is thebackbone of the Communicative English Program (CEP) of the Niigata University of
International and Information Studies (NUIS), a private university in Japan, and discusses how
valid and reliable this test appears to be. CEP is a semi-intensive and skill-based teachingprogram, founded in 2000. It is part of the Department of Information Culture of NUIS.
One of the biggest challenges in setting up the CEP teaching program was the need to create a
speaking test that could accurately measure the fluency criteria (content and communicationstrategies) of communication, as well as the accuracy criteria (grammar, vocabulary and
pronunciation). It had to be practical and optimise on resources at the same time.
The CEP speaking test is described in detail, using illustrative examples from the test itselfwherever possible. Attention is also given to the use of interrater reliability correlations as a
measure of the consistency between the examiners while applying the testing criteria.
Finally, the usefulness of reflecting on both the evolution and present forms of testing procedures
is considered in terms of its potential contribution to both the professional development of
teachers and their teaching programs.
The Origins of the CEP Speaking Test
The origins of the CEP speaking test can be traced back to 1999, when the coordinator of CEP,
Hadley, was a teacher at the Nagaoka National College of Technology, in Japan. He, together
with his co-worker Mort, created a speaking test (the forerunner of the CEP speaking test) to
assess the oral proficiency of their learners in terms of their ability to use English as a naturalcommunicative skill. Their primary concern was to find out what their learners coulddo, rather
than what they knew (Hadley and Mort, 1999).
As a result, the speaking test that emerged at this time was one that measured mainly the fluency
aspects of conversation (or the skills of making meaningful conversation), as well as the
accuracy aspects (such as vocabulary and grammatical correctness) that are also considered an
important part of conversation. The speaking test thus gave primary attention to the fluencyaspects of conversation, and secondary concern to the accuracy aspects of conversation. This had
the effect of making the examining process of this test more subjective in nature. They
consequently became concerned about its internal reliability, especially from a point of view of
the examining process as well as the necessity for maximising interrater reliability.
Interrater reliability measures the consistency between different examiners. Hadley and Mort
(1999, p. 2) described it as:
the degree of correlation between two or more examiners, with the goal of determining
whether they are using the same set of criteria when testing the oral proficiency of theirlearners.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
3/22
The Speaking Test Evaluation Sheet used in these early days can be seen in Figure 1 on the next
page. Please note the different weightings applied to the testing categories, which give higherpriority to communicative ability and fluency (and lower priority to features of accuracy).
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
4/22
Figure 1 - Speaking Test Evaluation Sheet, as Used atthe Nagaoka National College of Technology in 1999
Communicative Ability Includes lengths of utterances, flexibilityto speakers of differing levels.
Complexity of responses (Multiply by 6)
0/1 / 2 / 3 / 4 / 5 = ____ x 6 = ____
Fluency Appropriate speed, pauses and discoursestrategies (Multiply by 4)
0/1 / 2 / 3 / 4 / 5 = ____ x 4 = ____
Vocabulary Did the student use a wide variety ofwords and phrases, or use new
vocabulary used in class? (Multiply by 3)
0/1/ 2 / 3 / 4 / 5 = ____ x 3 = ____
Non-verbal Strategies Did the student supplement oralcommunication with appropriate
gestures, eye contact and body language?(Multiply by 3)
0/1 / 2 / 3 / 4 / 5 = ____ x 3 = ____
Grammar How accurate and appropriate was thestudents grammar? (Multiply by 2)
0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____
Pronunciation Was effort made to use correctintonation, or was the accent a barrier to
communication? (Multiply by 2)
0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____
It is important to stress that the aim was not to achieve exactly the same results between the
examiners at this time, but rather to achieve similar results that were fairly high. Indeed,
theoretical observers that Hadley and Mort referred to, such as Heaton (1997, p. 164, in Hadleyand Mort, p. 5) note that internal reliabilities of objective tests (such as multiple-choice tests) are
inherently higher than subjective tests (such as oral tests), and therefore what may not be
considered an acceptable level for a multiple-choice test may be acceptable for an oral test.Heaton adds that a moderate level of internal reliability is in fact desirable for an oral test
because such tests also rely on many uncontrolled variables within natural communicative
expression, rather than the direct questions and discreet answers required by objective tests.
The split-half method was used to check the internal reliability of the speaking test at this time.
The split-half method involved dividing the test into two nearly equal parts, correlating the
scores together for the two parts, and adjusting this coefficient using the Spearman-BrownProphecy Formula (The Spearman-Brown Prophecy Formula is used to, according to Henning
(1987, p. 197):
adjust estimates of reliability to coincide with changes in the numbers of items or
independent raters in a test.
Hadley and Mort (1999, p. 3-5) were disappointed with the results of their interrater reliabilitytesting, because a measure of only 0.54 was achieved, which was lower than the desired leveleven when taking into consideration that speaking tests are subjective in nature. It suggested that
they needed to become more aligned with each other in terms of their common understanding of
issues whilst internalising their examining criteria. They note possible reasons for this as havingbeen, firstly, a general lack of confidence and feelings of distraction amongst the two examiners.
Secondly, the scoring bands and their meanings needed to be more explicit.
As a result of Hadley and Morts (1999, p. 9) scoring bands and their meanings being somewhat
inexplicit, one examiner used a basic criterion of:
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
5/22
would a native speaker, who is unaccustomed with Japanese speech patterns and
mannerisms, be able to understand this student?
On the other hand, the second examiner used a basic criterion of:
based upon my experience of living in Japan for eight years, can I understand what thisstudent is trying to say.
They concluded that they had not been explicit enough in their basic pedagogic criteria for rating
learners, and worked out a middle ground between the two, which was stated as:
will a native speaker of English, who is sincerely open to communicating with Japanese, be
able to understand what the learner is trying to say, even though he or she is mostly
unaccustomed with Japanese mannerisms and speech patterns?
They also concluded that the results had helped them come to some important decisions to apply
for forthcoming oral testing, particularly with respect to the examiners endeavouring tounderstand each others pedagogical stance to improve interrater reliability. These considerations
laid the basis of the CEP speaking test.
Defining What We Wanted in a Speaking Test
Although we had some philosophical and practical background as a foundation to the CEP
speaking test in the beginning, thanks to the work already carried out by Hadley and Mort atthe Nagaoka National College of Technology, there was still much to consider in refining the
speaking test to specifically meet the requirements of CEP and to make it as valid and reliable as
possible, especially in terms of interrater reliability.
Our starting point was to consider exactly what it was we wanted to achieve, and to work
towards that goal. We began by going back to basics in considering the importance of testing,
particularly oral testing in a communicative program.
Definitions of tests were considered as one starting point, such as Bachmans (1990, p. 20) w ho
defines a language test as:
a measurement instrument designed to elicit a specific sample of anindividuals behaviour(and) quantifies characteristics of individuals accordingto explicit procedures.
As well as Underhill (1987, p. 7), who refers to speaking test as:
a repeatable procedure in which the learner speaks and is assessed on thebasis of what he says.
Weir (1990, p. 7) says:
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
6/22
in testing communicative language ability we are evaluating samples ofperformance, in certain specific contexts of use, created under particular testconstraints, for what they can tell us about a candidates communicative capacityor language ability.
We found that going back to the theoretical basics of speaking tests very beneficial as a commonstarting point from which to advance.
This process wasnt easy, and it was time-consuming, especially given the many demands thatdesigning as well as implementing a program simultaneously placed on us. In retrospect, it is no
wonder that speaking tests are generally considered a necessary evil by many teachers and
learners alike. However, they remain an indispensable means of providing teachers and learnerswith information of the teaching and learning process.
Hughes (1989, p. 1) says:
Many language teachers harbour a deep mistrust of tests and of testersthis mistrust is frequently well-founded. It cannot be denied that a greatdeal of testing is of very poor quality. Too often language tests have aharmful effect on teaching and learning; and too often they fail to measureaccurately whatever it is they have intended to measure.
We were aware of this potential shortcoming. Despite our frustrations, we put as much effort as
possible to make the CEP speaking test as valid and reliable as possible, so that the
testing be done well, to lessen the potential mistrust that learners and teachers might harbour
towards it, especially given the centrality of the speaking test within the CEP program.
The Importance of Reliability and Validity
Reliability and validity were the central concepts around which we worked to create the CEPspeaking test, but what is specifically meant by reliability and reliability, and why are they
important?
Reliability and validity are interrelated and rely on many aspects. In a broad sense, Henning
(1987, p. 198) defines validity as:
the extent to which a test measures the ability or knowledge that it is purported to measure.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
7/22
He defines reliability as:
the consistency of the scores obtainable from a test. It is usually an estimate ona scale of zero to one of the likelihood that the test would rank testees in thesame order from one administration to another proximate one (p. 198).
Reliability is therefore concerned with whether a test gives consistent results, as Underhill (1987,p. 9) says:
If the same learners are tested on two or three occasions, do they get the samescore each time?
Validity, on the other hand, is concerned with whether a test measures what it is supposed to.
Many important aspects of tests have a bearing on validity and reliability, and some worth
mentioning here include backwash effects, face validity, content validity and construct validity.
Hughes (1989, p. 1) states
the effect of testing on teaching and learning is known as backwash.
Backwash effects can be positive or negative, and they have a positive effect if they motivate
both teachers and learners to prepare for the tests. Related to this, is the importance ofconsidering the potential forward wash effects of tests that take place at the beginning of
teaching cycles, that motivate learners to learn and perform better for future tests (Hunt, 1998, p.
68).
Henning (1987, p. 192) defines face validity as:
a subjective impression, usually on the part of examinees, of the extent to which the testand its format fulfils the intended purpose of measurement.
Face validity is closely associated with content validity, defined by Henning (1987, p. 190) as:
usually a non-empirical expert judgement of the extent to which the content ofa test is comprehensive and representative of the content domain purported to bemeasured by the test.
Face and content value therefore refer to the extent to which the test is recognisable as a fair test
by learners, who thereby perform to their ability as a result. Tests that lack face and contentcause negative backwash effects and result in student underperformance, as well as the resultsbeing contested by both teachers and learners.
Henning (1987, p. 190) defines construct validity as:
the validity of the constructs measured by a test.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
8/22
Construct validity is related to content validity, in that it is concerned with the contents of the test
and their wider context. Construct validity thus refers to whether the test shares the same
philosophy of the teaching program of which it is a part, and can be measured by both statisticaland intuitive methods, according to Underhill (1987, p. 106) who adds that:
construct validity is not an easy idea to work withto reduce it to its simpleststatement it says: does the test match your views on language learning? Inpractice, there may be little difference between construct and content validity.
These were just some of the many things that needed a good deal consideration whilst creatingthe CEP speaking test.
CEP and its Speaking Test Today
Almost three years ago, when CEP was founded, it was clear that its teaching philosophy wouldbe communicative. What was meant by communicative and how this philosophy would be
reflected in our teaching and testing methodology was also a matter of much contemplation.
In the first year of CEP the co-ordinator and the two instructors of CEP were engaged
considerable innovation in syllabus design and implementation, which included extensive lesson
planning and the creation of the CEP intranet server for the administrative files, a system that hasproved to be extremely convenient, and where the final results of the speaking tests are stored.
Our CEP website was also made at that time, and can be viewed
athttp://www.nuis.ac.jp/~hadley/cepweb/cep/. The first year of CEP was like building and sailing
a ship on a rough sea.
Now, after intense innovation, followed by consolidation and refinement, CEP has become a
semi-intensive, skills-based, International Language (EIL) program. Small classes of 22 learnersare streamed into six distinct levels of language proficiency, and meet once a day for 45 minutes
from Monday to Friday where they study courses that focus on oral communication, listening
and reading skills. CEP consists of 8 teaching cycles (4 for each semester). Each cycle lasts 3
weeks. The first 2 weeks of each cycle are devoted to classroom activities, and the last week isdevoted to testing activities. Although a listening test is also undertaken during the last week, it
is the speaking test that is the most important and the most challenging for both the examiners
and the learners. There are thus 8 speaking tests in an academic year.
Oral communication skills are considered to be the most important part of CEP, and thus the
CEP speaking test has been, and still is, given considerable attention. It is true to say that the
CEP speaking test has become the backbone of the CEP program, although listening and readingtests are also undertaken. The speaking tests are considered to be the most demanding of the tests
by the learners, and are considered to be the most accurate reflection on how they are managing
in CEP, given the communicative philosophy of CEP.
It should also be stressed that the tradition of considering the fluency aspects of conversation as
being most important has been sustained, with a significant amount of time and effort beingdevoted to create the present form of the CEP speaking test and to make it as valid and reliable
http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/ -
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
9/22
as possible. Accuracy is regarded in CEP as attention to, and familiarity with aspects of form,
whereas fluency is regarded as a skill (as automated knowledge). CEP recognises that too
much attention to accuracy jeopardises fluency, and thus diverts the bulk of attention away fromaccuracy, by focusing on meaning. Thus CEP aims to encourage learners to concentrate more
on whatthey are saying, and less on how they are saying it.
Learners in CEP are therefore taught to express matters that are important to them and their lives,focussing on Japanese issues as they relate to the international setting. The learners are
encouraged to learn how to confidently and effectively communicate their concerns, cultural
viewpoints and personal interests by taking ownership of English and using it as a means ofmeaningful interchange with people of other countries, and to relate what it means to be Japanese
in a positive way to others in the world community. CEP thus wants its learners to learn how to
authentically express who they are as Japanese, in English, and be able to relate who they are and
why they think the way they do to people of other cultures.
The New Interchange: English for International Communication (NIC) Levels 1, 2 and 3
(Richards, 1998) are the base texts used for the homework, listening and speaking activities.Although the NIC units are followed in the order as prescribed by their writers, the sequencing of
lessons within these units are determined by the policy of CEP to move from accuracy (grammar
homework) to fluency (conversation), with the listening activities bridging the two.
Hence, the approach is to begin with activity-based homework activities, as a form of
consciousness-raising. Homework checking takes about 5 minutes, and listening activities
another 10 minutes, so that the remaining bulk of 30 minutes can be given to fluency-basedconversational activities. In the conversational activities, learners are encouraged to place most
emphasis on fluency (as opposed to accuracy), and conversational content and strategy, as well
as physical gestures and eye contact, play important roles. Learners are taught how to open and
close conversations, introduce and develop topics, understand and use common usefulexpressions as well as idiomatic phrases in the classroom. The speaking test then checks whether
they have internalised what the have done in the classroom, and whether they can take ownership
over what the have learned and use it as a skill.
To reinforce the need to practice speaking as much as possible during the classroom activities,
learners receive points in the form of plastic coins during their classroom speaking activities,which they cash in at the end of class. The points are recorded and contribute towards their final
semester score. This encourages the learners to talk, and they are awarded primary for
conversational effort, for making an effort to communicate meaningfully, and points are not
taken away for mistakes. For example, if the teacher asks a question during a listening exerciseand a student attempts to answer the question, that student will receive a participation point,
which counts towards the year-end grade, irrespective of whether the question is answered
correctly or not.
In addition to measuring mainly the fluency aspects of oral communicative ability that is
encouraged in the classroom, the speaking test gives a definite purpose for the homework and
classroom activities for each cycle. Without doing the homework, learners find themselvesunable cope adequately with the listening and speaking activities in the classroom (given the
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
10/22
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
11/22
environmental issues and learning are:
Discuss the environmental problems of JapanDiscuss what can be done to stop the illegal dumping of rubbishDiscuss ways to help abandoned animals
Discuss good ways to improve your English skillsDiscuss some new things you would like to learn to doDiscuss the reasons why you study English
One of the three learners is asked to choose a number, from one to six (the number of cards inthe continually shuffled pack). Without showing the corresponding card to the learners, the
examiner reads the question slowly and clearly to the learners twice, and the learners then think
about the question for ten seconds (this makes the task more directed and allows the weaker
learners some thinking time). The learners must then discuss this question for three minutes. Theexaminers listen and give each student a grade (at no point in this process may examiners clarify
the meaning of any word in the question, and the conversation is exclusively between the three
learners). Their scores are entered into the examiners computers, onto the prepared master fileswith the formulas to calculate each learners score in accordance with the weightings of the
testing categories. The learners final scores are an average of the three examiners grades. After
the three minutes is up, the learners are asked to stop. A new group of learners is called into the
room, and the process is continued until all the learners have been tested.
The learners are given their results in their next class on Monday. The rapid feedback is possible
by virtue of the scoring process being computerised, and is beneficial to the learners in that theirmemories of the tests are still fresh and they can recall what they did and reflect on what they
scored fairly easily.
Score SheetsThe scores of the CEP speaking test are based on two score sheets comprised of rating bands
consisting of assessment criteria for the examiners, and were created by the examiners in mutualagreement thus reflecting their combined pedagogical stance.
There are two rating band sheets in CEP, one for the higher proficiency levels of classes A, B
and C and the other for the lower proficiency levels of classes C, D and E. This is to make theprocess as fair as possible for the learners, especially the lower proficiency learners. The
different levels allow learners to undergo class work and testing under conditions that are not too
easy or too difficult given their current proficiency levels.
Despite the different levels, attention is given to the reliability of the bands so that they do not
overlap, are sufficiently described, as free as possible from allowing for the possibility of
subjective and impressionistic elements to enter into the evaluation, and afford as muchpossibility for the examiners to enter similar scores.
To make the process fair, the upper limit for the higher proficiency levels (A, B and C) does notrequire a student to have the ability of a native English speaker, but an ability similar to a
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
12/22
Japanese person who has spent three to five weeks abroad in a summer home stay program. The
requirements for the lower proficiency levels (D, E and F) are 3 bands lower than the A, B and C
bands, with the upper limit being an ability to converse on a simple level. Although on asimple level is subject to many interpretations, the examiners undergo extensive discussion
during the norming sessions to be sure that a mutual understanding of what it means is reached
and internalised. A simplified version of the score sheets can be seen in Figure 2.
Figure 2 - Simplified Version of Score Sheets (used in the examinationroom by the examiners)
CEP ABCAssessment CriteriaFluency(Content ofContributions)
Fluency(CommunicationStrategies)Accuracy (Grammar& Vocabulary) Accuracy(Pronunciation)
5Offers
manydetailsorexample
sOffers
valid &pertinentreasons&opinionsAble toconverseon topicwithoutstruggle
VeryactivelyengagesothersUses
gesturesandmaintainseye contactskilfully
Uses/understands complexgrammaticalstructures often
Uses/understands new andsophisticatedvocabulary often
native-like accentnatural rhythm
and intonation
4Offers a
fewdetailsorexamplesOffers
reasons&opinionsAble toconverseon topicwithminimalstruggle
OftenactivelyengagesothersUses
gesturesandmaintainseye contactappropriately
Uses/understands complex
grammaticalstructures witha few mistakes
Uses/understands some newandsophisticatedvocabulary.
non-native accentwith very fewmispronunciationsnatural rhythm
and intonation withslight mistakes
3Offers a
fewdetailsorexamplesOffers
simplereasonsoropinionsAble toconverseon topic
Sporadically
active inengagingothersUses few
gesturesandmaintainssome eyecontact
Uses/understands less
complexgrammaticalstructures
Uses/understands basictopical vocabulary
non-native accentwithsome mispronunciationsnon-native rhythm
and intonation
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
13/22
withstruggle
2Offers
detailsorexamples whenaskedOffers
reasonsoropinionswhenaskedGreatly
struggles ontopic
SomewhatPassiveand issometimes
engaged byothersUses few
gesturesandmaintainslittle eyecontact
Uses/understands simplegrammaticalstructures
Uses/understands somebasictopical vocabulary
non-native accentwithmany mispronunciationsnon-
native rhythm and
intonationinterferes withcomprehension
1 Struggleswithdetailsor
examples whenasked
Struggleswithreasonoropinionwhenasked
Conversesmainlyon
unrelated topic
Passiveand isusuallyengaged byothersUses
almost nogestures nor eyecontact.
Uses/understands simplegrammaticalstructures withdifficulty
Uses/understands somebasictopical vocabulary withdifficulty
non-native accentwithmany mispronunciations causingdifficulty inunderstandingnon-native rhythm
and intonationcausing difficultyin comprehension
0 Strugglesgreatlywithdetailsorexamples whenasked
Strugglesgreatly
withreasonsoropinionswhenasked
Conversesonly onunrelated topic
VerypassiveUses
neithergesturesnor eyecontact.
Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrorsUses some
basictopical vocabulary with greatdifficulty
non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm
and intonationinterferesgreatly withcomprehension
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
14/22
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
15/22
CEP DEFAssessment Criteria
Fluency(Content ofContributions)Fluency(CommunicationStrategies)
Accuracy (Grammar& Vocabulary) Accuracy (Pronunciation)
5Offers
simpledetailsorexamplesOffers
simplereasonoropinionsAble toadequatelyconverse ontopic
VeryactivelyengagesothersUses
gesturesandmaintainseyecontactappropriately
Uses/understands simple
grammaticalstructures well
Uses/understands newvocabulary well
non-native accentwith somemispronunciationsnon-native rhythm
and intonation
4 OfferslimiteddetailsorexamplesOffers
verysimplereasonsoropinions
Converseson topic
with alittlestruggle
Active inengagingothersGestures
andmaintainseyecontact
Uses/understands simplegrammaticalstructures witha few mistakes
Uses/understands somenewvocabulary.
non-native accentwith very severalmispronunciationsnon-native rhythm
and intonation withslight mistakes
3Offers
fewdetailsorexamplesOffers
fewreasonsoropinions
Converseson topicwithstruggle, andtends towander
Sporadicallyactive inengagingothersUses few
gesturesandmaintainssome eyecontact
Uses/understands simplegrammaticalstructures witherrors
Uses/understands basictopical vocabulary
non-native accentwithnumerous mispronunciationsnon-native rhythm
and intonation withseveral mistakes
2Offers
fewdetails
Somewhatpassive
Uses/understands simplegrammatical
non-native accent isheavy and interfereswith comprehension
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
16/22
orexamples whenaskedOffers
limitedreasonsor
opinionwhenasked
Strugglesgreatlytoconverse ontopicandshifts tounrelatedtopics
and issometimes engagedby othersUses
almost nogesturesand
maintainslittle eyecontact
structures withnumerouserrors
Uses/understands somebasictopical vocabul
ary
non-native rhythmand intonationinterferes withcomprehension
1 Strugglesgreatlywithdetailsorexamples whenasked
Strugglesgreatlywithreasonor
opinionwhenaskedAbleonly toconverse onunrelated topic
Verypassiveand isengagedby othersUses
neithergesturesnor eyecontact.
Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrors
Uses/understands somebasictopical vocabulary with greatdifficulty
non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm
and intonationinterferes greatly withcomprehension
0Offers
nodetailsorexamples whenaskedOffers
noreasonoropinionwhenaskedTotal
breakdown
DoesntparticipateTotal
communicationbreakdown.
Doesnt usesimplegrammaticalstructuresDoesnt
understandbasicvocabulary
non-nativeaccent breakscommunicationnon-native rhythm
and intonation stops
the conversation
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
17/22
Grading Procedure
During the speaking tests, the learners are graded in terms of content ofconversation (40%), communication strategies (30%), grammar and vocabulary
(15%) and pronunciation (15%).It should benoted that a larger portion of thegrade is allocated to the content of conversation and the communicationstrategies (fluency-based activities), than to grammar and vocabulary andpronunciation (accuracy-based activities). Content of conversation relate to theability to converse on a topic with some detail, giving reasons for opinions, whilecommunication strategies relate to starting and closing conversations, respondingto questions, soliciting information, which includes gestures and eye contact. Asimplified copy of the grading procedure can be seen below in Figure 3.
Figure 3Simplified Version of Grading Procedure
Content Communication and
Participation
Vocabulary and
Grammar
Pronunciation
40% 30% 15% 15%
Speaks a lot Active Uses new words from
textbook
Understandable
Gives examples Uses gestures Uses grammar from
textbook
Explains why Looks at Partner while
speaking
5 (good) 5 (good) 5 (good) 5 (good)
4 4 4 4
3 3 3 3
2 2 2 2
1 (bad) 1 (bad) 1 (bad) 1 (bad)
Speaks a little Not active (passive) Uses High
SchoolLevel English
Very Strong Accent
No examples Few gestures Doesnt use new words
or grammar from
textbook
Hard to understand
No reasons why Doesnt look at
Partner when speaking
(___ x 8 = ___) + (___ x 6 = ___) + (___ x 3 = ____) + (___ x 3 = ___)
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
18/22
The learners are reminded at the beginning of each cycle that the questions in thespeaking tests will be based on the activities they will cover in the classroom. It isfelt that the weighting of scores, in terms of their distribution between fluency andaccuracy, reflects the required balance in CEP between required knowledge
(accuracy) and skills (fluency).The examiners use laptop computers and record the test results in formatted files.Their scores are combined after the testing so that the learners final grades are theaverage of the three examiners scores.
Norming Procedure
The CEP examiners undergo a norming procedure on a regular basis (once a month, at the end of
each cycle, just prior to the speaking tests).
The norming process is taken seriously given the importance for the CEP examiners tounderstand and internalise the common testing standards in order for the speaking tests to be as
fair and reliable as possible. To this end, it is also important that the examiners base their scoreson the learners performances in the test itself, and not on how they might be expected to
perform based on performance in the classroom. Some learners converse competently in the
classroom, yet perform poorly in the speaking tests, while others are the opposite, but the CEPexaminers make much effort to examine only on what happens in the test, regardless of their
subjective opinions of the learners abilities.
During the norming sessions, the rating band sheets (a more detailed version than those used inthe classroom) are studied closely and the terms and concepts therein discussed to make sure that
common understandings are reached. Then tape excerpts of learners taking the speaking tests arewatched and each examiner assigns a grade accordingly. Then they discuss the grades they gave.If the scores of the examiners vary no more than 8 percentage points from each other, it is
considered that an adequate level of norming is reached. The goal of this practice is not so that
the examiners will give exactly the same score to learners each and every time, but rather that the
same standards are understood and applied.
The learners are also exposed to the speaking test standards, so that they understand the
requirements. A copy of the examiners rating sheet written in Japanese and a simplified version
of the examiners rating sheet written in English is given to each student at the beginning of theacademic year, and they are asked to grade a group of learners shown to them on a video
undergoing a speaking test. After the learners discuss the scores they gave to them, the teacherstell them what score each student actually received by the examiners, and the reasons why theyreceived them. Learners are also given additional handouts to prepare them for the tests. All
effort is made to emphasise that conversational activities in the classroom are linkedto the tests
not only by topic but also by the required conversational authenticity and fluency that learnerspractice in the classroom.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
19/22
Each student then prepares four questions based on the topics of the two NIC units covered in the
previous two weeks as homework, and brings them to class for the review day prior to the
speaking tests. They practice these in the classroom in groups of three for three minutes at atime, as a final preparation for the test.
How Valid and Reliable is the CEP Speaking Test?
It seems that the CEP speaking test, in its current form, has come a long way fromits forerunner, and has a fairly high degree of validity and reliability for thefollowing reasons:
The different weightings for accuracy and fluency in the scoring sheets,with most emphasis on fluency, that remain unchanged throughout the year;The content of the scoring sheets also remaining the same throughout the
year, in terms of terminological contents and mathematical formulae;The explicit nature of the terminological content of the scoring sheets,
created with input from all the examiners reflecting a combinedunderstanding of their pedagogical stance;The homogeneity of the learners in terms of race, age, academic status,
socio-economic and academic background;Their consistency with the philosophy of CEP;The consistency in application, directly after the first two weeks in each
cycle;The consistency and thoroughness of the norming procedure by the
examiners, to maximise interrater reliability;The number of examiners, three in total;The process of making learners aware of the purpose of the speaking testthrough their exposure to the norming process in class just prior to the test;The practicing of three minute conversations in rotation in the class just
prior to the speaking test, simulating the test with questions created by thelearners;The fairly large contribution of the speaking tests to the final CEP grade;The consistent and fairly high correlations achieved despite a new examiner
joining the team and having undergone only one intensive norming session(see below);The rapid feedback of results, helping learners relate their score to their
performances more easily; andThe backwash and forward wash effects that motivate the learners continue
to practice in classroom in preparation for the speaking tests, thusinternalising the link between classroom practice and test performance.
Interrater Reliability Correlations to Establish the Validity and Reliability of theCEP Speaking Test
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
20/22
Interrater reliability testing is consistently carried out in CEP, but the split-half method that was
used in the forerunner of the speaking test has not been applied again, mainly because of
problems encountered with the method when it was used prior to the time of CEP. Hadley andMort (1999 p. 50) noted:
although the split-half method is used with success with many more objective test designs,it is not certain if our test instrument can be measured objectively. We suspect that that thisinstrument may be more organic in nature, and cannot be easily separated into different parts.
The alternative convenience in using the correlation formula, as opposed to the split-half method,provided by the Microsoft Excel software package (which uses a simple regression method,
which is then correlated using the Pearson r correlation coefficient) has proven itself to be more
convenient and accurate in CEP.
It is easy to apply given that the scores are entered into computers and stored on the CEP intranet
program, and has been done for the previous two years (2001 and 2002). The Microsoft Excel
software package makes a regression analysis possible between two variables, and therebyallows for interrater correlations to be made between two examiners simultaneously. CEP had its
first speaking tests for the current academic year on May 16th
and 17th
. For the first time, three
examiners were used instead of two (the more the number of examiners the better
the interrater reliability and thereby the better the internal reliability). The interrater correlationresults between the three examiners for this test are shown in Table 1.
Table 1: Interrater Reliability Correlations of the Three CEP ExaminersCorrelation betweenExaminer A and C
Correlation betweenExaminer A and B
Correlation betweenExaminer C and B
0.87 0.89 0.87
These correlations are inspiring, in that reveal that the examiners are not giving exactly the same
scores (as that would render a correlation of 1.00), and neither are the scores too different from
each other (as that would render a correlation of less than 0.50). They are also slightly higher
than those of the previous two years (2001 and 2002) when two examiners were used.
An interesting observation is that, while examiners B and C have been in the CEP program for
two consecutive years, examiner A recently joined in April 2002 and only underwentone norming session with the other examiners prior to the testing. The correlations thus also
suggest that the CEP speaking test has a fair degree of internal reliability in that it renders similar
outcomes irrespective of who is examining the test, and this also suggests that
the norming procedure is working well in practice. Since this test two more speaking tests havebeen undertaken and the interrater correlations have continued to be at acceptable levels.
Detailed comparisons of interrater reliability correlations from cycle to cycle (test to test) havenot been made to date, but this would be a useful exercise, as it would give some idea of the
changing impact of the norming procedure over time. Similar trends could be established of the
learners scores from cycle to cycle. Such comparisons could yield interesting results,particularly in an effort to measure the relationship over time between aspects of reliability and
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
21/22
validity (for example, comparisons between interrater correlations and backwash and forward
wash effects through changes in the learners scores). Although this is beyond the scope of this
paper, it could be a subject of interesting academic research in the future.
Conclusion
We have found it very useful to consider the aspects of validity and reliability in the creation theCEP speaking tests. It has made us look at testing, especially oral testing, in a more critical way,
and to be more aware about the need for validity and reliability, especially through
the norming procedure to achieve acceptable levels of interrelater reliability. It made us realise
that there is a lot more to oral testing that we initially envisaged, and that maintaining an oral testin good form needs constant attention. It is not an objective test that, once created can be filed at
taken out only at times of use. The speaking test has consequently become a living part of CEP,
in that all classroom activity is given a specific purpose, and the examiners
undergo norming procedures before each speaking test is administered. Before going throughthis rigorous process, I often questioned the need to be so thorough and precise concerning all
aspects of the speaking test, particularly during the discussions in the norming sessions, andsometimes felt that we overdid it at the expense of other things that needed to be done. Inhindsight, I have come to realise the relevance of what we were doing, and appreciate the
product that I helped create all the more. We feel that our experience has equipped us with a
better understanding of the complex dynamics of oral testing and will certainly be of goodbenefit for our future professional development.
I would therefore recommend that all serious ESL teachers who have not yet closely considered
the validity and reliability of their speaking tests, to do so, as the insight gained would be verybeneficial to their professional development as well.
-
7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program
22/22
References
Bachman, L. F. (1990). Fundamental Considerations in Language Teaching. OxfordUniversity Press.
Hadley, G & Mort, J. (1999). An Investigation of Interrater Reliability in Oral Testing.Nagoya National College of Technology Journal, 35(2), 45-51, fromhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm.
Heaton, J. B. (1997). Writing English Language Tests. New York: Longman.
Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research.Heinle & Heinle Publishers.
Hughes, A. (1989).Testing for Language Teachers. Cambridge University Press.
Hunt, D. (1998). Designing a Reading Comprehension Test for Oral EnglishClasses.
The Shizuoka Gakuen College ReviewJournal, 11, 61-80.
Richards, J. C. (1998). New Interchange 1: English for International Communication.Cambridge University Press.
Richards, J. C. (1998). New Interchange 2: English for International Communication.
Cambridge University Press.
Richards, J. C. (1998). New Interchange 3: English for International Communication.
Cambridge University Press.
Underhill, N. (1987). Testing Spoken Language. Cambridge University Press.
Weir, C. J. (1990). Communicative Language Testing. Prentice Hall International.
http://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm