The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...

download The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

of 22

Transcript of The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English...

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    1/22

    The Challenges of Creating a Valid and Reliable Speaking Test as Part of a

    Communicative English ProgramDavid Jeffrey

    David Jeffrey is an instructor in the Communicative English Program at the Niigata Universityof International and Information Studies, Japan. He has had experience both in socio-economicdevelopment in South Africa and teaching English in Japan.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    2/22

    Abstract

    This paper describes the challenging evolution and present form of the speaking test, which is thebackbone of the Communicative English Program (CEP) of the Niigata University of

    International and Information Studies (NUIS), a private university in Japan, and discusses how

    valid and reliable this test appears to be. CEP is a semi-intensive and skill-based teachingprogram, founded in 2000. It is part of the Department of Information Culture of NUIS.

    One of the biggest challenges in setting up the CEP teaching program was the need to create a

    speaking test that could accurately measure the fluency criteria (content and communicationstrategies) of communication, as well as the accuracy criteria (grammar, vocabulary and

    pronunciation). It had to be practical and optimise on resources at the same time.

    The CEP speaking test is described in detail, using illustrative examples from the test itselfwherever possible. Attention is also given to the use of interrater reliability correlations as a

    measure of the consistency between the examiners while applying the testing criteria.

    Finally, the usefulness of reflecting on both the evolution and present forms of testing procedures

    is considered in terms of its potential contribution to both the professional development of

    teachers and their teaching programs.

    The Origins of the CEP Speaking Test

    The origins of the CEP speaking test can be traced back to 1999, when the coordinator of CEP,

    Hadley, was a teacher at the Nagaoka National College of Technology, in Japan. He, together

    with his co-worker Mort, created a speaking test (the forerunner of the CEP speaking test) to

    assess the oral proficiency of their learners in terms of their ability to use English as a naturalcommunicative skill. Their primary concern was to find out what their learners coulddo, rather

    than what they knew (Hadley and Mort, 1999).

    As a result, the speaking test that emerged at this time was one that measured mainly the fluency

    aspects of conversation (or the skills of making meaningful conversation), as well as the

    accuracy aspects (such as vocabulary and grammatical correctness) that are also considered an

    important part of conversation. The speaking test thus gave primary attention to the fluencyaspects of conversation, and secondary concern to the accuracy aspects of conversation. This had

    the effect of making the examining process of this test more subjective in nature. They

    consequently became concerned about its internal reliability, especially from a point of view of

    the examining process as well as the necessity for maximising interrater reliability.

    Interrater reliability measures the consistency between different examiners. Hadley and Mort

    (1999, p. 2) described it as:

    the degree of correlation between two or more examiners, with the goal of determining

    whether they are using the same set of criteria when testing the oral proficiency of theirlearners.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    3/22

    The Speaking Test Evaluation Sheet used in these early days can be seen in Figure 1 on the next

    page. Please note the different weightings applied to the testing categories, which give higherpriority to communicative ability and fluency (and lower priority to features of accuracy).

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    4/22

    Figure 1 - Speaking Test Evaluation Sheet, as Used atthe Nagaoka National College of Technology in 1999

    Communicative Ability Includes lengths of utterances, flexibilityto speakers of differing levels.

    Complexity of responses (Multiply by 6)

    0/1 / 2 / 3 / 4 / 5 = ____ x 6 = ____

    Fluency Appropriate speed, pauses and discoursestrategies (Multiply by 4)

    0/1 / 2 / 3 / 4 / 5 = ____ x 4 = ____

    Vocabulary Did the student use a wide variety ofwords and phrases, or use new

    vocabulary used in class? (Multiply by 3)

    0/1/ 2 / 3 / 4 / 5 = ____ x 3 = ____

    Non-verbal Strategies Did the student supplement oralcommunication with appropriate

    gestures, eye contact and body language?(Multiply by 3)

    0/1 / 2 / 3 / 4 / 5 = ____ x 3 = ____

    Grammar How accurate and appropriate was thestudents grammar? (Multiply by 2)

    0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____

    Pronunciation Was effort made to use correctintonation, or was the accent a barrier to

    communication? (Multiply by 2)

    0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____

    It is important to stress that the aim was not to achieve exactly the same results between the

    examiners at this time, but rather to achieve similar results that were fairly high. Indeed,

    theoretical observers that Hadley and Mort referred to, such as Heaton (1997, p. 164, in Hadleyand Mort, p. 5) note that internal reliabilities of objective tests (such as multiple-choice tests) are

    inherently higher than subjective tests (such as oral tests), and therefore what may not be

    considered an acceptable level for a multiple-choice test may be acceptable for an oral test.Heaton adds that a moderate level of internal reliability is in fact desirable for an oral test

    because such tests also rely on many uncontrolled variables within natural communicative

    expression, rather than the direct questions and discreet answers required by objective tests.

    The split-half method was used to check the internal reliability of the speaking test at this time.

    The split-half method involved dividing the test into two nearly equal parts, correlating the

    scores together for the two parts, and adjusting this coefficient using the Spearman-BrownProphecy Formula (The Spearman-Brown Prophecy Formula is used to, according to Henning

    (1987, p. 197):

    adjust estimates of reliability to coincide with changes in the numbers of items or

    independent raters in a test.

    Hadley and Mort (1999, p. 3-5) were disappointed with the results of their interrater reliabilitytesting, because a measure of only 0.54 was achieved, which was lower than the desired leveleven when taking into consideration that speaking tests are subjective in nature. It suggested that

    they needed to become more aligned with each other in terms of their common understanding of

    issues whilst internalising their examining criteria. They note possible reasons for this as havingbeen, firstly, a general lack of confidence and feelings of distraction amongst the two examiners.

    Secondly, the scoring bands and their meanings needed to be more explicit.

    As a result of Hadley and Morts (1999, p. 9) scoring bands and their meanings being somewhat

    inexplicit, one examiner used a basic criterion of:

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    5/22

    would a native speaker, who is unaccustomed with Japanese speech patterns and

    mannerisms, be able to understand this student?

    On the other hand, the second examiner used a basic criterion of:

    based upon my experience of living in Japan for eight years, can I understand what thisstudent is trying to say.

    They concluded that they had not been explicit enough in their basic pedagogic criteria for rating

    learners, and worked out a middle ground between the two, which was stated as:

    will a native speaker of English, who is sincerely open to communicating with Japanese, be

    able to understand what the learner is trying to say, even though he or she is mostly

    unaccustomed with Japanese mannerisms and speech patterns?

    They also concluded that the results had helped them come to some important decisions to apply

    for forthcoming oral testing, particularly with respect to the examiners endeavouring tounderstand each others pedagogical stance to improve interrater reliability. These considerations

    laid the basis of the CEP speaking test.

    Defining What We Wanted in a Speaking Test

    Although we had some philosophical and practical background as a foundation to the CEP

    speaking test in the beginning, thanks to the work already carried out by Hadley and Mort atthe Nagaoka National College of Technology, there was still much to consider in refining the

    speaking test to specifically meet the requirements of CEP and to make it as valid and reliable as

    possible, especially in terms of interrater reliability.

    Our starting point was to consider exactly what it was we wanted to achieve, and to work

    towards that goal. We began by going back to basics in considering the importance of testing,

    particularly oral testing in a communicative program.

    Definitions of tests were considered as one starting point, such as Bachmans (1990, p. 20) w ho

    defines a language test as:

    a measurement instrument designed to elicit a specific sample of anindividuals behaviour(and) quantifies characteristics of individuals accordingto explicit procedures.

    As well as Underhill (1987, p. 7), who refers to speaking test as:

    a repeatable procedure in which the learner speaks and is assessed on thebasis of what he says.

    Weir (1990, p. 7) says:

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    6/22

    in testing communicative language ability we are evaluating samples ofperformance, in certain specific contexts of use, created under particular testconstraints, for what they can tell us about a candidates communicative capacityor language ability.

    We found that going back to the theoretical basics of speaking tests very beneficial as a commonstarting point from which to advance.

    This process wasnt easy, and it was time-consuming, especially given the many demands thatdesigning as well as implementing a program simultaneously placed on us. In retrospect, it is no

    wonder that speaking tests are generally considered a necessary evil by many teachers and

    learners alike. However, they remain an indispensable means of providing teachers and learnerswith information of the teaching and learning process.

    Hughes (1989, p. 1) says:

    Many language teachers harbour a deep mistrust of tests and of testersthis mistrust is frequently well-founded. It cannot be denied that a greatdeal of testing is of very poor quality. Too often language tests have aharmful effect on teaching and learning; and too often they fail to measureaccurately whatever it is they have intended to measure.

    We were aware of this potential shortcoming. Despite our frustrations, we put as much effort as

    possible to make the CEP speaking test as valid and reliable as possible, so that the

    testing be done well, to lessen the potential mistrust that learners and teachers might harbour

    towards it, especially given the centrality of the speaking test within the CEP program.

    The Importance of Reliability and Validity

    Reliability and validity were the central concepts around which we worked to create the CEPspeaking test, but what is specifically meant by reliability and reliability, and why are they

    important?

    Reliability and validity are interrelated and rely on many aspects. In a broad sense, Henning

    (1987, p. 198) defines validity as:

    the extent to which a test measures the ability or knowledge that it is purported to measure.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    7/22

    He defines reliability as:

    the consistency of the scores obtainable from a test. It is usually an estimate ona scale of zero to one of the likelihood that the test would rank testees in thesame order from one administration to another proximate one (p. 198).

    Reliability is therefore concerned with whether a test gives consistent results, as Underhill (1987,p. 9) says:

    If the same learners are tested on two or three occasions, do they get the samescore each time?

    Validity, on the other hand, is concerned with whether a test measures what it is supposed to.

    Many important aspects of tests have a bearing on validity and reliability, and some worth

    mentioning here include backwash effects, face validity, content validity and construct validity.

    Hughes (1989, p. 1) states

    the effect of testing on teaching and learning is known as backwash.

    Backwash effects can be positive or negative, and they have a positive effect if they motivate

    both teachers and learners to prepare for the tests. Related to this, is the importance ofconsidering the potential forward wash effects of tests that take place at the beginning of

    teaching cycles, that motivate learners to learn and perform better for future tests (Hunt, 1998, p.

    68).

    Henning (1987, p. 192) defines face validity as:

    a subjective impression, usually on the part of examinees, of the extent to which the testand its format fulfils the intended purpose of measurement.

    Face validity is closely associated with content validity, defined by Henning (1987, p. 190) as:

    usually a non-empirical expert judgement of the extent to which the content ofa test is comprehensive and representative of the content domain purported to bemeasured by the test.

    Face and content value therefore refer to the extent to which the test is recognisable as a fair test

    by learners, who thereby perform to their ability as a result. Tests that lack face and contentcause negative backwash effects and result in student underperformance, as well as the resultsbeing contested by both teachers and learners.

    Henning (1987, p. 190) defines construct validity as:

    the validity of the constructs measured by a test.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    8/22

    Construct validity is related to content validity, in that it is concerned with the contents of the test

    and their wider context. Construct validity thus refers to whether the test shares the same

    philosophy of the teaching program of which it is a part, and can be measured by both statisticaland intuitive methods, according to Underhill (1987, p. 106) who adds that:

    construct validity is not an easy idea to work withto reduce it to its simpleststatement it says: does the test match your views on language learning? Inpractice, there may be little difference between construct and content validity.

    These were just some of the many things that needed a good deal consideration whilst creatingthe CEP speaking test.

    CEP and its Speaking Test Today

    Almost three years ago, when CEP was founded, it was clear that its teaching philosophy wouldbe communicative. What was meant by communicative and how this philosophy would be

    reflected in our teaching and testing methodology was also a matter of much contemplation.

    In the first year of CEP the co-ordinator and the two instructors of CEP were engaged

    considerable innovation in syllabus design and implementation, which included extensive lesson

    planning and the creation of the CEP intranet server for the administrative files, a system that hasproved to be extremely convenient, and where the final results of the speaking tests are stored.

    Our CEP website was also made at that time, and can be viewed

    athttp://www.nuis.ac.jp/~hadley/cepweb/cep/. The first year of CEP was like building and sailing

    a ship on a rough sea.

    Now, after intense innovation, followed by consolidation and refinement, CEP has become a

    semi-intensive, skills-based, International Language (EIL) program. Small classes of 22 learnersare streamed into six distinct levels of language proficiency, and meet once a day for 45 minutes

    from Monday to Friday where they study courses that focus on oral communication, listening

    and reading skills. CEP consists of 8 teaching cycles (4 for each semester). Each cycle lasts 3

    weeks. The first 2 weeks of each cycle are devoted to classroom activities, and the last week isdevoted to testing activities. Although a listening test is also undertaken during the last week, it

    is the speaking test that is the most important and the most challenging for both the examiners

    and the learners. There are thus 8 speaking tests in an academic year.

    Oral communication skills are considered to be the most important part of CEP, and thus the

    CEP speaking test has been, and still is, given considerable attention. It is true to say that the

    CEP speaking test has become the backbone of the CEP program, although listening and readingtests are also undertaken. The speaking tests are considered to be the most demanding of the tests

    by the learners, and are considered to be the most accurate reflection on how they are managing

    in CEP, given the communicative philosophy of CEP.

    It should also be stressed that the tradition of considering the fluency aspects of conversation as

    being most important has been sustained, with a significant amount of time and effort beingdevoted to create the present form of the CEP speaking test and to make it as valid and reliable

    http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/http://www.nuis.ac.jp/~hadley/cepweb/cep/
  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    9/22

    as possible. Accuracy is regarded in CEP as attention to, and familiarity with aspects of form,

    whereas fluency is regarded as a skill (as automated knowledge). CEP recognises that too

    much attention to accuracy jeopardises fluency, and thus diverts the bulk of attention away fromaccuracy, by focusing on meaning. Thus CEP aims to encourage learners to concentrate more

    on whatthey are saying, and less on how they are saying it.

    Learners in CEP are therefore taught to express matters that are important to them and their lives,focussing on Japanese issues as they relate to the international setting. The learners are

    encouraged to learn how to confidently and effectively communicate their concerns, cultural

    viewpoints and personal interests by taking ownership of English and using it as a means ofmeaningful interchange with people of other countries, and to relate what it means to be Japanese

    in a positive way to others in the world community. CEP thus wants its learners to learn how to

    authentically express who they are as Japanese, in English, and be able to relate who they are and

    why they think the way they do to people of other cultures.

    The New Interchange: English for International Communication (NIC) Levels 1, 2 and 3

    (Richards, 1998) are the base texts used for the homework, listening and speaking activities.Although the NIC units are followed in the order as prescribed by their writers, the sequencing of

    lessons within these units are determined by the policy of CEP to move from accuracy (grammar

    homework) to fluency (conversation), with the listening activities bridging the two.

    Hence, the approach is to begin with activity-based homework activities, as a form of

    consciousness-raising. Homework checking takes about 5 minutes, and listening activities

    another 10 minutes, so that the remaining bulk of 30 minutes can be given to fluency-basedconversational activities. In the conversational activities, learners are encouraged to place most

    emphasis on fluency (as opposed to accuracy), and conversational content and strategy, as well

    as physical gestures and eye contact, play important roles. Learners are taught how to open and

    close conversations, introduce and develop topics, understand and use common usefulexpressions as well as idiomatic phrases in the classroom. The speaking test then checks whether

    they have internalised what the have done in the classroom, and whether they can take ownership

    over what the have learned and use it as a skill.

    To reinforce the need to practice speaking as much as possible during the classroom activities,

    learners receive points in the form of plastic coins during their classroom speaking activities,which they cash in at the end of class. The points are recorded and contribute towards their final

    semester score. This encourages the learners to talk, and they are awarded primary for

    conversational effort, for making an effort to communicate meaningfully, and points are not

    taken away for mistakes. For example, if the teacher asks a question during a listening exerciseand a student attempts to answer the question, that student will receive a participation point,

    which counts towards the year-end grade, irrespective of whether the question is answered

    correctly or not.

    In addition to measuring mainly the fluency aspects of oral communicative ability that is

    encouraged in the classroom, the speaking test gives a definite purpose for the homework and

    classroom activities for each cycle. Without doing the homework, learners find themselvesunable cope adequately with the listening and speaking activities in the classroom (given the

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    10/22

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    11/22

    environmental issues and learning are:

    Discuss the environmental problems of JapanDiscuss what can be done to stop the illegal dumping of rubbishDiscuss ways to help abandoned animals

    Discuss good ways to improve your English skillsDiscuss some new things you would like to learn to doDiscuss the reasons why you study English

    One of the three learners is asked to choose a number, from one to six (the number of cards inthe continually shuffled pack). Without showing the corresponding card to the learners, the

    examiner reads the question slowly and clearly to the learners twice, and the learners then think

    about the question for ten seconds (this makes the task more directed and allows the weaker

    learners some thinking time). The learners must then discuss this question for three minutes. Theexaminers listen and give each student a grade (at no point in this process may examiners clarify

    the meaning of any word in the question, and the conversation is exclusively between the three

    learners). Their scores are entered into the examiners computers, onto the prepared master fileswith the formulas to calculate each learners score in accordance with the weightings of the

    testing categories. The learners final scores are an average of the three examiners grades. After

    the three minutes is up, the learners are asked to stop. A new group of learners is called into the

    room, and the process is continued until all the learners have been tested.

    The learners are given their results in their next class on Monday. The rapid feedback is possible

    by virtue of the scoring process being computerised, and is beneficial to the learners in that theirmemories of the tests are still fresh and they can recall what they did and reflect on what they

    scored fairly easily.

    Score SheetsThe scores of the CEP speaking test are based on two score sheets comprised of rating bands

    consisting of assessment criteria for the examiners, and were created by the examiners in mutualagreement thus reflecting their combined pedagogical stance.

    There are two rating band sheets in CEP, one for the higher proficiency levels of classes A, B

    and C and the other for the lower proficiency levels of classes C, D and E. This is to make theprocess as fair as possible for the learners, especially the lower proficiency learners. The

    different levels allow learners to undergo class work and testing under conditions that are not too

    easy or too difficult given their current proficiency levels.

    Despite the different levels, attention is given to the reliability of the bands so that they do not

    overlap, are sufficiently described, as free as possible from allowing for the possibility of

    subjective and impressionistic elements to enter into the evaluation, and afford as muchpossibility for the examiners to enter similar scores.

    To make the process fair, the upper limit for the higher proficiency levels (A, B and C) does notrequire a student to have the ability of a native English speaker, but an ability similar to a

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    12/22

    Japanese person who has spent three to five weeks abroad in a summer home stay program. The

    requirements for the lower proficiency levels (D, E and F) are 3 bands lower than the A, B and C

    bands, with the upper limit being an ability to converse on a simple level. Although on asimple level is subject to many interpretations, the examiners undergo extensive discussion

    during the norming sessions to be sure that a mutual understanding of what it means is reached

    and internalised. A simplified version of the score sheets can be seen in Figure 2.

    Figure 2 - Simplified Version of Score Sheets (used in the examinationroom by the examiners)

    CEP ABCAssessment CriteriaFluency(Content ofContributions)

    Fluency(CommunicationStrategies)Accuracy (Grammar& Vocabulary) Accuracy(Pronunciation)

    5Offers

    manydetailsorexample

    sOffers

    valid &pertinentreasons&opinionsAble toconverseon topicwithoutstruggle

    VeryactivelyengagesothersUses

    gesturesandmaintainseye contactskilfully

    Uses/understands complexgrammaticalstructures often

    Uses/understands new andsophisticatedvocabulary often

    native-like accentnatural rhythm

    and intonation

    4Offers a

    fewdetailsorexamplesOffers

    reasons&opinionsAble toconverseon topicwithminimalstruggle

    OftenactivelyengagesothersUses

    gesturesandmaintainseye contactappropriately

    Uses/understands complex

    grammaticalstructures witha few mistakes

    Uses/understands some newandsophisticatedvocabulary.

    non-native accentwith very fewmispronunciationsnatural rhythm

    and intonation withslight mistakes

    3Offers a

    fewdetailsorexamplesOffers

    simplereasonsoropinionsAble toconverseon topic

    Sporadically

    active inengagingothersUses few

    gesturesandmaintainssome eyecontact

    Uses/understands less

    complexgrammaticalstructures

    Uses/understands basictopical vocabulary

    non-native accentwithsome mispronunciationsnon-native rhythm

    and intonation

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    13/22

    withstruggle

    2Offers

    detailsorexamples whenaskedOffers

    reasonsoropinionswhenaskedGreatly

    struggles ontopic

    SomewhatPassiveand issometimes

    engaged byothersUses few

    gesturesandmaintainslittle eyecontact

    Uses/understands simplegrammaticalstructures

    Uses/understands somebasictopical vocabulary

    non-native accentwithmany mispronunciationsnon-

    native rhythm and

    intonationinterferes withcomprehension

    1 Struggleswithdetailsor

    examples whenasked

    Struggleswithreasonoropinionwhenasked

    Conversesmainlyon

    unrelated topic

    Passiveand isusuallyengaged byothersUses

    almost nogestures nor eyecontact.

    Uses/understands simplegrammaticalstructures withdifficulty

    Uses/understands somebasictopical vocabulary withdifficulty

    non-native accentwithmany mispronunciations causingdifficulty inunderstandingnon-native rhythm

    and intonationcausing difficultyin comprehension

    0 Strugglesgreatlywithdetailsorexamples whenasked

    Strugglesgreatly

    withreasonsoropinionswhenasked

    Conversesonly onunrelated topic

    VerypassiveUses

    neithergesturesnor eyecontact.

    Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrorsUses some

    basictopical vocabulary with greatdifficulty

    non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm

    and intonationinterferesgreatly withcomprehension

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    14/22

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    15/22

    CEP DEFAssessment Criteria

    Fluency(Content ofContributions)Fluency(CommunicationStrategies)

    Accuracy (Grammar& Vocabulary) Accuracy (Pronunciation)

    5Offers

    simpledetailsorexamplesOffers

    simplereasonoropinionsAble toadequatelyconverse ontopic

    VeryactivelyengagesothersUses

    gesturesandmaintainseyecontactappropriately

    Uses/understands simple

    grammaticalstructures well

    Uses/understands newvocabulary well

    non-native accentwith somemispronunciationsnon-native rhythm

    and intonation

    4 OfferslimiteddetailsorexamplesOffers

    verysimplereasonsoropinions

    Converseson topic

    with alittlestruggle

    Active inengagingothersGestures

    andmaintainseyecontact

    Uses/understands simplegrammaticalstructures witha few mistakes

    Uses/understands somenewvocabulary.

    non-native accentwith very severalmispronunciationsnon-native rhythm

    and intonation withslight mistakes

    3Offers

    fewdetailsorexamplesOffers

    fewreasonsoropinions

    Converseson topicwithstruggle, andtends towander

    Sporadicallyactive inengagingothersUses few

    gesturesandmaintainssome eyecontact

    Uses/understands simplegrammaticalstructures witherrors

    Uses/understands basictopical vocabulary

    non-native accentwithnumerous mispronunciationsnon-native rhythm

    and intonation withseveral mistakes

    2Offers

    fewdetails

    Somewhatpassive

    Uses/understands simplegrammatical

    non-native accent isheavy and interfereswith comprehension

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    16/22

    orexamples whenaskedOffers

    limitedreasonsor

    opinionwhenasked

    Strugglesgreatlytoconverse ontopicandshifts tounrelatedtopics

    and issometimes engagedby othersUses

    almost nogesturesand

    maintainslittle eyecontact

    structures withnumerouserrors

    Uses/understands somebasictopical vocabul

    ary

    non-native rhythmand intonationinterferes withcomprehension

    1 Strugglesgreatlywithdetailsorexamples whenasked

    Strugglesgreatlywithreasonor

    opinionwhenaskedAbleonly toconverse onunrelated topic

    Verypassiveand isengagedby othersUses

    neithergesturesnor eyecontact.

    Uses/understands simplegrammaticalstructures withgreat difficultyand numerouserrors

    Uses/understands somebasictopical vocabulary with greatdifficulty

    non-native accentwithmany mispronunciations causing greatdifficulty inunderstandingnon-native rhythm

    and intonationinterferes greatly withcomprehension

    0Offers

    nodetailsorexamples whenaskedOffers

    noreasonoropinionwhenaskedTotal

    breakdown

    DoesntparticipateTotal

    communicationbreakdown.

    Doesnt usesimplegrammaticalstructuresDoesnt

    understandbasicvocabulary

    non-nativeaccent breakscommunicationnon-native rhythm

    and intonation stops

    the conversation

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    17/22

    Grading Procedure

    During the speaking tests, the learners are graded in terms of content ofconversation (40%), communication strategies (30%), grammar and vocabulary

    (15%) and pronunciation (15%).It should benoted that a larger portion of thegrade is allocated to the content of conversation and the communicationstrategies (fluency-based activities), than to grammar and vocabulary andpronunciation (accuracy-based activities). Content of conversation relate to theability to converse on a topic with some detail, giving reasons for opinions, whilecommunication strategies relate to starting and closing conversations, respondingto questions, soliciting information, which includes gestures and eye contact. Asimplified copy of the grading procedure can be seen below in Figure 3.

    Figure 3Simplified Version of Grading Procedure

    Content Communication and

    Participation

    Vocabulary and

    Grammar

    Pronunciation

    40% 30% 15% 15%

    Speaks a lot Active Uses new words from

    textbook

    Understandable

    Gives examples Uses gestures Uses grammar from

    textbook

    Explains why Looks at Partner while

    speaking

    5 (good) 5 (good) 5 (good) 5 (good)

    4 4 4 4

    3 3 3 3

    2 2 2 2

    1 (bad) 1 (bad) 1 (bad) 1 (bad)

    Speaks a little Not active (passive) Uses High

    SchoolLevel English

    Very Strong Accent

    No examples Few gestures Doesnt use new words

    or grammar from

    textbook

    Hard to understand

    No reasons why Doesnt look at

    Partner when speaking

    (___ x 8 = ___) + (___ x 6 = ___) + (___ x 3 = ____) + (___ x 3 = ___)

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    18/22

    The learners are reminded at the beginning of each cycle that the questions in thespeaking tests will be based on the activities they will cover in the classroom. It isfelt that the weighting of scores, in terms of their distribution between fluency andaccuracy, reflects the required balance in CEP between required knowledge

    (accuracy) and skills (fluency).The examiners use laptop computers and record the test results in formatted files.Their scores are combined after the testing so that the learners final grades are theaverage of the three examiners scores.

    Norming Procedure

    The CEP examiners undergo a norming procedure on a regular basis (once a month, at the end of

    each cycle, just prior to the speaking tests).

    The norming process is taken seriously given the importance for the CEP examiners tounderstand and internalise the common testing standards in order for the speaking tests to be as

    fair and reliable as possible. To this end, it is also important that the examiners base their scoreson the learners performances in the test itself, and not on how they might be expected to

    perform based on performance in the classroom. Some learners converse competently in the

    classroom, yet perform poorly in the speaking tests, while others are the opposite, but the CEPexaminers make much effort to examine only on what happens in the test, regardless of their

    subjective opinions of the learners abilities.

    During the norming sessions, the rating band sheets (a more detailed version than those used inthe classroom) are studied closely and the terms and concepts therein discussed to make sure that

    common understandings are reached. Then tape excerpts of learners taking the speaking tests arewatched and each examiner assigns a grade accordingly. Then they discuss the grades they gave.If the scores of the examiners vary no more than 8 percentage points from each other, it is

    considered that an adequate level of norming is reached. The goal of this practice is not so that

    the examiners will give exactly the same score to learners each and every time, but rather that the

    same standards are understood and applied.

    The learners are also exposed to the speaking test standards, so that they understand the

    requirements. A copy of the examiners rating sheet written in Japanese and a simplified version

    of the examiners rating sheet written in English is given to each student at the beginning of theacademic year, and they are asked to grade a group of learners shown to them on a video

    undergoing a speaking test. After the learners discuss the scores they gave to them, the teacherstell them what score each student actually received by the examiners, and the reasons why theyreceived them. Learners are also given additional handouts to prepare them for the tests. All

    effort is made to emphasise that conversational activities in the classroom are linkedto the tests

    not only by topic but also by the required conversational authenticity and fluency that learnerspractice in the classroom.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    19/22

    Each student then prepares four questions based on the topics of the two NIC units covered in the

    previous two weeks as homework, and brings them to class for the review day prior to the

    speaking tests. They practice these in the classroom in groups of three for three minutes at atime, as a final preparation for the test.

    How Valid and Reliable is the CEP Speaking Test?

    It seems that the CEP speaking test, in its current form, has come a long way fromits forerunner, and has a fairly high degree of validity and reliability for thefollowing reasons:

    The different weightings for accuracy and fluency in the scoring sheets,with most emphasis on fluency, that remain unchanged throughout the year;The content of the scoring sheets also remaining the same throughout the

    year, in terms of terminological contents and mathematical formulae;The explicit nature of the terminological content of the scoring sheets,

    created with input from all the examiners reflecting a combinedunderstanding of their pedagogical stance;The homogeneity of the learners in terms of race, age, academic status,

    socio-economic and academic background;Their consistency with the philosophy of CEP;The consistency in application, directly after the first two weeks in each

    cycle;The consistency and thoroughness of the norming procedure by the

    examiners, to maximise interrater reliability;The number of examiners, three in total;The process of making learners aware of the purpose of the speaking testthrough their exposure to the norming process in class just prior to the test;The practicing of three minute conversations in rotation in the class just

    prior to the speaking test, simulating the test with questions created by thelearners;The fairly large contribution of the speaking tests to the final CEP grade;The consistent and fairly high correlations achieved despite a new examiner

    joining the team and having undergone only one intensive norming session(see below);The rapid feedback of results, helping learners relate their score to their

    performances more easily; andThe backwash and forward wash effects that motivate the learners continue

    to practice in classroom in preparation for the speaking tests, thusinternalising the link between classroom practice and test performance.

    Interrater Reliability Correlations to Establish the Validity and Reliability of theCEP Speaking Test

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    20/22

    Interrater reliability testing is consistently carried out in CEP, but the split-half method that was

    used in the forerunner of the speaking test has not been applied again, mainly because of

    problems encountered with the method when it was used prior to the time of CEP. Hadley andMort (1999 p. 50) noted:

    although the split-half method is used with success with many more objective test designs,it is not certain if our test instrument can be measured objectively. We suspect that that thisinstrument may be more organic in nature, and cannot be easily separated into different parts.

    The alternative convenience in using the correlation formula, as opposed to the split-half method,provided by the Microsoft Excel software package (which uses a simple regression method,

    which is then correlated using the Pearson r correlation coefficient) has proven itself to be more

    convenient and accurate in CEP.

    It is easy to apply given that the scores are entered into computers and stored on the CEP intranet

    program, and has been done for the previous two years (2001 and 2002). The Microsoft Excel

    software package makes a regression analysis possible between two variables, and therebyallows for interrater correlations to be made between two examiners simultaneously. CEP had its

    first speaking tests for the current academic year on May 16th

    and 17th

    . For the first time, three

    examiners were used instead of two (the more the number of examiners the better

    the interrater reliability and thereby the better the internal reliability). The interrater correlationresults between the three examiners for this test are shown in Table 1.

    Table 1: Interrater Reliability Correlations of the Three CEP ExaminersCorrelation betweenExaminer A and C

    Correlation betweenExaminer A and B

    Correlation betweenExaminer C and B

    0.87 0.89 0.87

    These correlations are inspiring, in that reveal that the examiners are not giving exactly the same

    scores (as that would render a correlation of 1.00), and neither are the scores too different from

    each other (as that would render a correlation of less than 0.50). They are also slightly higher

    than those of the previous two years (2001 and 2002) when two examiners were used.

    An interesting observation is that, while examiners B and C have been in the CEP program for

    two consecutive years, examiner A recently joined in April 2002 and only underwentone norming session with the other examiners prior to the testing. The correlations thus also

    suggest that the CEP speaking test has a fair degree of internal reliability in that it renders similar

    outcomes irrespective of who is examining the test, and this also suggests that

    the norming procedure is working well in practice. Since this test two more speaking tests havebeen undertaken and the interrater correlations have continued to be at acceptable levels.

    Detailed comparisons of interrater reliability correlations from cycle to cycle (test to test) havenot been made to date, but this would be a useful exercise, as it would give some idea of the

    changing impact of the norming procedure over time. Similar trends could be established of the

    learners scores from cycle to cycle. Such comparisons could yield interesting results,particularly in an effort to measure the relationship over time between aspects of reliability and

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    21/22

    validity (for example, comparisons between interrater correlations and backwash and forward

    wash effects through changes in the learners scores). Although this is beyond the scope of this

    paper, it could be a subject of interesting academic research in the future.

    Conclusion

    We have found it very useful to consider the aspects of validity and reliability in the creation theCEP speaking tests. It has made us look at testing, especially oral testing, in a more critical way,

    and to be more aware about the need for validity and reliability, especially through

    the norming procedure to achieve acceptable levels of interrelater reliability. It made us realise

    that there is a lot more to oral testing that we initially envisaged, and that maintaining an oral testin good form needs constant attention. It is not an objective test that, once created can be filed at

    taken out only at times of use. The speaking test has consequently become a living part of CEP,

    in that all classroom activity is given a specific purpose, and the examiners

    undergo norming procedures before each speaking test is administered. Before going throughthis rigorous process, I often questioned the need to be so thorough and precise concerning all

    aspects of the speaking test, particularly during the discussions in the norming sessions, andsometimes felt that we overdid it at the expense of other things that needed to be done. Inhindsight, I have come to realise the relevance of what we were doing, and appreciate the

    product that I helped create all the more. We feel that our experience has equipped us with a

    better understanding of the complex dynamics of oral testing and will certainly be of goodbenefit for our future professional development.

    I would therefore recommend that all serious ESL teachers who have not yet closely considered

    the validity and reliability of their speaking tests, to do so, as the insight gained would be verybeneficial to their professional development as well.

  • 7/28/2019 The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program

    22/22

    References

    Bachman, L. F. (1990). Fundamental Considerations in Language Teaching. OxfordUniversity Press.

    Hadley, G & Mort, J. (1999). An Investigation of Interrater Reliability in Oral Testing.Nagoya National College of Technology Journal, 35(2), 45-51, fromhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm.

    Heaton, J. B. (1997). Writing English Language Tests. New York: Longman.

    Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research.Heinle & Heinle Publishers.

    Hughes, A. (1989).Testing for Language Teachers. Cambridge University Press.

    Hunt, D. (1998). Designing a Reading Comprehension Test for Oral EnglishClasses.

    The Shizuoka Gakuen College ReviewJournal, 11, 61-80.

    Richards, J. C. (1998). New Interchange 1: English for International Communication.Cambridge University Press.

    Richards, J. C. (1998). New Interchange 2: English for International Communication.

    Cambridge University Press.

    Richards, J. C. (1998). New Interchange 3: English for International Communication.

    Cambridge University Press.

    Underhill, N. (1987). Testing Spoken Language. Cambridge University Press.

    Weir, C. J. (1990). Communicative Language Testing. Prentice Hall International.

    http://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htmhttp://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm