On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing 3rd Annual...

On Common Ground?How Raters Perceive Scoring Criteria

in Oral Proficiency Testing

3rd Annual Conference of EALTA

Krakow, Poland, 19-21 May 2006

Thomas EckesTestDaF Institute, Hagen, Germany

[email protected]

Overview

1. Rater Variability2. Rater Type Hypothesis

2.1 Questionnaire Study2.2 Raters and Criteria

3. Facets Analysis4. Cluster Analysis5. Summary and Discussion

Rater variability

In Rater-Mediated Assessments, Rater Variability

contributes to variance in observed ratings that is associated with raters and not with examinees

obscures the construct being measured threatens the validity and fairness of scores

awarded to examinees

Rater variability

Rater variability comes in many forms, eg: rater differences in severity or leniency

differences in the understanding and use of rating scale categories

differences in the kind of performance features raters attend to, or attach importance to

interactions between examinees, raters, and tasks

Rater variability

Approaches to Rater Variability in Oral Proficiency Testing

Many-facet Rasch measurement (eg Bonk & Ockey, 2003; Lumley & McNamara, 1995)

Generalizability theory (Bachman, Lynch & Mason, 1995; Lee, 2006)

Other scaling approaches, eg INDSCAL (Chalhoub-Deville, 1995), Grid-Technique/Thurstone-Scaling (Pollitt & Murray, 1996)

Rater variability

Approaches to Rater Variability in Oral Proficiency Testing contd

Discourse analysis (eg Brown, 2005; Meiron & Schick, 2000)

Verbal protocol analysis (Brown, Iwashita & McNamara, 2005)

Rater variability

Study Assessment N Raters Sep. Rel.

Bachman, Lynch, & Mason (1995)

Spanish (LAAS) 15 .92

Brown (2005)IELTS Speaking

Module6 .64

Eckes (2005)TestDaF

Speaking Section31 .98

Lumley & McNamara (1995)

OET Speaking Section

13 .89

Lynch & McNamara (1998)

Speaking Skills Module (access:)

4 1.00

Many-facet Rasch measurement studies

Rater variability

Two recent qualitative studies

Brown (2005) – IELTS Speaking Module

Interviewers varied along a number of dimensions (eg ways to deploy topics, elicitation techniques, interactional style)

Raters varied in their views of how interviewers‘ behavior influenced examinees‘ performance (having impact on ratings)

Rater variability

Brown, Iwashita & McNamara (2005) – new TOEFL project speaking tasks

Raters attended to 4 general categories (ie linguistic resources, phonology, fluency, content)

Within each category, raters considered a range of specific performance features (eg linguistic resources: grammar, vocabulary, expression, textualization)

Some evidence for rater disagreement (eg reuse of input text, disfluency associated with repair)

Content as a major focus in EAP tasks

Rater variability

Conclusions From Prior Research

Rater variability is substantial, even among trained, experienced raters

Differences in the interpretation and use of scoring criteria are an important source of rater variability

Rater variability

Raters appear to have inbuilt perceptions of what is acceptable to them . . . . even the explicitness of the descriptors and the standardization that takes place in a training session cannot remove these differences.

A. Brown (1995, p. 13)

Rater types

Rater Type Hypothesis

Experienced raters fall into types (classes, clusters) that are characterized by distinctive patterns of criterion perception.

Raters types are organized into a rater taxonomy, that is, into a hierarchical classification system relating raters and criteria to one another.

Rater types

Rater Type Hypothesis contd

Previous research in a writing proficiency context (Eckes, 2006) provided support for the hypothesis

Extensions in the present research: Speaking proficiency Three perceptual dimensions:

(1) Importance (2) Ease-of-application (3) Confidence

Rater types

Questionnaire Study

Background: The TestDaF

A large-scale and high-stakes test Designed for foreign students applying for entry to

an institution of higher education in Germany Measures German language proficiency at an

intermediate to high level Examines the four language skills in separate

sections In each section, task and item content are closely

related to the academic context

Rater types

Background contd: Speaking section

Performance-based test SOPI format (indirect speaking test) Multitrait scoring rubric Tasks are presented orally from tape (or CD) and in

print Responses are recorded on a second tape (or CD) Seven tasks at variable levels of difficulty (one

warm-up task) Range of common university situations (eg

discussing with fellow students, describing a diagram during a tutorial, forming hypotheses)

Rater types

Participants

Speaking section sample: 53 raters (15 men, 38 women)

Mean age: 47.7 years (SD = 9.6) 81% with 10 or more years as a DaF teacher 83% with 4 or more years as a DaF examiner Rater monitoring covered a 3-year period, including

11 TestDaF scoring sessions Raters generally manifested high degrees of scoring

proficiency

Rater types

Criterion Scale descriptor (top level)

1 Comprehensibility can be understood phonetically without difficulty

2 Content content is comprehensible in every respect

3 Vocabulary vocabulary is wide

4 Correctness errors seldom occur

5 Adequacy choice of linguistic means is appropriate

6 Completeness points are dealt with sufficiently

7 Description important information is summarized logically

8 Discussion advantages and disadvantages are discussed

9 Standpoint own point of view is stated conclusively

Rater types

Ratings of criteria along 3 scales:

Importance Ease-of-application Confidence

4-point scales, eg importance scale:

1 - Less important2 - Important3 - Very important4 - Extremely important

Facets analysis

Data analysis, Part I: Facets analysis

Research questions Degree of variability in raters‘ perceptions Degree of variability in perceived criteria Functioning of the 4-point rating scales

Two facets Raters (53) Criteria (9)

Program: FACETS (Version 3.59; Linacre, 2005)

Variable Map

Importance

Overall model fit ok

Separation reliability(a) Raters: .85(b) Criteria: .94

Chi-square homogeneity statistic highly significant (p < .01)

Logit Rater Criterion Scale

Attaching High Importance Highly Important

4 34 (4) 29 3 45 06 17 content 2 ----- 03 26 32 16 37 1 10 13 25 35 36 42 01 38 04 21 41 51 54 3 20 43 standpoint 30 53 comprehensibility 0 description 08 18 27 33 40 50 ----- 12 48 discussion 22 39 47 49 completeness vocabulary adequacy 05 19 28 correctness 2

–1 11 23 44 14 24 07 09 46 -----

–2 02 15 31

–3 (1) Attaching Low Importance Less Important

Variable Map

Ease-of-application



Chi-square homogeneity statistic highly significant (p < .01)


Experiencing Much Ease Highly Easy

2 (4) 36 45 ----- 29 32 1 26 35 50 53 01 30 3 13 24 33 38 43 40 standpoint 10 25 correctness 09 16 27 37 41 completeness vocabulary ----- 0 15 comprehensibility content discussion 06 07 08 12 39 adequacy 05 28 42 02 14 20 21 22 31 48 49 description 34 2 03 04 51

–1 11 19 23 46 47 ----- 52

–2 44

–3 18

–4 17 (1) Experiencing Less Ease Not So Easy

Variable Map

Confidence



Chi-square homogeneity statistic highly significant (p < .01; only raters)


Highly Confident Used With Much Confidence

3 (4) 06 29 36 49 2 ----- 54 24 43 45 50 26 32 1 25 35 01 08 09 33 38 3 03 13 16 27 34 42 comprehensibility 07 39 completeness standpoint content discussion 0 05 11 12 20 37 40 vocabulary 10 30 adequacy correctness 48 ----- description 15 22 28 31 41 51

–1 14 21 53 2 19 23 46 18 17 44 47 02 04 -----

–2 (1) Less Confident Used With Less Confidence

Facets analysis

Fit category Infit Outfit

Fit okay 24 24

Misfit 11 10

Overfit 18 19

Rater Fit – Importance

Note. Infit and Outfit are mean-square statistics. Misfit: Fit > 1.30. Overfit: Fit < 0.70. Number of raters = 53.

Facets analysis


Fit okay 22 21

Misfit 13 12

Overfit 18 20

Rater Fit – Ease-of-Application


Facets analysis


Fit okay 20 21

Misfit 13 12

Overfit 20 20

Rater Fit – Confidence


Facets analysis

Summary of Facets Analysis Findings

Raters showed significant differences in their perceptions of criteria

Raters showed substantial degrees of misfit/overfit, indicating rater heterogeneity in criterion perception (in line with the rater type hypothesis)

Criteria differed significantly along the importance and ease-of-application dimensions

The 4-point rating scales functioned effectively

Cluster analysis

Data analysis, Part II: Cluster analysis

Research questions Are there distinctive patterns of raters‘ criterion

perception along importance, ease-of-application, and confidence dimensions?

How many different patterns, or rater types, can be distinguished?

What are the rater-criterion interrelations for each of these types?

Cluster analysis

Clustering Method Error-variance approach (Eckes & Orlik, 1993; see also

Everitt, Landau, & Leese, 2001) Main objective: Joint hierarchical classification of two

different sets of elements (ie raters and criteria) Two-mode clusters with minimum internal heterogeneity

Preprocessing of the input data (ie columnwise duplication and reflection) to cluster high-rated criteria separately from low-rated criteria

Additionally: Overlapping clustering solution

Cluster analysis

Number of clusters

Stepsize criterion (increase in the cluster heterogeneity index)

Cluster cohesion index Point-biserial correlation between input data and

cluster membership

Two-Mode Clustering Solution

Importance

Overlapping solution added

Minus sign (-) indicates less important

A B

C

D

E

F

37, 32, 03, 41 correct.

18, 40 -compr.

47, 44, 39, 28, 27, 49, 20, 50 -compl. -descrip.

30, 22, 19, 11, 09, 05, 04, 02, 33 -discus.

24, 15, 14, 46, 31, 52, 23, 12, 07, 48 -standp. -vocab. -correct. -adequ.

02, 09, 31, 33, 49, 15, 23, 24, 27, 28, 44 -compl.

14, 15, 24, 31, 46, 07, 23 -adequ.

02, 11

06, 17, 24, 30, 33, 34, 49, 01, 04, 26, 29, 43, 45, 53 content

43, 36, 35, 25, 42, 26, 21, 13, 08, 01, 51, 17, 16, 10, 06, 38, 53, 45, 34, 29 descrip. compl. content vocab. standp. discus. adequ. compr.

02, 09, 15, 24, 31, 46, 07, 11

Cluster analysis

Cluster (N)

Compre-hensib.

Content Vocabu-lary

Correct-ness

Ade-quacy

Com-pleten.

Descrip-tion

Discus-sion

Stand-point

A (20)

+ + + + + + + +

B (18)

+ +

C (13)

D (16)

E (16)

F (12)

Note. + extremely important, - less important, else in between. N = number of raters per cluster (overlapping solution).

Rater Types – Importance


Ease-of-application


Minus sign (-) indicates less easy

A B

C

D

E

F

27, 41, 40, 20, 10, 49 compl.

24, 14, 04, 37 -content

52, 22, 17, 34, 33 -standp. -adequ. -correct.

43, 39, 03, 18, 12, 07, 06, 05, 48, 26, 21, 01, 51, 36 -descrip. -vocab. -discus. -compl.

05, 16, 06

31, 30, 25, 19, 16, 32 -compr.

05, 06, 12, 26, 34, 36, 48

23, 53, 35, 11, 47, 50, 02, 45, 46, 42, 29, 13, 09, 08, 38, 28, 15, 44 descrip. correct. content vocab. standp. discus. adequ. compr.

05, 06, 16, 17, 26, 49

12, 14, 19, 25, 33, 35, 38, 42, 47

Cluster analysis

Cluster (N)

Compre-hensib.

Content Vocabu-lary

Correct-ness

Ade-quacy

Com-pleten.

Descrip-tion

Discus-sion

Stand-point

A (18)

+ + + + + + + +

B (13)

C (15)

+

D (10)

E (8)

F (14)

Note. + extremely easy, - not so easy, else in between. N = number of raters per cluster (overlapping solution).

Rater Types – Ease-of-Application


Confidence


Minus sign (-) indicates less confident

A B

C

D

E

F

39, 26, 34, 32 -correct. -vocab.

06, 37, 17, 01, 52, 48 -content -descrip. -compl.

19, 16, 05, 31 -compr.

48, 01, 06, 07, 12, 17, 18

34, 18, 20, 40 compl.

05, 16, 01, 31, 37

27, 22, 18, 10, 07, 46 -adequ.

24, 12, 42, 09, 45, 11, 03, 15, 44, 35, 41, 40, 53, 50, 02, 08, 23, 14, 49, 38, 30, 20, 13, 43, 28, 25, 47 descrip. content vocab. standp. discus. correct. compl. compr. adequ.

29, 51, 33, 21, 04, 36 -standp. -discus.

05, 16, 21, 33, 34

Cluster analysis

Cluster (N)

Compre-hensib.

Content Vocabu-lary

Correct-ness

Ade-quacy

Com-pleten.

Descrip-tion

Discus-sion

Stand-point

A (27)

+ + + + + + + + +

B (11)

C (11)

D (6)

E (8)

+

F (11)

Note. + extremely confident, - less confident, else in between. N = number of raters per cluster (overlapping solution).

Rater Types – Confidence

(1) Rater differences in criterion perceptions

Experienced raters differed significantly in the general importance attached to criteria in the overall ease of criterion application in the confidence in using each criterion

adequately

Summary and discussion

(2) Distinctive rater types

Raters formed markedly different types (thus supporting the Rater Type Hypothesis)

Types were characterized by different subsets of criteria

This was particularly pronounced along the importance and ease-of-application dimensions


(3) Implications for rater training

Focus on empirically derived rater types, their strengths and weaknesses

Use of behavior-driven (or bottom-up) training procedures (Lievens, 2001) to balance raters’ attention more evenly on criteria

Rater monitoring needs to address the effects of type-based rater training on both operational rating behavior and self-reports


(4) Implications for research

Interrelations with other indicators of rater variability (eg severity/leniency, halo, rating scale use)

Influence of rater background variables (eg personal background, professional training, work experience) on interpretation and use of criteria

Combination with qualitative research strategies (eg verbal protocol analysis; Green, 1998)

Construction of type-specific rating process models


Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12, 238-257.

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89-110.

Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 1-15.

Brown, A. (2005). Interviewer variability in oral proficiency interviews. Peter Lang: Frankfurt/Main.

Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (TOEFL Monograph Series, MS-29). Princeton, NJ: Educational Testing Service.

Chalhoub-Deville, M. (1995). Deriving oral assessments scales across different tests and rater groups. Language Testing, 12, 17-33.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197-221.

Eckes, T. (2006). Rater types in writing performance assessments: A classification approach to rater variability. Manuscript submitted.

Eckes, T., & Orlik, P. (1993). An error variance approach to two-mode hierarchical clustering. Journal of Classification, 10, 51-74.

References

Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis (4th ed.). London: Arnold.

Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23, 131-166.

Lievens, F. (2001). Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. Journal of Applied Psychology, 86, 255-264.

Linacre, J. M. (2005). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Chicago: Winsteps.com.

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54-71.

Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15, 158-180.

Meiron, B. E., & Schick, L. S. (2000). Ratings, raters and test performance: An exploratory study. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 153-174). Cambridge: Cambridge University Press.

Pollitt, A., & Murray, N.L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 74-91). Cambridge: University of Cambridge Local Examinations Syndicate.

References

On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing 3rd Annual...

Documents

Transcript of On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing 3rd Annual...